Skip to content

Optimize XUI/XML parsing: LLXMLNode, LLStringTable, LLInitParam, LLXUIParser#313

Open
RyeMutt wants to merge 9 commits into
developfrom
rye/llxml-parse-opt
Open

Optimize XUI/XML parsing: LLXMLNode, LLStringTable, LLInitParam, LLXUIParser#313
RyeMutt wants to merge 9 commits into
developfrom
rye/llxml-parse-opt

Conversation

@RyeMutt

@RyeMutt RyeMutt commented Jun 14, 2026

Copy link
Copy Markdown
Member

Description

A focused performance and modernization pass over the XUI/XML parsing path and the LLInitParam param-block system that backs it. No user-facing behavior change — the same XUI/XML parses to the same widget trees and values; these are allocation/copy reductions, a handful of latent-defect fixes, and C++ modernization. 8 commits, grouped by area:

XML node parsing (llxml)

  • Eliminate O(n²) value accumulation in the parser. The expat character-data callback copied the node's entire accumulated value out (getValue()) and back (setValue()) on every chunk, so any text delivered in multiple chunks (large bodies, entity refs, expat buffer splits) was quadratic. Now appends in place via a new appendValue(string_view) / setValue(string&&).
  • Fewer per-attribute allocations in StartXMLNode — attribute names inspected as string_view (the expat atts array is already NUL-terminated), values moved into the node once, dead per-attribute dedup lookup removed (expat already rejects duplicate attribute names). EndXMLNode no longer copies the value to test it for whitespace; setBoolValue's quadratic llformat accumulator switched to append. Output is byte-for-byte identical.
  • Modernize LLXMLNode construction — default member initializers replace three duplicated init lists (also fixes a latent -Wreorder in the copy ctor); drop an unused <lldir.h> include; tidy escapeXML.
  • LLXmlTreeParser::characterData — append the char range directly instead of building a per-chunk std::string.
  • Unit tests — stand up the previously-empty llxml test target with 7 TUT cases covering the touched parse paths.

String table (llcommon/llstringtable)

  • Remove the dead STRING_TABLE_HASH_MAP branches (only ever active on VS2002/2003 via long-gone hash_multimap), modernize the surviving list-bucket implementation, and delete LLStringTableEntry's implicit copy ctor — it owns a raw char*, so a shallow copy is a double-free hazard (it's never copied in practice).

Param blocks (llcommon/llinitparam)

  • Defects: Multiple::isValid() now uses inclusive bounds to match validate() (an AtLeast<1> with exactly one element no longer reports not-provided); getPossibleValues() no longer grows a function-static vector unboundedly on the inspect path.
  • Copy/alloc reductions: getValueName()/calcValueName() return by const ref; the unnamed-handle set is built once for O(1) dup checks; non-block Multiple deserializes directly into the freshly-added element.
  • LazyValue reimplemented over std::unique_ptr (replaces hand-rolled rule-of-three, adds move support); assorted emplace_back/move adoptions.

XUI parser (llui/llxuiparser)

  • Defects: readXUI() dereferenced node before its null check and could deref a stale mCurReadNode; ScopedFile called fclose(NULL) on open failure. Read hot path: compute getSanitizedValue() once per node, find('.') over find("."), emplace_back.

Related Issues

No tracking issue — internal optimization/cleanup pass on the XUI/XML load path.

Issue Link: N/A


Checklist

  • I have provided a clear title and detailed description for this pull request.
  • If useful, I have included media such as screenshots and video to show off my changes.
  • I have tested the changes locally and verified they work as intended.
  • All new and existing tests pass.
  • Code follows the project's style guidelines.
  • Documentation has been updated if needed.
  • Any dependent changes have been merged and published in downstream modules
  • I have reviewed the contributing guidelines.

Additional Notes

  • Verification: llxml, llcommon, and llui build clean (RelWithDebInfo, VS2026); the new llxml unit tests pass 7/7. The parser changes are behavior-preserving (byte-for-byte output, including the escaped-string edge cases) — full in-viewer runtime verification is still recommended before merge.
  • Suggested review order: the llxml commits are behavior-preserving refactors; the llcommon/llui commits additionally fix the latent defects called out above (param-block min/max validity, unbounded inspect-path vector, null-deref in readXUI, fclose(NULL), string-table double-free).

RyeMutt and others added 8 commits June 14, 2026 00:17
Replace the triplicated constructor init lists with default member
initializers in the header; the constructors now only set what differs
per overload. This also fixes a latent -Wreorder in the copy constructor
(mParser/mParent were initialized out of declaration order).

Drop the unused <lldir.h> include (nothing references LLDir), and give
escapeXML a reserve() plus a range-for. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The character-data callback (XMLData) read the node's entire accumulated
value out via getValue() and wrote it back via setValue() on every chunk,
making multi-chunk text (anything over expat's buffer, entity refs, etc.)
quadratic. Add appendValue(string_view) and a setValue(string&&) overload
and append in place instead - linear in the text length.

StartXMLNode now inspects attribute names as string_view (the expat atts
array is already NUL-terminated), moves each attribute value into its node
once, and drops the dead per-attribute dedup lookup (expat already rejects
duplicate attribute names) plus an unused local. EndXMLNode no longer
copies the value to test it for whitespace.

setBoolValue built its string with llformat("%s ...", val.c_str(), ...),
re-formatting the whole accumulator each element; switch to append. All
output is byte-for-byte identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
characterData built a std::string for every character-data chunk even
when not dumping, then appended that copy to the node contents. Take a
(const char*, len) overload of appendContents and append the chunk
directly; only materialize a string in the (rarely taken) dump branch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stand up the previously-empty llxml unit-test target with TUT coverage
for the parse path that the preceding commits touched: attribute/value
round-trip, escaped-string unescaping, multi-chunk text accumulation
(via parseStream's 1KB chunks), typed float arrays, bool serialization,
escapeXML, and deepCopy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
STRING_TABLE_HASH_MAP is only ever defined under _MSC_VER 1300-1399
(VS2002/2003), and its branches use std::hash_multimap / __gnu_cxx::
hash_multimap, neither of which exists in modern stdlib. Remove the dead
branches from the header and from check/add/removeStringEntry and the
destructor, leaving the live list-bucket implementation. Also drop the
#if 0 alternate hash in hash_my_string, value-initialize the bucket
array, inline the now-dead ret_val locals, and switch NULL -> nullptr.

LLStringTableEntry owns a raw char* with a user-declared destructor but
had an implicit copy ctor (shallow copy -> double free). It is never
copied (always heap-allocated and held by pointer), so delete the copy
ctor/assignment to make the single-owner invariant explicit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
readXUI() dereferenced node (node->getName()) before its isNull() check,
and the warning branch further dereferenced a null/stale mCurReadNode;
check for null first and bail out. ScopedFile's destructor called
fclose() even when the file failed to open (fclose(NULL) is UB); guard
it.

readXUIImpl() called getSanitizedValue() up to twice per node; compute it
once and reuse. Use find('.') instead of find(".") for the single-char
scan, and emplace_back name-stack/output-stack entries instead of
push_back(make_pair(...)).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the hand-rolled new/delete and manual rule-of-three with a
std::unique_ptr<T> member. Deep-copy value semantics and lazy heap
allocation are preserved -- heap storage (rather than std::optional)
keeps a Lazy<> param small when T is a large, usually-absent block --
and move ctor/assignment are added (the type was previously copy-only).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Defects:
- Multiple::isValid() used strict < while the registered validate() uses
  <=, so a Multiple at exactly its min/max count (e.g. AtLeast<1> with a
  single element) reported not-provided. Make both bounds inclusive.
- getPossibleValues() appended to a function-static vector with no guard,
  so the inspect path (XSD/RNG writers) grew it unboundedly with
  duplicate entries. Populate it once.

Copy/alloc reductions:
- getValueName()/calcValueName() return const std::string& instead of by
  value, removing string copies in the serialize loops.
- serializeBlock()/inspectBlock() build the unnamed-handle set once for
  O(1) duplicate checks instead of an O(named*unnamed) rescan per param.
- Non-block Multiple deserialize parses directly into a freshly-added
  element (popping on failure) instead of into a local that is then
  copied in, matching how block-Multiple already works.

Moves:
- emplace_back over push_back(make_pair(...)); mValues = value over
  std::copy + back_inserter; add() uses emplace_back; rvalue
  set(container_t&&) plus matching Multiple assignment/call operators.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 447bee39-2838-4666-8c53-e26c1594e6e4

📥 Commits

Reviewing files that changed from the base of the PR and between 927c92e and 3a00b95.

📒 Files selected for processing (2)
  • indra/llxml/llxmlnode.cpp
  • indra/llxml/llxmltree.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • indra/llxml/llxmltree.cpp
  • indra/llxml/llxmlnode.cpp

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved robustness in XML parsing (better null-safety and whitespace handling).
    • Corrected handling of multi-value parameter count boundaries during validation/serialization.
  • Refactor

    • Improved performance by reducing redundant work in parameter and XML value processing.
    • Simplified internal string table behavior for more consistent lookups.
  • Tests

    • Added unit tests covering XML node parsing, escaping/unescaping, round-trips, and large/streamed character data.

Walkthrough

The PR modernizes three core areas of the codebase: LLStringTable removes its conditional hash-map implementation path and makes entries non-copyable; LLInitParam adopts std::unique_ptr, move semantics, const-ref returns, emplace_back, and std::unordered_set for O(1) named-param duplicate detection; LLXMLNode, LLXmlTree, and LLXUIParser gain rvalue/string_view APIs, simplified parsing logic, and a comprehensive 7-case unit test suite.

Changes

LLStringTable Simplification

Layer / File(s) Summary
Non-copyable entry + hash-map removal
indra/llcommon/llstringtable.h
Deletes copy constructor and copy assignment from LLStringTableEntry, removes the STRING_TABLE_HASH_MAP preprocessor block, and collapses LLStringTable to unconditionally declare list-based storage types.
Bucket-list constructor, destructor, and lookup/mutate
indra/llcommon/llstringtable.cpp
Rewrites constructor with value-initialized bucket array allocation, destructor with unconditional cleanup, hash_my_string with bitmask return, and checkStringEntry/addStringEntry/removeString to use only mStringList with push_front insertion and mUniqueEntries tracking.

LLInitParam Modernization

Layer / File(s) Summary
TypeValues const-ref returns and getPossibleValues guard
indra/llcommon/llinitparam.h
Changes getValueName/calcValueName to return const std::string& backed by static empty strings in both TypeValues<T> and TypeValuesHelper, and adds a static bool sInitialized guard to getPossibleValues() to prevent duplicate appends on repeated calls.
LazyValue smart-pointer refactor
indra/llcommon/llinitparam.h
Replaces raw T* with mutable std::unique_ptr<T> in LazyValue<T>, updates deep-copy assignment behavior, and moves allocation to std::make_unique in set() and ensureInstance().
Multiple-param bounds, in-place deserialization, and move APIs
indra/llcommon/llinitparam.h
Switches scalar and block multiple-param validity to inclusive bounds (<=), reworks scalar deserialization to emplace directly into mValues with pop_back on parse failure, adds set(container_t&&) move overloads, uses emplace_back for insertion, and adds move operator=/operator() to Block<>::Multiple.
O(1) unnamed-handle set in serializeBlock/inspectBlock
indra/llcommon/llinitparam.cpp, indra/llcommon/llinitparam.h
Adds <unordered_set> header, updates addParam to use emplace_back, and precomputes unnamed-handle sets in serializeBlock and inspectBlock replacing nested linear scans with O(1) membership tests.

LLXMLNode, LLXmlTree, and LLXUIParser Modernization

Layer / File(s) Summary
LLXMLNode new API declarations and default member initializers
indra/llxml/llxmlnode.h
Declares setValue(std::string&&) rvalue overload and appendValue(std::string_view) helper, and adds default member initializers for all parsing-related fields including mParser, mLineNumber, and type/version/length/precision/encoding enums.
LLXMLNode parsing and value-method implementation
indra/llxml/llxmlnode.cpp
Adds <string_view>, refactors StartXMLNode to use string_view for attribute parsing and sscanf for numeric attributes (id, version, size/length, precision), simplifies EndXMLNode whitespace clearing via find_first_not_of, rewrites XMLData to use appendValue, implements setValue(std::string&&) and appendValue(string_view), and optimizes escapeXML/setBoolValue with reserve and move semantics.
LLXmlTreeNode::appendContents pointer+length refactor
indra/llxml/llxmltree.h, indra/llxml/llxmltree.cpp
Replaces appendContents(const std::string&) with appendContents(const char*, size_type) in declaration and implementation; updates endElement with null guard, characterData with input validation and early returns, and content appending to use the new pointer+length signature directly.
LLXUIParser emplace_back, early null-check, and find char
indra/llui/llxuiparser.cpp
Replaces parserWarning+branch with early LL_WARNS return in readXUI, computes sanitized text once in readXUIImpl, replaces push_back(make_pair(...)) with emplace_back throughout both LLXUIParser and LLSimpleXUIParser, guards ScopedFile::fclose with null check, and changes find(".") to find('.') for character search.
LLXMLNode unit test suite and CMake wiring
indra/llxml/CMakeLists.txt, indra/llxml/tests/llxmlnode_test.cpp
Adds llxmlnode_test.cpp with seven test cases covering element/attribute parsing, escaped-string unescaping, stream-based character accumulation, typed float array parsing, boolean array serialization round-trip, XML special-character escaping, and deepCopy preservation; updates CMakeLists.txt to include llxmlnode.cpp in test sources with ll::expat dependency.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hop, hop through the code so fine,
Smart pointers now keep mem in line,
No more hash-map, just lists in a row,
emplace_back makes the stack vectors glow,
string_view whispers, "no copy for me!"
The rabbit refactors, wild and free! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes across the four modified components (XUI/XML parsing, LLStringTable, LLInitParam, LLXUIParser), accurately reflecting the focus of the changeset.
Description check ✅ Passed The PR description is comprehensive, including a detailed breakdown of changes by component, rationale for optimizations, defect fixes, modernization efforts, verification steps, and a review order. It aligns well with the template structure and provides sufficient context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
indra/llxml/llxmlnode.cpp (1)

368-385: ⚠️ Potential issue | 🟠 Major

Fix sscanf format specifiers for unsigned targets.

These calls parse into U32 (which is unsigned int), but use %d (which expects signed int*). That format/type mismatch is undefined behavior. Use %u for U32.

💡 Proposed fix
-            if (sscanf(atts[pos + 1], "%d.%d", &version_major, &version_minor) > 0)
+            if (sscanf(atts[pos + 1], "%u.%u", &version_major, &version_minor) > 0)
@@
-            if (sscanf(atts[pos + 1], "%d", &length) > 0)
+            if (sscanf(atts[pos + 1], "%u", &length) > 0)
@@
-            if (sscanf(atts[pos + 1], "%d", &precision) > 0)
+            if (sscanf(atts[pos + 1], "%u", &precision) > 0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@indra/llxml/llxmlnode.cpp` around lines 368 - 385, The sscanf format
specifiers use %d (signed int) but parse into U32 (unsigned int) variables,
creating undefined behavior. Replace all %d format specifiers with %u in the
three sscanf calls: the version_major and version_minor parsing, the length
variable parsing, and the precision variable parsing. This ensures the format
specifiers match the target variable types.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@indra/llxml/llxmltree.cpp`:
- Around line 624-627: The characterData() method dereferences mCurrent without
a NULL check before calling appendContents(), and endElement() has the same
vulnerability when accessing mCurrent->mContents.empty(). Expat can invoke these
callbacks for text before the root element exists or after it closes, leaving
mCurrent as NULL. Add a guard check to verify that mCurrent is not NULL before
dereferencing it in both the characterData() method (where the appendContents
call occurs) and the endElement() method (where mContents.empty() is accessed),
ensuring the code only proceeds if mCurrent points to a valid element.

---

Outside diff comments:
In `@indra/llxml/llxmlnode.cpp`:
- Around line 368-385: The sscanf format specifiers use %d (signed int) but
parse into U32 (unsigned int) variables, creating undefined behavior. Replace
all %d format specifiers with %u in the three sscanf calls: the version_major
and version_minor parsing, the length variable parsing, and the precision
variable parsing. This ensures the format specifiers match the target variable
types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6452996f-f9b8-4aa9-b5d2-bf477cb30912

📥 Commits

Reviewing files that changed from the base of the PR and between 753e934 and 927c92e.

📒 Files selected for processing (11)
  • indra/llcommon/llinitparam.cpp
  • indra/llcommon/llinitparam.h
  • indra/llcommon/llstringtable.cpp
  • indra/llcommon/llstringtable.h
  • indra/llui/llxuiparser.cpp
  • indra/llxml/CMakeLists.txt
  • indra/llxml/llxmlnode.cpp
  • indra/llxml/llxmlnode.h
  • indra/llxml/llxmltree.cpp
  • indra/llxml/llxmltree.h
  • indra/llxml/tests/llxmlnode_test.cpp

Comment thread indra/llxml/llxmltree.cpp
CodeRabbit review on #313:
- LLXmlTreeParser::characterData/endElement dereferenced mCurrent without
  a null check. It is structurally non-null for well-formed input (expat
  routes prolog/epilog text to the default handler, and endElement only
  fires for an opened element), but guard both as cheap defensive hardening.
- StartXMLNode parsed version/length/precision into U32 with %d (expects
  int*); switch to %u to match the target type. Output is unchanged for
  the non-negative values these attributes hold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant