Refactor iXBRL parser for improved HTML handling and memory efficiency by jfrench9 · Pull Request #579 · RoboFinSystems/robosystems

jfrench9 · 2026-04-03T06:13:10Z

Summary

Refactors the iXBRLParser in the SEC adapter to improve HTML table parsing, enhance memory efficiency, and provide more robust handling of iXBRL document structures. This is a focused refactor of a single module (ixbrl_parser.py) with significant internal improvements.

Key Accomplishments

Improved HTML handling: Enhanced the parser's ability to process and extract data from HTML table structures within iXBRL filings, resulting in more reliable and accurate parsing of SEC document content
Memory efficiency improvements: Optimized internal data handling to reduce memory footprint during parsing operations, which is particularly beneficial when processing large iXBRL documents
Expanded functionality: Added ~138 new lines of logic while refactoring ~26 existing lines, indicating both new capabilities and streamlined existing code paths
Cleaner architecture: Restructured the parser internals for better maintainability and readability, aligning with broader refactoring goals (as indicated by the branch name referencing markdown/table parsing improvements)

Breaking Changes

None expected. The refactor is internal to the iXBRLParser class. However, consumers of the parser's output should verify that returned data structures remain consistent, particularly if downstream code depends on specific table parsing output formats.

Testing Notes

Verify iXBRL document parsing against a representative set of SEC filings to ensure output correctness is maintained
Test with both small and large iXBRL filings to validate memory efficiency improvements
Pay particular attention to edge cases in HTML table structures (nested tables, merged cells, irregular row/column spans)
Confirm that existing integration points with the SEC adapter continue to function as expected

Infrastructure Considerations

No new dependencies introduced
No configuration changes required
The changes are confined to the SEC adapter layer and should not impact other system components
Consider monitoring memory usage in environments where large batches of SEC filings are processed to quantify the efficiency gains

🤖 Generated with Claude Code

Branch Info:

Source: refactor/ixbrl-table-parser-md
Target: main
Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Introduced string-find based matching for extracting ix:nonNumeric TextBlock elements to prevent catastrophic backtracking on large filings. - Added a new method, _find_textblocks, to encapsulate the logic for finding TextBlock elements. - Updated _strip_html to convert HTML tables to markdown format before stripping tags, preserving financial data structure. - Enhanced _build_continuation_map to use string-find matching, improving memory usage during processing.

…/ixbrl-table-parser-md

jfrench9 · 2026-04-03T06:13:11Z

@claude please review this PR

claude · 2026-04-03T06:13:40Z

Claude finished @jfrench9's task in 3m 3s —— View job

PR Review: Refactor iXBRL parser for improved HTML handling and memory efficiency

Tasks:

Gather context and understand the request
Read the changed file and diff
Analyze code for bugs, security, performance, and style issues
Post review feedback

Summary

This is a solid, well-motivated refactor. The core change — replacing catastrophic re.DOTALL + (.*?) regex patterns on full 5MB+ documents with string-find based nesting-aware matching — is correct and the real-world motivation (7GB+ allocation on 5MB files) is clearly documented. The HTML table-to-markdown addition is a nice improvement for preserving financial data structure. Test coverage is good overall.

A few issues worth addressing before merge:

Issues

1. Missing test for new `_strip_html` table conversion path

tests/adapters/sec/test_ixbrl_parser.py has a TestStripHtml class with 5 tests, but none cover the new branch added at ixbrl_parser.py:76-77. The table-to-markdown conversion is an observable behavior change — a test should confirm that table content is preserved in structured form rather than collapsed.

def test_converts_table_to_markdown():
    html = "<table><tr><th>Period</th><th>Revenue</th></tr><tr><td>Q1</td><td>1000</td></tr></table>"
    result = _strip_html(html)
    assert "Period" in result
    assert "Revenue" in result
    assert "Q1" in result

Fix this →

2. Duplicated nesting-find logic

_find_textblocks (ixbrl_parser.py:192) and _build_continuation_map (ixbrl_parser.py:258) share ~40 lines of near-identical nesting-aware tag-matching logic. The only differences are what gets extracted (element name/content vs continuation id/content) and the early-exit filter.

This isn't a correctness issue now, but if a bug is found in the nesting logic (e.g., handling of self-closing edge cases) it would need to be fixed in two places. Consider extracting a shared _find_nested_tags(html, open_tag, close_tag) -> Iterator[tuple[str, str]] helper that yields (attrs, inner_html) tuples.

Fix this →

3. `attrs` extraction inconsistency between the two methods

In _find_textblocks (ixbrl_parser.py:215):

attrs = html[start + len(open_tag) : tag_end]  # excludes tag name

In _build_continuation_map (ixbrl_parser.py:279):

attrs = html[start:tag_end]  # includes "<ix:continuation" prefix

Not a bug — the downstream regexes work either way — but it's a subtle inconsistency that makes the two methods harder to refactor together. If the nesting logic is extracted (see #2), this should be normalized.

4. `_strip_html` table detection is not fully case-insensitive

ixbrl_parser.py:76:

if "<table" in html or "<TABLE" in html:

Mixed-case variants like <Table> would bypass the conversion. The html_tables_to_markdown function itself handles case-insensitively via re.IGNORECASE, so the only harm is a missed optimization path. Prefer:

if "<table" in html.lower():

Low risk in practice (SEC filings use <table> or <TABLE>), but trivially fixed.

Fix this →

5. `_find_textblocks` and `_build_continuation_map` don't access `self`

Both methods are instance methods but use no instance state. They could be @staticmethod or module-level functions (consistent with the existing _strip_html, _label_from_element_name, _extract_elements_from_block pattern). Minor style point.

What's Good

Core fix is correct: The string-find + nesting-counter approach correctly handles nested ix:nonNumeric tags without the O(n²) backtracking. The visited-set cycle protection for continuation chains is also correct.
Documentation is excellent: The inline comment about "7GB+ allocation on 5MB files" gives the right level of context for future readers.
Early-exit optimization: Skipping non-TextBlock elements before doing the full close-tag search is a good O(n) optimization.
html_table.py design: Using HTMLParser for cell text extraction rather than regex is the right call for handling malformed/nested HTML in cells.
Test coverage: The existing tests cover the main logic paths well, including the cycle-protection test.

|

… review command - Introduced a new method, _find_nested_blocks, to efficiently find nested HTML elements while avoiding memory issues. - Updated the iXBRLParser to utilize _find_nested_blocks for processing ix:nonNumeric and ix:continuation elements. - Enhanced the PR review command documentation to clarify valid JSON fields and improve command usage examples.

jfrench9 added 3 commits April 3, 2026 00:21

Merge commit 'ab71b19989dbf2a4456c3466884fbbae030a9904' into refactor…

c3ee0aa

…/ixbrl-table-parser-md

Merge commit 'e7e46c647b0141b3e53a45b0004c36e9f207aa1e' into refactor…

1a73015

…/ixbrl-table-parser-md

jfrench9 merged commit e025e6b into main Apr 3, 2026
7 checks passed

jfrench9 deleted the refactor/ixbrl-table-parser-md branch April 3, 2026 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor iXBRL parser for improved HTML handling and memory efficiency#579

Refactor iXBRL parser for improved HTML handling and memory efficiency#579
jfrench9 merged 4 commits into
mainfrom
refactor/ixbrl-table-parser-md

jfrench9 commented Apr 3, 2026

Uh oh!

jfrench9 commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Apr 3, 2026

Summary

Key Accomplishments

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Refactor iXBRL parser for improved HTML handling and memory efficiency

Summary

Issues

1. Missing test for new _strip_html table conversion path

2. Duplicated nesting-find logic

3. attrs extraction inconsistency between the two methods

4. _strip_html table detection is not fully case-insensitive

5. _find_textblocks and _build_continuation_map don't access self

What's Good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 3, 2026 •

edited

Loading

1. Missing test for new `_strip_html` table conversion path

3. `attrs` extraction inconsistency between the two methods

4. `_strip_html` table detection is not fully case-insensitive

5. `_find_textblocks` and `_build_continuation_map` don't access `self`