Skip to content

Refactor iXBRL parser for improved HTML handling and memory efficiency#579

Merged
jfrench9 merged 4 commits into
mainfrom
refactor/ixbrl-table-parser-md
Apr 3, 2026
Merged

Refactor iXBRL parser for improved HTML handling and memory efficiency#579
jfrench9 merged 4 commits into
mainfrom
refactor/ixbrl-table-parser-md

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Apr 3, 2026

Summary

Refactors the iXBRLParser in the SEC adapter to improve HTML table parsing, enhance memory efficiency, and provide more robust handling of iXBRL document structures. This is a focused refactor of a single module (ixbrl_parser.py) with significant internal improvements.

Key Accomplishments

  • Improved HTML handling: Enhanced the parser's ability to process and extract data from HTML table structures within iXBRL filings, resulting in more reliable and accurate parsing of SEC document content
  • Memory efficiency improvements: Optimized internal data handling to reduce memory footprint during parsing operations, which is particularly beneficial when processing large iXBRL documents
  • Expanded functionality: Added ~138 new lines of logic while refactoring ~26 existing lines, indicating both new capabilities and streamlined existing code paths
  • Cleaner architecture: Restructured the parser internals for better maintainability and readability, aligning with broader refactoring goals (as indicated by the branch name referencing markdown/table parsing improvements)

Breaking Changes

  • None expected. The refactor is internal to the iXBRLParser class. However, consumers of the parser's output should verify that returned data structures remain consistent, particularly if downstream code depends on specific table parsing output formats.

Testing Notes

  • Verify iXBRL document parsing against a representative set of SEC filings to ensure output correctness is maintained
  • Test with both small and large iXBRL filings to validate memory efficiency improvements
  • Pay particular attention to edge cases in HTML table structures (nested tables, merged cells, irregular row/column spans)
  • Confirm that existing integration points with the SEC adapter continue to function as expected

Infrastructure Considerations

  • No new dependencies introduced
  • No configuration changes required
  • The changes are confined to the SEC adapter layer and should not impact other system components
  • Consider monitoring memory usage in environments where large batches of SEC filings are processed to quantify the efficiency gains

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/ixbrl-table-parser-md
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

jfrench9 added 3 commits April 3, 2026 00:21
- Introduced string-find based matching for extracting ix:nonNumeric TextBlock elements to prevent catastrophic backtracking on large filings.
- Added a new method, _find_textblocks, to encapsulate the logic for finding TextBlock elements.
- Updated _strip_html to convert HTML tables to markdown format before stripping tags, preserving financial data structure.
- Enhanced _build_continuation_map to use string-find matching, improving memory usage during processing.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Apr 3, 2026

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Apr 3, 2026

Claude finished @jfrench9's task in 3m 3s —— View job


PR Review: Refactor iXBRL parser for improved HTML handling and memory efficiency

Tasks:

  • Gather context and understand the request
  • Read the changed file and diff
  • Analyze code for bugs, security, performance, and style issues
  • Post review feedback

Summary

This is a solid, well-motivated refactor. The core change — replacing catastrophic re.DOTALL + (.*?) regex patterns on full 5MB+ documents with string-find based nesting-aware matching — is correct and the real-world motivation (7GB+ allocation on 5MB files) is clearly documented. The HTML table-to-markdown addition is a nice improvement for preserving financial data structure. Test coverage is good overall.

A few issues worth addressing before merge:


Issues

1. Missing test for new _strip_html table conversion path

tests/adapters/sec/test_ixbrl_parser.py has a TestStripHtml class with 5 tests, but none cover the new branch added at ixbrl_parser.py:76-77. The table-to-markdown conversion is an observable behavior change — a test should confirm that table content is preserved in structured form rather than collapsed.

def test_converts_table_to_markdown():
    html = "<table><tr><th>Period</th><th>Revenue</th></tr><tr><td>Q1</td><td>1000</td></tr></table>"
    result = _strip_html(html)
    assert "Period" in result
    assert "Revenue" in result
    assert "Q1" in result

Fix this →


2. Duplicated nesting-find logic

_find_textblocks (ixbrl_parser.py:192) and _build_continuation_map (ixbrl_parser.py:258) share ~40 lines of near-identical nesting-aware tag-matching logic. The only differences are what gets extracted (element name/content vs continuation id/content) and the early-exit filter.

This isn't a correctness issue now, but if a bug is found in the nesting logic (e.g., handling of self-closing edge cases) it would need to be fixed in two places. Consider extracting a shared _find_nested_tags(html, open_tag, close_tag) -> Iterator[tuple[str, str]] helper that yields (attrs, inner_html) tuples.

Fix this →


3. attrs extraction inconsistency between the two methods

In _find_textblocks (ixbrl_parser.py:215):

attrs = html[start + len(open_tag) : tag_end]  # excludes tag name

In _build_continuation_map (ixbrl_parser.py:279):

attrs = html[start:tag_end]  # includes "<ix:continuation" prefix

Not a bug — the downstream regexes work either way — but it's a subtle inconsistency that makes the two methods harder to refactor together. If the nesting logic is extracted (see #2), this should be normalized.


4. _strip_html table detection is not fully case-insensitive

ixbrl_parser.py:76:

if "<table" in html or "<TABLE" in html:

Mixed-case variants like <Table> would bypass the conversion. The html_tables_to_markdown function itself handles case-insensitively via re.IGNORECASE, so the only harm is a missed optimization path. Prefer:

if "<table" in html.lower():

Low risk in practice (SEC filings use <table> or <TABLE>), but trivially fixed.

Fix this →


5. _find_textblocks and _build_continuation_map don't access self

Both methods are instance methods but use no instance state. They could be @staticmethod or module-level functions (consistent with the existing _strip_html, _label_from_element_name, _extract_elements_from_block pattern). Minor style point.


What's Good

  • Core fix is correct: The string-find + nesting-counter approach correctly handles nested ix:nonNumeric tags without the O(n²) backtracking. The visited-set cycle protection for continuation chains is also correct.
  • Documentation is excellent: The inline comment about "7GB+ allocation on 5MB files" gives the right level of context for future readers.
  • Early-exit optimization: Skipping non-TextBlock elements before doing the full close-tag search is a good O(n) optimization.
  • html_table.py design: Using HTMLParser for cell text extraction rather than regex is the right call for handling malformed/nested HTML in cells.
  • Test coverage: The existing tests cover the main logic paths well, including the cycle-protection test.

|

… review command

- Introduced a new method, _find_nested_blocks, to efficiently find nested HTML elements while avoiding memory issues.
- Updated the iXBRLParser to utilize _find_nested_blocks for processing ix:nonNumeric and ix:continuation elements.
- Enhanced the PR review command documentation to clarify valid JSON fields and improve command usage examples.
@jfrench9 jfrench9 merged commit e025e6b into main Apr 3, 2026
7 checks passed
@jfrench9 jfrench9 deleted the refactor/ixbrl-table-parser-md branch April 3, 2026 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant