Refactor iXBRL parser for improved HTML handling and memory efficiency#579
Conversation
- Introduced string-find based matching for extracting ix:nonNumeric TextBlock elements to prevent catastrophic backtracking on large filings. - Added a new method, _find_textblocks, to encapsulate the logic for finding TextBlock elements. - Updated _strip_html to convert HTML tables to markdown format before stripping tags, preserving financial data structure. - Enhanced _build_continuation_map to use string-find matching, improving memory usage during processing.
…/ixbrl-table-parser-md
…/ixbrl-table-parser-md
|
@claude please review this PR |
|
Claude finished @jfrench9's task in 3m 3s —— View job PR Review: Refactor iXBRL parser for improved HTML handling and memory efficiencyTasks:
SummaryThis is a solid, well-motivated refactor. The core change — replacing catastrophic A few issues worth addressing before merge: Issues1. Missing test for new
|
… review command - Introduced a new method, _find_nested_blocks, to efficiently find nested HTML elements while avoiding memory issues. - Updated the iXBRLParser to utilize _find_nested_blocks for processing ix:nonNumeric and ix:continuation elements. - Enhanced the PR review command documentation to clarify valid JSON fields and improve command usage examples.
Summary
Refactors the
iXBRLParserin the SEC adapter to improve HTML table parsing, enhance memory efficiency, and provide more robust handling of iXBRL document structures. This is a focused refactor of a single module (ixbrl_parser.py) with significant internal improvements.Key Accomplishments
Breaking Changes
iXBRLParserclass. However, consumers of the parser's output should verify that returned data structures remain consistent, particularly if downstream code depends on specific table parsing output formats.Testing Notes
Infrastructure Considerations
🤖 Generated with Claude Code
Branch Info:
refactor/ixbrl-table-parser-mdmainCo-Authored-By: Claude noreply@anthropic.com