Skip to content

Refactor XBRL processing with semantic enrichment and taxonomy support#384

Merged
jfrench9 merged 6 commits into
mainfrom
refactor/xbrl-graph-enrichment
Feb 21, 2026
Merged

Refactor XBRL processing with semantic enrichment and taxonomy support#384
jfrench9 merged 6 commits into
mainfrom
refactor/xbrl-graph-enrichment

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This refactor significantly enhances the XBRL processing pipeline by introducing semantic enrichment capabilities, comprehensive taxonomy support, and improved data processing infrastructure. The changes establish a robust foundation for financial statement analysis with better element resolution and structure classification.

Key Accomplishments

Semantic Enrichment Infrastructure

  • New enrichment engine: Implemented fastembed-based semantic enrichment for XBRL elements, enabling intelligent matching and classification of financial concepts
  • Enhanced element resolution: Added sophisticated tools for resolving XBRL elements with semantic understanding and context awareness
  • Structure classification: Introduced automated classification of financial statement structures (balance sheet, income statement, cash flow)

Taxonomy Management System

  • Comprehensive taxonomy support: Built complete taxonomy modules for major financial statement types:
    • Balance Sheet concepts and structures
    • Income Statement line items and relationships
    • Cash Flow statement components
    • Core concept definitions and mappings
  • Standardized structures: Established consistent data structures for financial taxonomy elements with proper inheritance and validation

Enhanced XBRL Graph Processing

  • Improved period handling: Refactored temporal data processing with better date range management and period classification
  • Robust numeric processing: Enhanced value extraction and normalization for financial data points
  • Better error handling: Added comprehensive validation and error recovery mechanisms

MCP Tool Extensions

  • New resolution tools: Added specialized tools for element and structure resolution with semantic matching capabilities
  • Enhanced query interface: Improved example queries and facts retrieval with better context awareness
  • Extended tool management: Expanded tool registry with new semantic analysis capabilities

Infrastructure Improvements

  • Docker optimization: Updated containerization with extension versioning and improved build processes
  • Rate limiting: Implemented proper rate limiting for external API interactions
  • Connection pooling: Enhanced database connection management with better resource utilization
  • Schema evolution: Extended base schemas with custom field support and validation enhancements

Breaking Changes

  • Modified XBRL graph processor interface - existing integrations may need updates to accommodate new enrichment parameters
  • Updated schema structures for financial concepts - downstream consumers should verify compatibility
  • Changed connection pool configuration - deployment scripts may need adjustment for new pooling parameters

Testing Coverage

  • Comprehensive test suite for new taxonomy modules with concept validation and structure parsing tests
  • Semantic enrichment testing with mock embedding models and classification scenarios
  • MCP tool testing with resolution accuracy and performance benchmarks
  • Updated integration tests for enhanced XBRL processing pipeline
  • Schema validation tests for new custom field implementations

Infrastructure Considerations

  • New dependency on fastembed for semantic processing - ensure model artifacts are available in deployment environment
  • Increased memory requirements for embedding model loading and vector operations
  • Enhanced database connection pooling may require connection limit adjustments
  • Rate limiting implementation may affect throughput for high-volume processing scenarios
  • Docker image size increased due to ML model dependencies - consider image optimization for production deployments

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/xbrl-graph-enrichment
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

…cessor

- Updated the numeric value calculation to store the actual reported value without decimal scaling, clarifying the role of the `decimals` attribute.
- Changed the handling of period types to introduce a distinct `duration_type` for better classification of time contexts, replacing the previous `period_type` values.
- Enhanced documentation to reflect these changes, ensuring clarity on the usage of `period_type` and `duration_type` across the codebase.
- Adjusted related queries and tests to align with the new structure, ensuring consistent behavior in data retrieval and processing.
- Added support for embedding-based canonical concept matching in XBRL processing.
- Introduced a new `SemanticEnricher` class to handle lazy loading of the fastembed model and enrich XBRL elements, labels, and structures.
- Updated the `XBRLGraphProcessor` to enable semantic enrichment, including logging for enrichment processes.
- Defined canonical concepts for financial statements in a new taxonomy structure, enhancing the semantic understanding of XBRL data.
- Updated configuration to enable semantic enrichment features and ensure compatibility with existing processing workflows.
…miting

- Updated the Dockerfile to pin the extension version for ABI compatibility, ensuring consistent paths for LadybugDB extensions.
- Added a new vector extension to the Dockerfile for enhanced functionality.
- Modified execute.py to conditionally apply rate limiting based on environment configuration, improving flexibility in development environments.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Feb 21, 2026

Claude finished @jfrench9's task in 2m 54s —— View job


PR Review: Refactor XBRL processing with semantic enrichment and taxonomy support

Tasks

  • Gather context and understand the PR
  • Read key changed files
  • Analyze semantic enrichment implementation
  • Analyze taxonomy modules
  • Analyze MCP tool extensions
  • Analyze infrastructure changes (Docker, connection pool)
  • Provide comprehensive review feedback

This is a significant, well-structured PR that adds meaningful capability to the XBRL processing pipeline. The lazy-loading pattern for the embedding model is correct, the taxonomy is cleanly separated, and the vector index creation post-materialization (not on empty tables) is the right approach. The PR is largely sound but has a few issues worth fixing before merge.

Bugs

1. Reversed relationship direction in resolve_element_tool.py:190 — This is the most impactful bug.

# Current (WRONG — direction is reversed):
label_query = (
    f"MATCH (e:Element)<-[:ELEMENT_HAS_LABEL]-(l:Label) "
    ...
)

# Should be (matches the relationship defined in xbrl_graph.py):
label_query = (
    f"MATCH (e:Element)-[:ELEMENT_HAS_LABEL]->(l:Label) "
    ...
)

The label fallback at line 230 already uses the correct direction ((e:Element)-[:ELEMENT_HAS_LABEL]->(node)), which confirms this is a typo. The primary label enrichment in the fact count enrichment step will return zero results for all elements. Fix this →


2. Fragile index tracking in enrich_dataframes()xbrl_graph.py:316

Using len(canonical_concepts) as a positional index into embeddings is implicit and fragile:

for i, row in self.elements_df.iterrows():
    concept_id, confidence = enricher.match_canonical(
        embeddings[len(canonical_concepts)],  # Implicit positional index
        ...
    )
    canonical_concepts.append(concept_id)

If the DataFrame has non-contiguous integer indices (e.g., after a slice/filter), i wouldn't equal position. This pattern works today because iterrows() iterates in order, but the same pattern appears twice and is easy to misread. Standard Python would use enumerate().


Design Issues

3. Thread-safety gap in taxonomy singletons — taxonomy/__init__.py:23

_element_taxonomy: list[CanonicalConcept] | None = None
_structure_taxonomy: list[CanonicalConcept] | None = None

If two worker threads call get_element_taxonomy() simultaneously before the cache is populated, both will compute embeddings (a ~1s operation each). CPython's GIL prevents data corruption, but the double computation wastes memory. A threading.Lock() would prevent this, or at minimum a comment explaining the accepted tradeoff.

4. Silent over-broad exception catch — manager.py:115

except (AttributeError, Exception):
    pass

AttributeError is already a subclass of Exception, making the tuple redundant. More importantly, catching all exceptions silently will swallow real errors (import failures, config issues). This should be narrowed to expected exceptions (e.g., KeyError, AttributeError) with a debug log.

5. locals() usage in make_period()xbrl_graph.py:1579

if "period_data" in locals() and period_data:
    period_identifier = period_data["identifier"]

Checking locals() for variable existence is non-idiomatic and fragile. Initializing period_data = None before the if/elif chain and then checking if period_data is not None is the standard Python approach.

6. XBRL_SEMANTIC_ENRICHMENT = True is not runtime-configurable — config.py

The other feature flags (XBRL_STANDARDIZED_FILENAMES, etc.) follow a pattern of being set to False for toggling. This flag is hardcoded True in a module-level constant — it cannot be disabled without a code change. Consider deriving it from env.* or at minimum documenting it's intentionally always-on.


Schema Breaking Change

7. period_type column split (base.py)

The schema change splits the old period_type column into two new columns (period_type and duration_type). This is a breaking schema change — any existing SEC databases will be missing the duration_type column and have the old semantics in period_type. The PR description mentions this correctly under Breaking Changes, but it should be paired with a migration strategy or documented as a reset-required upgrade.


Threshold Inconsistency (Minor)

match_canonical() uses a threshold of 0.80 (ingestion-time, line 312), while match_canonical_from_query() uses 0.50 (query-time, line 350). The difference is reasonable (query-time should be more permissive), but it's not documented anywhere. A comment explaining the design intent would help future maintainers.


Test Coverage Observations

  • Weak assertion in test_match_canonical_below_threshold (test_enrichment.py:93): The test only asserts confidence < 0.85 if something matched — it doesn't assert that concept_id is None. A cooking recipe may still fuzzy-match a financial concept at low confidence. The test intent is "should not match", so it should assert concept_id is None.

  • The MCP tool tests in test_resolve_element.py are well-structured using the _query_router pattern. The @pytest.mark.slow markers on model-loading tests are appropriate.

  • The test_match_canonical_below_threshold test was the only semantic test without the slow marker that actually exercises enrichment — it doesn't load the model but tests the threshold logic correctly via mocking. Good separation.


Infrastructure

  • Dockerfile simplification (single LADYBUG_EXT_VERSION ARG replacing two) is a clear improvement and eliminates the potential ABI mismatch that the old dev + internal version split could cause.
  • Vector extension pre-installation in Docker and the INSTALL/LOAD fallback in pool connection setup is defensive and correct.
  • create_vector_index() correctly uses a no-op when table_name isn't in the index map, and handles the "already exists" error gracefully.

Summary: The core semantic enrichment architecture is solid and the taxonomy design is clean. The relationship direction bug in resolve_element_tool.py needs fixing before merge as it silently breaks label enrichment. The other items are lower priority but the locals() usage and silent exception swallowing should be addressed. The schema split is a known breaking change and is acceptable given the PR scope.
|

Comment thread robosystems/middleware/mcp/tools/manager.py Fixed
- Simplified the condition for accessing period data in the XBRLGraphProcessor by removing unnecessary local variable checks.
- Improved error handling in GraphMCPTools by adding detailed logging for exceptions during semantic enrichment checks, aiding in debugging and monitoring.
- Corrected the Cypher query syntax in resolve_element_tool.py to ensure proper relationship traversal in graph queries.
- Added new fields `canonical_concept`, `canonical_confidence`, and `embedding` to the accounting data generator for improved data structure.
- Updated `period_type` to `duration` and introduced `duration_type` for better classification of time contexts in accounting data.
- Changed logging level from `info` to `debug` in various modules to reduce verbosity and improve log clarity during debugging processes.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude have we fixed the issue? we relaxed some logging to debug to make it easier to read the logs. but we fixed a bunch of issues too

@claude
Copy link
Copy Markdown

claude Bot commented Feb 21, 2026

Claude finished @jfrench9's task in 1m 56s —— View job


PR Fix Status Review

Tasks


Here's the current status of each issue from the previous review:

Fixed ✅

Bug #1 — Reversed relationship direction (resolve_element_tool.py:185)
Fixed. The primary label query now correctly reads:

MATCH (e:Element)-[:ELEMENT_HAS_LABEL]->(l:Label)

This was the most impactful bug; it's gone.

Design #4 — Silent over-broad exception catch (manager.py:109-118)
Fixed and improved. The handler is now:

except Exception as exc:
    graph_id = getattr(self.client, "graph_id", "unknown")
    logger.debug(f"Semantic enrichment check failed for {graph_id}: {exc}")
return False

The (AttributeError, Exception) redundancy is removed, the exception is captured and logged at debug level (rather than silently swallowed), and the graph_id is surfaced for traceability. Well done.

Design #5locals() usage (xbrl_graph.py:1376-1583)
Fixed. period_data = None is now initialized at line 1379 before the if/elif chain, and the check at line 1579 is if period_data: — idiomatic Python.


Not Yet Fixed ❌

Bug #2 — Fragile index tracking (xbrl_graph.py:316-318)
Still present. The pattern embeddings[len(canonical_concepts)] is still used:

for i, row in self.elements_df.iterrows():
    concept_id, confidence = enricher.match_canonical(
        embeddings[len(canonical_concepts)],  # i is unused; implicit positional index
        ...
    )
    canonical_concepts.append(concept_id)

The variable i from iterrows() is imported but unused. This works because iterrows() always produces in-order rows, but it's misleading — the reader has to mentally verify the invariant. enumerate() makes it explicit. Low risk, but still worth a cleanup. Fix this →

Design #3 — Thread-safety gap in taxonomy singletons (taxonomy/__init__.py:23-24)
Still not addressed. The module-level globals _element_taxonomy and _structure_taxonomy are still initialized without a threading.Lock. In a multi-threaded WSGI or Dagster context, two concurrent callers could both find _element_taxonomy is None and independently trigger the ~1s embedding computation. CPython's GIL prevents data corruption, but the wasted work is real. The docstring improvement is appreciated, but the actual protection is still missing. Fix this →

Test #7 — Weak assertion in test_match_canonical_below_threshold (test_enrichment.py:93)
Still not fixed. The test reads:

concept_id, confidence = enricher.match_canonical(...)
if concept_id is not None:
    assert confidence < 0.85

The intent is "cooking recipes should not match a financial concept," but the test passes even if concept_id is returned — as long as confidence < 0.85. The assertion should be assert concept_id is None. Fix this →


Partially Addressed ⚠️

Design #6XBRL_SEMANTIC_ENRICHMENT not runtime-configurable (config.py:77)
The flag is still hardcoded True. However, the module now has a clear docstring: "These are processing-specific constants for the SEC/XBRL pipeline. They are not runtime-configurable - change them here if needed." That's acceptable documentation of an intentional design decision. The review concern is considered addressed unless you want it to be togglable via env.*.

Minor — Threshold docstring inconsistency (enrichment.py:266)
The docstring for match_canonical() still says "Threshold: minimum 0.70 confidence" but the code uses 0.80. The 0.80 (ingestion) vs 0.50 (query-time) split is also still undocumented. Not a bug, but the stale comment is misleading.


Summary

Issue Status
Bug #1: Reversed relationship direction ✅ Fixed
Bug #2: Fragile index tracking ❌ Not fixed
Design #3: Thread-safety in taxonomy singletons ❌ Not fixed
Design #4: Silent exception swallowing ✅ Fixed (logging relaxed to debug as expected)
Design #5: locals() usage ✅ Fixed
Design #6: XBRL_SEMANTIC_ENRICHMENT configurability ⚠️ Documented as intentional
Test #7: Weak assertion ❌ Not fixed
Minor: Threshold docstring discrepancy ❌ Not fixed

The two critical items from the previous review are both resolved (Bug #1 and the silent exception catch). The remaining open items are lower priority — Bug #2 and the test assertion are the most worthwhile to address before merge.

- Updated the `_write_parquet` method in `03_generate_data.py` for both accounting and element mapping demos to override type inference for columns with all-None values, ensuring correct data types are used when writing to Parquet files.
- Improved compatibility with DuckDB and LadybugDB by explicitly defining the schema for the `embedding` column, preventing type casting issues during data processing.
- Adjusted the default credentials file path in `05_query_graph.py` to reflect a more standardized directory structure for better accessibility across demo examples.
- Refactored the `create_vector_index` method in `manager.py` to allow for dynamic column specification, enhancing flexibility for different table schemas.
@jfrench9 jfrench9 merged commit fed593c into main Feb 21, 2026
7 checks passed
@jfrench9 jfrench9 deleted the refactor/xbrl-graph-enrichment branch February 21, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant