Refactor SEC indexing with quarterly partitioning and capacity constraints by jfrench9 · Pull Request #520 · RoboFinSystems/robosystems

jfrench9 · 2026-03-21T23:21:24Z

Summary

Refactors the SEC indexing pipeline to introduce quarterly partitioning and address OpenSearch capacity constraints. This change enhances the indexing configurations, jobs, sensors, and core text indexing logic to support a more scalable and resource-aware approach to SEC document ingestion.

Key Accomplishments

Quarterly Partitioning: Introduced quarterly partitioning for SEC indexing operations, enabling more granular control over data processing windows and reducing the volume of data handled in each indexing pass. This directly addresses OpenSearch capacity limitations by breaking large indexing workloads into manageable quarterly chunks.
Enhanced Configuration Management: Expanded and restructured indexing configurations (configs.py) to support the new partitioning scheme, providing clearer and more flexible parameterization for pipeline runs.
Job and Sensor Updates: Updated pipeline jobs and sensors to be partition-aware, ensuring that scheduling and triggering logic correctly aligns with the quarterly data boundaries.
Text Index Refactoring: Significantly refactored the core text indexing module (text_index.py) with ~340 lines added and ~196 removed, improving code structure, readability, and robustness while integrating the quarterly partitioning logic throughout the indexing workflow.

Breaking Changes

Pipeline configuration schema changes: Existing job configurations and any external references to SEC indexing job parameters may need to be updated to account for the new quarterly partitioning fields.
Sensor behavior changes: Sensors now operate on a quarterly partition basis, which may affect existing schedules and downstream dependencies that relied on the previous triggering cadence.

Testing Notes

Verify that quarterly partitions are correctly computed and that indexing jobs process only the expected date ranges for each quarter.
Validate that sensors correctly detect and trigger runs for each quarterly partition boundary.
Test edge cases around quarter boundaries (e.g., filings at the start/end of quarters) to ensure no data is missed or duplicated.
Confirm that OpenSearch cluster resource utilization remains within acceptable thresholds under the new partitioned workload pattern.

Infrastructure Considerations

OpenSearch clusters should be monitored during initial rollout to validate that the quarterly partitioning effectively reduces peak indexing load and stays within capacity constraints.
Existing pipeline schedules and materializations may need to be reset or backfilled to align with the new quarterly partitioning scheme.
Consider a phased rollout, processing recent quarters first, to validate behavior before triggering historical backfills.

🤖 Generated with Claude Code

Branch Info:

Source: refactor/opensearch-capacity-constraints
Target: main
Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Updated SECTextBlockIndexConfig, SECNarrativeIndexConfig, and SECiXBRLIndexConfig to include detailed descriptions and partitioning by quarter. - Modified job definitions for sec_textblocks_index, sec_narratives_index, and sec_ixbrl_index to reflect quarterly partitioning in descriptions and added partition definitions. - Improved sec_post_stage_index_sensor to derive the current quarter for indexing jobs. - Adjusted sec_textblocks_indexed function to process parquet files based on the current quarter, enhancing memory management and backfill capabilities.

jfrench9 · 2026-03-21T23:21:25Z

@claude please review this PR

claude · 2026-03-21T23:21:48Z

Claude finished @jfrench9's task in 2m 4s —— View job

PR Review: Refactor SEC indexing with quarterly partitioning and capacity constraints

Tasks:

Gather context and understand the request
Read changed files
Analyze configs.py changes
Analyze jobs.py changes
Analyze sensors.py changes
Analyze text_index.py changes (core refactor)
Post comprehensive review

Overview

This is a solid, well-motivated refactor. The quarterly partitioning strategy is the right architectural direction for addressing OpenSearch capacity constraints, and the streaming approach in sec_textblocks_indexed (Phase 1/2/3 structure) is a significant improvement over the previous load-everything-then-filter pattern. A few issues are worth addressing before merge.

Issues

1. `_partition_year` is now dead code — `text_index.py:100`

The function at line 100 is no longer called anywhere after this refactor. The new code uses partition_key.split("-Q")[0] inline. Should be removed.

Fix this →

2. Duplicate import from same module — `text_index.py:32,42`

# Line 32 — absolute import
from robosystems.adapters.sec.pipeline.configs import sec_quarter_partitions

# Line 42 — relative import from same module
from .configs import (
  SECiXBRLIndexConfig,
  SECNarrativeIndexConfig,
  SECTextBlockIndexConfig,
)

These should be consolidated into one import statement. Fix this →

3. `iterrows()` on a 1M-row Element table is very slow — `text_index.py:474`

The streaming improvements are great, but the Element table still has ~1M rows and is iterated with iterrows(), which is notoriously slow (~100x slower than vectorized operations). Consider to_dict("records") or vectorized pandas here:

# Instead of:
for _, row in elements_df.iterrows():
    eid = row.get("identifier")
    ...

# Prefer:
element_lookup = elements_df.set_index("identifier")[["qname", "name", "is_textblock"]].to_dict("index")
textblock_element_ids = set(elements_df.loc[elements_df["is_textblock"] == True, "identifier"])

Same pattern applies to the relationship streaming loops (lines ~519, ~555, ~581).

4. Sensor partition key mismatch risk — `sensors.py:693`

partition_key = run_tags.get("dagster/partition")
if not partition_key:
    now = datetime.now(timezone.utc)
    quarter = (now.month - 1) // 3 + 1
    partition_key = f"{now.year}-Q{quarter}"

The upstream staging job (triggering this sensor) doesn't appear to be partitioned itself — it checks mode=incremental. This means run_tags.get("dagster/partition") will almost always be None, falling through to the current-date fallback. This is fine for incremental runs (always index current quarter), but it means there's no way to trigger a specific historical quarter via the sensor chain. If that's intentional, it's worth documenting. If not, consider whether the staging run should propagate a target quarter.

5. Raw ZIP scan scope for narratives/iXBRL only coarsens to year — `text_index.py:842,1132`

partition_year = partition_key.split("-Q")[0]  # "2026-Q1" → "2026"
raw_prefix = f"{get_raw_key(DataSourceType.SEC)}/year={partition_year}"

This correctly narrows the S3 scan to a single year, but then relies on indexed_accessions to skip already-processed filings. This means that for Q1 2026, you'll still scan and open every ZIP for 2026 even if Q2/Q3/Q4 ZIPs exist in the same year=2026 prefix. This is expected behavior since raw ZIPs are annual (not quarterly), but it means the skip-check via indexed_accessions is load-bearing for correctness across quarter reruns. Worth a comment making this explicit.

6. `import` inside function body — `sensors.py:695`

if not partition_key:
    from datetime import datetime, timezone  # ← inside function

This works fine but is inconsistent style. The imports should be at the top of the file.

Positives

Streaming architecture in sec_textblocks_indexed is a significant improvement. The 3-phase structure (load lookups → stream relationships with filtering → build/index documents) is well-structured and the comments explaining memory bounds are valuable.
Column projection in _read_parquets_from_s3 avoids loading embeddings and other large columns — good use of PyArrow's columnar format.
Error tracking fix: Old code had "errors": errors + result["errors"] which double-counted. Fixed correctly.
documents.clear() vs documents = []: More memory-efficient since it reuses the existing list object. Good catch.
del statements for early GC on large lookup tables are good practice here.
ECS resource reduction: The CPU halving (1024→512) and memory reductions for narratives/iXBRL (4096→2048) are well-justified by the streaming approach. The inline comments explaining the memory bound rationale are helpful.
run_key now includes partition_key: f"sec-{job_name}-chain-{partition_key}-{dagster_run.run_id[:8]}" correctly prevents duplicate runs per partition.
BackfillPolicy.single_run() is appropriate for batch assets that process a full partition in one shot.

Minor

The PR description says "~340 lines added and ~196 removed" but the diff shows +285/-166 for text_index.py specifically (+336/-196 total). Minor discrepancy, not a blocker.

Verdict: Approve with suggested fixes — the dead code removal (#1), import consolidation (#2), and the iterrows() performance issue (#3) are worth addressing. Issues #4–6 are lower priority but worth understanding before merging.

- Replaced timezone handling in sec_post_stage_index_sensor to use UTC directly for consistency. - Streamlined data processing in sec_textblocks_indexed by consolidating DataFrame operations and reducing iteration overhead. - Removed the unused _partition_year function and its associated tests to clean up the codebase. - Enhanced readability by formatting DataFrame operations and improving logging messages for better traceability.

jfrench9 merged commit 01be2a4 into main Mar 21, 2026
7 checks passed

jfrench9 deleted the refactor/opensearch-capacity-constraints branch March 21, 2026 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SEC indexing with quarterly partitioning and capacity constraints#520

Refactor SEC indexing with quarterly partitioning and capacity constraints#520
jfrench9 merged 2 commits into
mainfrom
refactor/opensearch-capacity-constraints

jfrench9 commented Mar 21, 2026

Uh oh!

jfrench9 commented Mar 21, 2026

Uh oh!

claude Bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Mar 21, 2026

Summary

Key Accomplishments

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Mar 21, 2026

Uh oh!

claude Bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Refactor SEC indexing with quarterly partitioning and capacity constraints

Overview

Issues

1. _partition_year is now dead code — text_index.py:100

2. Duplicate import from same module — text_index.py:32,42

3. iterrows() on a 1M-row Element table is very slow — text_index.py:474

4. Sensor partition key mismatch risk — sensors.py:693

5. Raw ZIP scan scope for narratives/iXBRL only coarsens to year — text_index.py:842,1132

6. import inside function body — sensors.py:695

Positives

Minor

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Mar 21, 2026 •

edited

Loading

1. `_partition_year` is now dead code — `text_index.py:100`

2. Duplicate import from same module — `text_index.py:32,42`

3. `iterrows()` on a 1M-row Element table is very slow — `text_index.py:474`

4. Sensor partition key mismatch risk — `sensors.py:693`

5. Raw ZIP scan scope for narratives/iXBRL only coarsens to year — `text_index.py:842,1132`

6. `import` inside function body — `sensors.py:695`