Skip to content

Refactor SEC indexing with quarterly partitioning and capacity constraints#520

Merged
jfrench9 merged 2 commits into
mainfrom
refactor/opensearch-capacity-constraints
Mar 21, 2026
Merged

Refactor SEC indexing with quarterly partitioning and capacity constraints#520
jfrench9 merged 2 commits into
mainfrom
refactor/opensearch-capacity-constraints

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

Refactors the SEC indexing pipeline to introduce quarterly partitioning and address OpenSearch capacity constraints. This change enhances the indexing configurations, jobs, sensors, and core text indexing logic to support a more scalable and resource-aware approach to SEC document ingestion.

Key Accomplishments

  • Quarterly Partitioning: Introduced quarterly partitioning for SEC indexing operations, enabling more granular control over data processing windows and reducing the volume of data handled in each indexing pass. This directly addresses OpenSearch capacity limitations by breaking large indexing workloads into manageable quarterly chunks.

  • Enhanced Configuration Management: Expanded and restructured indexing configurations (configs.py) to support the new partitioning scheme, providing clearer and more flexible parameterization for pipeline runs.

  • Job and Sensor Updates: Updated pipeline jobs and sensors to be partition-aware, ensuring that scheduling and triggering logic correctly aligns with the quarterly data boundaries.

  • Text Index Refactoring: Significantly refactored the core text indexing module (text_index.py) with ~340 lines added and ~196 removed, improving code structure, readability, and robustness while integrating the quarterly partitioning logic throughout the indexing workflow.

Breaking Changes

  • Pipeline configuration schema changes: Existing job configurations and any external references to SEC indexing job parameters may need to be updated to account for the new quarterly partitioning fields.
  • Sensor behavior changes: Sensors now operate on a quarterly partition basis, which may affect existing schedules and downstream dependencies that relied on the previous triggering cadence.

Testing Notes

  • Verify that quarterly partitions are correctly computed and that indexing jobs process only the expected date ranges for each quarter.
  • Validate that sensors correctly detect and trigger runs for each quarterly partition boundary.
  • Test edge cases around quarter boundaries (e.g., filings at the start/end of quarters) to ensure no data is missed or duplicated.
  • Confirm that OpenSearch cluster resource utilization remains within acceptable thresholds under the new partitioned workload pattern.

Infrastructure Considerations

  • OpenSearch clusters should be monitored during initial rollout to validate that the quarterly partitioning effectively reduces peak indexing load and stays within capacity constraints.
  • Existing pipeline schedules and materializations may need to be reset or backfilled to align with the new quarterly partitioning scheme.
  • Consider a phased rollout, processing recent quarters first, to validate behavior before triggering historical backfills.

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/opensearch-capacity-constraints
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Updated SECTextBlockIndexConfig, SECNarrativeIndexConfig, and SECiXBRLIndexConfig to include detailed descriptions and partitioning by quarter.
- Modified job definitions for sec_textblocks_index, sec_narratives_index, and sec_ixbrl_index to reflect quarterly partitioning in descriptions and added partition definitions.
- Improved sec_post_stage_index_sensor to derive the current quarter for indexing jobs.
- Adjusted sec_textblocks_indexed function to process parquet files based on the current quarter, enhancing memory management and backfill capabilities.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Mar 21, 2026

Claude finished @jfrench9's task in 2m 4s —— View job


PR Review: Refactor SEC indexing with quarterly partitioning and capacity constraints

Tasks:

  • Gather context and understand the request
  • Read changed files
  • Analyze configs.py changes
  • Analyze jobs.py changes
  • Analyze sensors.py changes
  • Analyze text_index.py changes (core refactor)
  • Post comprehensive review

Overview

This is a solid, well-motivated refactor. The quarterly partitioning strategy is the right architectural direction for addressing OpenSearch capacity constraints, and the streaming approach in sec_textblocks_indexed (Phase 1/2/3 structure) is a significant improvement over the previous load-everything-then-filter pattern. A few issues are worth addressing before merge.


Issues

1. _partition_year is now dead code — text_index.py:100

The function at line 100 is no longer called anywhere after this refactor. The new code uses partition_key.split("-Q")[0] inline. Should be removed.

Fix this →

2. Duplicate import from same module — text_index.py:32,42

# Line 32 — absolute import
from robosystems.adapters.sec.pipeline.configs import sec_quarter_partitions

# Line 42 — relative import from same module
from .configs import (
  SECiXBRLIndexConfig,
  SECNarrativeIndexConfig,
  SECTextBlockIndexConfig,
)

These should be consolidated into one import statement. Fix this →

3. iterrows() on a 1M-row Element table is very slow — text_index.py:474

The streaming improvements are great, but the Element table still has ~1M rows and is iterated with iterrows(), which is notoriously slow (~100x slower than vectorized operations). Consider to_dict("records") or vectorized pandas here:

# Instead of:
for _, row in elements_df.iterrows():
    eid = row.get("identifier")
    ...

# Prefer:
element_lookup = elements_df.set_index("identifier")[["qname", "name", "is_textblock"]].to_dict("index")
textblock_element_ids = set(elements_df.loc[elements_df["is_textblock"] == True, "identifier"])

Same pattern applies to the relationship streaming loops (lines ~519, ~555, ~581).

4. Sensor partition key mismatch risk — sensors.py:693

partition_key = run_tags.get("dagster/partition")
if not partition_key:
    now = datetime.now(timezone.utc)
    quarter = (now.month - 1) // 3 + 1
    partition_key = f"{now.year}-Q{quarter}"

The upstream staging job (triggering this sensor) doesn't appear to be partitioned itself — it checks mode=incremental. This means run_tags.get("dagster/partition") will almost always be None, falling through to the current-date fallback. This is fine for incremental runs (always index current quarter), but it means there's no way to trigger a specific historical quarter via the sensor chain. If that's intentional, it's worth documenting. If not, consider whether the staging run should propagate a target quarter.

5. Raw ZIP scan scope for narratives/iXBRL only coarsens to year — text_index.py:842,1132

partition_year = partition_key.split("-Q")[0]  # "2026-Q1" → "2026"
raw_prefix = f"{get_raw_key(DataSourceType.SEC)}/year={partition_year}"

This correctly narrows the S3 scan to a single year, but then relies on indexed_accessions to skip already-processed filings. This means that for Q1 2026, you'll still scan and open every ZIP for 2026 even if Q2/Q3/Q4 ZIPs exist in the same year=2026 prefix. This is expected behavior since raw ZIPs are annual (not quarterly), but it means the skip-check via indexed_accessions is load-bearing for correctness across quarter reruns. Worth a comment making this explicit.

6. import inside function body — sensors.py:695

if not partition_key:
    from datetime import datetime, timezone  # ← inside function

This works fine but is inconsistent style. The imports should be at the top of the file.


Positives

  • Streaming architecture in sec_textblocks_indexed is a significant improvement. The 3-phase structure (load lookups → stream relationships with filtering → build/index documents) is well-structured and the comments explaining memory bounds are valuable.
  • Column projection in _read_parquets_from_s3 avoids loading embeddings and other large columns — good use of PyArrow's columnar format.
  • Error tracking fix: Old code had "errors": errors + result["errors"] which double-counted. Fixed correctly.
  • documents.clear() vs documents = []: More memory-efficient since it reuses the existing list object. Good catch.
  • del statements for early GC on large lookup tables are good practice here.
  • ECS resource reduction: The CPU halving (1024→512) and memory reductions for narratives/iXBRL (4096→2048) are well-justified by the streaming approach. The inline comments explaining the memory bound rationale are helpful.
  • run_key now includes partition_key: f"sec-{job_name}-chain-{partition_key}-{dagster_run.run_id[:8]}" correctly prevents duplicate runs per partition.
  • BackfillPolicy.single_run() is appropriate for batch assets that process a full partition in one shot.

Minor

  • The PR description says "~340 lines added and ~196 removed" but the diff shows +285/-166 for text_index.py specifically (+336/-196 total). Minor discrepancy, not a blocker.

Verdict: Approve with suggested fixes — the dead code removal (#1), import consolidation (#2), and the iterrows() performance issue (#3) are worth addressing. Issues #4–6 are lower priority but worth understanding before merging.

- Replaced timezone handling in sec_post_stage_index_sensor to use UTC directly for consistency.
- Streamlined data processing in sec_textblocks_indexed by consolidating DataFrame operations and reducing iteration overhead.
- Removed the unused _partition_year function and its associated tests to clean up the codebase.
- Enhanced readability by formatting DataFrame operations and improving logging messages for better traceability.
@jfrench9 jfrench9 merged commit 01be2a4 into main Mar 21, 2026
7 checks passed
@jfrench9 jfrench9 deleted the refactor/opensearch-capacity-constraints branch March 21, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant