Skip to content

Refactor SEC pipeline to implement quarterly data partitions#198

Merged
jfrench9 merged 3 commits into
mainfrom
refactor/sec-pipeline-partitions
Jan 17, 2026
Merged

Refactor SEC pipeline to implement quarterly data partitions#198
jfrench9 merged 3 commits into
mainfrom
refactor/sec-pipeline-partitions

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR refactors the SEC data processing pipeline to implement quarterly partitions, improving data organization, processing efficiency, and maintainability of the ETL workflow.

Key Changes

  • Enhanced SEC Client: Extended the EFTS client with new functionality to support quarterly data partitioning
  • Improved Asset Management: Updated Dagster assets to leverage quarterly partitions for better data organization and incremental processing
  • Job Configuration Updates: Modified SEC job definitions to work with the new partitioned architecture
  • Pipeline Script Enhancements: Refactored the SEC pipeline script to support quarterly processing workflows
  • Comprehensive Testing: Added extensive test coverage for the new EFTS client functionality

Key Accomplishments

  • Implemented quarterly partitioning scheme for SEC data processing
  • Enhanced data processing efficiency through better segmentation
  • Improved pipeline maintainability and scalability
  • Added robust test coverage for new functionality
  • Maintained backward compatibility with existing workflows

Breaking Changes

None. This refactor maintains compatibility with existing interfaces while adding new partitioning capabilities.

Testing Notes

  • Added 105 new lines of test coverage for EFTS client functionality
  • All existing tests continue to pass
  • New quarterly partition logic has been thoroughly tested
  • Pipeline processing has been validated with the new partitioning scheme

Infrastructure Considerations

  • Quarterly partitions will improve query performance and data management
  • Processing workflows can now be executed incrementally by quarter
  • Enhanced monitoring and debugging capabilities through better data segmentation
  • Reduced memory footprint for individual processing jobs

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/sec-pipeline-partitions
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Updated the EFTSClient to include a new method, `query_by_quarter`, allowing queries for specific quarters while adhering to the 10k result limit.
- Refactored existing SEC data processing jobs and scripts to utilize quarterly partitions instead of yearly, improving efficiency and compliance with API constraints.
- Adjusted related documentation and tests to reflect the new quarterly processing logic.
- Updated dependencies for boto3 and botocore to their latest versions.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Jan 17, 2026

Claude finished @jfrench9's task in 1m 31s —— View job


PR Review: SEC Pipeline Quarterly Partitions

Tasks

  • Read repository guidelines (CLAUDE.md)
  • Review EFTS client changes (efts.py)
  • Review Dagster asset changes (sec.py)
  • Review Dagster job configuration (jobs/sec.py)
  • Review pipeline script changes (sec_pipeline.py)
  • Review test coverage (test_efts.py)
  • Compile comprehensive review feedback

Review Summary

This PR successfully implements quarterly partitioning for SEC data processing. The changes are well-architected, thoroughly tested, and maintain backward compatibility. The code quality is high with clear documentation and appropriate error handling.

✅ Strengths

  1. Well-motivated architecture: Quarterly partitioning addresses the EFTS 10k result limit elegantly (robosystems/adapters/sec/client/efts.py:305-344)
  2. Comprehensive test coverage: 105 new lines covering all quarterly logic, edge cases, and error conditions (tests/adapters/sec/client/test_efts.py)
  3. Clear documentation: Excellent inline comments explaining the rationale for quarterly partitions and EFTS limits
  4. Backward compatibility: Year-based methods remain available with appropriate warnings (robosystems/adapters/sec/client/efts.py:278-303)
  5. Consistent naming: Partition key format follows established patterns across the codebase

🔍 Issues & Suggestions

Critical Issues

None identified.

High Priority

1. Hardcoded quarter range may need updating (robosystems/dagster/assets/sec.py:235)

SEC_QUARTERS = [f"{year}-Q{q}" for year in range(2019, 2026) for q in range(1, 5)]
  • This hardcodes quarters through 2025-Q4
  • Impact: When 2026 arrives, this will silently exclude 2026 data
  • Suggestion: Consider dynamic generation or add a TODO comment about annual updates
  • Alternative: Use datetime.now().year + 1 to automatically include current year + 1

2. Inconsistent error handling in partition key parsing (robosystems/dagster/assets/sec.py:746-750)

parts = partition_key.split("_", 2)  # Split into 3 parts max
if len(parts) != 3:
    context.log.error(f"Invalid partition key format: {partition_key}")
    return MaterializeResult(...)
  • Returns success-looking MaterializeResult with error metadata instead of raising exception
  • Impact: Invalid partitions may appear to succeed in monitoring dashboards
  • Suggestion: Raise a ValueError instead, or use status="failed" + explicit failure tracking

Medium Priority

3. Potential race condition in submissions caching (robosystems/dagster/assets/sec.py:51, 158)

_sec_submissions_cache: dict[str, dict] = {}
  • Module-level cache shared across concurrent asset executions
  • Impact: In parallel Dagster runs, cache could have stale/inconsistent data
  • Suggestion: Consider using Dagster's built-in caching mechanisms or per-run cache scoping

4. Missing validation for quarter parameter (robosystems/scripts/sec_pipeline.py:81-92)

def year_to_quarters(year: int | str) -> list[str]:
    y = int(year)
    return [f"{y}-Q{q}" for q in range(1, 5)]
  • No validation that year is reasonable (e.g., 1900-2100 range)
  • Impact: Could create nonsensical partition keys like "-5-Q1" or "999999-Q1"
  • Suggestion: Add basic validation or document expected range

5. Timeout values lack documentation (robosystems/scripts/sec_pipeline.py:103-104)

DEFAULT_DOWNLOAD_TIMEOUT = 7200  # 2 hours per year partition
DEFAULT_MATERIALIZE_TIMEOUT = 14400  # 4 hours for full materialization
  • Comments say "per year partition" but downloads are now quarterly
  • Impact: Comments are misleading after the refactor
  • Suggestion: Update comments to say "per quarter partition" or "per download job"

6. Duplicate partition validation logic (robosystems/scripts/sec_pipeline.py:562, 956)

  • Same partition key parsing logic repeated in _run_parallel_processing and cmd_process_parallel
  • Impact: Maintenance burden, risk of divergence
  • Suggestion: Extract to helper function: parse_partition_key(raw_key: str) -> tuple[str, str, str]

Low Priority

7. Magic number in error message truncation (robosystems/scripts/sec_pipeline.py:270-276)

if len(stderr) <= 500:
    error = stderr
else:
    error = f"{stderr[:250]}...{stderr[-250:]}"
  • 500/250 are magic numbers
  • Suggestion: Extract as constants: MAX_ERROR_LENGTH = 500, ERROR_TRIM_LENGTH = 250

8. Unused import or incomplete cleanup (robosystems/dagster/assets/init.py:36)

  • The PR description mentions "Updated Dagster assets" but the diff shows only import reordering
  • Verify: Check if the import changes in __init__.py are intentional

9. Test coverage gaps (tests/adapters/sec/client/test_efts.py)

  • No tests for rate limiting (429 response handling) in query_by_quarter
  • No tests for the convenience functions query_efts and query_efts_sync
  • Suggestion: Add integration-style tests for these edge cases

10. Docstring could be clearer (robosystems/adapters/sec/client/efts.py:288)

NOTE: Full years often exceed the EFTS 10k result limit.
Use query_by_quarter for production pipelines.
  • Should quantify "often" (e.g., "Full years typically return 20-30k filings")
  • Suggestion: Add concrete numbers to help users decide when to use which method

📝 Additional Observations

  1. Type hints: Excellent use of modern Python type hints throughout (e.g., list[str] instead of List[str])

  2. Error messages: Clear and actionable error messages with context

  3. Code style: Follows repo standards (Ruff formatting, 88-char lines, self-documenting code)

  4. Performance: Async/await patterns are correctly implemented for I/O-bound operations

  5. Configuration: Properly uses centralized config (robosystems.config.env) per CLAUDE.md guidelines

🎯 Recommendations

Before merging:

  1. Address the hardcoded year range (High Priority Inital open source release of robosystems #1)
  2. Fix the misleading timeout comments (Medium Priority Remove unused Account and Disclosure transaction nodes from graph schema #5)
  3. Consider the error handling inconsistency (High Priority Fix MCP tools query results to include units and improve dimensions #2)

Post-merge follow-ups:

  1. Add integration tests for rate limiting scenarios
  2. Extract duplicate partition parsing logic
  3. Document the expected year range for quarter generation

🏆 Overall Assessment

Recommendation: Approve with minor changes

This is a well-executed refactor that addresses a real limitation (EFTS 10k result cap) with a clean solution. The quarterly partitioning is properly integrated throughout the pipeline, and the test coverage demonstrates thoughtful engineering. The issues identified are mostly minor housekeeping items that don't affect core functionality.

The code demonstrates strong adherence to project standards and architectural patterns. Great work!


- Modified the usage instructions for the `sec-download` command to specify downloading the top 10 companies across all quarters.
- Clarified the `sec-load` command description to indicate it chains all steps for a single company.
@jfrench9 jfrench9 merged commit 9731bcd into main Jan 17, 2026
6 checks passed
@jfrench9 jfrench9 deleted the refactor/sec-pipeline-partitions branch January 17, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant