Refactor SEC indexing with quarterly partitioning and capacity constraints#520
Conversation
- Updated SECTextBlockIndexConfig, SECNarrativeIndexConfig, and SECiXBRLIndexConfig to include detailed descriptions and partitioning by quarter. - Modified job definitions for sec_textblocks_index, sec_narratives_index, and sec_ixbrl_index to reflect quarterly partitioning in descriptions and added partition definitions. - Improved sec_post_stage_index_sensor to derive the current quarter for indexing jobs. - Adjusted sec_textblocks_indexed function to process parquet files based on the current quarter, enhancing memory management and backfill capabilities.
|
@claude please review this PR |
|
Claude finished @jfrench9's task in 2m 4s —— View job PR Review: Refactor SEC indexing with quarterly partitioning and capacity constraintsTasks:
OverviewThis is a solid, well-motivated refactor. The quarterly partitioning strategy is the right architectural direction for addressing OpenSearch capacity constraints, and the streaming approach in Issues1.
|
- Replaced timezone handling in sec_post_stage_index_sensor to use UTC directly for consistency. - Streamlined data processing in sec_textblocks_indexed by consolidating DataFrame operations and reducing iteration overhead. - Removed the unused _partition_year function and its associated tests to clean up the codebase. - Enhanced readability by formatting DataFrame operations and improving logging messages for better traceability.
Summary
Refactors the SEC indexing pipeline to introduce quarterly partitioning and address OpenSearch capacity constraints. This change enhances the indexing configurations, jobs, sensors, and core text indexing logic to support a more scalable and resource-aware approach to SEC document ingestion.
Key Accomplishments
Quarterly Partitioning: Introduced quarterly partitioning for SEC indexing operations, enabling more granular control over data processing windows and reducing the volume of data handled in each indexing pass. This directly addresses OpenSearch capacity limitations by breaking large indexing workloads into manageable quarterly chunks.
Enhanced Configuration Management: Expanded and restructured indexing configurations (
configs.py) to support the new partitioning scheme, providing clearer and more flexible parameterization for pipeline runs.Job and Sensor Updates: Updated pipeline jobs and sensors to be partition-aware, ensuring that scheduling and triggering logic correctly aligns with the quarterly data boundaries.
Text Index Refactoring: Significantly refactored the core text indexing module (
text_index.py) with ~340 lines added and ~196 removed, improving code structure, readability, and robustness while integrating the quarterly partitioning logic throughout the indexing workflow.Breaking Changes
Testing Notes
Infrastructure Considerations
🤖 Generated with Claude Code
Branch Info:
refactor/opensearch-capacity-constraintsmainCo-Authored-By: Claude noreply@anthropic.com