Refactor SEC processing sensor to improve efficiency with batch processing#151
Conversation
…ocessing This update removes the deprecated `_check_processed_exists` function and introduces a new method, `_list_processed_partitions`, to efficiently list processed partition keys from S3. The `sec_processing_sensor` function is enhanced to process raw filings in batches, utilizing a cursor for pagination and set lookups for checking processed files, significantly improving performance and reducing timeout risks. Logging has been updated for better tracking of processing status.
|
@claude please review this PR |
|
Claude finished @jfrench9's task in 1m 59s —— View job PR Review: SEC Processing Sensor Batch OptimizationI've reviewed the refactoring of the SEC processing sensor. This is a well-executed performance optimization that addresses a critical scalability bottleneck. Here's my detailed analysis: ✅ Strengths1. Excellent Performance Optimization
This change alone could reduce sensor evaluation time from minutes to seconds for large filing sets. 2. Smart Cursor-Based Pagination (lines 163-233)
3. Proper Set-Based Deduplication (lines 91-119) 4. Maintains Backward Compatibility
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a22117f329
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| paginate_kwargs = {"Bucket": raw_bucket, "Prefix": sec_prefix} | ||
| if start_after: | ||
| paginate_kwargs["StartAfter"] = start_after |
There was a problem hiding this comment.
Handle late-arriving keys before current cursor
Because the sensor paginates using StartAfter from the persisted cursor and only resets once it reaches the end of the listing, any new raw ZIPs whose S3 keys sort lexicographically before the current cursor won’t be seen until a full scan completes. In a large bucket, that can delay processing of late-arriving filings for hours or days (e.g., older year/CIK keys uploaded after the cursor has advanced). Consider a strategy that periodically rescans earlier prefixes or avoids a strictly forward-only cursor so new keys are picked up promptly.
Useful? React with 👍 / 👎.
…g sensor This update renames the test class and methods to reflect the new functionality of the _list_processed_partitions helper. The tests now cover scenarios for finding processed partitions, handling empty buckets, and ignoring non-Entity files. Additionally, the sensor tests have been updated to use the new helper, ensuring accurate validation of processed filings.
This update introduces additional yield statements in the sec_processing_sensor function to provide clearer feedback when no raw filings are found or when all filings in a batch have already been processed. These changes enhance the logging and user experience by explicitly stating the reasons for skipping processing in different scenarios.
…lings and already processed filings This change modifies the assertions in the test cases to reflect the new behavior of the sec_processing_sensor, ensuring that it returns a SkipReason when no raw filings are found or when all filings have already been processed. This enhances the clarity of test outcomes and aligns with recent updates to the sensor's functionality.
Summary
This PR refactors the SEC processing sensor to significantly improve efficiency by implementing batch processing capabilities. The changes optimize how SEC data is processed while maintaining backward compatibility.
Key Accomplishments
Changes Made
Breaking Changes
None. This refactor maintains backward compatibility with existing SEC processing workflows.
Testing Notes
Infrastructure Considerations
🤖 Generated with Claude Code
Branch Info:
refactor/sec-process-sensor-batchesmainCo-Authored-By: Claude noreply@anthropic.com