Implement SEC EFTS-based discovery and downloader with improved logging#134
Conversation
- Introduced a new `downloader.py` module for async downloading of SEC filings using the EFTS API, featuring rate limiting and progress tracking. - Added `efts.py` for querying the EFTS API, enabling efficient discovery of filings by form type and date. - Updated `__init__.py` to include lazy imports for new modules, preventing circular dependencies. - Refactored SEC pipeline to utilize EFTS for filing discovery, replacing the previous per-company iteration approach with O(1) discovery. - Enhanced Dagster assets and jobs to support the new EFTS-based workflow, including configuration updates for submission fetching and filing downloads.
…improved readability - Updated logging statements in `downloader.py` and `sec.py` to use multi-line formatting for better clarity and maintainability. - Ensured consistent formatting across log messages to enhance code readability.
|
@claude please review this PR |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Claude finished @jfrench9's task in 2m 13s —— View job PR Review CompleteI've completed a thorough review of this PR. Overall, this is an excellent refactoring that modernizes the SEC pipeline with EFTS-based discovery, async downloads, and intelligent rate limiting. The code quality is high, test coverage is comprehensive, and the architecture is well-designed. Below are my findings organized by category. ✅ StrengthsArchitecture & Design
Code Quality
Testing
🔍 Issues Found1. CRITICAL: Potential Infinite Recursion (High Priority)Location: Both retry handlers can cause unbounded recursion if rate limiting persists: if response.status == 429:
retry_after = int(response.headers.get("Retry-After", 60))
logger.warning(f"Rate limited, waiting {retry_after}s")
await asyncio.sleep(retry_after)
return await self._download_filing(hit, year, bucket) # Recursive call with no limitImpact: Stack overflow if SEC API continues returning 429s Recommendation: async def _download_filing(self, hit: EFTSHit, year: int, bucket: str, retry_count: int = 0) -> bool:
MAX_RETRIES = 3
# ... existing code ...
if response.status == 429:
if retry_count >= MAX_RETRIES:
logger.error(f"Max retries exceeded for {hit.accession_number}")
self._stats.failed += 1
return False
retry_after = int(response.headers.get("Retry-After", 60))
logger.warning(f"Rate limited, waiting {retry_after}s (retry {retry_count + 1}/{MAX_RETRIES})")
await asyncio.sleep(retry_after)
return await self._download_filing(hit, year, bucket, retry_count + 1)Apply the same pattern to 2. Security: Unvalidated Retry-After HeaderLocation: retry_after = int(response.headers.get("Retry-After", 60))Issue: No validation on Fix: retry_after = min(int(response.headers.get("Retry-After", 60)), 300) # Cap at 5 minutes3. Resource Leak Risk: S3 Upload Outside SemaphoreLocation: async with self._semaphore:
async with self._limiter:
# Download happens here
content = await response.read()
# Upload to S3 (OUTSIDE semaphore)
try:
self._get_s3_client().s3_client.put_object(...)Issue: The S3 upload happens outside the semaphore, meaning you could have unbounded concurrent S3 uploads if many downloads complete simultaneously. This could exhaust file descriptors or network connections. Impact: Potential resource exhaustion under high concurrency Recommendation: Either:
4. Dagster Asset Duplication of CodeLocation: Issue: The Current: @asset
def sec_raw_filings(...):
async def run_efts_download():
# 250+ lines of inline download logic
# Reimplements rate limiting, semaphores, retry logic, etc.Recommendation: Refactor to use the new classes: @asset
def sec_raw_filings(...):
import asyncio
from robosystems.adapters.sec.client import SECDownloader
async def run_download():
async with SECDownloader(
requests_per_second=config.download_rate,
max_concurrent=config.download_concurrency,
skip_existing=config.skip_existing,
) as downloader:
return await downloader.download_year(
year=year,
form_types=config.form_types,
ciks=cik_filter,
)
stats = asyncio.run(run_download())
# Handle submissions separately if neededThis would reduce the asset from ~300 lines to ~50 lines and eliminate duplication. 5. Missing Error Handling: Empty Content CheckLocation: if not content or len(content) == 0:Issue: Minor Issue: The check is fine, but the error handling only logs and marks as failed. Consider if some filings genuinely have empty XBRL (unlikely but possible). 6. Test Gap: No Integration Tests for Full PipelineObservation: While unit tests are comprehensive (947 lines), there don't appear to be integration tests that verify:
Recommendation: Add at least one integration test marked with
7. Inconsistent Naming:
|
- Added retry count and maximum retry limits to handle rate limiting more effectively in `downloader.py` and `efts.py`. - Improved logging for rate limiting scenarios to provide clearer feedback on retry attempts and wait times. - Updated `__init__.py` to clarify which modules are available via lazy imports, improving module organization and usability.
|
@claude have we addressed all issues? |
|
Claude finished @jfrench9's task in 1m 56s —— View job Issue Status Review Complete ✅I've verified which issues from the previous review have been addressed: ✅ Critical Issues Successfully Fixed1. Infinite Recursion in Retry Logic (CRITICAL)Status: ✅ FIXED Both
The retry logic now properly exits after 3 attempts and logs errors appropriately. This prevents stack overflow. 2. Unvalidated Retry-After Header (Security)Status: ✅ FIXED Both files now cap the
This prevents DoS attacks from malicious servers sending extreme retry values. 3. S3 Upload Semaphore Placement (Resource Leak)Status: ✅ FIXED The S3 upload is now inside the semaphore block:
This prevents unbounded concurrent S3 uploads. ❌ Issues NOT Fixed4. Code Duplication in Dagster Asset (Should Fix)Status: ❌ NOT FIXED The Current state:
Why this matters:
Recommendation: Refactor the asset to use 5. Missing Integration Tests (Nice to Have)Status: ❌ NOT ADDED While unit test coverage is excellent (947 lines of tests), there are no integration tests that verify:
Note: This was marked as "nice to have" in the original review, so not critical for merge.
|
| Issue | Status | Priority | Fixed? |
|---|---|---|---|
| Infinite recursion in retry handlers | Critical | Must Fix | ✅ YES |
| Unvalidated Retry-After header | Security | Must Fix | ✅ YES |
| S3 upload outside semaphore | High | Must Fix | ✅ YES |
| Code duplication in Dagster asset | Medium | Should Fix | ❌ NO |
| Missing integration tests | Low | Nice to Have | ❌ NO |
__init__.py export mismatch |
Low | Static Analysis |
🎯 Final Answer
Yes, all critical issues have been addressed! ✅
The three must-fix issues from the original review are now resolved:
- ✅ Stack overflow prevention (retry limits added)
- ✅ Security hardening (Retry-After capped)
- ✅ Resource leak prevention (S3 upload in semaphore)
Remaining work (not blockers):
- Code duplication in the Dagster asset - recommend addressing in a follow-up PR
__init__.pystatic analysis warnings - intentional design choice, safe to ignore
The PR is safe to merge from a correctness and security standpoint. Great work fixing the critical issues! 🚀
| s3.client.head_object(Bucket=bucket, Key=submissions_key) | ||
| submissions_skipped += 1 | ||
| continue | ||
| except Exception: |
| # Execute with progress logging | ||
| completed = 0 | ||
| for coro in asyncio.as_completed(tasks): | ||
| await coro |
| tasks = [fetch_submission(cik) for cik in ciks_to_fetch] | ||
| completed = 0 | ||
| for coro in asyncio.as_completed(tasks): | ||
| await coro |
| completed = 0 | ||
|
|
||
| for coro in asyncio.as_completed(tasks): | ||
| await coro |
Summary
This PR introduces a new SEC EFTS (Electronic Filing Transfer System) based discovery mechanism and downloader, replacing the previous implementation with a more robust and efficient solution. The changes include comprehensive logging improvements and a modular client architecture for SEC data processing.
Key Accomplishments
Breaking Changes
Testing Notes
Infrastructure Considerations
🤖 Generated with Claude Code
Branch Info:
feature/improved-sec-downloadermainCo-Authored-By: Claude noreply@anthropic.com