Fix S3 credential management and enhance SEC data processing by jfrench9 · Pull Request #276 · RoboFinSystems/robosystems

jfrench9 · 2026-02-01T23:04:37Z

Summary

This bugfix addresses issues with S3 credential management in the LadybugBackend while enhancing SEC submission fetching and Parquet file handling capabilities.

Key Changes

Enhanced S3 Credential Management: Refactored credential fetching logic in LadybugBackend to improve reliability and error handling
Improved SEC Data Processing: Enhanced submission fetching mechanisms with better error handling and data validation
Parquet Handling Upgrades: Strengthened Parquet file processing with improved data integrity checks and optimization
Dagster Asset Updates: Refactored SEC-related assets for better performance and maintainability

Key Accomplishments

✅ Resolved S3 authentication issues that were preventing reliable data access
✅ Improved robustness of SEC data ingestion pipeline
✅ Enhanced Parquet file processing with better error recovery
✅ Updated test coverage to reflect new functionality
✅ Streamlined credential management workflow

Breaking Changes

None. All changes are backward compatible.

Testing Notes

All existing unit tests have been updated and continue to pass
Enhanced test coverage for Parquet processing edge cases
S3 credential fetching has been tested across different authentication scenarios
SEC data processing improvements verified with sample datasets

Infrastructure Considerations

Monitor S3 connection stability after deployment
Verify credential rotation workflows continue to function properly
Watch for any performance improvements in SEC data processing jobs
Ensure Parquet file processing maintains expected throughput

🤖 Generated with Claude Code

Branch Info:

Source: bugfix/lbug-copy-s3
Target: main
Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

- Added support for explicit S3 credentials and environment variable loading. - Implemented fetching of credentials from EC2 Instance Metadata Service (IMDS) for IAM roles. - Improved error handling and logging for credential loading processes. - Configured S3 performance settings consistently across credential modes.

- Simplified credential loading process by distinguishing between production and development environments. - Enhanced error messages for clarity when credentials are unavailable. - Added fallback mechanism to use explicit credentials in development if IMDS fails.

- Added `skip_submissions` option to `SECDownloadConfig` to control fetching of submissions.json files, improving rate limiting management. - Updated `sec_raw_filings` function to conditionally skip submissions fetching based on the new configuration. - Improved ParquetWriter to ensure string columns are explicitly typed using pyarrow-backed string dtype, addressing type inference issues in Parquet files. - Added tests to verify the new string dtype handling for Parquet columns.

jfrench9 · 2026-02-01T23:04:39Z

@claude please review this PR

claude · 2026-02-01T23:04:55Z

Claude finished @jfrench9's task in 1m 48s —— View job

PR Review Complete ✅

I've reviewed all 4 changed files in this PR. Here's my comprehensive analysis:

Tasks

Read changed files to understand modifications
Review parquet.py processor changes
Review SEC Dagster assets refactoring
Review LadybugBackend S3 credential management
Review test file updates
Check for bugs, security issues, and code quality
Verify alignment with repository standards
Provide comprehensive review feedback

Summary

This PR implements important improvements to S3 credential management and SEC data processing. The changes are generally well-designed, but I've identified several critical security issues and code quality concerns that should be addressed before merging.

Critical Issues 🚨

1. SQL Injection Vulnerability in S3 Credential Handling (robosystems/graph_api/backends/lbug.py:188-285)

The credential escaping logic only escapes single quotes but doesn't handle other SQL injection vectors:

escaped_key = s3_credentials["aws_access_key_id"].replace("'", "''")
conn.execute(f"CALL s3_access_key_id = '{escaped_key}'")

Problem: If credentials contain backslashes, double quotes, or other special characters, this could lead to SQL injection or broken queries.

Recommendation: Use parameterized queries if LadybugDB supports them, or implement more robust escaping:

import re
def escape_sql_string(value: str) -> str:
    """Escape string for SQL CALL statement."""
    return value.replace("\\", "\\\\").replace("'", "''")

Fix this →

2. Missing Error Handling for Missing `requests` Library (robosystems/graph_api/backends/lbug.py:220)

The code imports requests inside a try block for IMDS, but requests may not be installed:

try:
    import requests
    # ... IMDS calls
except Exception as imds_err:
    logger.warning(f"Failed to fetch credentials from EC2 IMDS: {imds_err}")

Problem: If requests is not installed, the generic exception handler masks the real issue (ImportError vs network error).

Recommendation: Add explicit ImportError handling and ensure requests is in production dependencies:

try:
    import requests
except ImportError:
    logger.error("requests library required for EC2 IMDS credential fetching in production")
    credentials_loaded = False
else:
    try:
        # IMDS logic here
    except Exception as imds_err:
        logger.warning(f"Failed to fetch credentials from EC2 IMDS: {imds_err}")

Fix this →

3. Incorrect Prefix in Test Assertions (tests/adapters/sec/processors/test_parquet_writer.py:88, 104)

Test assertions expect node__ and rel__ but the actual code produces node_ and rel_:

# Test expects:
assert result == "node__Entity.parquet"  # Line 88
# But code produces:
prefix = "rel_" if is_relationship else "node_"  # parquet.py:283

Problem: Tests are passing with wrong expectations. Either the tests or the code is incorrect.

Recommendation: Fix test assertions to match actual behavior or update the code to use double underscores if that's the intended format.

Fix this →

Major Issues ⚠️

4. Missing Return Value Affects Error Detection (robosystems/graph_api/backends/lbug.py:277-281)

When credentials fail to load, the code logs an error but doesn't return or raise:

if not credentials_loaded:
    logger.error("No S3 credentials available...")
    # Code continues to execute S3 operations anyway

Problem: Subsequent S3 operations will fail with cryptic errors instead of failing fast with a clear credential error.

Recommendation: Raise an exception when credentials can't be loaded:

if not credentials_loaded:
    raise S3IngestionError(
        "No S3 credentials available. In prod/staging, ensure EC2 has IAM role. "
        "In dev, set AWS_S3_ACCESS_KEY_ID/AWS_S3_SECRET_ACCESS_KEY."
    )

Fix this →

5. Inconsistent Column Type Handling (robosystems/adapters/sec/processors/parquet.py:341, 420)

The code calls _convert_to_string_dtype() but doesn't use its return value:

_convert_to_string_dtype(df, col)  # Lines 341, 420

But the function returns the modified DataFrame:

def _convert_to_string_dtype(df: pd.DataFrame, col: str) -> pd.DataFrame:
    df[col] = df[col].astype(pd.ArrowDtype(pa.string()))
    return df  # Return value is ignored!

Problem: While the function mutates the DataFrame in place (so it still works), ignoring the return value is inconsistent with the function signature and could cause confusion.

Recommendation: Either use the return value or change the function to return None:

df = _convert_to_string_dtype(df, col)  # Use return value
# OR
def _convert_to_string_dtype(df: pd.DataFrame, col: str) -> None:  # Change signature

Minor Issues 📝

6. Hardcoded Timeout Values (robosystems/graph_api/backends/lbug.py:225, 231, 234, 241)

IMDS requests have hardcoded 2-second timeouts:

timeout=2,

Recommendation: Extract to a constant for easier tuning:

IMDS_TIMEOUT_SECONDS = 2
# ... later ...
timeout=IMDS_TIMEOUT_SECONDS,

7. Missing Type Hints (robosystems/graph_api/backends/lbug.py:25-31)

The _load_httpfs_extension method lacks a return type annotation:

def _load_httpfs_extension(self, conn) -> None:  # Add -> None

Note: The conn parameter also lacks a type hint, but this may be intentional if the connection type is complex.

8. Confusing Variable Name (robosystems/graph_api/backends/lbug.py:248)

The variable name Token differs from AWS API convention:

session_token = creds.get("Token", "")  # AWS IMDS returns "Token", not "SessionToken"

Note: This is actually correct per AWS IMDS API, but a comment would help clarify since AWS SDKs typically use SessionToken.

Code Quality Observations ✨

Good Practices:

Excellent error recovery - The IMDS → env var fallback pattern is robust
Clear logging - Good distinction between info/warning/error levels
Performance optimization - Using asyncio.to_thread for long-running COPY operations (line 321)
Test coverage - Comprehensive unit tests for ParquetWriter functionality

Architectural Concerns:

IMDS logic belongs in a separate module - The ingest_from_s3 method is now 213 lines long. Consider extracting credential fetching to a dedicated S3CredentialProvider class.
Missing SEC assets diff review - The dagster/assets/sec.py file has 275 changed lines but I couldn't review all changes in detail. Consider breaking large refactors into separate PRs.

Repository Standards Compliance

Per CLAUDE.md:

✅ No os.getenv() usage - correctly uses from robosystems.config import env
✅ No production emojis - only used in logs during development
✅ Type hints present (mostly)
⚠️ Missing docstrings for new IMDS logic - consider adding inline comments

Recommendations for Next Steps

Before merging: Fix critical issues Inital open source release of robosystems #1, Fix MCP tools query results to include units and improve dimensions #2, Enhance SEC pipeline security and add reconsolidation support #3
High priority: Address issue chore: improve documentation and suppress build output #4 (fail fast on missing credentials)
Nice to have: Address issues Remove unused Account and Disclosure transaction nodes from graph schema #5-8
Follow-up PR: Extract credential logic to separate class for maintainability

Testing Recommendations

Before deploying to production:

Test IMDS credential fetching on actual EC2 instance with IAM role
Verify session token handling works (or document if unsupported)
Test with special characters in S3 credentials (especially ' and \)
Verify asyncio.to_thread doesn't cause issues with existing Dagster jobs

Overall Assessment: The PR implements valuable functionality with good architectural decisions (environment-based credential handling, async execution). However, the security and error handling issues must be addressed before merging to prevent potential SQL injection and production failures.

- Modified the sec_raw_filings function to ignore the result of the await expression when fetching submissions, improving clarity and maintaining functionality.

jfrench9 added 3 commits January 31, 2026 21:07

github-code-quality Bot found potential problems Feb 1, 2026

View reviewed changes

Comment thread robosystems/dagster/assets/sec.py Fixed

jfrench9 and others added 2 commits February 1, 2026 17:10

fix: Update submission fetching in sec_raw_filings function

6fe161b

- Modified the sec_raw_filings function to ignore the result of the await expression when fetching submissions, improving clarity and maintaining functionality.

Merge branch 'main' into bugfix/lbug-copy-s3

c88ce29

jfrench9 merged commit 7704094 into main Feb 1, 2026
7 checks passed

jfrench9 deleted the bugfix/lbug-copy-s3 branch February 2, 2026 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix S3 credential management and enhance SEC data processing#276

Fix S3 credential management and enhance SEC data processing#276
jfrench9 merged 5 commits into
mainfrom
bugfix/lbug-copy-s3

jfrench9 commented Feb 1, 2026

Uh oh!

jfrench9 commented Feb 1, 2026

Uh oh!

claude Bot commented Feb 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Feb 1, 2026

Summary

Key Changes

Key Accomplishments

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Feb 1, 2026

Uh oh!

claude Bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete ✅

Tasks

Summary

Critical Issues 🚨

1. SQL Injection Vulnerability in S3 Credential Handling (robosystems/graph_api/backends/lbug.py:188-285)

2. Missing Error Handling for Missing requests Library (robosystems/graph_api/backends/lbug.py:220)

3. Incorrect Prefix in Test Assertions (tests/adapters/sec/processors/test_parquet_writer.py:88, 104)

Major Issues ⚠️

4. Missing Return Value Affects Error Detection (robosystems/graph_api/backends/lbug.py:277-281)

5. Inconsistent Column Type Handling (robosystems/adapters/sec/processors/parquet.py:341, 420)

Minor Issues 📝

6. Hardcoded Timeout Values (robosystems/graph_api/backends/lbug.py:225, 231, 234, 241)

7. Missing Type Hints (robosystems/graph_api/backends/lbug.py:25-31)

8. Confusing Variable Name (robosystems/graph_api/backends/lbug.py:248)

Code Quality Observations ✨

Repository Standards Compliance

Recommendations for Next Steps

Testing Recommendations

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Feb 1, 2026 •

edited

Loading

2. Missing Error Handling for Missing `requests` Library (robosystems/graph_api/backends/lbug.py:220)