Fix DuckDB materialization robustness with SSE monitoring and deps by jfrench9 · Pull Request #212 · RoboFinSystems/robosystems

jfrench9 · 2026-01-19T18:14:44Z

Summary

This PR enhances the robustness of DuckDB table materialization processes by implementing Server-Sent Events (SSE) monitoring for long-running operations and refactoring the SEC data pipeline to remove staging dependencies.

Key Changes

DuckDB Materialization Improvements

SSE Monitoring: Added real-time monitoring capabilities for long-running DuckDB table creation tasks through Server-Sent Events
Enhanced Error Handling: Improved error handling and logging throughout the materialization process
Task Management: Upgraded task management system with better response formats and monitoring capabilities

SEC Pipeline Refactoring

Removed Staging Dependencies: Eliminated DuckDB staging references from the SEC pipeline for simplified architecture
Streamlined Processing: Refactored ingestion processors and asset definitions for better maintainability
Updated Job Configuration: Modified job and sensor configurations to align with new pipeline structure

Dependency Updates

PyArrow: Updated to version 23.0.0 for improved performance and compatibility
Soupsieve: Updated to version 2.8.2 for enhanced CSS selector parsing

Technical Improvements

Enhanced graph API client with better task monitoring capabilities
Improved table management router with comprehensive materialization workflow
Updated task models to support new monitoring features
Refined sensor logic for better pipeline orchestration

Testing

Updated test suites to reflect new task management and response formats
Modified sensor tests to align with pipeline changes
Enhanced table management tests for improved coverage

Infrastructure Considerations

Long-running DuckDB operations now provide real-time progress updates
Improved monitoring visibility for database materialization processes
Enhanced error tracking and debugging capabilities for production environments
No breaking changes to existing API contracts

This update significantly improves the reliability and observability of DuckDB materialization workflows while modernizing the underlying infrastructure dependencies.

🤖 Generated with Claude Code

Branch Info:

Source: bugfix/duckdb-materialization-robustness
Target: main
Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…pp Runner and Lambda services. Updated comments for clarity on allowed frontend repositories.

…ctor SEC pipeline to remove DuckDB staging references and enhance materialization process. Implement SSE monitoring for long-running DuckDB table creation tasks, improving error handling and logging. Update related tests to reflect changes in task management and response formats.

jfrench9 · 2026-01-19T18:14:45Z

@claude please review this PR

claude · 2026-01-19T18:14:59Z

Claude finished @jfrench9's task in 1m 33s —— View job

PR Review: DuckDB Materialization Robustness

Review Tasks

Read and analyze all 15 changed files
Review SSE monitoring implementation
Review SEC pipeline refactoring
Review error handling and logging improvements
Review test coverage
Provide comprehensive feedback

Summary

This PR introduces solid improvements to DuckDB table materialization with SSE monitoring for long-running operations. The architecture is well-designed with proper separation of concerns. However, there are several issues that should be addressed before merging, ranging from critical bugs to maintainability concerns.

Critical Issues

1. Potential Deadlock in SSE Event Loop (robosystems/graph_api/core/task_sse.py:148)

Severity: High

await asyncio.sleep(2)

The SSE event generator has a tight polling loop with 2-second sleeps. While the task is running, this continuously polls Redis every 2 seconds. For a long-running task (30+ minutes), this means 900+ Redis queries. This could:

Cause unnecessary load on Redis
Potentially starve other async operations if many SSE connections are active
Waste resources on rapid polling when no updates are available

Recommendation: Consider implementing a more efficient notification mechanism (e.g., Redis pub/sub) or increase the polling interval to 5-10 seconds for steady-state monitoring.

2. Missing Timeout Handling in Table Creation (robosystems/graph_api/routers/databases/tables/management.py:136-189)

Severity: High

The perform_table_creation background task has no timeout mechanism. If table_manager.create_table(request) hangs indefinitely, the task will never complete, and the SSE stream will keep sending heartbeats forever (or until the client disconnects).

Recommendation: Add a timeout wrapper around the table creation:

try:
    result = await asyncio.wait_for(
        asyncio.to_thread(table_manager.create_table, request),
        timeout=1800  # 30 minutes
    )
except asyncio.TimeoutError:
    await staging_task_manager.update_task(
        task_id,
        status=TaskStatus.FAILED,
        error="Table creation timeout after 30 minutes"
    )

3. Incorrect Error Handling in SSE Monitor (robosystems/graph_api/client/client.py:521-525)

Severity: Medium

if current_time - last_heartbeat > 120:
    logger.warning(
        "No heartbeat received for 2 minutes, connection may be stale"
    )

The code logs a warning about stale connections but continues processing. This should either:

Return an error and exit the loop
Attempt to reconnect
Or clearly document why it's only a warning

As written, a stale connection will continue indefinitely, wasting resources.

Major Issues

4. Inconsistent Response Format (robosystems/adapters/sec/processors/ingestion.py:328-341)

Severity: Medium

The _create_duckdb_tables method expects SSE response format but inconsistently extracts data:

result = response.get("result", {})
duration = response.get("duration_seconds", result.get("duration_seconds", 0))

This double-fallback pattern (response.get() then result.get()) suggests uncertainty about response structure. The response format should be clearly defined in the API contract.

Recommendation: Document the exact response format from create_table and remove ambiguity. Consider using Pydantic models for type safety.

5. Missing Error Recovery in Pipeline (robosystems/adapters/sec/processors/ingestion.py:312-345)

Severity: Medium

When table creation fails, the exception is raised immediately, aborting the entire pipeline. This means if table 3 of 10 fails, tables 4-10 are never created. Consider:

failed_tables = []
for table_name, s3_keys in tables_info.items():
    try:
        # ... create table ...
    except Exception as e:
        logger.error(f"Failed to create table {table_name}: {e}")
        failed_tables.append((table_name, str(e)))
        continue  # Process remaining tables

if failed_tables:
    raise RuntimeError(f"Failed to create {len(failed_tables)} tables: {failed_tables}")

6. Unbounded Redis Key Storage (robosystems/graph_api/routers/databases/tables/management.py:90-96)

Severity: Medium

Task data is stored in Redis with 24-hour TTL, but there's no cleanup for completed tasks or limits on storage:

await redis_client.setex(
    f"lbug:task:{task_id}",
    86400,  # 24 hours
    json.dumps(task_data),
)

In a high-throughput environment, this could accumulate thousands of completed tasks in Redis. Consider:

Shorter TTL for completed tasks (1 hour)
Move completed task history to PostgreSQL
Implement periodic cleanup

Code Quality Issues

7. Inconsistent Naming Convention (robosystems/graph_api/routers/tasks.py:119-131)

Severity: Low

def get_task_type(self, task_id: str) -> TaskType:
    if task_id.startswith("ingest") or task_id.startswith("copy"):
        return TaskType.INGESTION
    elif task_id.startswith("backup"):
        return TaskType.BACKUP
    elif task_id.startswith("staging"):
        return TaskType.STAGING

This hardcoded prefix matching is fragile. If task ID format changes or new task types are added, this breaks. Consider:

Storing task type in Redis task data
Using a registry pattern for task types

8. Magic Numbers (robosystems/graph_api/core/task_sse.py:43, 75, 148)

Severity: Low

heartbeat_interval: int = 30
if current_time - last_heartbeat > heartbeat_interval:

The heartbeat interval (30s), stale connection threshold (120s), and polling interval (2s) are hardcoded. These should be configurable constants at the module level for easier tuning.

9. Overly Broad Exception Handling (robosystems/graph_api/routers/databases/tables/management.py:179-188)

Severity: Low

except Exception as e:
    logger.error(f"[Task {task_id}] Failed: {e}")

Catching all exceptions makes debugging harder and could hide programming errors. Consider catching specific exceptions and letting others propagate for proper error tracking.

10. Test Coverage Gaps

Severity: Low

The updated tests (test_tables_management.py:37-54) only test the happy path for SSE-based table creation. Missing tests for:

Timeout scenarios
Task failure scenarios
SSE connection errors
Heartbeat handling
Progress updates

Documentation Issues

11. Missing SSE Event Format Documentation

Severity: Low

The SSE event stream format (events: connected, heartbeat, progress, completed, failed) is not documented in the code or API docs. This makes it hard for clients to implement monitoring correctly.

Recommendation: Add comprehensive docstrings to generate_task_sse_events with example event payloads.

12. Unclear DuckDB Limitations (robosystems/adapters/sec/processors/ingestion.py:18-27)

Severity: Low

The docstring mentions "LIMITATION: This approach currently ALWAYS rebuilds the graph from scratch" but doesn't explain why or what would be needed for incremental updates. This is important architectural context.

Positive Aspects

✅ Well-structured SSE implementation - Generic generate_task_sse_events can be reused for all long-running operations
✅ Good separation of concerns - Task management, SSE streaming, and business logic are properly separated
✅ Comprehensive logging - Good use of structured logging with task IDs and timing information
✅ Proper async/await usage - Correctly uses async patterns throughout
✅ Clean refactoring - Successfully removed staging dependencies from SEC pipeline
✅ Dependency updates - PyArrow 23.0.0 and soupsieve 2.8.2 bring performance improvements

Minor Suggestions

robosystems/dagster/sensors/sec.py:154-158 - Consider making the dev environment skip configurable via env var for testing
robosystems/graph_api/models/tasks.py:30 - Add STAGING to the docstring examples
robosystems/dagster/assets/sec.py - The 77-line deletion suggests significant refactoring; ensure no valuable logic was lost

Recommendations

Before Merging:

Fix critical issue Fix MCP tools query results to include units and improve dimensions #2 (missing timeout in table creation)
Address issue Enhance SEC pipeline security and add reconsolidation support #3 (stale connection handling)
Add timeout/failure tests to test suite

Post-Merge:

Monitor Redis memory usage with 24-hour task TTL in production
Consider implementing Redis pub/sub for more efficient SSE notifications
Add API documentation for SSE event formats
Add observability metrics for task durations and failure rates

Overall, this is a solid improvement to the codebase that enhances observability and robustness. The SSE monitoring architecture is well-designed. Address the critical timeout issues and consider the recommendations for production readiness.

…/app-runner-implementation

…and reliability. Update SEC processing job to utilize a 99% Spot and 1% On-Demand capacity provider strategy. Enhance DuckDB table creation process with better error handling and logging, including timeout management for long-running tasks. Return list of successfully created tables and log failures for improved debugging.

…materialization-robustness

jfrench9 added 2 commits January 18, 2026 21:07

Enhance CloudFormation OIDC configuration by adding permissions for A…

ce2e7be

…pp Runner and Lambda services. Updated comments for clarity on allowed frontend repositories.

jfrench9 added 3 commits January 19, 2026 12:15

Merge branch 'bugfix/duckdb-materialization-robustness' into refactor…

e1a3a72

…/app-runner-implementation

Merge branch 'refactor/app-runner-implementation' into bugfix/duckdb-…

2e18649

…materialization-robustness

jfrench9 merged commit 6857a38 into main Jan 19, 2026
7 checks passed

jfrench9 deleted the bugfix/duckdb-materialization-robustness branch January 20, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DuckDB materialization robustness with SSE monitoring and deps#212

Fix DuckDB materialization robustness with SSE monitoring and deps#212
jfrench9 merged 5 commits into
mainfrom
bugfix/duckdb-materialization-robustness

jfrench9 commented Jan 19, 2026

Uh oh!

jfrench9 commented Jan 19, 2026

Uh oh!

claude Bot commented Jan 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Jan 19, 2026

Summary

Key Changes

DuckDB Materialization Improvements

SEC Pipeline Refactoring

Dependency Updates

Technical Improvements

Testing

Infrastructure Considerations

Uh oh!

jfrench9 commented Jan 19, 2026

Uh oh!

claude Bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: DuckDB Materialization Robustness

Review Tasks

Summary

Critical Issues

1. Potential Deadlock in SSE Event Loop (robosystems/graph_api/core/task_sse.py:148)

2. Missing Timeout Handling in Table Creation (robosystems/graph_api/routers/databases/tables/management.py:136-189)

3. Incorrect Error Handling in SSE Monitor (robosystems/graph_api/client/client.py:521-525)

Major Issues

4. Inconsistent Response Format (robosystems/adapters/sec/processors/ingestion.py:328-341)

5. Missing Error Recovery in Pipeline (robosystems/adapters/sec/processors/ingestion.py:312-345)

6. Unbounded Redis Key Storage (robosystems/graph_api/routers/databases/tables/management.py:90-96)

Code Quality Issues

7. Inconsistent Naming Convention (robosystems/graph_api/routers/tasks.py:119-131)

8. Magic Numbers (robosystems/graph_api/core/task_sse.py:43, 75, 148)

9. Overly Broad Exception Handling (robosystems/graph_api/routers/databases/tables/management.py:179-188)

10. Test Coverage Gaps

Documentation Issues

11. Missing SSE Event Format Documentation

12. Unclear DuckDB Limitations (robosystems/adapters/sec/processors/ingestion.py:18-27)

Positive Aspects

Minor Suggestions

Recommendations

Before Merging:

Post-Merge:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jan 19, 2026 •

edited

Loading