Skip to content

Fix DuckDB materialization robustness with SSE monitoring and deps#212

Merged
jfrench9 merged 5 commits into
mainfrom
bugfix/duckdb-materialization-robustness
Jan 19, 2026
Merged

Fix DuckDB materialization robustness with SSE monitoring and deps#212
jfrench9 merged 5 commits into
mainfrom
bugfix/duckdb-materialization-robustness

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This PR enhances the robustness of DuckDB table materialization processes by implementing Server-Sent Events (SSE) monitoring for long-running operations and refactoring the SEC data pipeline to remove staging dependencies.

Key Changes

DuckDB Materialization Improvements

  • SSE Monitoring: Added real-time monitoring capabilities for long-running DuckDB table creation tasks through Server-Sent Events
  • Enhanced Error Handling: Improved error handling and logging throughout the materialization process
  • Task Management: Upgraded task management system with better response formats and monitoring capabilities

SEC Pipeline Refactoring

  • Removed Staging Dependencies: Eliminated DuckDB staging references from the SEC pipeline for simplified architecture
  • Streamlined Processing: Refactored ingestion processors and asset definitions for better maintainability
  • Updated Job Configuration: Modified job and sensor configurations to align with new pipeline structure

Dependency Updates

  • PyArrow: Updated to version 23.0.0 for improved performance and compatibility
  • Soupsieve: Updated to version 2.8.2 for enhanced CSS selector parsing

Technical Improvements

  • Enhanced graph API client with better task monitoring capabilities
  • Improved table management router with comprehensive materialization workflow
  • Updated task models to support new monitoring features
  • Refined sensor logic for better pipeline orchestration

Testing

  • Updated test suites to reflect new task management and response formats
  • Modified sensor tests to align with pipeline changes
  • Enhanced table management tests for improved coverage

Infrastructure Considerations

  • Long-running DuckDB operations now provide real-time progress updates
  • Improved monitoring visibility for database materialization processes
  • Enhanced error tracking and debugging capabilities for production environments
  • No breaking changes to existing API contracts

This update significantly improves the reliability and observability of DuckDB materialization workflows while modernizing the underlying infrastructure dependencies.


🤖 Generated with Claude Code

Branch Info:

  • Source: bugfix/duckdb-materialization-robustness
  • Target: main
  • Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

…pp Runner and Lambda services. Updated comments for clarity on allowed frontend repositories.
…ctor SEC pipeline to remove DuckDB staging references and enhance materialization process. Implement SSE monitoring for long-running DuckDB table creation tasks, improving error handling and logging. Update related tests to reflect changes in task management and response formats.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Jan 19, 2026

Claude finished @jfrench9's task in 1m 33s —— View job


PR Review: DuckDB Materialization Robustness

Review Tasks

  • Read and analyze all 15 changed files
  • Review SSE monitoring implementation
  • Review SEC pipeline refactoring
  • Review error handling and logging improvements
  • Review test coverage
  • Provide comprehensive feedback

Summary

This PR introduces solid improvements to DuckDB table materialization with SSE monitoring for long-running operations. The architecture is well-designed with proper separation of concerns. However, there are several issues that should be addressed before merging, ranging from critical bugs to maintainability concerns.

Critical Issues

1. Potential Deadlock in SSE Event Loop (robosystems/graph_api/core/task_sse.py:148)

Severity: High

await asyncio.sleep(2)

The SSE event generator has a tight polling loop with 2-second sleeps. While the task is running, this continuously polls Redis every 2 seconds. For a long-running task (30+ minutes), this means 900+ Redis queries. This could:

  • Cause unnecessary load on Redis
  • Potentially starve other async operations if many SSE connections are active
  • Waste resources on rapid polling when no updates are available

Recommendation: Consider implementing a more efficient notification mechanism (e.g., Redis pub/sub) or increase the polling interval to 5-10 seconds for steady-state monitoring.

2. Missing Timeout Handling in Table Creation (robosystems/graph_api/routers/databases/tables/management.py:136-189)

Severity: High

The perform_table_creation background task has no timeout mechanism. If table_manager.create_table(request) hangs indefinitely, the task will never complete, and the SSE stream will keep sending heartbeats forever (or until the client disconnects).

Recommendation: Add a timeout wrapper around the table creation:

try:
    result = await asyncio.wait_for(
        asyncio.to_thread(table_manager.create_table, request),
        timeout=1800  # 30 minutes
    )
except asyncio.TimeoutError:
    await staging_task_manager.update_task(
        task_id,
        status=TaskStatus.FAILED,
        error="Table creation timeout after 30 minutes"
    )

3. Incorrect Error Handling in SSE Monitor (robosystems/graph_api/client/client.py:521-525)

Severity: Medium

if current_time - last_heartbeat > 120:
    logger.warning(
        "No heartbeat received for 2 minutes, connection may be stale"
    )

The code logs a warning about stale connections but continues processing. This should either:

  • Return an error and exit the loop
  • Attempt to reconnect
  • Or clearly document why it's only a warning

As written, a stale connection will continue indefinitely, wasting resources.

Major Issues

4. Inconsistent Response Format (robosystems/adapters/sec/processors/ingestion.py:328-341)

Severity: Medium

The _create_duckdb_tables method expects SSE response format but inconsistently extracts data:

result = response.get("result", {})
duration = response.get("duration_seconds", result.get("duration_seconds", 0))

This double-fallback pattern (response.get() then result.get()) suggests uncertainty about response structure. The response format should be clearly defined in the API contract.

Recommendation: Document the exact response format from create_table and remove ambiguity. Consider using Pydantic models for type safety.

5. Missing Error Recovery in Pipeline (robosystems/adapters/sec/processors/ingestion.py:312-345)

Severity: Medium

When table creation fails, the exception is raised immediately, aborting the entire pipeline. This means if table 3 of 10 fails, tables 4-10 are never created. Consider:

failed_tables = []
for table_name, s3_keys in tables_info.items():
    try:
        # ... create table ...
    except Exception as e:
        logger.error(f"Failed to create table {table_name}: {e}")
        failed_tables.append((table_name, str(e)))
        continue  # Process remaining tables

if failed_tables:
    raise RuntimeError(f"Failed to create {len(failed_tables)} tables: {failed_tables}")

6. Unbounded Redis Key Storage (robosystems/graph_api/routers/databases/tables/management.py:90-96)

Severity: Medium

Task data is stored in Redis with 24-hour TTL, but there's no cleanup for completed tasks or limits on storage:

await redis_client.setex(
    f"lbug:task:{task_id}",
    86400,  # 24 hours
    json.dumps(task_data),
)

In a high-throughput environment, this could accumulate thousands of completed tasks in Redis. Consider:

  • Shorter TTL for completed tasks (1 hour)
  • Move completed task history to PostgreSQL
  • Implement periodic cleanup

Code Quality Issues

7. Inconsistent Naming Convention (robosystems/graph_api/routers/tasks.py:119-131)

Severity: Low

def get_task_type(self, task_id: str) -> TaskType:
    if task_id.startswith("ingest") or task_id.startswith("copy"):
        return TaskType.INGESTION
    elif task_id.startswith("backup"):
        return TaskType.BACKUP
    elif task_id.startswith("staging"):
        return TaskType.STAGING

This hardcoded prefix matching is fragile. If task ID format changes or new task types are added, this breaks. Consider:

  • Storing task type in Redis task data
  • Using a registry pattern for task types

8. Magic Numbers (robosystems/graph_api/core/task_sse.py:43, 75, 148)

Severity: Low

heartbeat_interval: int = 30
if current_time - last_heartbeat > heartbeat_interval:

The heartbeat interval (30s), stale connection threshold (120s), and polling interval (2s) are hardcoded. These should be configurable constants at the module level for easier tuning.

9. Overly Broad Exception Handling (robosystems/graph_api/routers/databases/tables/management.py:179-188)

Severity: Low

except Exception as e:
    logger.error(f"[Task {task_id}] Failed: {e}")

Catching all exceptions makes debugging harder and could hide programming errors. Consider catching specific exceptions and letting others propagate for proper error tracking.

10. Test Coverage Gaps

Severity: Low

The updated tests (test_tables_management.py:37-54) only test the happy path for SSE-based table creation. Missing tests for:

  • Timeout scenarios
  • Task failure scenarios
  • SSE connection errors
  • Heartbeat handling
  • Progress updates

Documentation Issues

11. Missing SSE Event Format Documentation

Severity: Low

The SSE event stream format (events: connected, heartbeat, progress, completed, failed) is not documented in the code or API docs. This makes it hard for clients to implement monitoring correctly.

Recommendation: Add comprehensive docstrings to generate_task_sse_events with example event payloads.

12. Unclear DuckDB Limitations (robosystems/adapters/sec/processors/ingestion.py:18-27)

Severity: Low

The docstring mentions "LIMITATION: This approach currently ALWAYS rebuilds the graph from scratch" but doesn't explain why or what would be needed for incremental updates. This is important architectural context.

Positive Aspects

Well-structured SSE implementation - Generic generate_task_sse_events can be reused for all long-running operations
Good separation of concerns - Task management, SSE streaming, and business logic are properly separated
Comprehensive logging - Good use of structured logging with task IDs and timing information
Proper async/await usage - Correctly uses async patterns throughout
Clean refactoring - Successfully removed staging dependencies from SEC pipeline
Dependency updates - PyArrow 23.0.0 and soupsieve 2.8.2 bring performance improvements

Minor Suggestions

  1. robosystems/dagster/sensors/sec.py:154-158 - Consider making the dev environment skip configurable via env var for testing
  2. robosystems/graph_api/models/tasks.py:30 - Add STAGING to the docstring examples
  3. robosystems/dagster/assets/sec.py - The 77-line deletion suggests significant refactoring; ensure no valuable logic was lost

Recommendations

Before Merging:

  1. Fix critical issue Fix MCP tools query results to include units and improve dimensions #2 (missing timeout in table creation)
  2. Address issue Enhance SEC pipeline security and add reconsolidation support #3 (stale connection handling)
  3. Add timeout/failure tests to test suite

Post-Merge:

  1. Monitor Redis memory usage with 24-hour task TTL in production
  2. Consider implementing Redis pub/sub for more efficient SSE notifications
  3. Add API documentation for SSE event formats
  4. Add observability metrics for task durations and failure rates

Overall, this is a solid improvement to the codebase that enhances observability and robustness. The SSE monitoring architecture is well-designed. Address the critical timeout issues and consider the recommendations for production readiness.

…and reliability. Update SEC processing job to utilize a 99% Spot and 1% On-Demand capacity provider strategy. Enhance DuckDB table creation process with better error handling and logging, including timeout management for long-running tasks. Return list of successfully created tables and log failures for improved debugging.
@jfrench9 jfrench9 merged commit 6857a38 into main Jan 19, 2026
7 checks passed
@jfrench9 jfrench9 deleted the bugfix/duckdb-materialization-robustness branch January 20, 2026 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant