Skip to content

Refactor: Enable direct graph provisioning with enhanced SSE monitoring#137

Merged
jfrench9 merged 2 commits into
mainfrom
refactor/direct-graph-operation
Jan 5, 2026
Merged

Refactor: Enable direct graph provisioning with enhanced SSE monitoring#137
jfrench9 merged 2 commits into
mainfrom
refactor/direct-graph-operation

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Jan 5, 2026

Summary

This refactor introduces a direct graph provisioning feature that enables real-time graph operations through enhanced Server-Sent Events (SSE) monitoring capabilities. The changes streamline the graph creation and management workflow while improving observability and error handling.

Key Accomplishments

Direct Graph Provisioning

  • New Direct Monitor System: Implemented comprehensive direct monitoring capabilities for graph operations with real-time status tracking
  • Enhanced Graph Assets: Added Dagster assets for graph management and integrated them into the data pipeline
  • Improved Operation Management: Refactored the generic graph service to support direct provisioning workflows
  • Router Enhancements: Updated graph and subgraph routers to leverage the new direct provisioning capabilities

Infrastructure & Configuration

  • Environment Configuration: Extended configuration management with new environment variables and secrets handling
  • AWS Integration: Enhanced AWS setup procedures and S3 configuration documentation
  • Dependency Updates: Updated project dependencies and lock files to support new features

Monitoring & Observability

  • Event Storage Improvements: Refactored event storage mechanisms for better performance and reliability
  • SSE Middleware Enhancement: Expanded SSE capabilities with new monitoring modules and improved event handling
  • Billing Integration: Updated billing jobs to work with the new graph provisioning system
  • Dashboard Updates: Refined AWS Cost and Usage Report dashboard configurations

Testing & Quality Assurance

  • Comprehensive Test Coverage: Added extensive test suites for the new direct monitor functionality
  • Updated Existing Tests: Refactored existing graph and event storage tests to accommodate new features
  • Integration Testing: Enhanced test coverage for graph operations and SSE event handling

Infrastructure Considerations

  • Configuration Updates: Review and update environment variables for direct graph provisioning
  • AWS Permissions: Ensure appropriate AWS permissions are in place for enhanced S3 operations
  • Monitoring Setup: New monitoring capabilities may require observability infrastructure adjustments
  • Database Schema: Event storage changes may require database migration consideration

Breaking Changes

None identified. This refactor maintains backward compatibility while adding new functionality.

Deployment Notes

  • Verify all new environment variables are properly configured
  • Confirm AWS infrastructure permissions support the enhanced S3 operations
  • Monitor initial deployments for proper SSE event flow and graph provisioning performance

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/direct-graph-operation
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

…documentation

- Added `DIRECT_GRAPH_PROVISIONING_ENABLED` flag to `.env.example` and `env.py` to allow direct execution of graph creation, improving performance by reducing latency.
- Updated S3 configuration comments in `.env.example` for clarity, including deprecation notices for old bucket aliases and a new structure for primary bucket configurations.
- Refactored graph creation logic in `main.py`, `subgraphs/main.py`, and `generic_graph_service.py` to support direct execution paths, enhancing the overall graph creation process.
- Adjusted tests to disable direct execution for mocked paths, ensuring consistent behavior during testing.
@jfrench9
Copy link
Copy Markdown
Member Author

jfrench9 commented Jan 5, 2026

@claude please review this PR

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 5, 2026

Claude finished @jfrench9's task in 2m 53s —— View job


PR Review Complete ✅

Reviewed PR #137: Direct Graph Provisioning with Enhanced SSE Monitoring


Executive Summary

This PR introduces a significant architectural improvement by implementing direct graph provisioning that bypasses Dagster orchestration for user-facing operations, reducing latency from 60+ seconds to ~3 seconds. The implementation is well-structured, thoroughly tested, and maintains backward compatibility with the existing Dagster-based workflow.

Overall Assessment: APPROVE with minor recommendations

The code quality is high, error handling is comprehensive, and the feature flag (DIRECT_GRAPH_PROVISIONING_ENABLED) provides safe rollout capability. However, there are several areas that warrant attention before or shortly after merge.


Critical Issues

1. Database Session Management - Potential Resource Leaks

Location: robosystems/middleware/sse/direct_monitor.py:230-293, 340-402, 538-680

Issue: The database session cleanup pattern using generator exhaustion is fragile and may leak connections on exceptions.

# Current pattern (lines 230-293)
db_gen = get_db_session()
db = next(db_gen)
try:
    # ... operations ...
finally:
    try:
        next(db_gen)
    except StopIteration:
        pass

Problem: If an exception occurs before the finally block, or if the generator protocol isn't properly exhausted, database connections may leak. This is especially critical in long-running background tasks.

Recommendation: Use context managers for guaranteed cleanup:

from robosystems.database import get_db_session_context

async with get_db_session_context() as db:
    # ... operations ...
    # Session automatically closed even on exceptions

Risk: HIGH - Connection pool exhaustion in production
Occurrences: run_entity_graph_creation (line 230), run_subgraph_creation (line 340), run_graph_provisioning (line 538), run_repository_provisioning (line 770)


2. Race Condition in Event Storage Metadata Updates

Location: robosystems/middleware/sse/event_storage.py:367-456, 534-620

Issue: While the code uses optimistic locking with Redis WATCH, there's a window where status updates can be lost if multiple events fire simultaneously.

Scenario:

  1. Progress event starts reading metadata (status: RUNNING)
  2. Completion event fires, updates status to COMPLETED
  3. Progress event's update overwrites COMPLETED → RUNNING

Current mitigation: Lines 378-386 skip progress events to reduce race conditions, but this is a band-aid solution.

Better approach: Use Redis MULTI/EXEC transactions with proper conflict resolution:

# Instead of skipping progress events entirely, use atomic operations
pipe.multi()
# Only update if current status priority is lower
pipe.execute()

Risk: MEDIUM - Potential for incorrect final states in operation metadata
Evidence: Comments at lines 375-378 and 544-545 acknowledge this limitation


High Priority Issues

3. Error Handling - Subscription Failure Cleanup

Location: robosystems/middleware/sse/direct_monitor.py:691-718

Issue: When graph provisioning fails, the subscription cleanup logic opens a new database session but doesn't handle the case where the session itself fails.

# Lines 692-717
try:
    db_gen = get_db_session()
    db = next(db_gen)
    try:
        subscription = db.query(BillingSubscription).filter(...).first()
        if subscription:
            subscription.status = "failed"
            # ... metadata update ...
            db.commit()  # ⚠️ No error handling for commit failures
    finally:
        try:
            next(db_gen)
        except StopIteration:
            pass
except Exception as cleanup_error:
    logger.error(f"Failed to mark subscription as failed: {cleanup_error}")

Problem: If db.commit() fails, the subscription remains in "provisioning" state, creating orphaned subscriptions.

Recommendation: Add explicit rollback and retry logic:

try:
    db.commit()
except Exception as commit_error:
    logger.error(f"Failed to commit subscription failure: {commit_error}")
    db.rollback()
    # Consider: Queue for manual intervention or retry

Risk: MEDIUM - Orphaned subscriptions requiring manual cleanup


4. Progress Callback Error Handling

Location: robosystems/middleware/sse/direct_monitor.py:68-88

Issue: The ProgressEmitter.__call__ method silently swallows all exceptions when emitting progress events.

def __call__(self, message: str, percent: float):
    try:
        asyncio.get_running_loop()
        asyncio.create_task(self._emit_async(message, percent))
    except RuntimeError:
        logger.debug(f"Skipping progress emit (no event loop): {message}")

Problems:

  1. create_task fires and forgets - errors in _emit_async are never logged
  2. No backpressure - could create thousands of pending tasks if service is slow
  3. No visibility into SSE system health

Recommendation:

def __call__(self, message: str, percent: float):
    try:
        loop = asyncio.get_running_loop()
        task = loop.create_task(self._emit_async(message, percent))
        # Store task reference to detect errors
        task.add_done_callback(self._handle_emit_error)
    except RuntimeError:
        logger.debug(f"No event loop, skipping progress: {message}")

def _handle_emit_error(self, task):
    if task.exception():
        logger.warning(f"Progress emit failed: {task.exception()}")
        # Optionally: Circuit breaker to disable progress if SSE is down

Risk: MEDIUM - Silent progress tracking failures


5. Dagster Materialization Timeout Too Aggressive

Location: robosystems/middleware/sse/direct_monitor.py:47-49

Issue: 5-second timeout for Dagster materialization reporting may be too short for cold Dagster instances.

DAGSTER_REPORT_TIMEOUT = 5.0

Impact: In production, if Dagster is under load, materialization reports will timeout and operations won't appear in the Dagster UI. This defeats the observability benefit.

Recommendation: Increase to 15-30 seconds, or make it configurable:

DAGSTER_REPORT_TIMEOUT = env.DAGSTER_MATERIALIZATION_TIMEOUT or 15.0

Risk: LOW - Reduced observability, but not a functional issue


Medium Priority Issues

6. Schema Persistence Logic Duplication

Location: robosystems/operations/graph/generic_graph_service.py:206-253

Issue: Schema persistence logic is duplicated between custom schema and extensions paths.

Lines 140-176 (custom schema):

schema_persistence = {
    "schema_type": "custom",
    "schema_ddl": custom_ddl,
    # ...
}

Lines 241-247 (extensions):

schema_persistence = {
    "schema_type": "extensions",
    "schema_ddl": extensions_ddl,
    # ...
}

Recommendation: Extract to a helper method:

def _prepare_schema_persistence(
    schema_type: str,
    ddl: str,
    json_config: dict,
    custom_name: str | None = None,
    custom_version: str | None = None
) -> dict:
    return {
        "schema_type": schema_type,
        "schema_ddl": ddl,
        "schema_json": json_config,
        "custom_schema_name": custom_name,
        "custom_schema_version": custom_version,
    }

Risk: LOW - Maintainability issue


7. Missing Environment Variable Documentation

Location: .env.example:94

Issue: New feature flag DIRECT_GRAPH_PROVISIONING_ENABLED is added but lacks comprehensive documentation about when to enable/disable it.

Current:

DIRECT_GRAPH_PROVISIONING_ENABLED=true

Recommendation: Add explanatory comment:

## Direct Graph Provisioning
## Bypasses Dagster orchestration for graph creation (60s -> 3s latency)
## Disable for debugging or if you need Dagster UI visibility for all operations
## Production: true (recommended), Debugging: false
DIRECT_GRAPH_PROVISIONING_ENABLED=true

Risk: LOW - Developer confusion


8. Billing Job Integration Incomplete Documentation

Location: robosystems/dagster/jobs/billing.py:168-217

Issue: The provisioning integration with direct execution is added but the transition path isn't documented. When does the code use direct execution vs Dagster jobs?

Lines 168-171: Checkout completion handler calls provisioning, but it's unclear if this uses direct execution or job-based execution.

Recommendation: Add docstring clarifying the execution path:

async def _handle_checkout_completed(...):
    """
    Handle checkout.session.completed event.
    
    Execution Path:
    - If DIRECT_GRAPH_PROVISIONING_ENABLED: Uses direct_monitor.run_graph_provisioning
    - Otherwise: Submits provision_graph_job to Dagster
    
    Note: Direct execution provides faster provisioning but less visibility in Dagster UI.
    """

Risk: LOW - Developer confusion during debugging


Low Priority / Nitpicks

9. Inconsistent Error Message Formats

Location: Multiple files

Examples:

  • direct_monitor.py:176: "Graph creation failed: {e}"
  • direct_monitor.py:296: "Entity graph creation failed: {e}"
  • direct_monitor.py:683: "Graph provisioning failed for subscription {subscription_id}: {e}"

Some include context (subscription_id), others don't (user_id, graph_id).

Recommendation: Standardize error logging format:

logger.error(
    f"Graph creation failed: {e}",
    extra={
        "user_id": user_id,
        "operation_id": operation_id,
        "error_type": type(e).__name__,
    }
)

This enables better log aggregation and alerting.

Risk: TRIVIAL - Log quality


10. Test Coverage - Missing Edge Cases

Location: tests/middleware/sse/test_direct_monitor.py

Observation: Tests cover happy path and basic failures, but missing:

  1. Concurrent operation handling - What happens if two operations try to provision the same subscription simultaneously?
  2. Partial failure recovery - If graph is created but credit pool creation fails (line 387-392), is the graph properly cleaned up?
  3. SSE connection loss - What happens if Redis connection is lost mid-operation?

Recommendation: Add integration tests for these scenarios:

@pytest.mark.integration
async def test_concurrent_provisioning_same_subscription():
    """Verify idempotency when webhook fires twice."""
    # ...

@pytest.mark.integration  
async def test_graph_cleanup_on_credit_pool_failure():
    """Verify graph is deallocated if credit pool creation fails."""
    # ...

Risk: TRIVIAL - But important for production confidence


Positive Observations

Excellent Design Decisions

  1. Feature Flag Pattern (line 507): Using DIRECT_GRAPH_PROVISIONING_ENABLED allows safe rollout and easy rollback
  2. Backward Compatibility (lines 548-597): Dagster path preserved as fallback
  3. Progress Callback Abstraction (lines 52-88): Clean adapter pattern for SSE integration
  4. Dagster Materialization Reporting (lines 156-169): Maintains observability even with direct execution
  5. Comprehensive Error Context (lines 179-183): Error details include operation type for debugging

Code Quality Highlights

  1. Docstrings: All functions have clear, comprehensive docstrings with Args/Returns
  2. Type Hints: Consistent use throughout (PEP 484 compliant)
  3. Logging: Structured logging with contextual information
  4. Testing: 646 lines of new tests demonstrate commitment to quality
  5. Separation of Concerns: Direct execution logic cleanly separated from Dagster logic

Security Best Practices

  1. Input Validation: Subscription status checks before provisioning (line 555)
  2. Idempotency: Webhook deduplication in billing.py (line 95-102)
  3. Error Information Disclosure: Error messages don't leak sensitive data

Performance Considerations

Positive Impacts

  1. Latency Reduction: 60s → 3s for graph creation (95% improvement)
  2. Resource Efficiency: Eliminates Dagster ECS task cold starts for user operations
  3. Scalability: BackgroundTasks scale better than Dagster for high-frequency operations

Potential Concerns

  1. Memory Usage: Background tasks run in API process, increasing memory footprint

    • Mitigation: Monitor API container memory usage post-deployment
    • Recommendation: Consider rate limiting graph creation operations
  2. Connection Pool Pressure: Direct execution uses main app's database pool

    • Current pool size: Likely 20-40 connections (from DEFAULT_POOL_SIZE)
    • Recommendation: Monitor connection pool saturation metrics

Adherence to Project Standards (CLAUDE.md)

Compliant:

  • No os.getenv() - uses centralized config (line 14, 458, 506)
  • Proper error handling patterns
  • Structured logging throughout
  • Type hints and docstrings

Violations:

  • Database Session Management: Not using recommended get_db_session() context manager pattern

⚠️ Recommendations:

  • Consider adding this pattern to the project's testing checklist
  • Update CLAUDE.md with guidance on background task database session management

Deployment Recommendations

Pre-Deployment

  1. Feature Flag Strategy:

    # Phase 1: Deploy with flag disabled (1 week)
    DIRECT_GRAPH_PROVISIONING_ENABLED=false
    
    # Phase 2: Enable for 10% of operations (canary)
    # Use modulo on operation_id hash to route
    
    # Phase 3: Full rollout
    DIRECT_GRAPH_PROVISIONING_ENABLED=true
  2. Monitoring Setup:

    • Add CloudWatch alarms for:
      • API container memory usage (> 80%)
      • Database connection pool saturation (> 90%)
      • Graph creation latency (p50, p95, p99)
      • SSE event publish failures (> 5% failure rate)
  3. Runbook Update:

    • Document rollback procedure (flip feature flag)
    • Add troubleshooting for orphaned subscriptions
    • Include Redis connectivity checks

Post-Deployment

  1. Metrics to Track:

    • Graph creation latency (before/after comparison)
    • Dagster materialization success rate
    • API container resource usage trends
    • Database connection pool utilization
  2. Alert on:

    • Subscription status = "provisioning" for > 5 minutes
    • Graph creation failures > 5% rate
    • SSE event storage errors

Testing Recommendations

Additional Test Scenarios

  1. Load Testing:

    # Simulate 100 concurrent graph creations
    # Verify no connection pool exhaustion
    # Verify no memory leaks
  2. Chaos Engineering:

    # Kill Redis mid-operation → verify graceful degradation
    # Simulate Dagster unavailability → verify materialization timeout handling
    # Simulate database connection loss → verify session cleanup
  3. Integration Tests:

    # End-to-end: Create graph → verify in database → verify SSE events → verify Dagster materialization

Documentation Updates Needed

  1. API Documentation:

    • Update graph creation endpoint docs to mention latency improvement
    • Document SSE event schema for direct operations
    • Add troubleshooting guide for failed provisioning
  2. Architecture Documentation:

    • Add sequence diagram showing direct execution flow
    • Document decision matrix: when to use Dagster vs direct execution
    • Update deployment architecture diagram
  3. Developer Guide:

    • Add section on implementing new direct execution operations
    • Document ProgressEmitter usage pattern
    • Explain Dagster materialization reporting best practices

Summary and Recommendation

Strengths

  • ✅ Significant performance improvement (95% latency reduction)
  • ✅ Well-tested with comprehensive test suite
  • ✅ Backward compatible with feature flag
  • ✅ Clean separation of concerns
  • ✅ Excellent documentation and code comments

Concerns

  • ⚠️ Database session management pattern needs improvement
  • ⚠️ Race condition risk in event storage (mitigated but not eliminated)
  • ⚠️ Missing monitoring and alerting for new execution path

Final Verdict

APPROVE

This is a well-designed and well-implemented feature that delivers significant value. The identified issues are manageable and can be addressed through:

The feature flag provides a safety net for rollback if issues arise in production. I recommend:

  1. Merge after addressing Critical Issue Inital open source release of robosystems #1 (database sessions)
  2. Deploy with feature flag disabled initially
  3. Enable for canary testing (10% of traffic)
  4. Monitor for 48 hours before full rollout
  5. Address remaining issues in follow-up PRs

Estimated Effort for Fixes:

Great work on this refactor! The architecture is sound and the implementation quality is high.


Files Reviewed: 21 files (3005 additions, 975 deletions)
Review Time: Comprehensive code analysis
Reviewer: Claude Sonnet 4.5


Comment thread robosystems/operations/graph/generic_graph_service.py Fixed
…r handling in database commits

- Introduced environment variable `DAGSTER_MATERIALIZATION_TIMEOUT` to configure the Dagster report timeout, defaulting to 15 seconds for improved handling of cold instances.
- Enhanced error handling during database commit operations in `run_graph_provisioning` and `run_repository_provisioning` functions, logging errors and rolling back transactions as necessary.
- Updated test case to reflect the new default timeout value.
@jfrench9 jfrench9 merged commit 6c03c1a into main Jan 5, 2026
6 checks passed
@jfrench9 jfrench9 linked an issue Jan 5, 2026 that may be closed by this pull request
37 tasks
@jfrench9 jfrench9 deleted the refactor/direct-graph-operation branch January 7, 2026 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Direct Operation Execution for Latency-Sensitive Tasks

1 participant