Skip to content

Refactor Grafana dashboards and improve production infrastructure#343

Merged
jfrench9 merged 2 commits into
mainfrom
refactor/improve-prod-implementation
Feb 12, 2026
Merged

Refactor Grafana dashboards and improve production infrastructure#343
jfrench9 merged 2 commits into
mainfrom
refactor/improve-prod-implementation

Conversation

@jfrench9
Copy link
Copy Markdown
Member

Summary

This refactoring effort streamlines our monitoring and infrastructure setup by consolidating Grafana dashboards, updating CloudFormation configurations, and enhancing the SEC data pipeline implementation.

Key Changes

Monitoring Infrastructure

  • Consolidated Grafana dashboards: Merged environment-specific dashboards into unified configurations
    • Renamed and restructured dashboard files for better organization
    • Removed redundant staging and production-specific log dashboards
    • Updated CUR (Cost and Usage Report) dashboard configuration
  • Updated documentation: Revised setup and usage instructions to reflect new dashboard structure

Infrastructure Updates

  • Enhanced CloudFormation templates: Updated API, Dagster, and graph-ladybug-replicas configurations
  • Improved resource definitions: Added new parameters and refined existing infrastructure components

SEC Pipeline Improvements

  • New stage pipeline module: Added comprehensive staging functionality for SEC data processing
  • Enhanced materialization logic: Improved data materialization processes with better error handling
  • Refactored staging operations: Streamlined staging workflows for better performance and maintainability

Key Accomplishments

  • Reduced dashboard maintenance overhead by eliminating duplicate configurations
  • Simplified monitoring setup with unified dashboard approach
  • Enhanced SEC data pipeline reliability and performance
  • Improved infrastructure configuration management

Breaking Changes

  • Grafana dashboard file names and structure have changed - existing dashboard imports will need to be updated
  • Some CloudFormation parameters may have been modified - review infrastructure deployments

Testing Notes

  • Verify Grafana dashboard functionality after import with new file structure
  • Test SEC pipeline staging and materialization processes
  • Validate CloudFormation stack updates in non-production environments first

Infrastructure Considerations

  • Dashboard configurations should be re-imported using the new consolidated files
  • CloudFormation stacks may require updates - review parameter changes before deployment
  • Monitor SEC pipeline performance after deployment to ensure improvements are working as expected

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/improve-prod-implementation
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Changed datasource UID from `${DS_ATHENA}` to `${athena}` in `cur.json` for consistency.
- Added new dashboards: `logs.json` for CloudWatch logs and `ops.json` for platform observability metrics.
- Removed outdated `prod_logs.json` and `staging.json` dashboards.
- Updated `README.md` to reflect new dashboard structure and template variable usage.
- Enhanced CloudFormation templates with additional permissions for CloudWatch metrics and Auto Scaling actions.
@jfrench9
Copy link
Copy Markdown
Member Author

@claude please review this PR

@claude
Copy link
Copy Markdown

claude Bot commented Feb 12, 2026

Claude finished @jfrench9's task in 1m 47s —— View job


PR Review Complete ✓

Todo List

  • Read and analyze Grafana dashboard changes
  • Review CloudFormation template updates
  • Review SEC pipeline code changes
  • Provide comprehensive review feedback

Summary

This is a well-executed refactoring that consolidates monitoring infrastructure and improves the SEC data pipeline. The changes significantly reduce maintenance overhead (+1693/-7331 lines) while enhancing functionality. I've identified a few areas for attention before merging.


🟢 Strengths

1. Grafana Dashboard Consolidation

  • Excellent consolidation from environment-specific dashboards to unified configs
  • Clean removal of staging.json and prod_logs.json eliminates duplication
  • Template variable approach (${env}, ${datasource}) is the right pattern
  • Updated README clearly documents the new structure

2. CloudFormation Improvements

  • dagster.yaml:177: Added TunnelCommand output is a great UX improvement
  • graph-ladybug-replicas.yaml: The consolidated disk monitoring configuration (single root volume with httpfs caching) aligns well with S3 ATTACH architecture

3. SEC Pipeline Enhancements

  • stage.py: New memory management with boost_graph_memory/release_graph_memory is smart - prevents OOM during large staging operations
  • materialization.py: copy_incremental_to_ladybug quarter overlap logic (lines 252-256) elegantly handles late-indexed filings
  • staging.py: Hash-based batched materialization (lines 776-825) is a solid approach for handling 500M+ row tables

🟡 Issues to Address

High Priority

1. CloudWatch Dashboard AWS Account Hardcoding

  • File: bin/tools/grafana/logs.json:653,681,709
  • Issue: Uses placeholder ${aws_account} in ARNs but this won't resolve automatically
"value": "arn:aws:logs:us-east-1:${aws_account}:log-group:/robosystems/prod/api:*"
  • Impact: Dashboard will fail to load log groups
  • Fix: Either document that users must replace this manually, or use a Grafana variable query to resolve the account ID dynamically

2. SEC Pipeline Error Handling - Silent Failures

  • File: robosystems/adapters/sec/pipeline/stage.py:68-69,102-107
  • Issue: Memory boost/release failures are logged as warnings and swallowed
except Exception as boost_err:
    context.log.warning(f"Could not boost memory (non-fatal): {boost_err}")
  • Impact: Pipeline may proceed with insufficient memory, leading to OOM failures downstream
  • Recommendation: At minimum, log these at ERROR level or add telemetry to track how often this fails. Consider failing the asset if the boost fails on known memory-intensive operations.

3. API CloudFormation Missing Context

  • File: cloudformation/api.yaml:8
  • Issue: The diff shows 8 added lines but no removed lines from api.yaml. Without seeing what was added, I can't verify correctness
  • Action Required: Please confirm what the 8 new lines are in api.yaml

Medium Priority

4. Incremental Staging - Partial Success Handling

  • File: robosystems/adapters/sec/processors/ingestion/staging.py:493
  • Issue: Returns status="partial" when some tables fail, but MaterializeResult still shows status="partial" which may not trigger alerts
status = "success" if len(successful_tables) == total_tables else "partial"
  • Question: Should partial failures fail the asset to prevent incomplete data propagation, or is the current approach intentional for resilience?

5. Hardcoded Year in CUR Dashboard

  • File: bin/tools/grafana/cur.json:1264
"where year = '2026' and month = '2'"
  • Issue: Hardcoded year/month filter will require monthly updates
  • Recommendation: Use Grafana's time range variables ($__timeFrom, $__timeTo) which are already used elsewhere in the same dashboard

6. Grafana README - Missing Migration Instructions

  • File: bin/tools/grafana/README.md
  • Issue: README doesn't include migration steps for users with existing dashboards
  • Recommendation: Add a "Migration" section explaining:
    • How to delete old staging.json, prod_logs.json dashboards
    • How to import the new consolidated dashboards
    • How to update saved dashboard URLs/bookmarks

🔵 Minor Observations

7. Typo in API Logs Query

  • File: bin/tools/grafana/logs.json:399
filter not strcontains(@message, "opentelemetry.exporter.otlp.proto.http.trace_exporter"
  • Issue: Missing closing parenthesis - should be ...trace_exporter")
  • Impact: CloudWatch Insights query will fail

8. Inconsistent Timeout Patterns

  • File: robosystems/adapters/sec/processors/ingestion/materialization.py:335
  • Observation: Uses INCREMENTAL_COPY_TIMEOUT (10 min) vs CHUNKED_MATERIALIZATION_TIMEOUT for batched operations
  • Question: Are these timeout values empirically validated for production workloads?

9. DuckDB Connection Pool Management

  • File: robosystems/adapters/sec/pipeline/stage.py:103
  • Observation: release_graph_memory closes connections but doesn't verify cleanup succeeded
  • Recommendation: Consider adding a health check or connection count log after release to verify cleanup

✅ Testing Recommendations

Before merging, please verify:

  1. Grafana Dashboards:

    • Import all three dashboards (ops.json, logs.json, cur.json) into a test Grafana instance
    • Verify all template variables resolve correctly
    • Test the environment selector toggles between prod/staging views
    • Fix the AWS account ID placeholder issue
  2. CloudFormation:

    • Deploy to staging environment with the updated templates
    • Verify the new parameters don't break existing stacks
    • Test the new TunnelCommand output in dagster.yaml
  3. SEC Pipeline:

    • Run sec_duckdb_incremental_staged asset in staging
    • Monitor memory usage during boost/release cycles
    • Verify quarter overlap logic correctly handles boundary cases (test with data at the end of Q1, beginning of Q2)

📋 Action Items

Must Fix Before Merge

  1. Fix hardcoded ${aws_account} in logs.json (see issue Inital open source release of robosystems #1)
  2. Fix syntax error in API logs query (see issue #7)
  3. Clarify what was added to api.yaml (see issue Enhance SEC pipeline security and add reconsolidation support #3)

Should Fix Before Merge

  1. Update CUR dashboard to use dynamic time variables (see issue Remove unused Account and Disclosure transaction nodes from graph schema #5)
  2. Add migration instructions to Grafana README (see issue Fix worker monitor Valkey connection with Secrets Manager auth #6)
  3. Reconsider error handling for memory boost failures (see issue Fix MCP tools query results to include units and improve dimensions #2)

Consider for Follow-up

  1. Review incremental staging partial failure strategy (see issue chore: improve documentation and suppress build output #4)
  2. Document/validate timeout values for production (see issue #8)
  3. Add connection cleanup verification (see issue #9)

Verdict

Approval Status: 🟡 Approve with Changes

The consolidation and improvements are valuable, but the Grafana dashboard configuration issues (especially the AWS account placeholder and query syntax error) will break functionality. Please address the "Must Fix" items before merging.

Great work on the dashboard consolidation and SEC pipeline improvements! The memory management enhancements and quarter overlap logic are particularly well-designed.


| Branch: refactor/improve-prod-implementation

…arity

- Modified `cur.json` to remove specific year and month filters from the SQL query for broader data retrieval.
- Cleaned up `logs.json` by removing unnecessary `current` fields in multiple log group configurations, enhancing readability.
@jfrench9 jfrench9 merged commit 4814e89 into main Feb 12, 2026
7 checks passed
@jfrench9 jfrench9 deleted the refactor/improve-prod-implementation branch February 13, 2026 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant