Refactor Grafana dashboards and improve production infrastructure by jfrench9 · Pull Request #343 · RoboFinSystems/robosystems

jfrench9 · 2026-02-12T18:06:30Z

Summary

This refactoring effort streamlines our monitoring and infrastructure setup by consolidating Grafana dashboards, updating CloudFormation configurations, and enhancing the SEC data pipeline implementation.

Key Changes

Monitoring Infrastructure

Consolidated Grafana dashboards: Merged environment-specific dashboards into unified configurations
- Renamed and restructured dashboard files for better organization
- Removed redundant staging and production-specific log dashboards
- Updated CUR (Cost and Usage Report) dashboard configuration
Updated documentation: Revised setup and usage instructions to reflect new dashboard structure

Infrastructure Updates

Enhanced CloudFormation templates: Updated API, Dagster, and graph-ladybug-replicas configurations
Improved resource definitions: Added new parameters and refined existing infrastructure components

SEC Pipeline Improvements

New stage pipeline module: Added comprehensive staging functionality for SEC data processing
Enhanced materialization logic: Improved data materialization processes with better error handling
Refactored staging operations: Streamlined staging workflows for better performance and maintainability

Key Accomplishments

Reduced dashboard maintenance overhead by eliminating duplicate configurations
Simplified monitoring setup with unified dashboard approach
Enhanced SEC data pipeline reliability and performance
Improved infrastructure configuration management

Breaking Changes

Grafana dashboard file names and structure have changed - existing dashboard imports will need to be updated
Some CloudFormation parameters may have been modified - review infrastructure deployments

Testing Notes

Verify Grafana dashboard functionality after import with new file structure
Test SEC pipeline staging and materialization processes
Validate CloudFormation stack updates in non-production environments first

Infrastructure Considerations

Dashboard configurations should be re-imported using the new consolidated files
CloudFormation stacks may require updates - review parameter changes before deployment
Monitor SEC pipeline performance after deployment to ensure improvements are working as expected

🤖 Generated with Claude Code

Branch Info:

Source: refactor/improve-prod-implementation
Target: main
Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

- Changed datasource UID from `${DS_ATHENA}` to `${athena}` in `cur.json` for consistency. - Added new dashboards: `logs.json` for CloudWatch logs and `ops.json` for platform observability metrics. - Removed outdated `prod_logs.json` and `staging.json` dashboards. - Updated `README.md` to reflect new dashboard structure and template variable usage. - Enhanced CloudFormation templates with additional permissions for CloudWatch metrics and Auto Scaling actions.

jfrench9 · 2026-02-12T18:06:31Z

@claude please review this PR

claude · 2026-02-12T18:06:57Z

Claude finished @jfrench9's task in 1m 47s —— View job

PR Review Complete ✓

Todo List

Read and analyze Grafana dashboard changes
Review CloudFormation template updates
Review SEC pipeline code changes
Provide comprehensive review feedback

Summary

This is a well-executed refactoring that consolidates monitoring infrastructure and improves the SEC data pipeline. The changes significantly reduce maintenance overhead (+1693/-7331 lines) while enhancing functionality. I've identified a few areas for attention before merging.

🟢 Strengths

1. Grafana Dashboard Consolidation

Excellent consolidation from environment-specific dashboards to unified configs
Clean removal of staging.json and prod_logs.json eliminates duplication
Template variable approach (${env}, ${datasource}) is the right pattern
Updated README clearly documents the new structure

2. CloudFormation Improvements

dagster.yaml:177: Added TunnelCommand output is a great UX improvement
graph-ladybug-replicas.yaml: The consolidated disk monitoring configuration (single root volume with httpfs caching) aligns well with S3 ATTACH architecture

3. SEC Pipeline Enhancements

stage.py: New memory management with boost_graph_memory/release_graph_memory is smart - prevents OOM during large staging operations
materialization.py: copy_incremental_to_ladybug quarter overlap logic (lines 252-256) elegantly handles late-indexed filings
staging.py: Hash-based batched materialization (lines 776-825) is a solid approach for handling 500M+ row tables

🟡 Issues to Address

High Priority

1. CloudWatch Dashboard AWS Account Hardcoding

File: bin/tools/grafana/logs.json:653,681,709
Issue: Uses placeholder ${aws_account} in ARNs but this won't resolve automatically

"value": "arn:aws:logs:us-east-1:${aws_account}:log-group:/robosystems/prod/api:*"

Impact: Dashboard will fail to load log groups
Fix: Either document that users must replace this manually, or use a Grafana variable query to resolve the account ID dynamically

2. SEC Pipeline Error Handling - Silent Failures

File: robosystems/adapters/sec/pipeline/stage.py:68-69,102-107
Issue: Memory boost/release failures are logged as warnings and swallowed

except Exception as boost_err:
    context.log.warning(f"Could not boost memory (non-fatal): {boost_err}")

Impact: Pipeline may proceed with insufficient memory, leading to OOM failures downstream
Recommendation: At minimum, log these at ERROR level or add telemetry to track how often this fails. Consider failing the asset if the boost fails on known memory-intensive operations.

3. API CloudFormation Missing Context

File: cloudformation/api.yaml:8
Issue: The diff shows 8 added lines but no removed lines from api.yaml. Without seeing what was added, I can't verify correctness
Action Required: Please confirm what the 8 new lines are in api.yaml

Medium Priority

4. Incremental Staging - Partial Success Handling

File: robosystems/adapters/sec/processors/ingestion/staging.py:493
Issue: Returns status="partial" when some tables fail, but MaterializeResult still shows status="partial" which may not trigger alerts

status = "success" if len(successful_tables) == total_tables else "partial"

Question: Should partial failures fail the asset to prevent incomplete data propagation, or is the current approach intentional for resilience?

5. Hardcoded Year in CUR Dashboard

File: bin/tools/grafana/cur.json:1264

"where year = '2026' and month = '2'"

Issue: Hardcoded year/month filter will require monthly updates
Recommendation: Use Grafana's time range variables ($__timeFrom, $__timeTo) which are already used elsewhere in the same dashboard

6. Grafana README - Missing Migration Instructions

File: bin/tools/grafana/README.md
Issue: README doesn't include migration steps for users with existing dashboards
Recommendation: Add a "Migration" section explaining:
- How to delete old staging.json, prod_logs.json dashboards
- How to import the new consolidated dashboards
- How to update saved dashboard URLs/bookmarks

🔵 Minor Observations

7. Typo in API Logs Query

File: bin/tools/grafana/logs.json:399

filter not strcontains(@message, "opentelemetry.exporter.otlp.proto.http.trace_exporter"

Issue: Missing closing parenthesis - should be ...trace_exporter")
Impact: CloudWatch Insights query will fail

8. Inconsistent Timeout Patterns

File: robosystems/adapters/sec/processors/ingestion/materialization.py:335
Observation: Uses INCREMENTAL_COPY_TIMEOUT (10 min) vs CHUNKED_MATERIALIZATION_TIMEOUT for batched operations
Question: Are these timeout values empirically validated for production workloads?

9. DuckDB Connection Pool Management

File: robosystems/adapters/sec/pipeline/stage.py:103
Observation: release_graph_memory closes connections but doesn't verify cleanup succeeded
Recommendation: Consider adding a health check or connection count log after release to verify cleanup

✅ Testing Recommendations

Before merging, please verify:

Grafana Dashboards:
- Import all three dashboards (ops.json, logs.json, cur.json) into a test Grafana instance
- Verify all template variables resolve correctly
- Test the environment selector toggles between prod/staging views
- Fix the AWS account ID placeholder issue
CloudFormation:
- Deploy to staging environment with the updated templates
- Verify the new parameters don't break existing stacks
- Test the new TunnelCommand output in dagster.yaml
SEC Pipeline:
- Run sec_duckdb_incremental_staged asset in staging
- Monitor memory usage during boost/release cycles
- Verify quarter overlap logic correctly handles boundary cases (test with data at the end of Q1, beginning of Q2)

📋 Action Items

Must Fix Before Merge

Fix hardcoded ${aws_account} in logs.json (see issue Inital open source release of robosystems #1)
Fix syntax error in API logs query (see issue #7)
Clarify what was added to api.yaml (see issue Enhance SEC pipeline security and add reconsolidation support #3)

Should Fix Before Merge

Update CUR dashboard to use dynamic time variables (see issue Remove unused Account and Disclosure transaction nodes from graph schema #5)
Add migration instructions to Grafana README (see issue Fix worker monitor Valkey connection with Secrets Manager auth #6)
Reconsider error handling for memory boost failures (see issue Fix MCP tools query results to include units and improve dimensions #2)

Consider for Follow-up

Review incremental staging partial failure strategy (see issue chore: improve documentation and suppress build output #4)
Document/validate timeout values for production (see issue #8)
Add connection cleanup verification (see issue #9)

Verdict

Approval Status: 🟡 Approve with Changes

The consolidation and improvements are valuable, but the Grafana dashboard configuration issues (especially the AWS account placeholder and query syntax error) will break functionality. Please address the "Must Fix" items before merging.

Great work on the dashboard consolidation and SEC pipeline improvements! The memory management enhancements and quarter overlap logic are particularly well-designed.

| Branch: refactor/improve-prod-implementation

…arity - Modified `cur.json` to remove specific year and month filters from the SQL query for broader data retrieval. - Cleaned up `logs.json` by removing unnecessary `current` fields in multiple log group configurations, enhancing readability.

jfrench9 merged commit 4814e89 into main Feb 12, 2026
7 checks passed

jfrench9 deleted the refactor/improve-prod-implementation branch February 13, 2026 03:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Grafana dashboards and improve production infrastructure#343

Refactor Grafana dashboards and improve production infrastructure#343
jfrench9 merged 2 commits into
mainfrom
refactor/improve-prod-implementation

jfrench9 commented Feb 12, 2026

Uh oh!

jfrench9 commented Feb 12, 2026

Uh oh!

claude Bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jfrench9 commented Feb 12, 2026

Summary

Key Changes

Monitoring Infrastructure

Infrastructure Updates

SEC Pipeline Improvements

Key Accomplishments

Breaking Changes

Testing Notes

Infrastructure Considerations

Uh oh!

jfrench9 commented Feb 12, 2026

Uh oh!

claude Bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete ✓

Todo List

Summary

🟢 Strengths

🟡 Issues to Address

High Priority

Medium Priority

🔵 Minor Observations

✅ Testing Recommendations

📋 Action Items

Must Fix Before Merge

Should Fix Before Merge

Consider for Follow-up

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Feb 12, 2026 •

edited

Loading