Conversation
📝 WalkthroughWalkthroughAdds Prometheus metrics, a Metrics type and middleware, /metrics and health endpoints, threads metrics into API, indexer, DA worker and metadata fetcher, updates tests and AppState, and adds a log-format CLI option with structured logging changes. Changes
Sequence DiagramssequenceDiagram
participant Client
participant ReadinessHandler as /health/ready
participant Database
participant HeadTracker
Client->>ReadinessHandler: GET /health/ready
ReadinessHandler->>Database: SELECT 1
alt DB unreachable
Database-->>ReadinessHandler: Error
ReadinessHandler-->>Client: 503 {"status":"not_ready","reason":"database unreachable"}
else DB reachable
Database-->>ReadinessHandler: OK
ReadinessHandler->>HeadTracker: latest_indexed_block()
HeadTracker-->>ReadinessHandler: Some(block,timestamp) or None
alt None or too old (>5m)
ReadinessHandler-->>Client: 503 {"status":"not_ready","reason":"indexer state unavailable|stale"}
else recent
ReadinessHandler-->>Client: 200 {"status":"ready"}
end
end
sequenceDiagram
participant Client
participant Middleware as HTTP Metrics Middleware
participant Router
participant Prometheus as PrometheusHandle
Client->>Middleware: HTTP Request
Middleware->>Middleware: start timer, capture method
Middleware->>Router: forward request
Router-->>Middleware: Response (status)
Middleware->>Middleware: compute latency, get MatchedPath
Middleware->>Prometheus: increment http_requests_total{method,path,status}
Middleware->>Prometheus: observe http_request_duration_seconds{method,path}
Middleware-->>Client: Return Response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/crates/atlas-server/src/indexer/fetcher.rs (1)
132-150:⚠️ Potential issue | 🟡 MinorMissing
record_rpc_request("error")on parse failure.When JSON parsing fails (line 138), only
metrics.error("rpc", "rpc_parse")is called, butmetrics.record_rpc_request("error")is not. This is inconsistent with the HTTP failure path (lines 101-102) which records both. Theatlas_indexer_rpc_requests_totalmetric will undercount failures when the HTTP request succeeds but the response cannot be parsed.🛠️ Proposed fix
Err(e) => { let delay = RPC_RETRY_DELAYS .get(attempt) .copied() .unwrap_or(*RPC_RETRY_DELAYS.last().unwrap_or(&30)); + metrics.record_rpc_request("error"); metrics.error("rpc", "rpc_parse"); tracing::warn!(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/src/indexer/fetcher.rs` around lines 132 - 150, The parse-failure branch in fetcher.rs calls metrics.error("rpc", "rpc_parse") but omits metrics.record_rpc_request("error"), causing undercounted RPC failures; update the Err(e) arm (around RPC_RETRY_DELAYS / last_error handling) to also call metrics.record_rpc_request("error") (same place where the HTTP failure path records both) before logging and sleeping so parse errors increment atlas_indexer_rpc_requests_total consistently with other failures.
🧹 Nitpick comments (2)
backend/crates/atlas-server/src/cli.rs (1)
373-380: Constrain--atlas.log.formatto accepted values at parse time.The
formatfield accepts any string value, causing invalid inputs to silently fall back to text format via the wildcard pattern ininit_tracing(). Addvalue_parserto reject invalid values at CLI parse time instead.Proposed change
#[arg( long = "atlas.log.format", env = "LOG_FORMAT", default_value = "text", + value_parser = ["text", "json"], value_name = "FORMAT", help = "Log output format: text or json" )] pub format: String,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/src/cli.rs` around lines 373 - 380, The CLI currently defines pub format: String in the CLI struct and accepts any string, letting init_tracing() silently handle invalid values; update the #[arg(...)] for the format field in backend/crates/atlas-server/src/cli.rs to add a value_parser that restricts accepted values to "text" and "json" (e.g., use clap's PossibleValuesParser or value_parser = ["text","json"]) so invalid inputs are rejected at parse time and users get an immediate error instead of falling back in init_tracing().backend/crates/atlas-server/src/indexer/indexer.rs (1)
415-420: Batch duration includes wait time, not just processing time.
elapsedis measured fromlast_log_time, which is reset at the end of each batch. This meansrecord_batch_durationcaptures the full cycle time including any sleep at chain head, not just active processing. If the intent is to measure processing time only, the timer should start at the beginning of each batch iteration.However, measuring full cycle time may be intentional for understanding overall throughput. If this is the desired behavior, consider clarifying the metric description in
metrics.rs(line 57-59) to indicate it includes idle time at chain head.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/crates/atlas-server/src/api/error.rs`:
- Around line 54-70: Add same-file unit tests in a #[cfg(test)] mod tests block
that exercise the error-type branching for AtlasError by constructing
AtlasError::Database(...), AtlasError::Internal(...), AtlasError::Config(...),
AtlasError::Rpc(...), and AtlasError::MetadataFetch(...) and asserting the
mapped error_type matches "database", "internal", "config", "rpc_request", and
"metadata_fetch" respectively; if the match logic lives inside a private method
(the block with let error_type = match &self.0 { ... }), either call the public
API that triggers that method or add a small pub(crate) helper to expose the
classification so tests can call it directly, then run cargo test --workspace to
verify.
In `@backend/crates/atlas-server/src/api/handlers/health.rs`:
- Around line 14-60: Add a #[cfg(test)] mod tests block in this same file and
implement unit tests that call liveness() and readiness(...) to cover all
branches: a liveness test asserting Json(HealthResponse{status:"ok",..}), a
readiness test simulating DB failure by making sqlx query execution return Err
(mock or use a test pool) to assert SERVICE_UNAVAILABLE and reason contains
"database unreachable", a readiness test where state.head_tracker.latest()
returns a stale block (indexed_at > 5 minutes ago) asserting SERVICE_UNAVAILABLE
and "indexer stale" in the reason, and a readiness test with healthy DB and
fresh head asserting OK/"ready"; reference the functions liveness, readiness,
HealthResponse, AppState, state.pool and head_tracker.latest() to locate code
and use a minimal test AppState/Arc<AppState> setup or mocks to control DB and
head_tracker behavior.
- Around line 25-31: The readiness handler currently returns the raw DB error
string in the JSON body (from the sqlx::query(...).execute(&state.pool).await
error branch), which leaks internal details; instead, log the full error on the
server (e.g., with tracing::error! or the existing logger on state) and change
the HealthResponse reason to a generic message like "database unreachable" (or
None) while keeping StatusCode::SERVICE_UNAVAILABLE and status "not_ready";
locate the error branch around the sqlx::query execute call and replace the
Json(HealthResponse { reason: Some(format!("database unreachable: {e}")) }) with
a generic reason and a server-side error log that includes e.
- Around line 35-51: The current readiness check treats None from
state.head_tracker.latest().await as healthy; instead, when
head_tracker.latest() returns None you must fall back to the persisted indexer
state the same way status.rs does: call or obtain the equivalent indexer_state
(e.g., state.indexer_state or its accessor) to get the last indexed
block/timestamp, then compute age = now - indexed_at and return
SERVICE_UNAVAILABLE with the same "indexer stale" HealthResponse if age > 5
minutes; keep the existing path when head_tracker.latest() returns Some but
reuse the indexer_state fallback when head is empty so the health handler
mirrors status.rs behavior (refer to head_tracker.latest(), state.indexer_state,
and HealthResponse).
In `@backend/crates/atlas-server/src/api/handlers/metrics.rs`:
- Around line 6-8: Add a same-file unit test in a #[cfg(test)] mod tests block
that constructs a minimal Arc<AppState> with a Prometheus handle (matching how
AppState is normally built), calls the async metrics(State(state):
State<Arc<AppState>>) handler (awaiting it) and asserts the returned String
contains expected Prometheus output (e.g., non-empty or contains a known metric
name). Put the test in the same file (metrics.rs), reference AppState,
prometheus_handle.render() and the metrics handler to ensure the state wiring is
exercised; use tokio::test or an async test runner to await the handler.
In `@backend/crates/atlas-server/src/api/mod.rs`:
- Around line 49-52: The /metrics route is currently created as metrics_routes
and merged after the main TimeoutLayer so it bypasses the 10s timeout; update
the routing so that metrics_routes is wrapped with the same TimeoutLayer (10s,
StatusCode::REQUEST_TIMEOUT) or move its route back into the main Router that
the TimeoutLayer is applied to, ensuring handlers::metrics::metrics is not the
only non-SSE route running unbounded; make the same change for the other
instance referenced around lines 196-205 so all non-SSE routes are uniformly
wrapped by TimeoutLayer.
---
Outside diff comments:
In `@backend/crates/atlas-server/src/indexer/fetcher.rs`:
- Around line 132-150: The parse-failure branch in fetcher.rs calls
metrics.error("rpc", "rpc_parse") but omits metrics.record_rpc_request("error"),
causing undercounted RPC failures; update the Err(e) arm (around
RPC_RETRY_DELAYS / last_error handling) to also call
metrics.record_rpc_request("error") (same place where the HTTP failure path
records both) before logging and sleeping so parse errors increment
atlas_indexer_rpc_requests_total consistently with other failures.
---
Nitpick comments:
In `@backend/crates/atlas-server/src/cli.rs`:
- Around line 373-380: The CLI currently defines pub format: String in the CLI
struct and accepts any string, letting init_tracing() silently handle invalid
values; update the #[arg(...)] for the format field in
backend/crates/atlas-server/src/cli.rs to add a value_parser that restricts
accepted values to "text" and "json" (e.g., use clap's PossibleValuesParser or
value_parser = ["text","json"]) so invalid inputs are rejected at parse time and
users get an immediate error instead of falling back in init_tracing().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7e9015e8-37ad-403a-9af0-b066ffd16ce6
📒 Files selected for processing (21)
backend/Cargo.tomlbackend/crates/atlas-server/Cargo.tomlbackend/crates/atlas-server/src/api/error.rsbackend/crates/atlas-server/src/api/handlers/faucet.rsbackend/crates/atlas-server/src/api/handlers/health.rsbackend/crates/atlas-server/src/api/handlers/metrics.rsbackend/crates/atlas-server/src/api/handlers/mod.rsbackend/crates/atlas-server/src/api/handlers/sse.rsbackend/crates/atlas-server/src/api/handlers/status.rsbackend/crates/atlas-server/src/api/mod.rsbackend/crates/atlas-server/src/cli.rsbackend/crates/atlas-server/src/config.rsbackend/crates/atlas-server/src/indexer/da_worker.rsbackend/crates/atlas-server/src/indexer/evnode.rsbackend/crates/atlas-server/src/indexer/fetcher.rsbackend/crates/atlas-server/src/indexer/indexer.rsbackend/crates/atlas-server/src/indexer/metadata.rsbackend/crates/atlas-server/src/lib.rsbackend/crates/atlas-server/src/main.rsbackend/crates/atlas-server/src/metrics.rsbackend/crates/atlas-server/tests/integration/common.rs
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/crates/atlas-server/tests/integration/common.rs (1)
117-127: Consider logging test skips more visibly in CI.Tests are silently skipped (only
eprintln!) when the database environment cannot be initialized. In CI pipelines, this could mask configuration issues. Consider using#[ignore]attribute with a custom test harness or ensuring CI always hasATLAS_TEST_DATABASE_URLset.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 117 - 127, The run function currently skips tests silently using eprintln! when ENV is Err, which can hide CI configuration issues; update run (referencing run and ENV) to (1) emit a visible log (use the tracing or log crate instead of eprintln!) so CI logs show the skip, and (2) if running in CI (detect via std::env::var("CI") or "GITHUB_ACTIONS"), make the test fail early (panic! with a clear message referencing ATLAS_TEST_DATABASE_URL) instead of skipping so CI surfaces the missing configuration; otherwise keep the local skip behavior for developer runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/crates/atlas-server/tests/integration/common.rs`:
- Around line 117-127: The run function currently skips tests silently using
eprintln! when ENV is Err, which can hide CI configuration issues; update run
(referencing run and ENV) to (1) emit a visible log (use the tracing or log
crate instead of eprintln!) so CI logs show the skip, and (2) if running in CI
(detect via std::env::var("CI") or "GITHUB_ACTIONS"), make the test fail early
(panic! with a clear message referencing ATLAS_TEST_DATABASE_URL) instead of
skipping so CI surfaces the missing configuration; otherwise keep the local skip
behavior for developer runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0fa92730-4a14-4864-b25f-20bbc3a0eaf9
📒 Files selected for processing (15)
backend/Cargo.tomlbackend/crates/atlas-server/src/api/error.rsbackend/crates/atlas-server/src/api/handlers/health.rsbackend/crates/atlas-server/src/api/handlers/metrics.rsbackend/crates/atlas-server/src/api/handlers/status.rsbackend/crates/atlas-server/src/api/mod.rsbackend/crates/atlas-server/tests/integration/addresses.rsbackend/crates/atlas-server/tests/integration/blocks.rsbackend/crates/atlas-server/tests/integration/common.rsbackend/crates/atlas-server/tests/integration/nfts.rsbackend/crates/atlas-server/tests/integration/schema.rsbackend/crates/atlas-server/tests/integration/search.rsbackend/crates/atlas-server/tests/integration/status.rsbackend/crates/atlas-server/tests/integration/tokens.rsbackend/crates/atlas-server/tests/integration/transactions.rs
✅ Files skipped from review due to trivial changes (1)
- backend/crates/atlas-server/tests/integration/addresses.rs
🚧 Files skipped from review as they are similar to previous changes (5)
- backend/Cargo.toml
- backend/crates/atlas-server/src/api/handlers/metrics.rs
- backend/crates/atlas-server/src/api/error.rs
- backend/crates/atlas-server/src/api/handlers/status.rs
- backend/crates/atlas-server/src/api/handlers/health.rs
There was a problem hiding this comment.
🧹 Nitpick comments (2)
backend/crates/atlas-server/tests/integration/common.rs (2)
117-127: Silent test skipping may hide CI failures.When
ENVinitialization fails, the test prints to stderr and returns successfully. In test frameworks, this counts as a passing test. If Docker is unavailable andATLAS_TEST_DATABASE_URLisn't set in CI, all integration tests will appear to pass despite not running.Consider making skips explicit:
Option 1: Panic to fail explicitly in CI
pub fn run<F: std::future::Future<Output = ()>>(f: F) { if let Err(error) = ENV.as_ref() { - eprintln!("skipping integration test: {error}"); - return; + if env::var("CI").is_ok() { + panic!("integration test environment unavailable in CI: {error}"); + } else { + eprintln!("skipping integration test: {error}"); + return; + } }Option 2: Use #[ignore] on integration tests and run explicitly in CI
Mark integration tests with
#[ignore]and run them in CI withcargo test -- --ignored.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 117 - 127, The current run function quietly returns when ENV.as_ref() is Err, causing tests to be counted as passing; change this to fail loudly by replacing the silent eprintln!/return with a panic! (or assert!) that includes the ENV initialization error and a hint about setting ATLAS_TEST_DATABASE_URL or enabling Docker; update the logic inside run (the ENV.as_ref() branch) to call panic!("integration test skipped due to ENV init error: {error}. Set ATLAS_TEST_DATABASE_URL or enable Docker to run integration tests.") so failing CI will surface missing integration setup.
91-93: Prometheus recorder is built but not installed globally.The recorder is created with
build_recorder()and its handle is extracted, but the recorder itself is dropped without being installed. Per theMetrics::new()docstring insrc/metrics.rs, it "Must be called afterinstall_prometheus_recorder()".For tests that don't verify metrics output, this works fine (types compile, code runs). However, if any test tries to verify metrics via
prometheus_handle.render(), it won't see any registered metrics sinceMetrics::new()on line 110 registers with the default no-op recorder.If metrics verification is not needed in tests, consider adding a brief comment explaining this is intentional. Otherwise, install the recorder:
Suggested fix if metrics verification is needed
- let prometheus_handle = metrics_exporter_prometheus::PrometheusBuilder::new() - .build_recorder() - .handle(); + let recorder = metrics_exporter_prometheus::PrometheusBuilder::new() + .build_recorder(); + let prometheus_handle = recorder.handle(); + // Note: install() can only be called once per process; tests may conflict + let _ = recorder.install();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 91 - 93, The Prometheus recorder returned by PrometheusBuilder::build_recorder() is being dropped (only the handle is kept) so Metrics::new() registers against the default no-op recorder; capture both the recorder and handle (let (recorder, prometheus_handle) = metrics_exporter_prometheus::PrometheusBuilder::new().build_recorder()), then call install_prometheus_recorder(recorder) (the function referenced in src/metrics.rs) before calling Metrics::new() so tests that call prometheus_handle.render() will see registered metrics; alternatively, if you intentionally don't verify metrics in these tests, add a brief comment next to the current code explaining that dropping the recorder is intentional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/crates/atlas-server/tests/integration/common.rs`:
- Around line 117-127: The current run function quietly returns when
ENV.as_ref() is Err, causing tests to be counted as passing; change this to fail
loudly by replacing the silent eprintln!/return with a panic! (or assert!) that
includes the ENV initialization error and a hint about setting
ATLAS_TEST_DATABASE_URL or enabling Docker; update the logic inside run (the
ENV.as_ref() branch) to call panic!("integration test skipped due to ENV init
error: {error}. Set ATLAS_TEST_DATABASE_URL or enable Docker to run integration
tests.") so failing CI will surface missing integration setup.
- Around line 91-93: The Prometheus recorder returned by
PrometheusBuilder::build_recorder() is being dropped (only the handle is kept)
so Metrics::new() registers against the default no-op recorder; capture both the
recorder and handle (let (recorder, prometheus_handle) =
metrics_exporter_prometheus::PrometheusBuilder::new().build_recorder()), then
call install_prometheus_recorder(recorder) (the function referenced in
src/metrics.rs) before calling Metrics::new() so tests that call
prometheus_handle.render() will see registered metrics; alternatively, if you
intentionally don't verify metrics in these tests, add a brief comment next to
the current code explaining that dropping the recorder is intentional.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 597a4861-10cf-4f61-9db2-907b37575796
📒 Files selected for processing (1)
backend/crates/atlas-server/tests/integration/common.rs
|
This is a nice set of metrics ! I would suggest to add some, so we can have a little bit more insight about performance/swiftness of indexation. atlas_indexer_missing_blocks (Gauge)Tracks the current count of blocks known to be missing from the index and pending backfill. Must be set to 0 when fully caught up. atlas_indexer_block_processing_duration_seconds (Histogram)CodeRabbit flagged that atlas_indexer_batch_duration_second includes idle/sleep time between batches, not just processing time. p95 would be inflated and not really usable for a "processing time" alert. This new histogram should be timed from batch start to write completion, before any sleep/wait. Suggested buckets: .001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5. atlas_db_pool_checkout_duration_seconds (Histogram, label: pool)Time from requesting a connection from the pool to receiving one. Must be instrumented at every pool.acquire() / pool.get() call site. atlas_indexer_head_block_timestamp_seconds (Gauge)Unix timestamp (seconds) of the most recently indexed block — specifically the chain-level block timestamp, not when it was processed. Updated every time atlas_indexer_head_block is updated. |
Summary
atlas-server, including a shared metrics registry and a/metricsendpoint/health/liveand/health/readySummary by CodeRabbit
New Features
Refactor
Tests