Skip to content

feat: add atlas-server observability metrics#47

Open
pthmas wants to merge 3 commits intomainfrom
pthmas/observability
Open

feat: add atlas-server observability metrics#47
pthmas wants to merge 3 commits intomainfrom
pthmas/observability

Conversation

@pthmas
Copy link
Copy Markdown
Collaborator

@pthmas pthmas commented Mar 31, 2026

Summary

  • add Prometheus-backed observability to atlas-server, including a shared metrics registry and a /metrics endpoint
  • expose richer health probes with /health/live and /health/ready
  • instrument HTTP requests, API errors, indexer/DA/metadata work, SSE connections, and DB pool gauges
  • add configurable text/json log formatting and thread observability state through server startup and test helpers

Summary by CodeRabbit

  • New Features

    • Added health endpoints (/health/live, /health/ready), a Prometheus metrics endpoint (/metrics), and a log format option (text/json).
    • Exposed runtime metrics (HTTP, indexer, DA worker, metadata, SSE connections, DB pool) and an error counter.
  • Refactor

    • Instrumented services and background workers with structured logging and metrics; added middleware to record HTTP metrics.
  • Tests

    • Updated unit/integration tests and test harness to initialize and verify metrics and adjust pooling semantics.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

Adds Prometheus metrics, a Metrics type and middleware, /metrics and health endpoints, threads metrics into API, indexer, DA worker and metadata fetcher, updates tests and AppState, and adds a log-format CLI option with structured logging changes.

Changes

Cohort / File(s) Summary
Workspace deps
backend/Cargo.toml, backend/crates/atlas-server/Cargo.toml
Added metrics = "0.24" and metrics-exporter-prometheus = "0.16"; converted testcontainers to table with features = ["blocking"].
Metrics core
backend/crates/atlas-server/src/metrics.rs, backend/crates/atlas-server/src/lib.rs
New metrics module: installs Prometheus recorder, Metrics type + helpers, SseConnectionGuard, and http_metrics_middleware; exported via pub mod metrics.
API surface & router
backend/crates/atlas-server/src/api/mod.rs, .../handlers/mod.rs
Added metrics and prometheus_handle to AppState, registered routes /metrics, /health/live, /health/ready, and inserted HTTP metrics middleware into router layering.
Health & metrics handlers
backend/crates/atlas-server/src/api/handlers/health.rs, .../handlers/metrics.rs
New liveness/readiness handlers (DB check + indexer staleness) and /metrics handler returning Prometheus text; unit tests added.
SSE changes
backend/crates/atlas-server/src/api/handlers/sse.rs
make_event_stream now accepts metrics: Option<Metrics> and uses SseConnectionGuard; block_events passes Some(state.metrics); tests updated to pass None where appropriate.
Api error instrumentation
backend/crates/atlas-server/src/api/error.rs
Added error_type(&AtlasError) helper and increments atlas_errors_total{component="api",error_type=...} in IntoResponse. Unit test covers mapping.
Indexer/DA/metadata instrumentation
backend/crates/atlas-server/src/indexer/...
Threaded Metrics into Indexer, DaWorker, MetadataFetcher, and fetcher functions; added many metric recordings (blocks indexed, RPC/db durations, failures, metadata outcomes) and switched logs to structured fields.
Main/init & tracing
backend/crates/atlas-server/src/main.rs
Installs Prometheus recorder at startup, creates Metrics, wires metrics into AppState and workers, spawns DB-pool sampler, and extends init_tracing to accept format (text
CLI/config
backend/crates/atlas-server/src/cli.rs, .../config.rs
Added --atlas.log.format / LOG_FORMAT to LogArgs with default "text"; tests updated to include format.
Tests & integration harness
backend/crates/atlas-server/tests/integration/common.rs, many integration tests, .../handlers/faucet.rs, .../handlers/status.rs
Integration harness refactored to lazy init ENV/container, pool() returns PgPool, test AppState now includes Metrics and prometheus_handle; many tests changed to pass &PgPool.

Sequence Diagrams

sequenceDiagram
  participant Client
  participant ReadinessHandler as /health/ready
  participant Database
  participant HeadTracker

  Client->>ReadinessHandler: GET /health/ready
  ReadinessHandler->>Database: SELECT 1
  alt DB unreachable
    Database-->>ReadinessHandler: Error
    ReadinessHandler-->>Client: 503 {"status":"not_ready","reason":"database unreachable"}
  else DB reachable
    Database-->>ReadinessHandler: OK
    ReadinessHandler->>HeadTracker: latest_indexed_block()
    HeadTracker-->>ReadinessHandler: Some(block,timestamp) or None
    alt None or too old (>5m)
      ReadinessHandler-->>Client: 503 {"status":"not_ready","reason":"indexer state unavailable|stale"}
    else recent
      ReadinessHandler-->>Client: 200 {"status":"ready"}
    end
  end
Loading
sequenceDiagram
  participant Client
  participant Middleware as HTTP Metrics Middleware
  participant Router
  participant Prometheus as PrometheusHandle

  Client->>Middleware: HTTP Request
  Middleware->>Middleware: start timer, capture method
  Middleware->>Router: forward request
  Router-->>Middleware: Response (status)
  Middleware->>Middleware: compute latency, get MatchedPath
  Middleware->>Prometheus: increment http_requests_total{method,path,status}
  Middleware->>Prometheus: observe http_request_duration_seconds{method,path}
  Middleware-->>Client: Return Response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • tac0turtle

Poem

🐇 I hopped through code with metrics bright,

Prometheus lanterns in the night,
Health beats steady, counters climb,
SSE guards count every time,
Dashboards bloom — a carrot shrine! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'feat: add atlas-server observability metrics' accurately and concisely summarizes the main change: adding observability and Prometheus metrics to atlas-server.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pthmas/observability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pthmas pthmas marked this pull request as ready for review March 31, 2026 16:22
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/crates/atlas-server/src/indexer/fetcher.rs (1)

132-150: ⚠️ Potential issue | 🟡 Minor

Missing record_rpc_request("error") on parse failure.

When JSON parsing fails (line 138), only metrics.error("rpc", "rpc_parse") is called, but metrics.record_rpc_request("error") is not. This is inconsistent with the HTTP failure path (lines 101-102) which records both. The atlas_indexer_rpc_requests_total metric will undercount failures when the HTTP request succeeds but the response cannot be parsed.

🛠️ Proposed fix
             Err(e) => {
                 let delay = RPC_RETRY_DELAYS
                     .get(attempt)
                     .copied()
                     .unwrap_or(*RPC_RETRY_DELAYS.last().unwrap_or(&30));

+                metrics.record_rpc_request("error");
                 metrics.error("rpc", "rpc_parse");
                 tracing::warn!(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/fetcher.rs` around lines 132 - 150,
The parse-failure branch in fetcher.rs calls metrics.error("rpc", "rpc_parse")
but omits metrics.record_rpc_request("error"), causing undercounted RPC
failures; update the Err(e) arm (around RPC_RETRY_DELAYS / last_error handling)
to also call metrics.record_rpc_request("error") (same place where the HTTP
failure path records both) before logging and sleeping so parse errors increment
atlas_indexer_rpc_requests_total consistently with other failures.
🧹 Nitpick comments (2)
backend/crates/atlas-server/src/cli.rs (1)

373-380: Constrain --atlas.log.format to accepted values at parse time.

The format field accepts any string value, causing invalid inputs to silently fall back to text format via the wildcard pattern in init_tracing(). Add value_parser to reject invalid values at CLI parse time instead.

Proposed change
     #[arg(
         long = "atlas.log.format",
         env = "LOG_FORMAT",
         default_value = "text",
+        value_parser = ["text", "json"],
         value_name = "FORMAT",
         help = "Log output format: text or json"
     )]
     pub format: String,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/cli.rs` around lines 373 - 380, The CLI
currently defines pub format: String in the CLI struct and accepts any string,
letting init_tracing() silently handle invalid values; update the #[arg(...)]
for the format field in backend/crates/atlas-server/src/cli.rs to add a
value_parser that restricts accepted values to "text" and "json" (e.g., use
clap's PossibleValuesParser or value_parser = ["text","json"]) so invalid inputs
are rejected at parse time and users get an immediate error instead of falling
back in init_tracing().
backend/crates/atlas-server/src/indexer/indexer.rs (1)

415-420: Batch duration includes wait time, not just processing time.

elapsed is measured from last_log_time, which is reset at the end of each batch. This means record_batch_duration captures the full cycle time including any sleep at chain head, not just active processing. If the intent is to measure processing time only, the timer should start at the beginning of each batch iteration.

However, measuring full cycle time may be intentional for understanding overall throughput. If this is the desired behavior, consider clarifying the metric description in metrics.rs (line 57-59) to indicate it includes idle time at chain head.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/crates/atlas-server/src/api/error.rs`:
- Around line 54-70: Add same-file unit tests in a #[cfg(test)] mod tests block
that exercise the error-type branching for AtlasError by constructing
AtlasError::Database(...), AtlasError::Internal(...), AtlasError::Config(...),
AtlasError::Rpc(...), and AtlasError::MetadataFetch(...) and asserting the
mapped error_type matches "database", "internal", "config", "rpc_request", and
"metadata_fetch" respectively; if the match logic lives inside a private method
(the block with let error_type = match &self.0 { ... }), either call the public
API that triggers that method or add a small pub(crate) helper to expose the
classification so tests can call it directly, then run cargo test --workspace to
verify.

In `@backend/crates/atlas-server/src/api/handlers/health.rs`:
- Around line 14-60: Add a #[cfg(test)] mod tests block in this same file and
implement unit tests that call liveness() and readiness(...) to cover all
branches: a liveness test asserting Json(HealthResponse{status:"ok",..}), a
readiness test simulating DB failure by making sqlx query execution return Err
(mock or use a test pool) to assert SERVICE_UNAVAILABLE and reason contains
"database unreachable", a readiness test where state.head_tracker.latest()
returns a stale block (indexed_at > 5 minutes ago) asserting SERVICE_UNAVAILABLE
and "indexer stale" in the reason, and a readiness test with healthy DB and
fresh head asserting OK/"ready"; reference the functions liveness, readiness,
HealthResponse, AppState, state.pool and head_tracker.latest() to locate code
and use a minimal test AppState/Arc<AppState> setup or mocks to control DB and
head_tracker behavior.
- Around line 25-31: The readiness handler currently returns the raw DB error
string in the JSON body (from the sqlx::query(...).execute(&state.pool).await
error branch), which leaks internal details; instead, log the full error on the
server (e.g., with tracing::error! or the existing logger on state) and change
the HealthResponse reason to a generic message like "database unreachable" (or
None) while keeping StatusCode::SERVICE_UNAVAILABLE and status "not_ready";
locate the error branch around the sqlx::query execute call and replace the
Json(HealthResponse { reason: Some(format!("database unreachable: {e}")) }) with
a generic reason and a server-side error log that includes e.
- Around line 35-51: The current readiness check treats None from
state.head_tracker.latest().await as healthy; instead, when
head_tracker.latest() returns None you must fall back to the persisted indexer
state the same way status.rs does: call or obtain the equivalent indexer_state
(e.g., state.indexer_state or its accessor) to get the last indexed
block/timestamp, then compute age = now - indexed_at and return
SERVICE_UNAVAILABLE with the same "indexer stale" HealthResponse if age > 5
minutes; keep the existing path when head_tracker.latest() returns Some but
reuse the indexer_state fallback when head is empty so the health handler
mirrors status.rs behavior (refer to head_tracker.latest(), state.indexer_state,
and HealthResponse).

In `@backend/crates/atlas-server/src/api/handlers/metrics.rs`:
- Around line 6-8: Add a same-file unit test in a #[cfg(test)] mod tests block
that constructs a minimal Arc<AppState> with a Prometheus handle (matching how
AppState is normally built), calls the async metrics(State(state):
State<Arc<AppState>>) handler (awaiting it) and asserts the returned String
contains expected Prometheus output (e.g., non-empty or contains a known metric
name). Put the test in the same file (metrics.rs), reference AppState,
prometheus_handle.render() and the metrics handler to ensure the state wiring is
exercised; use tokio::test or an async test runner to await the handler.

In `@backend/crates/atlas-server/src/api/mod.rs`:
- Around line 49-52: The /metrics route is currently created as metrics_routes
and merged after the main TimeoutLayer so it bypasses the 10s timeout; update
the routing so that metrics_routes is wrapped with the same TimeoutLayer (10s,
StatusCode::REQUEST_TIMEOUT) or move its route back into the main Router that
the TimeoutLayer is applied to, ensuring handlers::metrics::metrics is not the
only non-SSE route running unbounded; make the same change for the other
instance referenced around lines 196-205 so all non-SSE routes are uniformly
wrapped by TimeoutLayer.

---

Outside diff comments:
In `@backend/crates/atlas-server/src/indexer/fetcher.rs`:
- Around line 132-150: The parse-failure branch in fetcher.rs calls
metrics.error("rpc", "rpc_parse") but omits metrics.record_rpc_request("error"),
causing undercounted RPC failures; update the Err(e) arm (around
RPC_RETRY_DELAYS / last_error handling) to also call
metrics.record_rpc_request("error") (same place where the HTTP failure path
records both) before logging and sleeping so parse errors increment
atlas_indexer_rpc_requests_total consistently with other failures.

---

Nitpick comments:
In `@backend/crates/atlas-server/src/cli.rs`:
- Around line 373-380: The CLI currently defines pub format: String in the CLI
struct and accepts any string, letting init_tracing() silently handle invalid
values; update the #[arg(...)] for the format field in
backend/crates/atlas-server/src/cli.rs to add a value_parser that restricts
accepted values to "text" and "json" (e.g., use clap's PossibleValuesParser or
value_parser = ["text","json"]) so invalid inputs are rejected at parse time and
users get an immediate error instead of falling back in init_tracing().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e9015e8-37ad-403a-9af0-b066ffd16ce6

📥 Commits

Reviewing files that changed from the base of the PR and between 1f86eb2 and df1ebb2.

📒 Files selected for processing (21)
  • backend/Cargo.toml
  • backend/crates/atlas-server/Cargo.toml
  • backend/crates/atlas-server/src/api/error.rs
  • backend/crates/atlas-server/src/api/handlers/faucet.rs
  • backend/crates/atlas-server/src/api/handlers/health.rs
  • backend/crates/atlas-server/src/api/handlers/metrics.rs
  • backend/crates/atlas-server/src/api/handlers/mod.rs
  • backend/crates/atlas-server/src/api/handlers/sse.rs
  • backend/crates/atlas-server/src/api/handlers/status.rs
  • backend/crates/atlas-server/src/api/mod.rs
  • backend/crates/atlas-server/src/cli.rs
  • backend/crates/atlas-server/src/config.rs
  • backend/crates/atlas-server/src/indexer/da_worker.rs
  • backend/crates/atlas-server/src/indexer/evnode.rs
  • backend/crates/atlas-server/src/indexer/fetcher.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/src/indexer/metadata.rs
  • backend/crates/atlas-server/src/lib.rs
  • backend/crates/atlas-server/src/main.rs
  • backend/crates/atlas-server/src/metrics.rs
  • backend/crates/atlas-server/tests/integration/common.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/crates/atlas-server/tests/integration/common.rs (1)

117-127: Consider logging test skips more visibly in CI.

Tests are silently skipped (only eprintln!) when the database environment cannot be initialized. In CI pipelines, this could mask configuration issues. Consider using #[ignore] attribute with a custom test harness or ensuring CI always has ATLAS_TEST_DATABASE_URL set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 117 -
127, The run function currently skips tests silently using eprintln! when ENV is
Err, which can hide CI configuration issues; update run (referencing run and
ENV) to (1) emit a visible log (use the tracing or log crate instead of
eprintln!) so CI logs show the skip, and (2) if running in CI (detect via
std::env::var("CI") or "GITHUB_ACTIONS"), make the test fail early (panic! with
a clear message referencing ATLAS_TEST_DATABASE_URL) instead of skipping so CI
surfaces the missing configuration; otherwise keep the local skip behavior for
developer runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/tests/integration/common.rs`:
- Around line 117-127: The run function currently skips tests silently using
eprintln! when ENV is Err, which can hide CI configuration issues; update run
(referencing run and ENV) to (1) emit a visible log (use the tracing or log
crate instead of eprintln!) so CI logs show the skip, and (2) if running in CI
(detect via std::env::var("CI") or "GITHUB_ACTIONS"), make the test fail early
(panic! with a clear message referencing ATLAS_TEST_DATABASE_URL) instead of
skipping so CI surfaces the missing configuration; otherwise keep the local skip
behavior for developer runs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0fa92730-4a14-4864-b25f-20bbc3a0eaf9

📥 Commits

Reviewing files that changed from the base of the PR and between df1ebb2 and 9675bff.

📒 Files selected for processing (15)
  • backend/Cargo.toml
  • backend/crates/atlas-server/src/api/error.rs
  • backend/crates/atlas-server/src/api/handlers/health.rs
  • backend/crates/atlas-server/src/api/handlers/metrics.rs
  • backend/crates/atlas-server/src/api/handlers/status.rs
  • backend/crates/atlas-server/src/api/mod.rs
  • backend/crates/atlas-server/tests/integration/addresses.rs
  • backend/crates/atlas-server/tests/integration/blocks.rs
  • backend/crates/atlas-server/tests/integration/common.rs
  • backend/crates/atlas-server/tests/integration/nfts.rs
  • backend/crates/atlas-server/tests/integration/schema.rs
  • backend/crates/atlas-server/tests/integration/search.rs
  • backend/crates/atlas-server/tests/integration/status.rs
  • backend/crates/atlas-server/tests/integration/tokens.rs
  • backend/crates/atlas-server/tests/integration/transactions.rs
✅ Files skipped from review due to trivial changes (1)
  • backend/crates/atlas-server/tests/integration/addresses.rs
🚧 Files skipped from review as they are similar to previous changes (5)
  • backend/Cargo.toml
  • backend/crates/atlas-server/src/api/handlers/metrics.rs
  • backend/crates/atlas-server/src/api/error.rs
  • backend/crates/atlas-server/src/api/handlers/status.rs
  • backend/crates/atlas-server/src/api/handlers/health.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/crates/atlas-server/tests/integration/common.rs (2)

117-127: Silent test skipping may hide CI failures.

When ENV initialization fails, the test prints to stderr and returns successfully. In test frameworks, this counts as a passing test. If Docker is unavailable and ATLAS_TEST_DATABASE_URL isn't set in CI, all integration tests will appear to pass despite not running.

Consider making skips explicit:

Option 1: Panic to fail explicitly in CI
 pub fn run<F: std::future::Future<Output = ()>>(f: F) {
     if let Err(error) = ENV.as_ref() {
-        eprintln!("skipping integration test: {error}");
-        return;
+        if env::var("CI").is_ok() {
+            panic!("integration test environment unavailable in CI: {error}");
+        } else {
+            eprintln!("skipping integration test: {error}");
+            return;
+        }
     }
Option 2: Use #[ignore] on integration tests and run explicitly in CI

Mark integration tests with #[ignore] and run them in CI with cargo test -- --ignored.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 117 -
127, The current run function quietly returns when ENV.as_ref() is Err, causing
tests to be counted as passing; change this to fail loudly by replacing the
silent eprintln!/return with a panic! (or assert!) that includes the ENV
initialization error and a hint about setting ATLAS_TEST_DATABASE_URL or
enabling Docker; update the logic inside run (the ENV.as_ref() branch) to call
panic!("integration test skipped due to ENV init error: {error}. Set
ATLAS_TEST_DATABASE_URL or enable Docker to run integration tests.") so failing
CI will surface missing integration setup.

91-93: Prometheus recorder is built but not installed globally.

The recorder is created with build_recorder() and its handle is extracted, but the recorder itself is dropped without being installed. Per the Metrics::new() docstring in src/metrics.rs, it "Must be called after install_prometheus_recorder()".

For tests that don't verify metrics output, this works fine (types compile, code runs). However, if any test tries to verify metrics via prometheus_handle.render(), it won't see any registered metrics since Metrics::new() on line 110 registers with the default no-op recorder.

If metrics verification is not needed in tests, consider adding a brief comment explaining this is intentional. Otherwise, install the recorder:

Suggested fix if metrics verification is needed
-    let prometheus_handle = metrics_exporter_prometheus::PrometheusBuilder::new()
-        .build_recorder()
-        .handle();
+    let recorder = metrics_exporter_prometheus::PrometheusBuilder::new()
+        .build_recorder();
+    let prometheus_handle = recorder.handle();
+    // Note: install() can only be called once per process; tests may conflict
+    let _ = recorder.install();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 91 -
93, The Prometheus recorder returned by PrometheusBuilder::build_recorder() is
being dropped (only the handle is kept) so Metrics::new() registers against the
default no-op recorder; capture both the recorder and handle (let (recorder,
prometheus_handle) =
metrics_exporter_prometheus::PrometheusBuilder::new().build_recorder()), then
call install_prometheus_recorder(recorder) (the function referenced in
src/metrics.rs) before calling Metrics::new() so tests that call
prometheus_handle.render() will see registered metrics; alternatively, if you
intentionally don't verify metrics in these tests, add a brief comment next to
the current code explaining that dropping the recorder is intentional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/tests/integration/common.rs`:
- Around line 117-127: The current run function quietly returns when
ENV.as_ref() is Err, causing tests to be counted as passing; change this to fail
loudly by replacing the silent eprintln!/return with a panic! (or assert!) that
includes the ENV initialization error and a hint about setting
ATLAS_TEST_DATABASE_URL or enabling Docker; update the logic inside run (the
ENV.as_ref() branch) to call panic!("integration test skipped due to ENV init
error: {error}. Set ATLAS_TEST_DATABASE_URL or enable Docker to run integration
tests.") so failing CI will surface missing integration setup.
- Around line 91-93: The Prometheus recorder returned by
PrometheusBuilder::build_recorder() is being dropped (only the handle is kept)
so Metrics::new() registers against the default no-op recorder; capture both the
recorder and handle (let (recorder, prometheus_handle) =
metrics_exporter_prometheus::PrometheusBuilder::new().build_recorder()), then
call install_prometheus_recorder(recorder) (the function referenced in
src/metrics.rs) before calling Metrics::new() so tests that call
prometheus_handle.render() will see registered metrics; alternatively, if you
intentionally don't verify metrics in these tests, add a brief comment next to
the current code explaining that dropping the recorder is intentional.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 597a4861-10cf-4f61-9db2-907b37575796

📥 Commits

Reviewing files that changed from the base of the PR and between 9675bff and 5f3ada6.

📒 Files selected for processing (1)
  • backend/crates/atlas-server/tests/integration/common.rs

@pthmas pthmas requested a review from tac0turtle April 1, 2026 15:19
@auricom
Copy link
Copy Markdown

auricom commented Apr 1, 2026

This is a nice set of metrics ! I would suggest to add some, so we can have a little bit more insight about performance/swiftness of indexation.

atlas_indexer_missing_blocks (Gauge)

Tracks the current count of blocks known to be missing from the index and pending backfill. Must be set to 0 when fully caught up.

atlas_indexer_block_processing_duration_seconds (Histogram)

CodeRabbit flagged that atlas_indexer_batch_duration_second includes idle/sleep time between batches, not just processing time. p95 would be inflated and not really usable for a "processing time" alert.

This new histogram should be timed from batch start to write completion, before any sleep/wait. Suggested buckets: .001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5.

atlas_db_pool_checkout_duration_seconds (Histogram, label: pool)

Time from requesting a connection from the pool to receiving one. Must be instrumented at every pool.acquire() / pool.get() call site.
Suggested buckets (fine-grained sub-millisecond): .0001, .0005, .001, .005, .01, .025, .05, .1.

atlas_indexer_head_block_timestamp_seconds (Gauge)

Unix timestamp (seconds) of the most recently indexed block — specifically the chain-level block timestamp, not when it was processed. Updated every time atlas_indexer_head_block is updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants