Skip to content

Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727

Open
prk-Jr wants to merge 16 commits into
feature/edgezero-pr19-spin-adapterfrom
feature/edgezero-pr19-cutover-canary
Open

Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727
prk-Jr wants to merge 16 commits into
feature/edgezero-pr19-spin-adapterfrom
feature/edgezero-pr19-cutover-canary

Conversation

@prk-Jr

@prk-Jr prk-Jr commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds edgezero_rollout_pct (0–100) to the trusted_server_config store so EdgeZero traffic can be shifted incrementally without a deploy
  • Uses FNV-1a 32-bit hash of client_ip for deterministic, sticky per-user bucket assignment (0–99); routing predicate is bucket < rollout_pct
  • Fail-safe defaults: read error or invalid value → 0 (all legacy); absent key (backward compat) → 100 (all EdgeZero); rollout_pct = 0 is the immediate no-deploy rollback

Changes

File Change
crates/trusted-server-adapter-fastly/src/main.rs Added parse_rollout_pct, fnv1a_bucket, canary_routes_to_edgezero, read_rollout_pct; updated main() with canary dispatch; 11 unit tests (pinned FNV-1a vectors, distribution, boundary routing)
fastly.toml Added edgezero_rollout_pct = "0" to local Viceroy config store with ops comment
docs/internal/EDGEZERO_MIGRATION.md New ops runbook: config store keys, failure modes, pre-flight activation, 4 canary stages, pass/fail thresholds, rollback procedure, monitoring

Closes

Closes #500

Test plan

  • cargo test-fastly — 65 + 956 + 21 + 3 tests green
  • cargo test-axum — 31 tests green
  • cargo clippy-fastly && cargo clippy-axum — no warnings
  • cargo fmt --all -- --check — clean
  • JS tests: cd crates/js/lib && npx vitest run — 291 tests green
  • JS format: cd crates/js/lib && npm run format — clean
  • Docs format: cd docs && npm run format — clean
  • WASM build: cargo build --package trusted-server-adapter-fastly --release --target wasm32-wasip1
  • Manual testing via fastly compute serve with edgezero_rollout_pct = "1" and "0" (rollback)

Checklist

  • Changes follow CLAUDE.md conventions
  • No unwrap() in production code — use expect("should ...")
  • Uses log macros (not println!)
  • New code has tests
  • No secrets or credentials committed

prk-Jr added 8 commits May 21, 2026 14:26
When edgezero_enabled=true, reads edgezero_rollout_pct (0-100) from the
trusted_server_config store. Routes each request to EdgeZero if
fnv1a_bucket(client_ip) < rollout_pct, giving the ops team sticky
per-user canary control without a deploy.

Absent key defaults to 100 (full rollout, backward compatible with
edgezero_enabled=true deployments that predate this PR). Invalid values
and read errors default to 0 (all legacy, fail-safe).
Set to "0" so local dev and CI stay on the legacy path by default.
Ops changes this in the production config store — no re-deploy required.
Documents the two config store keys, canary progression stages (1% →
10% → 50% → 100%), hold-point durations, pass/fail thresholds, and
instant rollback procedure.
@prk-Jr prk-Jr self-assigned this May 21, 2026
@prk-Jr prk-Jr changed the title Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19) May 21, 2026
@aram356 aram356 requested review from ChristianPavilonis and aram356 and removed request for aram356 May 21, 2026 15:14
@prk-Jr prk-Jr changed the base branch from feature/edgezero-pr18-phase5-verification to feature/edgezero-pr19-spin-adapter May 27, 2026 13:05
Comment thread crates/trusted-server-adapter-fastly/src/main.rs Outdated
Comment thread crates/trusted-server-adapter-fastly/src/main.rs
Comment thread crates/trusted-server-adapter-fastly/src/main.rs
Comment thread crates/trusted-server-adapter-fastly/src/main.rs Outdated
Comment thread crates/trusted-server-adapter-fastly/src/main.rs Outdated

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

One low-severity docs/ops note: the canary runbook references routing log lines that are emitted at debug level, so production verification may require a debug/log-level deployment or equivalent observability setup.

Comment thread docs/internal/EDGEZERO_MIGRATION.md Outdated
prk-Jr added 4 commits June 13, 2026 11:55
The absent-key branch logged at warn on every request when edgezero_enabled=true
but edgezero_rollout_pct was unset, flooding logs at production QPS while traffic
flowed fine. Demote to debug; the setup procedure and migration runbook remain
the operational safety net. The absent-key default stays 100 (backward compat)
so an existing edgezero_enabled=true deployment is not silently rolled back.
For rollout_pct 0 (full rollback) and 100 (full cutover) — which together cover
most of the canary's lifetime — the routing decision is fixed, yet every request
still allocated the client-IP routing key string and ran the FNV hash. Match on
the rollout value and only compute the bucket for a partial rollout.

Also rename canary_routes_to_edgezero to routes_to_edgezero: the predicate is
just `bucket < rollout_pct` and applies at any rollout value, so "canary" read
misleadingly at the call site.
The four documented branches (absent -> 100, valid -> parsed, invalid -> 0,
read error -> 0) are safety-critical but only the parse step was indirectly
covered. ConfigStoreHandle::new(Arc::new(...)) over the public ConfigStore trait
is constructible in unit tests, so add an in-memory stub store and pin every
branch, including the fail-safe defaults that matter during an incident.
The routing verification log lines are emitted at log::debug!, but the Fastly
logger is capped at Info in production, so verifying the canary via log tailing
needs a debug-level deployment or another observability signal (the runbook now
says so and points at the real-time stats split as the alternative).

Also reconcile the documented log format with the adapter: at the degenerate
rollout values (0 and 100) the bucket is no longer computed, so those lines read
`routing request through {legacy,EdgeZero} path (rollout_pct=N)` without a bucket.

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review:

Review Summary

Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly entry-point canary routing, config-store failure behavior, rollout docs, and CI/review context. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment about strengthening coverage for the safety-critical entry-point dispatch matrix.

Findings

P0 / Blockers

None.

P1 / High

None.

P2 / Medium

1 inline comment submitted.

P3 / Low

None.

CI / Existing Reviews

All reported PR checks are passing. Existing inline feedback about absent-key logging, config-store branch tests, degenerate rollout hashing, naming, and debug-level log documentation appears to have been addressed in later commits. Local cargo test-fastly rollout -- --nocapture could not run in this environment because Viceroy failed to satisfy fastly_http_downstream::downstream_bot_analyzed; CI's Fastly job is green.

Comment thread crates/trusted-server-adapter-fastly/src/main.rs Outdated
prk-Jr added 3 commits June 13, 2026 21:35
Pull the rollout-percentage routing matrix out of main() into a pure
should_route_to_edgezero(rollout_pct, client_ip) helper so the
safety-critical wiring from rollout state to the EdgeZero/legacy choice
can be unit-tested directly, not just the parsing/hashing helpers it
composes. main() keeps the edgezero_enabled gate and config reads as thin
I/O glue.

Add tests covering the full dispatch matrix: rollout_pct=0 (always
legacy), rollout_pct=100 (always EdgeZero), and partial-rollout bucket
hit/miss for both present and absent client IPs.
…er' into feature/edgezero-pr19-cutover-canary
The base branch refactored IpAddr out of main.rs and dropped the import.
Merging base in lost the import that should_route_to_edgezero and its
tests rely on, breaking the wasm32-wasip1 build. Re-add it.

@ChristianPavilonis ChristianPavilonis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review:

Review Summary

Reviewed PR #727 against feature/edgezero-pr19-spin-adapter, focusing on the Fastly canary dispatch, rollout percentage parsing/defaults, config-store failure behavior, tests, local Fastly config, and the migration runbook. I did not find any blocking correctness, security, data-loss, authorization, or severe compatibility issues. I left one P2 inline comment about making the production canary observability path explicit before operators rely on the runbook.

Findings

P0 / Blockers

None.

P1 / High

None.

P2 / Medium

1 inline comment submitted.

P3 / Low

None.

CI / Existing Reviews

All reported PR checks are passing. Existing review feedback about absent-key log volume, degenerate rollout hashing, config-store branch coverage, naming, and debug-level log documentation appears addressed by later commits. Local cargo test-fastly rollout -- --nocapture compiled the changed Fastly test binary but could not execute under the installed Viceroy because it lacks fastly_http_downstream::downstream_bot_analyzed; CI's Fastly test job is green.

> **The routing log lines are emitted at `log::debug!`, while the Fastly logger
> is capped at `Info` in production.** Verifying the canary via log tailing
> therefore requires a debug-level logging deployment (or another equivalent
> observability signal, such as the Fastly real-time stats traffic split). With

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review: P2: Make the production canary route signal explicit before relying on it operationally.

This paragraph now correctly calls out that the route-decision logs are debug! while the Fastly logger is capped at Info, but the fallback it names (Fastly real-time stats traffic split) is not something this PR appears to create. The code only emits the EdgeZero/legacy decision at debug level, and the runbook also says the x-edgezero-path marker does not exist yet. Without a production-visible route marker (or an exact documented dashboard/query that already distinguishes the two branches), operators cannot reliably confirm that rollout_pct=0 is actually all legacy, that 1%/10% traffic is flowing on the intended path, or attribute latency/error changes to the canary branch.

Suggested fix: either add/document a concrete production-safe signal before running the canary (for example a sampled/rate-limited info log field, a metrics/log field, or an internally gated diagnostic response header), or change the runbook to make a debug-level logging deployment an explicit precondition and remove the unsupported real-time-stats fallback unless there is a specific existing split/dashboard to reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants