Skip to content

feat(sidecar): production keyring backend support (closes #162)#168

Merged
bdchatham merged 4 commits into
mainfrom
feat/sidecar-keyring-backend
May 12, 2026
Merged

feat(sidecar): production keyring backend support (closes #162)#168
bdchatham merged 4 commits into
mainfrom
feat/sidecar-keyring-backend

Conversation

@bdchatham
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham commented May 12, 2026

Summary

  • Implements Phase 1 step 2 of the in-pod governance signing workstream (design: docs/design/in-pod-governance-signing.md, Component B).
  • Adds three envs read at startup: `SEI_KEYRING_BACKEND` (`test`/`file`/`os`, empty = governance disabled), `SEI_KEYRING_DIR` (defaults to `$SEI_HOME/keyring-file`), `SEI_KEYRING_PASSPHRASE` (required when backend=file).
  • New `sidecar/server/keyring.go`: `OpenKeyring` factory + `SmokeTestKeyring` with bounded retry + defensive `redactPassphrase`.
  • Passphrase wiped from `os.Environ()` unconditionally before any error check via `os.Unsetenv`.
  • Engine gains `ExecutionConfig.Keyring` accessible via Set/Get on `*Engine`. No handler reads it in this PR — that wiring lands in feat(sidecar): sign-tx task family (gov vote / submit-proposal / deposit) #163 (sign-tx task family).
  • Genesis isolation preserved: `generate_gentx.go` continues to construct its own `BackendTest` keyring locally; does NOT consume the shared `Engine.Keyring`.
  • New `sidecar/docs/keyring.md` covers env contract + trust model + genesis-path isolation.

Discretionary decisions flagged for review

These weren't fully prescribed; cross-reviewers should sanity-check:

  1. *`ExecutionConfig` on `Engine` with Set/Get accessors, not as a `NewEngine` constructor argument. Rationale: ~16 NewEngine call sites in tests; signature change would force churn for a field no handler reads yet. Mutex-guarded accessor pair added. Alternative considered: `NewEngineWithConfig` second constructor — rejected (duplicates constructor surface). Wiring to handlers belongs to feat(sidecar): sign-tx task family (gov vote / submit-proposal / deposit) #163.

  2. `OpenKeyring` strips a trailing `keyring-file` segment from the supplied dir. The SDK's file backend internally appends `keyring-file` to the rootDir, so a caller passing `/sei/keyring-file` would land at `/sei/keyring-file/keyring-file`. The strip is a guardrail matching both possible operator mental models. Design's sketch used `filepath.Dir(dir)` unconditionally; I made it conditional on the suffix to keep `test`/`os` backends (which receive an arbitrary dir) unaffected.

  3. `SmokeTestKeyring` recovers from `panic` in the underlying 99designs/keyring lib. The lib panics on a non-directory FileDir (discovered during testing). Recovery wraps the panic into a smoke-test error so a bad mount path produces fail-fast logging rather than a SIGSEGV.

  4. Redaction = verbatim `strings.ReplaceAll` of the passphrase at the error-return boundary inside `OpenKeyring`. Defensive against future upstream regressions; the SDK doesn't echo the passphrase in any current code path. Wraps with `%s` (not `%w`) at that boundary to sever any chain that might carry the passphrase via a `%v` of an internal struct.

  5. Passphrase `os.Unsetenv` happens before the error check, on every code path through `OpenKeyring`. Deferred-unset alternative would risk leaking the value across an early-return.

  6. No stdout-capturing test for the `keyring opened` log line. The single log statement contains backend + dir only — no passphrase. The contract is asserted indirectly via code review of all call sites plus the `redactPassphrase` unit test. A captured-output test would be brittle against `seilog`'s writer wiring.

Spec gaps surfaced

  • Issue body AC item 2 contradicts the merged design. Issue says "Shared keyring factory consumed by `generate_gentx.go` (refactored, not duplicated)" but design Component B says "Refactor `generate_gentx.go` to take its keyring backend from a shared sidecar-level factory rather than hardcoding `BackendTest`. The factory reads from envs above. Genesis path continues to use `BackendTest` when configured." On re-reading both, the design is consistent with EITHER reading — but the orchestrator's brief explicitly directed "Do NOT consume the shared `cfg.Keyring` for the gentx flow" to preserve genesis isolation as a hard property. Followed the brief; documented the deliberate isolation in `generate_gentx.go`. Worth confirming in review.
  • Design code sketch `openKeyring` shows `codec.NewProtoCodec(...)` as a parameter. Real `keyring.New` signature is `(appName, backend, rootDir, userInput, opts...)` — no codec. Followed real SDK; design sketch is stale.

Test plan

  • `go build ./...` clean
  • `go test ./...` all packages pass
  • `golangci-lint run --new-from-rev=origin/main ./...` reports 0 issues
  • Review: confirm passphrase never appears in logs, error strings, or task results
  • Review: confirm `generate_gentx` genesis path is unchanged (BackendTest hardcoded locally; isolation comment present)
  • Review: confirm `os.Unsetenv` happens before any path that could surface the passphrase

Cross-review

This PR is being reviewed in parallel by platform-engineer, kubernetes-specialist, and security-specialist before being marked ready for merge. Findings will be synthesized into a single resolution comment.

🤖 Generated with Claude Code


Note

Medium Risk
Introduces new startup-time secret/env handling and Cosmos SDK keyring initialization, which can affect sidecar boot behavior and touches sensitive passphrase redaction/wiping logic. Also changes engine startup sequencing (explicit stale-task rehydration) which could impact crash-recovery behavior if misordered.

Overview
Adds opt-in production keyring backend support for the sidecar: serve now reads SEI_KEYRING_BACKEND/SEI_KEYRING_DIR/SEI_KEYRING_PASSPHRASE, opens and smoke-tests a Cosmos SDK keyring, and wipes the passphrase env to reduce exposure in /proc/<pid>/environ.

Introduces engine.ExecutionConfig (currently holding Keyring) and stores it on Engine, and adjusts engine startup so stale-task recovery is triggered explicitly via RehydrateStaleTasks() after config is installed.

Adds server.OpenKeyring + SmokeTestKeyring utilities with backend validation, directory normalization for the file backend, bounded retry, panic recovery, and passphrase redaction; includes unit tests plus new operator-facing docs in sidecar/docs/keyring.md. Also documents that generate-gentx continues using an isolated BackendTest keyring (no production keyring reuse).

Reviewed by Cursor Bugbot for commit 7e8670d. Bugbot is set up for automated code reviews on this repo. Configure here.

bdchatham and others added 2 commits May 12, 2026 08:14
Implements Phase 1 step 2 of the in-pod governance signing workstream
(design: docs/design/in-pod-governance-signing.md, Component B).

Adds three envs read at serve.go startup:
  SEI_KEYRING_BACKEND       — test | file | os (empty = governance disabled)
  SEI_KEYRING_DIR           — defaults to $SEI_HOME/keyring-file
  SEI_KEYRING_PASSPHRASE    — required when backend=file

New sidecar/server/keyring.go:
  - OpenKeyring(backend, dir, passphrase) constructs a keyring.Keyring
    via the Cosmos SDK keyring package. File-backend opens supply the
    passphrase via in-memory reader; test/os backends accept arbitrary
    dirs.
  - SmokeTestKeyring exercises kr.List() with bounded retry (3 attempts,
    2s backoff) to absorb kubelet Secret-mount races. Empty keyrings
    pass — missing-key errors surface at first sign-tx (deferred to
    #163), not here.
  - redactPassphrase severs any error chain that might echo the
    passphrase (defensive; SDK does not currently leak).

After OpenKeyring succeeds, the caller in serve.go MUST
  os.Unsetenv("SEI_KEYRING_PASSPHRASE")
to remove the secret from /proc/<pid>/environ. Done unconditionally
before any error check, on every code path.

Engine gains ExecutionConfig (with Keyring field) accessible via
Set/Get on *Engine. No handler reads it in this PR — wiring belongs
to #163 (sign-tx task family). The Engine accessor pair was chosen
over a NewEngine signature change to avoid churning ~16 test call sites
for a field no consumer reads yet.

Genesis isolation preserved: generate_gentx.go continues to construct
its own keyring.BackendTest locally and does NOT consume the shared
Engine.Keyring. Documented inline with a WHY comment.

New docs/keyring.md covers the env contract, fail-fast semantics,
trust model (passphrase lifetime, never-logged, wiped post-init), and
genesis-path isolation rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-review by platform-engineer / kubernetes-specialist / security-specialist
surfaced 5 must-fix items and several minor cleanups. All applied here.

Drop mutex on Engine.SetExecutionConfig / ExecutionConfig
  (platform-engineer N1): the field is set-once during single-threaded
  startup before any goroutines run. The mutex was theater; dropping it
  matches the actual usage contract. Doc comment now says "not safe to
  call concurrently with task execution" explicitly.

Wipe SEI_KEYRING_PASSPHRASE immediately after os.Getenv
  (security-specialist #2): previously the wipe happened after the
  passphrase-required check, which is benign today (value is "" at that
  point) but future-fragile. Moving the Unsetenv to immediately after
  Getenv closes the /proc/<pid>/environ window unconditionally on every
  return path.

Sever wrapped error chain in OpenKeyring
  (platform-engineer N4 + security-specialist #3): use errors.New with
  redacted message instead of %w-wrapping the underlying SDK error. This
  prevents a typed field that might embed the passphrase from
  resurfacing via a future caller's %w or %v of a wrapped struct.
  Removed the %w wrap at buildExecutionConfig's call site too.

Surface suffix-strip behavior in operator-facing docs
  (platform-engineer N3 + kubernetes-specialist n2): document that for
  the file backend, a trailing /keyring-file segment is stripped before
  handoff to the SDK. Also clarify the home-dir resolution (--home flag
  defaulting to $SEI_HOME) rather than implying a direct $SEI_HOME read.

Tighten SmokeTestKeyring panic-recovery comment
  (platform-engineer N5): the previous comment claimed fail-fast was
  bypassed without recovery, but a panic IS fail-fast. The real reason
  for recovery is so the bounded retry loop can run. Updated.

Add comment near kr.List() in smokeTestAttempt
  (security-specialist nit): explicit "discard the slice deliberately;
  List() decrypts the index, not individual keys" — documents the
  intentional non-destructiveness of the check.

Deferred to follow-ups (out of #162 scope, documented in PR comment):
  - Rehydration-vs-config ordering trap (k8s N4) — fixes with #163 when
    sign-tx handlers exist
  - ExecutionConfig blast radius / per-handler capabilities (security #7) —
    design-trajectory work
  - Stdout-capturing log-line redaction test (platform #6, security #6) —
    seilog caches its writer at init time, making runtime redirect
    trivially pass-through; deferred to code-inspection at the single log
    site (serve.go:185)
  - shareProcessNamespace doc-vs-code consistency (security #1) — meta
    issue, not a vuln in this PR; noted in #221 thread

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Contributor Author

Cross-review complete — findings integrated

3-specialist parallel review against commit `750db33` (platform-engineer / kubernetes-specialist / security-specialist).

Verdicts: platform-engineer LGTM-with-fixes • kubernetes-specialist Approve-with-non-blocking-follow-ups • security-specialist LGTM-with-fixes (no code blockers).

Addressed in commit `afa9cac`

# Source Item Fix
1 platform N1 Mutex on `SetExecutionConfig` was theater (field is set-once during single-threaded startup) Dropped the mutex; doc comment explicitly states "not safe to call concurrently with task execution"
2 security #2 `os.Unsetenv` happened after the passphrase-required check, leaving a future-fragile window Moved to immediately after `os.Getenv`; unconditional on every return path
3 platform N4 + security #3 `%w`-wrapping the SDK error preserved the chain — a typed field embedding the passphrase could resurface via a future caller's `%w` or `%v` of a wrapped struct Use `errors.New(redactPassphrase(...))` to sever the chain; also removed `%w` wrap at the `buildExecutionConfig` call site
4 platform N3 + k8s n2 Suffix-strip on `SEI_KEYRING_DIR` was hidden behavior; home-dir resolution was misdescribed in docs Documented suffix-strip explicitly; clarified `--home` flag (defaulting to `$SEI_HOME`) rather than implying direct `$SEI_HOME` read
5 platform N5 `SmokeTestKeyring` panic-recovery comment overstated the fail-fast property Tightened: real reason is so the bounded retry loop can run
6 security nit No explicit comment on why `kr.List()` discards the slice Added: "List() decrypts the index, not individual keys — strongest non-destructive check"

Deferred to follow-ups (clearly out of #162 scope, surfaced for tracking)

  • Rehydration-vs-config ordering trap (k8s N4): `NewEngine` rehydrates stale tasks BEFORE `SetExecutionConfig` runs. If a future sign-tx task is in 'running' state across a pod restart, rehydration would execute it with a zero-value config (nil Keyring). No bug today (no sign-tx handlers exist yet). Will fix in feat(sidecar): sign-tx task family (gov vote / submit-proposal / deposit) #163 by either rehydrating after the config is installed OR by having sign-tx handlers read the engine's config at execution time (not goroutine-launch time).
  • `ExecutionConfig` blast radius / per-handler capabilities (security Bump version to 0.0.2 in prep for release #7): `*Engine.ExecutionConfig()` is package-exported and any future task handler can pull the unlocked keyring. Right scoping for MVP; consider an opaque `SignerProvider` interface in a future hardening pass.
  • Log-line redaction test (platform Fix multi-line read and typos in config #6, security Fix multi-line read and typos in config #6): seilog caches its writer at init time, so an `os.Stdout` redirect during a test doesn't capture the actual log output. A runtime test would trivially pass-through rather than catch regressions. Deferred to code-inspection at the single log site (`serve.go:185` — `serveLog.Info("keyring opened", "backend", backend, "dir", dir)` — passphrase NOT in the arglist).
  • `shareProcessNamespace` doc-vs-code consistency (security Integrate UCI lint, versioning and go check #1): the security reviewer flagged that the prompt claims `true` while the design doc text at one point says `false`. The CONTROLLER's actual pod-spec keeps `shareProcessNamespace=true` (deferred per Harden sidecar/seid trust boundary against /proc-based leakage sei-k8s-controller#221). This is a meta-issue, not a vulnerability in this PR; will reconcile when #221 lands.

Discretionary decisions reviewers ratified

  • `OpenKeyring` suffix-strip of trailing `/keyring-file`: all three reviewers accepted as correct ergonomics; documented in docs now.
  • Smoke test scope (`kr.List()` only, no specific key-name validation): all three accepted; permitted-empty-keyring posture documented in design Component B.
  • 3 × 2s smoke-test retry window: k8s reviewer confirmed it's appropriate for kubelet Secret-mount races (typical race is ~100-500ms; 6s comfortable; chronic infra problems still CrashLoopBackOff in 6s as desired).
  • Genesis isolation (generate_gentx continues to use `BackendTest` locally): all three reviewers approved the deliberate separation; the comment at `generate_gentx.go` explaining "why" is load-bearing.

Gates

  • `go build ./...` clean
  • `go test ./...` all packages pass
  • `golangci-lint run --new-from-rev=origin/main ./...` — 0 issues

PR is ready for human review.

Comment thread serve.go
Comment thread sidecar/engine/engine.go Outdated
Comment thread sidecar/engine/types.go
Comment thread sidecar/server/keyring.go Outdated
1. Cursor — passphrase wasn't wiped on the unsupported-backend early
   return. Moved os.Getenv/os.Unsetenv for SEI_KEYRING_PASSPHRASE to
   the very top of buildExecutionConfig so EVERY return path leaves
   /proc/<pid>/environ clean — including unset backend and unsupported
   backend paths that previously returned before the wipe.

2. User (engine.go:234) — dropped SetExecutionConfig/ExecutionConfig
   getter/setter pair in favor of a directly-exported Engine.Config
   field. With the mutex already dropped per cross-review, the methods
   were a no-op wrapper. Field doc comment captures the set-once
   contract.

3. User (engine/types.go:92) — trimmed the ExecutionConfig doc comment
   from 10 lines to 2: "carries process-wide dependencies; fields are
   nil when the corresponding subsystem isn't configured."

4. User (server/keyring.go:25) — tightened comments throughout the
   file. Backend-aliases comment dropped 4 lines → 2. AllowedBackends
   comment dropped 3 lines → 1. Smoke-test retry comment dropped 5 → 2.
   OpenKeyring godoc dropped 12 lines → 4. SmokeTestKeyring godoc
   dropped 7 → 4. redactPassphrase godoc dropped 4 → 2. kr.List()
   comment dropped 3 lines → 1. All WHYs preserved; padding removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Contributor Author

Addressed Cursor + line comments (commit `bb26ba6`)

Source Pin Resolution
Cursor bugbot `serve.go:158` — passphrase not wiped on unsupported-backend path Moved `os.Getenv` + `os.Unsetenv` for `SEI_KEYRING_PASSPHRASE` to the very top of `buildExecutionConfig`. Every return path (unset backend, unsupported backend, missing passphrase, OpenKeyring error, smoke-test failure, success) now leaves `/proc//environ` clean.
User `engine.go:234` — "Any reason to not just export the field?" Dropped `SetExecutionConfig`/`ExecutionConfig()` getter+setter pair. Now an exported `Engine.Config` field with a doc comment capturing the set-once-before-Submit contract. Call site at `serve.go:127` simplified to `eng.Config = execCfg`.
User `engine/types.go:92` — "make more concise" `ExecutionConfig` godoc trimmed from 10 lines to 2 — just the carrier intent + the nil-when-unconfigured field contract.
User `server/keyring.go:25` — "make all comments more concise" Tightened every comment in the file: backend aliases (4→2), AllowedBackends (3→1), smoke-test retry (5→2), OpenKeyring godoc (12→4), SmokeTestKeyring godoc (7→4), redactPassphrase (4→2), kr.List() inline (3→1). All WHYs preserved; padding removed. Net −46 lines in the file.

Gates

  • `go build ./...` clean
  • `go test ./...` all packages pass
  • `golangci-lint run --new-from-rev=origin/main ./...` — 0 issues

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bb26ba6. Configure here.

Comment thread serve.go
…ing)

Cursor bugbot flagged that eng.Config = execCfg at serve.go:127 races
with goroutines spawned by NewEngine's auto-rehydration of stale tasks.
Today no handler reads Config so no race manifests, but #163 will wire
sign-tx handlers that consume Config.Keyring — the race becomes real
the moment that lands.

Fix the ordering before it can bite:

- NewEngine becomes a pure constructor; no goroutine spawn.
- RehydrateStaleTasks is now an exported method the caller invokes
  explicitly after installing Config. Doc comment says "must be called
  only after Config is installed."
- serve.go calls eng.Config = execCfg then eng.RehydrateStaleTasks();
  the goroutine-spawn happens-after the field write, so the Go memory
  model guarantees rehydrated handlers see the installed Config without
  any synchronization.
- Two tests that depended on auto-rehydration (engine_test.go and
  engine_e2e_test.go stale-task cases) updated to call
  RehydrateStaleTasks() explicitly.
- All other 14 NewEngine call sites are unaffected — they use empty
  stores so rehydration was already a no-op.

go test -race ./sidecar/engine/... clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Contributor Author

Addressed Cursor finding (commit `7e8670d`)

"Config write races with rehydration goroutines from NewEngine" — real, well-spotted. Same trap the k8s-specialist reviewer flagged earlier as a deferred-to-#163 follow-up. Locking it in here is cheaper than carrying the future-race forward.

Fix

`NewEngine` was spawning rehydration goroutines as a side effect of construction; the caller then wrote `eng.Config = execCfg` after the goroutines were already live. Today the handlers don't touch `Config` so no race manifested, but #163 will wire sign-tx handlers that read `Config.Keyring` — the race becomes real then.

Made rehydration explicit:

  • `NewEngine` is now a pure constructor (no goroutine spawn)
  • `RehydrateStaleTasks` is now an exported method the caller invokes after installing `Config`. Doc comment says "must be called only after Config is installed."
  • `serve.go:127` now reads:
    ```go
    eng := engine.NewEngine(ctx, handlers, store)
    eng.Config = execCfg
    eng.RehydrateStaleTasks()
    ```
    The `go` statement establishes a happens-before edge with the prior field write, so rehydrated handlers see the installed Config without any synchronization.

Call-site impact

  • 1 production call site: `serve.go` (updated)
  • 2 tests that depended on auto-rehydration (`engine_test.go` + `engine_e2e_test.go` stale-task cases): updated to call `RehydrateStaleTasks()` explicitly
  • 14 other `NewEngine` call sites: unaffected — they use empty stores so rehydration was already a no-op

Gates

  • `go build ./...` clean
  • `go test ./...` all packages pass
  • `go test -race ./sidecar/engine/...` clean
  • `golangci-lint run --new-from-rev=origin/main ./...` — 0 issues

Net diff: +15 / -10.

@bdchatham bdchatham merged commit 9111698 into main May 12, 2026
3 checks passed
@bdchatham bdchatham deleted the feat/sidecar-keyring-backend branch May 12, 2026 16:11
bdchatham added a commit that referenced this pull request May 18, 2026
Release the genesis-overrides feature (PR #181) plus the other six
commits that have accumulated since v0.0.49: trusted-header authn
(#179), gov-software-upgrade handler (#177), gov-vote handler (#175),
`/tx?hash=` discriminator hardening (#173), sign-tx foundation (#170),
production keyring backend (#168).

The Releaser workflow (.github/workflows/uci-release-publish.yml)
triggers on push to version.json and tags this commit as v0.0.50.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant