feat(sidecar): production keyring backend support (closes #162)#168
Conversation
Implements Phase 1 step 2 of the in-pod governance signing workstream
(design: docs/design/in-pod-governance-signing.md, Component B).
Adds three envs read at serve.go startup:
SEI_KEYRING_BACKEND — test | file | os (empty = governance disabled)
SEI_KEYRING_DIR — defaults to $SEI_HOME/keyring-file
SEI_KEYRING_PASSPHRASE — required when backend=file
New sidecar/server/keyring.go:
- OpenKeyring(backend, dir, passphrase) constructs a keyring.Keyring
via the Cosmos SDK keyring package. File-backend opens supply the
passphrase via in-memory reader; test/os backends accept arbitrary
dirs.
- SmokeTestKeyring exercises kr.List() with bounded retry (3 attempts,
2s backoff) to absorb kubelet Secret-mount races. Empty keyrings
pass — missing-key errors surface at first sign-tx (deferred to
#163), not here.
- redactPassphrase severs any error chain that might echo the
passphrase (defensive; SDK does not currently leak).
After OpenKeyring succeeds, the caller in serve.go MUST
os.Unsetenv("SEI_KEYRING_PASSPHRASE")
to remove the secret from /proc/<pid>/environ. Done unconditionally
before any error check, on every code path.
Engine gains ExecutionConfig (with Keyring field) accessible via
Set/Get on *Engine. No handler reads it in this PR — wiring belongs
to #163 (sign-tx task family). The Engine accessor pair was chosen
over a NewEngine signature change to avoid churning ~16 test call sites
for a field no consumer reads yet.
Genesis isolation preserved: generate_gentx.go continues to construct
its own keyring.BackendTest locally and does NOT consume the shared
Engine.Keyring. Documented inline with a WHY comment.
New docs/keyring.md covers the env contract, fail-fast semantics,
trust model (passphrase lifetime, never-logged, wiped post-init), and
genesis-path isolation rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-review by platform-engineer / kubernetes-specialist / security-specialist surfaced 5 must-fix items and several minor cleanups. All applied here. Drop mutex on Engine.SetExecutionConfig / ExecutionConfig (platform-engineer N1): the field is set-once during single-threaded startup before any goroutines run. The mutex was theater; dropping it matches the actual usage contract. Doc comment now says "not safe to call concurrently with task execution" explicitly. Wipe SEI_KEYRING_PASSPHRASE immediately after os.Getenv (security-specialist #2): previously the wipe happened after the passphrase-required check, which is benign today (value is "" at that point) but future-fragile. Moving the Unsetenv to immediately after Getenv closes the /proc/<pid>/environ window unconditionally on every return path. Sever wrapped error chain in OpenKeyring (platform-engineer N4 + security-specialist #3): use errors.New with redacted message instead of %w-wrapping the underlying SDK error. This prevents a typed field that might embed the passphrase from resurfacing via a future caller's %w or %v of a wrapped struct. Removed the %w wrap at buildExecutionConfig's call site too. Surface suffix-strip behavior in operator-facing docs (platform-engineer N3 + kubernetes-specialist n2): document that for the file backend, a trailing /keyring-file segment is stripped before handoff to the SDK. Also clarify the home-dir resolution (--home flag defaulting to $SEI_HOME) rather than implying a direct $SEI_HOME read. Tighten SmokeTestKeyring panic-recovery comment (platform-engineer N5): the previous comment claimed fail-fast was bypassed without recovery, but a panic IS fail-fast. The real reason for recovery is so the bounded retry loop can run. Updated. Add comment near kr.List() in smokeTestAttempt (security-specialist nit): explicit "discard the slice deliberately; List() decrypts the index, not individual keys" — documents the intentional non-destructiveness of the check. Deferred to follow-ups (out of #162 scope, documented in PR comment): - Rehydration-vs-config ordering trap (k8s N4) — fixes with #163 when sign-tx handlers exist - ExecutionConfig blast radius / per-handler capabilities (security #7) — design-trajectory work - Stdout-capturing log-line redaction test (platform #6, security #6) — seilog caches its writer at init time, making runtime redirect trivially pass-through; deferred to code-inspection at the single log site (serve.go:185) - shareProcessNamespace doc-vs-code consistency (security #1) — meta issue, not a vuln in this PR; noted in #221 thread Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cross-review complete — findings integrated3-specialist parallel review against commit `750db33` (platform-engineer / kubernetes-specialist / security-specialist). Verdicts: platform-engineer LGTM-with-fixes • kubernetes-specialist Approve-with-non-blocking-follow-ups • security-specialist LGTM-with-fixes (no code blockers). Addressed in commit `afa9cac`
Deferred to follow-ups (clearly out of #162 scope, surfaced for tracking)
Discretionary decisions reviewers ratified
Gates
PR is ready for human review. |
1. Cursor — passphrase wasn't wiped on the unsupported-backend early return. Moved os.Getenv/os.Unsetenv for SEI_KEYRING_PASSPHRASE to the very top of buildExecutionConfig so EVERY return path leaves /proc/<pid>/environ clean — including unset backend and unsupported backend paths that previously returned before the wipe. 2. User (engine.go:234) — dropped SetExecutionConfig/ExecutionConfig getter/setter pair in favor of a directly-exported Engine.Config field. With the mutex already dropped per cross-review, the methods were a no-op wrapper. Field doc comment captures the set-once contract. 3. User (engine/types.go:92) — trimmed the ExecutionConfig doc comment from 10 lines to 2: "carries process-wide dependencies; fields are nil when the corresponding subsystem isn't configured." 4. User (server/keyring.go:25) — tightened comments throughout the file. Backend-aliases comment dropped 4 lines → 2. AllowedBackends comment dropped 3 lines → 1. Smoke-test retry comment dropped 5 → 2. OpenKeyring godoc dropped 12 lines → 4. SmokeTestKeyring godoc dropped 7 → 4. redactPassphrase godoc dropped 4 → 2. kr.List() comment dropped 3 lines → 1. All WHYs preserved; padding removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addressed Cursor + line comments (commit `bb26ba6`)
Gates
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit bb26ba6. Configure here.
…ing) Cursor bugbot flagged that eng.Config = execCfg at serve.go:127 races with goroutines spawned by NewEngine's auto-rehydration of stale tasks. Today no handler reads Config so no race manifests, but #163 will wire sign-tx handlers that consume Config.Keyring — the race becomes real the moment that lands. Fix the ordering before it can bite: - NewEngine becomes a pure constructor; no goroutine spawn. - RehydrateStaleTasks is now an exported method the caller invokes explicitly after installing Config. Doc comment says "must be called only after Config is installed." - serve.go calls eng.Config = execCfg then eng.RehydrateStaleTasks(); the goroutine-spawn happens-after the field write, so the Go memory model guarantees rehydrated handlers see the installed Config without any synchronization. - Two tests that depended on auto-rehydration (engine_test.go and engine_e2e_test.go stale-task cases) updated to call RehydrateStaleTasks() explicitly. - All other 14 NewEngine call sites are unaffected — they use empty stores so rehydration was already a no-op. go test -race ./sidecar/engine/... clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addressed Cursor finding (commit `7e8670d`)"Config write races with rehydration goroutines from NewEngine" — real, well-spotted. Same trap the k8s-specialist reviewer flagged earlier as a deferred-to-#163 follow-up. Locking it in here is cheaper than carrying the future-race forward. Fix`NewEngine` was spawning rehydration goroutines as a side effect of construction; the caller then wrote `eng.Config = execCfg` after the goroutines were already live. Today the handlers don't touch `Config` so no race manifested, but #163 will wire sign-tx handlers that read `Config.Keyring` — the race becomes real then. Made rehydration explicit:
Call-site impact
Gates
Net diff: +15 / -10. |
Release the genesis-overrides feature (PR #181) plus the other six commits that have accumulated since v0.0.49: trusted-header authn (#179), gov-software-upgrade handler (#177), gov-vote handler (#175), `/tx?hash=` discriminator hardening (#173), sign-tx foundation (#170), production keyring backend (#168). The Releaser workflow (.github/workflows/uci-release-publish.yml) triggers on push to version.json and tags this commit as v0.0.50.

Summary
docs/design/in-pod-governance-signing.md, Component B).Discretionary decisions flagged for review
These weren't fully prescribed; cross-reviewers should sanity-check:
*`ExecutionConfig` on `Engine` with Set/Get accessors, not as a `NewEngine` constructor argument. Rationale: ~16 NewEngine call sites in tests; signature change would force churn for a field no handler reads yet. Mutex-guarded accessor pair added. Alternative considered: `NewEngineWithConfig` second constructor — rejected (duplicates constructor surface). Wiring to handlers belongs to feat(sidecar): sign-tx task family (gov vote / submit-proposal / deposit) #163.
`OpenKeyring` strips a trailing `keyring-file` segment from the supplied dir. The SDK's file backend internally appends `keyring-file` to the rootDir, so a caller passing `/sei/keyring-file` would land at `/sei/keyring-file/keyring-file`. The strip is a guardrail matching both possible operator mental models. Design's sketch used `filepath.Dir(dir)` unconditionally; I made it conditional on the suffix to keep `test`/`os` backends (which receive an arbitrary dir) unaffected.
`SmokeTestKeyring` recovers from `panic` in the underlying 99designs/keyring lib. The lib panics on a non-directory FileDir (discovered during testing). Recovery wraps the panic into a smoke-test error so a bad mount path produces fail-fast logging rather than a SIGSEGV.
Redaction = verbatim `strings.ReplaceAll` of the passphrase at the error-return boundary inside `OpenKeyring`. Defensive against future upstream regressions; the SDK doesn't echo the passphrase in any current code path. Wraps with `%s` (not `%w`) at that boundary to sever any chain that might carry the passphrase via a `%v` of an internal struct.
Passphrase `os.Unsetenv` happens before the error check, on every code path through `OpenKeyring`. Deferred-unset alternative would risk leaking the value across an early-return.
No stdout-capturing test for the `keyring opened` log line. The single log statement contains backend + dir only — no passphrase. The contract is asserted indirectly via code review of all call sites plus the `redactPassphrase` unit test. A captured-output test would be brittle against `seilog`'s writer wiring.
Spec gaps surfaced
Test plan
Cross-review
This PR is being reviewed in parallel by platform-engineer, kubernetes-specialist, and security-specialist before being marked ready for merge. Findings will be synthesized into a single resolution comment.
🤖 Generated with Claude Code
Note
Medium Risk
Introduces new startup-time secret/env handling and Cosmos SDK keyring initialization, which can affect sidecar boot behavior and touches sensitive passphrase redaction/wiping logic. Also changes engine startup sequencing (explicit stale-task rehydration) which could impact crash-recovery behavior if misordered.
Overview
Adds opt-in production keyring backend support for the sidecar:
servenow readsSEI_KEYRING_BACKEND/SEI_KEYRING_DIR/SEI_KEYRING_PASSPHRASE, opens and smoke-tests a Cosmos SDK keyring, and wipes the passphrase env to reduce exposure in/proc/<pid>/environ.Introduces
engine.ExecutionConfig(currently holdingKeyring) and stores it onEngine, and adjusts engine startup so stale-task recovery is triggered explicitly viaRehydrateStaleTasks()after config is installed.Adds
server.OpenKeyring+SmokeTestKeyringutilities with backend validation, directory normalization for the file backend, bounded retry, panic recovery, and passphrase redaction; includes unit tests plus new operator-facing docs insidecar/docs/keyring.md. Also documents thatgenerate-gentxcontinues using an isolatedBackendTestkeyring (no production keyring reuse).Reviewed by Cursor Bugbot for commit 7e8670d. Bugbot is set up for automated code reviews on this repo. Configure here.