Fix NyxID relay callback authentication#368
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b5f51393b4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Review 对照 issue #366 — 不能 close #366整体方向对,validator 重写得很干净(错误码与 issue §1 表格 12 项逐项对齐、JWKS kid miss cooldown + single-flight via 🔴 阻塞项(必须修复)1. NyxID 侧的 callback JWT 还没有发布,PR 是硬切 + 无 fallback。
2. Replay guard 在生产 DI 里没注册,replay 防护实际上被关掉了。
🟠 与 issue 验收标准的明显偏差3.
4.
5.
🟡 次级问题6. Token 清理只覆盖了 success / non-retryable failure。
7. 架构守卫的 ripgrep 模式过宽。
8. Projection OCC 端到端测试未覆盖。
✅ 做得好的部分
建议下一步
这些处理完之后 #366 才适合 close。当前状态可以作为主干推进,但不建议直接 close issue。 |
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## dev #368 +/- ##
=======================================
Coverage 70.04% 70.04%
=======================================
Files 1155 1155
Lines 82584 82584
Branches 10868 10868
=======================================
Hits 57844 57844
Misses 20562 20562
Partials 4178 4178
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Document NyxRelayInboundActivity as an actor-inbox-only envelope that must never enter persisted state, then add a regression test that runs a relay inbound + LLM reply turn with a sentinel reply_token and asserts the byte sequence never appears in any persisted event payload.
Re-review after
|
| Review 项 | 状态 | 备注 |
|---|---|---|
| 🔴 Replay guard 生产 DI 缺失 | ✅ 已修 | INyxIdRelayReplayGuard 在 Aevatar.GAgents.ChannelRuntime/ServiceCollectionExtensions.cs 与 Aevatar.GAgents.NyxidChat/ServiceCollectionExtensions.cs 都重新 TryAddSingleton,CallbackReplayWindowSeconds 默认 6 分钟,并加了 NyxIdChatServiceCollectionExtensionsTests + ServiceCollectionExtensionsTests 两处 DI 注册断言。validator Fail() 也走同一通道发指标。 |
🟠 credential_ref proto 留空壳 |
✅ 已修 | ChannelTransportBinding / ChannelBotRegistrationEntry / ChannelBotRegisterCommand / ChannelBotRegistrationDocument 都 reserved 2/9/12/13; + reserved "credential_ref",全调用链(projector / query port / GAgent / tool / provisioning / mirror repair request / prompt 文案)一并删除。新增 proto compatibility 与 surface 测试断言 descriptor 找不到该字段。 |
🟠 callback_jwt_validation_failures_total{reason} 指标 |
✅ 已修 | 新增 NyxIdRelayMetrics (Meter 名 Aevatar.Channel.NyxIdRelay,counter callback_jwt_validation_failures_total);validator 用 Fail() 集中点统一 emit reason,并加了 MeterListener 断言测试。 |
🟠 NyxRelayInboundActivity.reply_token 与 issue §4 字面冲突 |
✅ 收口 | 我在这次 push (4d9ebea5) 加了两层保险:proto 上明确注释为 actor-inbox-only transient envelope,禁止持久化;并在 ConversationGAgentDedupTests 增加端到端 invariant 测试 — 用 sentinel reply_token 跑完 relay inbound → LLM reply ready → completed 全流程,扫描所有 IEventStore payload bytes,确认 token 字节序列从未落盘。 |
| 🟡 Token cleanup 仅覆盖 success / non-retryable | ✅ 已修 | RemoveNyxRelayReplyToken 现已扩展到 duplicate detect、inbound transient/permanent failure、LLM reply success、permanent failure;新增 ScheduleNyxRelayReplyTokenCleanupAsync + NyxRelayReplyTokenCleanupRequestedEvent durable timeout 兜底 transient retry 链路;capture 与 read 路径都做了 SweepExpiredNyxRelayReplyTokens() lazy GC。 |
| 🟡 架构守卫 ripgrep 模式过宽 | ✅ 已修 | 收窄到 (AddSingleton|TryAddSingleton)<IAevatarSecretsStore,只禁 DI 注册不再误伤普通引用。 |
| Projection OCC 端到端测试 | ⏸️ 仍延后 | 与 #365 配对处理,本 PR 不涉及。 |
| 🔴 NyxID 切换顺序(NyxID#500) | ⏸️ 暂搁置 | 按要求等 NyxID 那边 callback JWT feature 上线 + 我这边二次复核之后再 merge。 |
本地验证:
dotnet test test/Aevatar.AI.Tests462 通过dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests281 通过dotnet test test/Aevatar.GAgents.Channel.Protocol.Tests102 通过(包含新加的HandleNyxRelayInboundActivityAsync_NeverPersistsReplyTokenIntoEventStore)bash tools/ci/architecture_guards.sh通过到 buf lint(本地无 buf,与 PR 描述一致)bash tools/ci/test_stability_guards.sh通过
架构上剩余唯一的开放项就是 NyxID#500 上线 + 灰度顺序确认,按指示等回头来 verify。其余 issue #366 验收标准都收敛了。
NyxID's relay callback JWT does not emit a sub claim, and Aevatar never consumes the resulting ScopeId downstream. Remove the dead ScopeId field from the auth result records along with the unused sub read. Also fill in token_type="relay_callback" on the endpoint coverage test's JWT helper so the validator's new token_type check accepts the synthetic callbacks.
Re-review after
|
Drives streaming edit-in-place of an LLM reply so Lark users see a placeholder within ~1s and incremental updates instead of a 5-30s silent wait. Built on top of the NyxID relay edit endpoint shipped in ChronoAIProject/NyxID#480 / #483. This redesign keeps the reply token inside the conversation actor (per PR #368's security boundary): the inbox runtime only accumulates deltas and signals them to the actor; the actor is the sole caller of the outbound port, holding both the reply token and the placeholder platform_message_id in in-memory runtime state. ## Flow Inbox runtime (async LLM, no outbound port access): LLM stream delta -> TurnStreamingReplySink.OnDeltaAsync (throttle 750ms) -> actor.HandleEventAsync(LlmReplyStreamChunkEvent) LLM stream ends -> TurnStreamingReplySink.FinalizeAsync (bypass throttle) -> actor.HandleEventAsync(LlmReplyStreamChunkEvent) final flush -> actor.HandleEventAsync(LlmReplyReadyEvent) ConversationGAgent (single-threaded turn, owns reply token + streaming state): HandleLlmReplyStreamChunkAsync: - resolve runtimeContext with reply token (same path as HandleLlmReplyReadyAsync) - read per-correlation streaming state (PlatformMessageId, Disabled, EditCount) - if first chunk: runner.RunStreamChunkAsync(chunk, null PMID) -> placeholder send - if subsequent: runner.RunStreamChunkAsync(chunk, PMID) -> edit - on failure: mark state.Disabled and drop further chunks for this turn HandleLlmReplyReadyAsync: - if streaming state is healthy and LLM completed: force final update if final text differs from last flushed, then persist ConversationTurnCompletedEvent directly (no re-send) - otherwise: existing RunLlmReplyAsync fallback path - cleanup streaming state + reply token in both paths ## Architectural rules respected - Reply token never leaves the actor. Inbox runtime can't call the outbound port; it can only signal via EventEnvelope. Actor's _nyxRelayReplyTokens dict remains the single authoritative store. - No middle-layer ID map. Sink per-turn state (throttle timestamp, last emitted text) lives in instance fields on a per-invocation sink; the actor's _nyxRelayStreamingStates dict is actor-owned runtime state, same lifecycle as _nyxRelayReplyTokens. - No generic actor query/reply. Chunk dispatch is fire-and-(awaited)- order-preserving signaling; no back-channel read. - New proto event LlmReplyStreamChunkEvent is a runtime-only signal documented as never-persist; HandleEventAsync dispatches to handler without PersistDomainEventAsync, matching existing event flow. - Strong-typed event evolution: new event instead of overloading LlmReplyReadyEvent with a boolean, keeping semantics single-purpose. ## Degradation policy (v1) Any placeholder or update failure permanently disables streaming for the turn and falls back to the legacy single-shot reply path. Partial stale placeholder may remain visible to the user; the final send arrives as a second message. Transient-failure retry is tracked separately in #371 as backlog. ## Feature flag NyxIdRelayOptions.StreamingRepliesEnabled defaults to true; StreamingFlushIntervalMs defaults to 750ms per the issue spec. ## Verification - dotnet build aevatar.slnx - dotnet test test/Aevatar.GAgents.ChannelRuntime.Tests (298 passed, +17 new) - dotnet test test/Aevatar.AI.Tests (469 passed, +5 new) - dotnet test test/Aevatar.GAgents.Channel.Protocol.Tests (108 passed, +6 new) - bash tools/ci/architecture_guards.sh - bash tools/ci/test_stability_guards.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes #366.
This replaces the NyxID relay callback HMAC/local-secret authentication path with a dedicated callback JWT verified through JWKS. The callback token validates
aud, expiry with clock skew,jti,api_key_id,message_id,platform, and byte-exactbody_sha256binding before the relay payload is accepted.The relay reply token is not persisted in
ChatActivity, actor state, read models, or outbound protobuf contracts. It is carried only on the transient actor-inbox relay envelope, captured intoConversationGAgentruntime memory for the current turn, and cleaned on success, failure, duplicate delivery, token expiry, or actor deactivation. Outbound delivery persists onlyreply_message_idandcorrelation_id; the oldreply_access_tokenprotobuf field is reserved.Rollout Note
This PR intentionally does not restore HMAC fallback or
IAevatarSecretsStoreusage in production paths. Aevatar should be deployed after NyxID callback JWT forwarding is available; callbacks withoutX-NyxID-Callback-Tokenare rejected ascallback_jwt_missing.Root Cause
The previous relay HMAC flow depended on
IAevatarSecretsStore, which stores secrets locally on each node. In multi-pod deployment, provisioning can write the relay secret on one pod while callbacks land on another pod, causing signature verification to use a missing key and fail asinvalid_signature.Changes
kidmiss JWKS refresh cooldown and single-flight behavior.body_sha256validation.correlation_idto the relay callback payload and require it to match JWTjti.jti/message_idvalues.callback_jwt_validation_failures_total{reason}metrics.credential_ref,reply_access_token) and remove dead credential-ref plumbing.IAevatarSecretsStoreregistration.IAevatarSecretsStore.Validation
dotnet build test/Aevatar.AI.Tests/Aevatar.AI.Tests.csproj --nologo --no-restoredotnet build test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo --no-restoredotnet build test/Aevatar.GAgents.Channel.Protocol.Tests/Aevatar.GAgents.Channel.Protocol.Tests.csproj --nologo --no-restoredotnet test test/Aevatar.AI.Tests/Aevatar.AI.Tests.csproj --nologodotnet test test/Aevatar.GAgents.ChannelRuntime.Tests/Aevatar.GAgents.ChannelRuntime.Tests.csproj --nologo --no-builddotnet test test/Aevatar.GAgents.Channel.Protocol.Tests/Aevatar.GAgents.Channel.Protocol.Tests.csproj --nologo --no-buildbash tools/ci/test_stability_guards.shgit diff --checkbash tools/ci/architecture_guards.shwas run locally and passed the earlier sub-guards, but stopped at proto lint becausebufis not installed in the local environment. The added secrets-store DI scan was also run directly and returned no hits.