Skip to content

MAF-Inspired Framework Improvements#3

Merged
eanzhao merged 1 commit into
devfrom
feature/maf-inspire
Feb 19, 2026
Merged

MAF-Inspired Framework Improvements#3
eanzhao merged 1 commit into
devfrom
feature/maf-inspire

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Feb 16, 2026

MAF-Inspired Framework Improvements

基于 Microsoft Agent Framework (MAF) 调研,对 Aevatar 框架进行 6 项改进,涵盖中间件管线、可观测性、人机交互、工具审批、编排模式和表达式语言。

变更概览

改进项 优先级 新增文件 修改文件 涉及层
Agent Middleware Pipeline HIGH 4 3 AI.Abstractions / AI.Core
OpenTelemetry GenAI Conventions HIGH 4 1 AI.Core / Foundation.Runtime
Human-in-the-Loop MEDIUM 3 5 Workflow.Core / Projection / AGUI
Tool Approval Mechanism MEDIUM 1 1 AI.Abstractions
Orchestration Patterns LOW 5 0 Foundation.Core
Expression Language LOW 1 1 Workflow.Core

1. Agent Middleware Pipeline

问题

原有 Hook 系统 (IAIGAgentExecutionHook) 是 best-effort 执行,无法修改或拦截 Agent 输入输出。MAF 提供三层中间件(Agent Run / Function Calling / IChatClient),支持安全过滤、限流、内容审查等横切关注点。

方案

新增三层 Middleware 接口,与 Hook 系统共存(Hook 做观测,Middleware 做拦截):

  • IAgentRunMiddleware — 包裹整个 Agent Chat 执行,可修改用户输入、短路返回
  • IToolCallMiddleware — 包裹每次 Tool 执行,可验证参数、覆盖结果、阻止调用
  • ILLMCallMiddleware — 包裹每次 LLM 调用,可修改 messages、注入 system prompt、变换响应

中间件通过 DI 注册,遵循 ASP.NET Core 中间件模式(next() 递归链)。

补充约定:

  • ChatAsyncChatStreamAsync 均会进入 IAgentRunMiddleware 链。
  • 流式路径中,真实的 provider.ChatStreamAsync(...) 会在 ILLMCallMiddleware 链的 next() 内执行,确保观测/限流/缓存等横切逻辑对流式调用同样生效。

新增文件

  • src/Aevatar.AI.Abstractions/Middleware/IAgentRunMiddleware.cs
  • src/Aevatar.AI.Abstractions/Middleware/IToolCallMiddleware.cs
  • src/Aevatar.AI.Abstractions/Middleware/ILLMCallMiddleware.cs
  • src/Aevatar.AI.Core/Middleware/MiddlewarePipeline.cs

修改文件

  • src/Aevatar.AI.Core/Tools/ToolCallLoop.cs — 集成 Tool + LLM 中间件
  • src/Aevatar.AI.Core/Chat/ChatRuntime.cs — 集成 Agent Run + LLM 中间件
  • src/Aevatar.AI.Core/AIGAgentBase.cs — 从 DI 收集中间件并注入 Runtime

使用示例

// 注册安全过滤中间件
services.AddSingleton<IAgentRunMiddleware, ContentSafetyMiddleware>();
services.AddSingleton<IToolCallMiddleware, ToolAuditMiddleware>();
services.AddSingleton<ILLMCallMiddleware, PromptInjectionGuardMiddleware>();

2. OpenTelemetry GenAI Semantic Conventions

问题

原有 ActivitySource 仅在 LocalActor 层记录事件处理 span,缺少 LLM 调用和 Tool 执行的标准化 span 和 metrics。MAF 遵循 OpenTelemetry GenAI Semantic Conventions 发射 invoke_agent / chat / execute_tool 标准 span。

方案

以内置 Middleware 形式实现可观测性(GenAIObservabilityMiddleware),同时实现三个 Middleware 接口:

Span 类型(GenAI 语义规范):

  • invoke_agent <name> — 每次 Agent Run,包含 gen_ai.agent.id / gen_ai.agent.name / gen_ai.provider.name
  • chat <model> — 每次 LLM 调用,包含 gen_ai.request.model / gen_ai.provider.name / gen_ai.usage.*_tokens
  • execute_tool <name> — 每次 Tool 调用,包含 gen_ai.tool.name / gen_ai.tool.call_id(span kind 为 INTERNAL

Metrics(直方图):

  • gen_ai.client.token.usage — Token 消耗
  • gen_ai.client.operation.duration — LLM 调用耗时
  • aevatar.tool.invocation.duration — Tool 调用耗时

敏感数据控制:通过 GenAIActivitySource.EnableSensitiveData 控制是否在 span 中包含 prompt/response 内容。

新增文件

  • src/Aevatar.AI.Core/Observability/GenAIActivitySource.cs
  • src/Aevatar.AI.Core/Observability/GenAIObservabilityMiddleware.cs
  • src/Aevatar.Foundation.Runtime/Observability/AevatarObservabilityOptions.cs
  • src/Aevatar.Foundation.Runtime/Observability/GenAIMetrics.cs

修改文件

  • src/Aevatar.Foundation.Runtime/Observability/AevatarActivitySource.cs — 增加 GenAI span 工厂方法

使用示例

// 启用全链路 GenAI 可观测性
var mw = new GenAIObservabilityMiddleware();
services.AddSingleton<IAgentRunMiddleware>(mw);
services.AddSingleton<IToolCallMiddleware>(mw);
services.AddSingleton<ILLMCallMiddleware>(mw);

3. Human-in-the-Loop Workflow Steps

问题

MAF 提供 Question / Confirmation / WaitForInput / RequestExternalInput 四种 Human-in-the-Loop 操作。Aevatar 工作流引擎没有暂停执行等待人工输入的机制。

方案

新增两个工作流步骤模块和暂停/恢复事件:

  • HumanApprovalModule — 处理 type: human_approval,暂停工作流等待 yes/no 审批
  • HumanInputModule — 处理 type: human_input,暂停工作流等待自由文本输入

通过 WorkflowSuspendedEvent / WorkflowResumedEvent 实现暂停恢复,AGUI 层映射为 HUMAN_INPUT_REQUEST 事件供前端渲染。

YAML DSL 扩展

steps:
  - id: approve_report
    type: human_approval
    parameters:
      prompt: "审批生成的报告?"
      timeout: "3600"
      on_reject: fail

  - id: get_context
    type: human_input
    parameters:
      prompt: "请提供补充信息"
      variable: user_context
      timeout: "1800"
      on_timeout: fail

参数约定(最小集):

  • human_approvalprompt / timeout(秒)/ on_reject(默认 fail
  • human_inputprompt / variable / timeout(秒)/ on_timeout(默认 fail

新增文件

  • src/workflow/Aevatar.Workflow.Core/Modules/HumanApprovalModule.cs
  • src/workflow/Aevatar.Workflow.Core/Modules/HumanInputModule.cs
  • src/workflow/Aevatar.Workflow.Projection/Reducers/WorkflowSuspendedEventReducer.cs

修改文件

  • src/workflow/Aevatar.Workflow.Core/cognitive_messages.proto — 新增 WorkflowSuspendedEvent / WorkflowResumedEvent
  • src/Aevatar.Presentation.AGUI/AGUIEvents.cs — 新增 HumanInputRequestEvent / HumanInputResponseEvent
  • src/workflow/Aevatar.Workflow.Presentation.AGUIAdapter/EventEnvelopeToAGUIEventMapper.cs — 新增映射 Handler
  • src/workflow/Aevatar.Workflow.Presentation.AGUIAdapter/DependencyInjection/ServiceCollectionExtensions.cs — 注册
  • src/workflow/Aevatar.Workflow.Core/ServiceCollectionExtensions.cs — 默认注册 human_input / human_approval 步骤模块

4. Tool Approval Mechanism

问题

MAF 的 @tool(approval_mode="always_require") 可阻止 Agent 自主执行敏感工具。Aevatar 的工具系统无审批机制。

方案

IAgentTool 接口增加 ApprovalMode 属性(默认 NeverRequire,向后兼容):

public enum ToolApprovalMode
{
    NeverRequire = 0,   // 立即执行(默认)
    AlwaysRequire = 1,  // 始终需要审批
    Auto = 2,           // 由 Middleware 决定
}

与 Middleware Pipeline 自然组合:IToolCallMiddleware 可检查 context.Tool.ApprovalMode 实现审批逻辑。

新增文件

  • src/Aevatar.AI.Abstractions/ToolProviders/ToolApprovalMode.cs

修改文件

  • src/Aevatar.AI.Abstractions/ToolProviders/IAgentTool.cs — 增加 ApprovalMode 默认接口属性

5. Reusable Multi-Agent Orchestration Patterns

问题

MAF 将 5 种编排模式打包为可复用组件。Aevatar 通过 Workflow 模块实现类似能力,但与 YAML 引擎紧耦合,无法在无 YAML 场景下程序化使用。

方案

抽取 4 种编排模式为 Aevatar.Foundation.Core.Orchestration 命名空间下的一等公民:

模式 类名 行为
Sequential SequentialOrchestration 链式:A → B → C
Concurrent ConcurrentOrchestration 并行扇出 + 合并
Vote VoteOrchestration 并行执行 + 投票选最佳
Handoff HandoffOrchestration 动态 Agent 控制转移

所有模式基于 IOrchestration<TInput, TOutput> 接口,接受 Func<string, string, CancellationToken, Task<string>> 作为 Agent 执行委托,与具体 Agent 实现解耦。

新增文件

  • src/Aevatar.Foundation.Core/Orchestration/IOrchestration.cs
  • src/Aevatar.Foundation.Core/Orchestration/SequentialOrchestration.cs
  • src/Aevatar.Foundation.Core/Orchestration/ConcurrentOrchestration.cs
  • src/Aevatar.Foundation.Core/Orchestration/VoteOrchestration.cs
  • src/Aevatar.Foundation.Core/Orchestration/HandoffOrchestration.cs

6. Declarative Workflow Expression Language

问题

MAF 的声明式工作流使用 PowerFx 表达式语言。Aevatar YAML DSL 有变量替换但缺少运行时表达式求值。

方案

新增轻量级 ${...} 表达式求值器,支持:

语法 功能 示例
${name} 变量引用 ${variables.user}
${if(...)} 条件 ${if(variables.age, 'adult', 'minor')}
${concat(...)} 拼接 ${concat('Hello, ', variables.name)}
${isBlank(...)} 空值检测 ${isBlank(variables.input)}
${length(...)} 长度 ${length(variables.text)}
${and/or/not(...)} 布尔逻辑 ${and(variables.a, variables.b)}
${upper/lower/trim(...)} 字符串函数 ${upper(variables.name)}

支持嵌套函数调用和引号字面量,参数分隔正确处理括号嵌套。

新增文件

  • src/workflow/Aevatar.Workflow.Core/Expressions/WorkflowExpressionEvaluator.cs

修改文件

  • src/workflow/Aevatar.Workflow.Core/Modules/WorkflowLoopModule.cs — 在派发 StepRequestEvent 前对步骤参数执行 ${...} 求值

运行时变量与求值时机

  • 求值时机WorkflowLoopModule.DispatchStep(...) 发布 StepRequestEvent 前,对 step.parameters 的 value 做一次模板替换(${...})。
  • 变量来源:每个 run_id 维护独立变量字典,默认包含:
    • input:当前步骤的输入(上一步输出)
    • <stepId>:已完成步骤的输出(同名 key),可用 ${first_step}${variables.first_step} 引用

验证

dotnet build aevatar.slnx --nologo    # 0 错误
dotnet test aevatar.slnx --nologo     # 所有新增测试通过

@eanzhao eanzhao merged commit a2dc54a into dev Feb 19, 2026
1 check passed
eanzhao added a commit that referenced this pull request Apr 27, 2026
…ServiceId; share contract math

Addresses PR #457 review.

## Functional fix (the inline review): InvokePath / invoke handler mismatch

The contract returned by the new `GET /members/.../endpoints/.../contract`
was telling the frontend to call `/members/{memberId}/invoke/...`, but the
existing platform handler for that path resolves the member through
`IMemberPublishedServiceResolver` which today returns
`publishedServiceId == memberId`. Studio's bind path persists
`publishedServiceId == "member-{memberId}"`. So the contract was built for
`member-{memberId}` while invoke would target `{memberId}` → 404.

Fix: register `StudioAwareMemberPublishedServiceResolver` from Studio's
DI. It first asks `IStudioMemberQueryPort` for the member's stored
`publishedServiceId`; if no Studio member exists, falls back to the
legacy deterministic mapping (`memberId == publishedServiceId`) so
direct platform binds keep working unchanged. Now contract / activate /
retire / invoke / runs all resolve to the same identity.

## Refactors per the PR review

- **#1 Duplicated contract-building logic**: extracted the pure
  helpers (`ResolveCurrentContractRevision`,
  `EnumeratePreferredContractRevisionIds`, `RevisionContainsEndpoint`,
  `IsChatEndpoint`, `ResolveStreamFrameFormat`,
  `BuildBase64PayloadPlaceholder`, `BuildTypedInvokeRequestExampleBody`)
  into `Aevatar.GAgentService.Abstractions.Services.ServiceEndpointContractMath`.
  Both `ScopeServiceEndpoints.cs` (legacy) and `StudioMemberService.cs`
  (member-first) funnel through it. A bug fix in one helper now
  propagates to both paths automatically.

- **#3 / #4 Repeated resolve+verify pattern**: introduced
  `ResolveBoundServiceContextAsync` returning
  `(ScopeId, MemberId, PublishedServiceId, Identity, Service, Revisions)`.
  The three new methods now all share one query path; activate /
  retire dropped from 4 platform queries to 2.

- **#2 Non-atomic activate**: documented with a `NOTE:` comment that
  `SetDefaultServingRevision` then `ActivateServiceRevision` is
  intentionally non-transactional, mirroring the legacy scope-default
  behavior, and that both commands are platform-side idempotent.

- **#7 Hardcoded "retired" string**: introduced
  `MemberRevisionLifecycleStatusNames.Retired` next to the existing
  `MemberLifecycleStageNames` so future lifecycle verbs declare
  themselves alongside it instead of as scattered magic strings.

- **#6 / #8 Input trimming**: collapsed the four ad-hoc trimming sites
  into a single `NormalizeRequired(value, fieldName)` helper applied at
  the service entry of every public method. Trimming now happens at
  exactly one boundary per call.

## Tests

- 13 new tests pin the resolver's contract (Studio member → stored
  publishedServiceId; non-Studio member → legacy fallback; trim;
  reject malformed input; empty publishedServiceId degrades safely).
- Existing tests unchanged: 327 Studio + 281 platform integration
  passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 27, 2026
P1 — workflow termination on publish failure (reviewer's critical):
`PublishFailureAsync` emits `StepCompletedEvent { Success = false }`,
which the kernel routes through `TryRetryAsync` → `TryOnErrorAsync` →
fail. With no retry/on_error policy on `publish_to_twitter`, a Twitter
401/403/429/5xx terminated the entire workflow run as failed. Add
`on_error: { strategy: skip, default_output: "twitter_publish_failed" }`
to the YAML so the run advances to `done` cleanly; the module already
surfaces categorized errors to Lark independently.

#2 — Twitter v2 native error shape: `ClassifyTwitterResponse` now
recognizes the third response shape NyxID can forward verbatim:
`{ "title": "...", "detail": "...", "errors": [...] }` (Twitter's
native problem-details for content-policy / duplicate-tweet rejections).
Falls through to `twitter_publish_rejected` with the Twitter `message`
text in the Lark surfacing so users read the actual rejection reason.

#1 — Duplicate tweet risk: documented in code comment that
`POST /2/tweets` has no server-side dedup; the social_media template
intentionally has no `retry` policy on this step, and `on_error: skip`
advances rather than retrying. Authors customizing the YAML must keep
this invariant.

#3 — Removed redundant `nyxClient!` null-forgiving (no-op cleanup).

#4 — Renamed `ChannelMetadataKeys.LarkProxySlug` →
`LarkOutboundProxySlug` (`channel.lark.outbound_proxy_slug`) to
disambiguate "Lark API surface" from "NyxID provider routing".

#5 — Added xml-doc on `TrySendLarkAsync` documenting the dual-scope
api-key dependency (key must carry both api-twitter AND api-lark-bot
entitlements) so future callers don't silently break Lark surfacing
when narrowing the key's scope.

#6 — Added `RequiredServiceSlugs` field to `SocialMediaTemplateSpec`
for parity with `DailyReportTemplateSpec`; `CreateSocialMediaAgentAsync`
now reads slugs from the spec instead of inlining the list.

Tests:
- 3 new `ClassifyTwitterResponse` tests for the Twitter native error
  shapes (errors-array, RFC-7807 title/detail-only, empty-object
  unexpected-shape).
- Existing social_media test now also asserts `strategy: skip` lands in
  the upserted YAML.
- 482 channel-runtime + 236 workflow.core tests pass; full solution
  builds with 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 27, 2026
…elper

Resolves three followup architecture-review points on PR #451:

### Channel.Runtime drops AI / Workflow direct deps (review #2)

`Channel.Runtime`'s csproj used to pull in `Aevatar.AI.Abstractions` and
`Aevatar.Workflow.Application.Abstractions` because of two files that
straddled the channel/AI and channel/workflow boundary:

- `ChannelContextMiddleware` (an `ILLMCallMiddleware` impl) — moved to
  `Aevatar.GAgents.NyxidChat`, which is the only package that needs it
  and already references `AI.Abstractions`. NyxidChat SCE registers it
  for the LLM call pipeline; Channel.Runtime SCE no longer touches
  `ILLMCallMiddleware`.
- `ChannelCardActionRouting` (builds `WorkflowResumeCommand`) — moved
  to `Aevatar.GAgents.NyxidChat` for the same reason. Its sole consumer
  (`ChannelConversationTurnRunner`) lives there too.

`Channel.Runtime.csproj` now references only `Channel.Abstractions`,
`Foundation.Abstractions`/`Core`, and the `CQRS.Projection.*` slice —
matching the "channel-agnostic flow + projection infrastructure"
charter from the RFC. Tests (`ChannelCardActionRoutingTests`) get the
extra `using Aevatar.GAgents.NyxidChat;`.

### Extract Elasticsearch projection-store toggle helper (review #4)

The `ResolveElasticsearchEnabled` + `BuildElasticsearchOptions` helper
pair was duplicated three times (Channel.Runtime / Device / Scheduled
SCEs) with slightly different log strings and `Console.Error.WriteLine`
output. Centralized into
`Aevatar.CQRS.Projection.Providers.Elasticsearch.DependencyInjection.ElasticsearchProjectionConfiguration`
with two static helpers:

- `IsEnabled(IConfiguration?, ILogger?, string? storeName)` — explicit
  flag → endpoints presence → false; logs a structured warning via
  `ILogger` (when supplied) instead of `Console.Error.WriteLine`.
- `BindOptions(IConfiguration)` — typed binder for
  `ElasticsearchProjectionDocumentStoreOptions`.

All three SCEs now call into this helper; per-package warning text is
parameterized via `storeName`. Section path
(`Projection:Document:Providers:Elasticsearch`) is exposed as a const
so future call sites stay in sync.

### Followup points acknowledged but deferred

- **Cross-package dep chain `NyxidChat → Authoring.Lark → Scheduled →
  Platform.Lark`** (review #1) — pre-existing arch debt that the split
  surfaced rather than introduced. Cleaner would be to invert via
  `IInboundFlowResolver` plug-ins so `ChannelConversationTurnRunner`
  doesn't reach into `AgentBuilderCardFlow` directly. Out of scope for
  the package split; tracking as a separate follow-up.
- **Tombstone compactor "central coordinator"** (review #3) —
  `Channel.Runtime` defines `ITombstoneCompactionTarget` but does not
  reference `Device` / `Scheduled` at the csproj level; per-package
  targets register themselves through DI. The plug-in pattern is
  intentional and keeps the DAG one-way.
- **`Scheduled` package name vs UserAgentCatalog content** (review #5)
  — `UserAgentCatalog` is the delivery-target registry that Scheduled
  agents read at execution time to route output, so co-locating it
  with `SkillRunnerGAgent` / `WorkflowAgentGAgent` is intentional.
  Renaming to `AgentCatalog` would split actors from their primary
  consumer; deferring.

473/473 ChannelRuntime.Tests pass; full slnx still only fails the same
two pre-existing Mainnet hosting `BindAsync on IStudioMemberService`
tests that reproduce on origin/dev.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 29, 2026
…ash-command

Wires up all six pieces the ADR called out as parallel-capable now that
NyxID#549 has shipped, so per-user binding stops requiring a follow-up PR
per layer:

#1 — AuthContext.external_subject (Channel.Abstractions)
    Move ExternalSubjectRef out of Identity.Abstractions into
    Channel.Abstractions where AuthContext lives. Identity is a consumer
    of channel concepts, not their owner; the new typed
    AuthContext.external_subject = 4 field is the broker-mode outbound
    identity carrier (ADR §Outbound Send) and the legacy string
    user_credential_ref keeps working for non-broker callers. Adds a
    new AuthContext.OnBehalfOfExternalSubject helper.

#2 — Projection chain
    ExternalIdentityBindingDocument (proto) + Partial (IProjectionReadModel)
    + MetadataProvider + MaterializationContext + Projector
    (ICurrentStateProjectionMaterializer) + ExternalIdentityBindingProjectionQueryPort
    + ExternalIdentityBindingProjectionReadinessPort (polling impl, write-side
    completion path only — ADR §Projection Readiness explicitly allows this).
    AddChannelIdentityProjection registers the lot.

#3 — NyxIdRemoteCapabilityBroker
    HTTP-backed INyxIdCapabilityBroker + INyxIdBrokerCallbackClient against
    the NyxID#549 wire shape: /oauth/authorize URL building, RFC 8693
    token-exchange with subject_token_type=urn:nyxid:params:oauth:token-type:binding-id,
    /oauth/bindings/{id} delete, authorization-code -> binding_id exchange.
    Maps invalid_grant -> BindingRevokedException so the upper layer
    event-source-revokes the local binding. PkceHelper covers RFC 7636 S256;
    StateTokenCodec seals correlation+verifier+exp into an HMAC token (kid
    header, deterministic Protobuf payload — ADR §Implementation Notes #1).
    AddNyxIdRemoteCapabilityBroker wires HttpClient + options + codec +
    both interfaces.

#4 — /api/oauth/nyxid-callback endpoint (IdentityOAuthEndpoints)
    Decodes state, exchanges code -> binding_id, dispatches CommitBindingCommand
    to ExternalIdentityBindingGAgent via IActorRuntime, waits on the projection
    readiness port, returns a friendly bind-confirmation page (id_token decoded
    locally for the display name — no /oauth/userinfo round-trip per ADR L61).
    Classifies error UX by HTTP status (ADR §Implementation Notes #3):
    400 for state-token issues, 502 for broker exchange failure, 200 with a
    "binding pending propagation" message on projection timeout.

#5 — Slash-command routing in ChannelConversationTurnRunner
    /init and /unbind are handled before the LLM by a new
    TryHandleSlashCommandAsync that resolves identity ports lazily through
    the existing IServiceProvider — deployments without per-user binding
    fall through unchanged. /init replies with the authorize URL (private DM
    only — ADR §Decision); /unbind resolves binding_id, calls
    broker.RevokeBindingAsync, replies with the unbind confirmation. The
    runner short-circuits via the existing SendReplyAsync, so reply
    delivery rides the relay outbound port like any other reply.

#6 — /api/webhooks/nyxid-broker-revocation receiver (Continuous Access Evaluation)
    BrokerRevocationWebhookValidator verifies HMAC-SHA256 over the raw body
    (X-NyxID-Signature: sha256=<hex>), parses the JSON envelope into a
    typed BrokerRevocationNotification, and the endpoint event-source-revokes
    the local binding actor. Aligns with NyxID#549 V2-7's CAE channel — when
    a user revokes from NyxID's console the binding goes inactive in seconds
    rather than waiting for the 5-min access TTL.

Tests:
- 591 ChannelRuntime.Tests pass (39 of which are the new Identity-tagged
  tests covering actor, projector, broker fake, state-token codec, and
  ExternalSubjectRef extension).
- Full solution `dotnet build aevatar.slnx` is clean.
- `tools/docs/lint.sh` 33 file(s) checked, 0 errors.

The ChannelConversationTurnRunner integration is feature-flag-shaped: if
neither IExternalIdentityBindingQueryPort nor INyxIdCapabilityBroker is
registered, the slash-command path is a no-op and existing
bot-owner-shared behaviour is preserved. Production rollout is a DI
toggle (AddChannelIdentityProjection + AddNyxIdRemoteCapabilityBroker
+ MapIdentityOAuthEndpoints) plus the Bot-Owner-Shared termination
strategy choice from the ADR §Bot-Owner-Shared 模式终止策略.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 29, 2026
Five round-2 follow-ups from the Codex consensus:

#5 Orphan binding leak on actor-activation failure (resolved):
  The OAuth callback's already-bound branch revoked the orphan
  binding_id; the actor-activation-failed branch (when the local actor
  cannot be created post-exchange) silently returned 503 and left the
  freshly issued binding_id active at NyxID. Now the same
  TryRevokeOrphanBindingAsync helper runs before the 503 response so
  both leak paths cleanup symmetrically.

#3 NyxIdAuthorityResolver footgun on staging clusters (resolved):
  Resolve() now takes an optional ILogger and emits a warning when
  AEVATAR_NYXID_AUTHORITY is unset AND ASPNETCORE_ENVIRONMENT /
  DOTNET_ENVIRONMENT indicate a non-Development environment. Local dev
  remains silent (env is empty / starts with "dev" / "local"); staging
  / qa / prod operators see the warning so a forgotten env-var doesn't
  silently register clients against production NyxID.

#1 csproj coupling — store wiring belongs to the host (resolved):
  Split AddChannelIdentity into two extension methods. AddChannelIdentity
  now registers actors / projector / broker / slash commands but NOT
  document stores. AddChannelIdentityProjectionStores wires the ES vs
  InMemory choice and is called by the composition root (Mainnet host
  now invokes both methods). Tests / demos can mix and match — agent
  module never decides on the host's behalf which physical store to use.

#2 kid rotation grace window (resolved):
  HMAC key rotation previously invalidated all in-flight state tokens
  (kid hardcoded "v1", no multi-key verification). The actor now carries
  current_kid + previous_kid + previous_demoted_at_unix on its state +
  projection. Encode signs with the snapshot's current kid; decode tries
  current first, then previous when demoted_at + state_token_lifetime
  is still in the future. A v1→v2 rotation produces deterministic kid
  succession (parses + increments the integer suffix) so verifiers can
  route signed tokens to the right key.

#4 HMAC key still in projection (partial — code unchanged, ADR
  hardened):
  The projection-store-as-actor-state-mirror pattern is the established
  shape in this codebase (Channel.Runtime same coupling). Removing the
  hmac_key from the document would require a generic actor query/reply
  pattern that CLAUDE.md explicitly forbids ("禁止 generic actor
  query/reply"). The ADR-0018 §Implementation Notes #1 already
  documents the explicit tradeoff (state_token TTL ≤ 5 min, rotation
  command available, ES index access boundary equals actor event-store
  boundary, KMS migration path noted as a follow-up). The proto
  comment on AevatarOAuthClientDocument now also calls out the
  production access-scoping requirement so a least-privileged ES
  reader can't extract the key. Encryption-at-rest with an envelope
  key from env-var is the natural follow-up if a deployment widens ES
  access beyond the actor event-store boundary.

Tests:
  - StateTokenCodec gains a kid-rotation grace test (decode succeeds
    with previous kid before lifetime expires; fails after).
  - AevatarOAuthClientGAgent gains a rotation-demotion test (verifies
    v1→v2 kid succession + previous-key carry-over) and a first-seed
    test (no previous-key fields populated on initial provision).
  - 800 ChannelRuntime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 30, 2026
Three actionable items from kimi (the rest of the round were either
fixed in 2f54361 / de82e0a or push-back acknowledgements):

#1 (kimi MAJOR, di) — service-locator anti-pattern:
  ChannelSlashCommandContext exposed IServiceProvider so handlers could
  pull anything they wanted at request time, which CLAUDE.md forbids
  outside framework integration code. Removed Services from the context;
  ModelChannelSlashCommandHandler now takes IUserConfigQueryPort +
  IUserConfigCommandService directly via constructor (both nullable
  optional so deployments without the Studio projection wired up still
  register the handler and surface "not enabled" cleanly). Existing
  /init / /unbind / /whoami handlers were already constructor-injected
  so this is a /model-only refactor + abstraction-shape clean-up.

#2 (kimi MAJOR, security) — webhook validator missed grace window:
  BrokerRevocationWebhookValidator only verified inbound NyxID-signed
  webhooks against snapshot.HmacKey. During a key rotation, NyxID can
  still send webhooks signed with the previous key (their cache /
  in-flight state lags the rotation event), and silently rejecting
  them drops real CAE revocations. Validator now mirrors
  StateTokenCodec.ResolveVerificationKey: try current key first, then
  the previous key when its demoted_at + state_token_lifetime is still
  in the future. Took NyxIdBrokerOptions + TimeProvider via constructor
  to share the same lifetime config as the codec.

#3 (kimi MINOR) — three ADR-0017 → ADR-0018 reference fixes in
  StateTokenCodec, INyxIdBrokerCallbackClient, NyxIdBrokerOptions; the
  earlier renumbering missed these xmldoc comments.

6 new BrokerRevocationWebhookValidatorTests cover the current-key happy
path, forged-signature rejection, previous-key acceptance within grace
window, previous-key rejection after grace window, missing signature
header, and unsupported scheme prefix. 807 ChannelRuntime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request May 8, 2026
- #15 (major, di): AddOrnnSkills now self-registers NyxIdApiClient and
  NyxIdToolOptions via TryAdd safety nets so workflow hosts that enable
  Ornn without AddNyxIdTools get a clean 404 path on missing BaseUrl
  rather than a confusing DI resolution failure.
- #3 (major, di): Introduce IExternalIdentityBindingProjectionPort and
  inject the interface in IdentityOAuthEndpoints. The concrete class
  remains for runtime composition; the registration also publishes the
  interface as a forwarder so existing tests that hold the concrete type
  keep working.
- #14 + #17 (major, concurrency): TurnStreamingReplySink TOCTOU concern
  was based on a misread — drainSignal is captured inside the same lock
  acquisition as _dispatchInProgress=false in the cap branch, and the
  throttle gate's nextIsFinal=false invariant makes _drainTcs guaranteed
  null on that path. Document the invariants so future readers don't
  re-flag this.
- #20 (major, arch): Replace the singleton Dictionary cache in
  NyxIdLlmServiceCatalogClient with IMemoryCache. Per CLAUDE.md
  "中间层状态约束", per-caller state lives in a host-owned cache, not
  a service field. AbsoluteExpiration policy preserved (30s).
- #9 (minor): /api/v1/proxy/services?per_page=100 was already extracted
  into NyxIdLlmCatalogRoutes.ProxyServicesPath — both call sites already
  use the constant.
- #10 (minor): LooksLikeLlmRouteCandidate already has negative-signal
  filtering plus a two-distinct-weak-signals threshold. No change.
- #11 (minor): ExternalIdentityBindingProjector already logs a Warning
  with DocumentId/EventId/Version on the empty-binding delete path.

Build clean (Ornn / Identity / NyxidChat / Channel.Runtime), updated
catalog client test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request May 12, 2026
…tions

Production changes:
- Migrate response_sessions and tool-state protos from `string *_json` to
  `bytes *_payload`; actors stay Protobuf-only, JSON parsing moves to the
  host boundary (addresses mimo-v2.5-pro blocker findings #2/#3).
- Extract `ResponsesTodoItemParser` in Abstractions so the host adapter
  preview and the actor's persisted view share one parser (addresses
  glm-5.1 finding on duplicated todo parsing).
- Introduce `IWebApiClient` abstraction so `ResponsesAevatarToolProvider`
  depends on an interface (addresses kimi finding on DI inversion).
- Add `WebFetchUrlGuard` to block loopback, private (10/8, 172.16/12,
  192.168/16, 169.254/16, 127/8, 0/8), and link-local addresses before
  WebFetch emits an HTTP request (addresses codex blocker #1 on SSRF /
  bearer token exfiltration via attacker-controlled URLs).

Test changes:
- Update existing Response* tests to track the new proto shape.
- Add `WebFetchUrlGuardTests` covering scheme rejection, IPv4/IPv6
  private ranges, IPv4-mapped IPv6, hostname loopback variants, and
  the public-host accept path.
- Add `ResponsesTodoItemParserTests` covering todos-array, single
  string/object, content-key fallback chain, numeric id, malformed
  JSON, and whitespace normalization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant