Refactor workflow execution lifecycle and align MessageId semantics#5
Closed
eanzhao wants to merge 1 commit into
Closed
Refactor workflow execution lifecycle and align MessageId semantics#5eanzhao wants to merge 1 commit into
eanzhao wants to merge 1 commit into
Conversation
Enforce per-run execution invariants (owner/YAML fail-fast, RunId == execution actor ID, and metadata run_id consistency) and propagate MessageId on workflow completion to keep streaming/projection correlation stable. Also isolate per-run stats from long-lived workflow actors, add compatibility warnings for direct WorkflowGAgent entry, sync identifier/lifecycle docs, and add a destroy-race memo for follow-up soft-destroy/grace-period design.
This was referenced Apr 13, 2026
This was referenced Apr 21, 2026
eanzhao
added a commit
that referenced
this pull request
Apr 27, 2026
P1 — workflow termination on publish failure (reviewer's critical):
`PublishFailureAsync` emits `StepCompletedEvent { Success = false }`,
which the kernel routes through `TryRetryAsync` → `TryOnErrorAsync` →
fail. With no retry/on_error policy on `publish_to_twitter`, a Twitter
401/403/429/5xx terminated the entire workflow run as failed. Add
`on_error: { strategy: skip, default_output: "twitter_publish_failed" }`
to the YAML so the run advances to `done` cleanly; the module already
surfaces categorized errors to Lark independently.
#2 — Twitter v2 native error shape: `ClassifyTwitterResponse` now
recognizes the third response shape NyxID can forward verbatim:
`{ "title": "...", "detail": "...", "errors": [...] }` (Twitter's
native problem-details for content-policy / duplicate-tweet rejections).
Falls through to `twitter_publish_rejected` with the Twitter `message`
text in the Lark surfacing so users read the actual rejection reason.
#1 — Duplicate tweet risk: documented in code comment that
`POST /2/tweets` has no server-side dedup; the social_media template
intentionally has no `retry` policy on this step, and `on_error: skip`
advances rather than retrying. Authors customizing the YAML must keep
this invariant.
#3 — Removed redundant `nyxClient!` null-forgiving (no-op cleanup).
#4 — Renamed `ChannelMetadataKeys.LarkProxySlug` →
`LarkOutboundProxySlug` (`channel.lark.outbound_proxy_slug`) to
disambiguate "Lark API surface" from "NyxID provider routing".
#5 — Added xml-doc on `TrySendLarkAsync` documenting the dual-scope
api-key dependency (key must carry both api-twitter AND api-lark-bot
entitlements) so future callers don't silently break Lark surfacing
when narrowing the key's scope.
#6 — Added `RequiredServiceSlugs` field to `SocialMediaTemplateSpec`
for parity with `DailyReportTemplateSpec`; `CreateSocialMediaAgentAsync`
now reads slugs from the spec instead of inlining the list.
Tests:
- 3 new `ClassifyTwitterResponse` tests for the Twitter native error
shapes (errors-array, RFC-7807 title/detail-only, empty-object
unexpected-shape).
- Existing social_media test now also asserts `strategy: skip` lands in
the upserted YAML.
- 482 channel-runtime + 236 workflow.core tests pass; full solution
builds with 0 errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao
added a commit
that referenced
this pull request
Apr 27, 2026
…elper Resolves three followup architecture-review points on PR #451: ### Channel.Runtime drops AI / Workflow direct deps (review #2) `Channel.Runtime`'s csproj used to pull in `Aevatar.AI.Abstractions` and `Aevatar.Workflow.Application.Abstractions` because of two files that straddled the channel/AI and channel/workflow boundary: - `ChannelContextMiddleware` (an `ILLMCallMiddleware` impl) — moved to `Aevatar.GAgents.NyxidChat`, which is the only package that needs it and already references `AI.Abstractions`. NyxidChat SCE registers it for the LLM call pipeline; Channel.Runtime SCE no longer touches `ILLMCallMiddleware`. - `ChannelCardActionRouting` (builds `WorkflowResumeCommand`) — moved to `Aevatar.GAgents.NyxidChat` for the same reason. Its sole consumer (`ChannelConversationTurnRunner`) lives there too. `Channel.Runtime.csproj` now references only `Channel.Abstractions`, `Foundation.Abstractions`/`Core`, and the `CQRS.Projection.*` slice — matching the "channel-agnostic flow + projection infrastructure" charter from the RFC. Tests (`ChannelCardActionRoutingTests`) get the extra `using Aevatar.GAgents.NyxidChat;`. ### Extract Elasticsearch projection-store toggle helper (review #4) The `ResolveElasticsearchEnabled` + `BuildElasticsearchOptions` helper pair was duplicated three times (Channel.Runtime / Device / Scheduled SCEs) with slightly different log strings and `Console.Error.WriteLine` output. Centralized into `Aevatar.CQRS.Projection.Providers.Elasticsearch.DependencyInjection.ElasticsearchProjectionConfiguration` with two static helpers: - `IsEnabled(IConfiguration?, ILogger?, string? storeName)` — explicit flag → endpoints presence → false; logs a structured warning via `ILogger` (when supplied) instead of `Console.Error.WriteLine`. - `BindOptions(IConfiguration)` — typed binder for `ElasticsearchProjectionDocumentStoreOptions`. All three SCEs now call into this helper; per-package warning text is parameterized via `storeName`. Section path (`Projection:Document:Providers:Elasticsearch`) is exposed as a const so future call sites stay in sync. ### Followup points acknowledged but deferred - **Cross-package dep chain `NyxidChat → Authoring.Lark → Scheduled → Platform.Lark`** (review #1) — pre-existing arch debt that the split surfaced rather than introduced. Cleaner would be to invert via `IInboundFlowResolver` plug-ins so `ChannelConversationTurnRunner` doesn't reach into `AgentBuilderCardFlow` directly. Out of scope for the package split; tracking as a separate follow-up. - **Tombstone compactor "central coordinator"** (review #3) — `Channel.Runtime` defines `ITombstoneCompactionTarget` but does not reference `Device` / `Scheduled` at the csproj level; per-package targets register themselves through DI. The plug-in pattern is intentional and keeps the DAG one-way. - **`Scheduled` package name vs UserAgentCatalog content** (review #5) — `UserAgentCatalog` is the delivery-target registry that Scheduled agents read at execution time to route output, so co-locating it with `SkillRunnerGAgent` / `WorkflowAgentGAgent` is intentional. Renaming to `AgentCatalog` would split actors from their primary consumer; deferring. 473/473 ChannelRuntime.Tests pass; full slnx still only fails the same two pre-existing Mainnet hosting `BindAsync on IStudioMemberService` tests that reproduce on origin/dev. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 27, 2026
eanzhao
added a commit
that referenced
this pull request
Apr 29, 2026
…ash-command Wires up all six pieces the ADR called out as parallel-capable now that NyxID#549 has shipped, so per-user binding stops requiring a follow-up PR per layer: #1 — AuthContext.external_subject (Channel.Abstractions) Move ExternalSubjectRef out of Identity.Abstractions into Channel.Abstractions where AuthContext lives. Identity is a consumer of channel concepts, not their owner; the new typed AuthContext.external_subject = 4 field is the broker-mode outbound identity carrier (ADR §Outbound Send) and the legacy string user_credential_ref keeps working for non-broker callers. Adds a new AuthContext.OnBehalfOfExternalSubject helper. #2 — Projection chain ExternalIdentityBindingDocument (proto) + Partial (IProjectionReadModel) + MetadataProvider + MaterializationContext + Projector (ICurrentStateProjectionMaterializer) + ExternalIdentityBindingProjectionQueryPort + ExternalIdentityBindingProjectionReadinessPort (polling impl, write-side completion path only — ADR §Projection Readiness explicitly allows this). AddChannelIdentityProjection registers the lot. #3 — NyxIdRemoteCapabilityBroker HTTP-backed INyxIdCapabilityBroker + INyxIdBrokerCallbackClient against the NyxID#549 wire shape: /oauth/authorize URL building, RFC 8693 token-exchange with subject_token_type=urn:nyxid:params:oauth:token-type:binding-id, /oauth/bindings/{id} delete, authorization-code -> binding_id exchange. Maps invalid_grant -> BindingRevokedException so the upper layer event-source-revokes the local binding. PkceHelper covers RFC 7636 S256; StateTokenCodec seals correlation+verifier+exp into an HMAC token (kid header, deterministic Protobuf payload — ADR §Implementation Notes #1). AddNyxIdRemoteCapabilityBroker wires HttpClient + options + codec + both interfaces. #4 — /api/oauth/nyxid-callback endpoint (IdentityOAuthEndpoints) Decodes state, exchanges code -> binding_id, dispatches CommitBindingCommand to ExternalIdentityBindingGAgent via IActorRuntime, waits on the projection readiness port, returns a friendly bind-confirmation page (id_token decoded locally for the display name — no /oauth/userinfo round-trip per ADR L61). Classifies error UX by HTTP status (ADR §Implementation Notes #3): 400 for state-token issues, 502 for broker exchange failure, 200 with a "binding pending propagation" message on projection timeout. #5 — Slash-command routing in ChannelConversationTurnRunner /init and /unbind are handled before the LLM by a new TryHandleSlashCommandAsync that resolves identity ports lazily through the existing IServiceProvider — deployments without per-user binding fall through unchanged. /init replies with the authorize URL (private DM only — ADR §Decision); /unbind resolves binding_id, calls broker.RevokeBindingAsync, replies with the unbind confirmation. The runner short-circuits via the existing SendReplyAsync, so reply delivery rides the relay outbound port like any other reply. #6 — /api/webhooks/nyxid-broker-revocation receiver (Continuous Access Evaluation) BrokerRevocationWebhookValidator verifies HMAC-SHA256 over the raw body (X-NyxID-Signature: sha256=<hex>), parses the JSON envelope into a typed BrokerRevocationNotification, and the endpoint event-source-revokes the local binding actor. Aligns with NyxID#549 V2-7's CAE channel — when a user revokes from NyxID's console the binding goes inactive in seconds rather than waiting for the 5-min access TTL. Tests: - 591 ChannelRuntime.Tests pass (39 of which are the new Identity-tagged tests covering actor, projector, broker fake, state-token codec, and ExternalSubjectRef extension). - Full solution `dotnet build aevatar.slnx` is clean. - `tools/docs/lint.sh` 33 file(s) checked, 0 errors. The ChannelConversationTurnRunner integration is feature-flag-shaped: if neither IExternalIdentityBindingQueryPort nor INyxIdCapabilityBroker is registered, the slash-command path is a no-op and existing bot-owner-shared behaviour is preserved. Production rollout is a DI toggle (AddChannelIdentityProjection + AddNyxIdRemoteCapabilityBroker + MapIdentityOAuthEndpoints) plus the Bot-Owner-Shared termination strategy choice from the ADR §Bot-Owner-Shared 模式终止策略. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
eanzhao
added a commit
that referenced
this pull request
Apr 29, 2026
Five round-2 follow-ups from the Codex consensus: #5 Orphan binding leak on actor-activation failure (resolved): The OAuth callback's already-bound branch revoked the orphan binding_id; the actor-activation-failed branch (when the local actor cannot be created post-exchange) silently returned 503 and left the freshly issued binding_id active at NyxID. Now the same TryRevokeOrphanBindingAsync helper runs before the 503 response so both leak paths cleanup symmetrically. #3 NyxIdAuthorityResolver footgun on staging clusters (resolved): Resolve() now takes an optional ILogger and emits a warning when AEVATAR_NYXID_AUTHORITY is unset AND ASPNETCORE_ENVIRONMENT / DOTNET_ENVIRONMENT indicate a non-Development environment. Local dev remains silent (env is empty / starts with "dev" / "local"); staging / qa / prod operators see the warning so a forgotten env-var doesn't silently register clients against production NyxID. #1 csproj coupling — store wiring belongs to the host (resolved): Split AddChannelIdentity into two extension methods. AddChannelIdentity now registers actors / projector / broker / slash commands but NOT document stores. AddChannelIdentityProjectionStores wires the ES vs InMemory choice and is called by the composition root (Mainnet host now invokes both methods). Tests / demos can mix and match — agent module never decides on the host's behalf which physical store to use. #2 kid rotation grace window (resolved): HMAC key rotation previously invalidated all in-flight state tokens (kid hardcoded "v1", no multi-key verification). The actor now carries current_kid + previous_kid + previous_demoted_at_unix on its state + projection. Encode signs with the snapshot's current kid; decode tries current first, then previous when demoted_at + state_token_lifetime is still in the future. A v1→v2 rotation produces deterministic kid succession (parses + increments the integer suffix) so verifiers can route signed tokens to the right key. #4 HMAC key still in projection (partial — code unchanged, ADR hardened): The projection-store-as-actor-state-mirror pattern is the established shape in this codebase (Channel.Runtime same coupling). Removing the hmac_key from the document would require a generic actor query/reply pattern that CLAUDE.md explicitly forbids ("禁止 generic actor query/reply"). The ADR-0018 §Implementation Notes #1 already documents the explicit tradeoff (state_token TTL ≤ 5 min, rotation command available, ES index access boundary equals actor event-store boundary, KMS migration path noted as a follow-up). The proto comment on AevatarOAuthClientDocument now also calls out the production access-scoping requirement so a least-privileged ES reader can't extract the key. Encryption-at-rest with an envelope key from env-var is the natural follow-up if a deployment widens ES access beyond the actor event-store boundary. Tests: - StateTokenCodec gains a kid-rotation grace test (decode succeeds with previous kid before lifetime expires; fails after). - AevatarOAuthClientGAgent gains a rotation-demotion test (verifies v1→v2 kid succession + previous-key carry-over) and a first-seed test (no previous-key fields populated on initial provision). - 800 ChannelRuntime tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 30, 2026
loning
added a commit
that referenced
this pull request
May 20, 2026
…sign-philosophy label) Bug:iter18 #719 r4 三 solver 全 propose(consensus 应达成),但 issue 残留 stale design-philosophy label,meta-judge 强制 escalate:philosophy。导致 controller 看到 escalate 走 label-人路径,绕回死循环。 Fix:meta-judge Step 2.5 加 stale-label override 检查: - design-philosophy label + 有 maintainer directive 评论(non-AI,narrowing scope) → label 视为 stale,trigger #5 不 fire - 没 maintainer directive 或评论 vague → label 仍 active ⟦AI:AUTO-LOOP⟧ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Workflow Run Lifecycle(重构汇报稿)
执行摘要
这次重构的目标,是把“长期编排职责”和“单次执行职责”拆开,解决当前一个
WorkflowGAgent同时承担模板管理、运行执行、并发隔离带来的复杂度问题。目标方案是:
WorkflowGAgent作为长期编排实例,负责 workflow 定义、策略和入口。WorkflowExecutionGAgent作为一次 run 的执行实例,AgentId与RunId同值。MessageId保留为 AI 消息链路标识,不与RunId合并。这套模型的核心收益是:执行隔离更清晰、查询主键统一、并发场景更稳,且与现有 CQRS 投影链路兼容。
1. 背景与重构目标
1.1 当前痛点
RunId、ActorId、MessageId在语义上容易被混用,讨论成本高。1.2 重构目标
RunId成为唯一执行主键,查询和追踪统一。MessageId的消息链路能力,避免流式与投影能力回退。2. 目标架构
2.1 角色分工
WorkflowGAgent(long-lived)WorkflowExecutionGAgent(per-run)flowchart TB workflowAgent["WorkflowGAgent (long-lived)"] executionA["WorkflowExecutionGAgent (runA)"] executionB["WorkflowExecutionGAgent (runB)"] runId["RunId = ExecutionActorId"] sessionKey["MessageId (message chain key)"] workflowAgent --> executionA workflowAgent --> executionB executionA --> runId executionA --> sessionKey2.2 标识符职责边界
WorkflowGAgent.AgentIdRunIdRunIdExecutionActorIdWorkflowExecutionGAgent.AgentIdRunId同值MessageId3. 关键设计:Execution 如何定位长期 Workflow
核心原则:执行实例不做运行时搜索,编排层在创建时显式绑定 owner。
标准流程:
WorkflowGAgent(agentId命中则复用,否则创建并配置)。RunId。WorkflowExecutionGAgent,并强制executionActorId = runId。BindWorkflowAgentId(workflowAgentId)显式绑定 owner。ConfigureWorkflow(workflowYaml, workflowName)完成 execution 初始化。LinkAsync(workflowAgentId, executionActorId)。SendToAsync(workflowAgentId, evt)。设计约束:
workflowName扫描 owner。workflowAgentId在 run 生命周期内不可变。4. 一次请求的调用链与参数赋值
sequenceDiagram participant client as Client participant api as HostApi participant app as WorkflowApplication participant resolver as RunActorResolver participant execution as WorkflowExecutionGAgent participant role as RoleGAgent participant projection as ProjectionPipeline client->>api: "/api/chat(prompt, workflow, agentId?)" api->>app: "ExecuteAsync(request)" app->>resolver: "ResolveOrCreate(workflowAgent)" app->>app: "Generate runId" app->>execution: "Create(actorId = runId)" app->>execution: "BindWorkflowAgentId(workflowAgentId)" app->>execution: "ConfigureWorkflow(workflowYaml, workflowName)" app->>resolver: "LinkAsync(workflowAgentId, executionActorId)" app->>execution: "Dispatch run events" execution->>role: "ChatRequestEvent(messageId = runId:stepId)" role-->>execution: "TextMessageStart/Content/End(messageId)" execution-->>projection: "Events(runId, messageId)" projection-->>api: "Run report + stream frames"参数赋值表:
promptChatInput.Prompt原样传递ChatRequestEvent.PromptworkflowNameagentIdWorkflowGAgentrunIdexecutionActorIdexecutionActorId = runIdworkflowAgentIdthreadIdexecutionActorIdmessageIdchat-{guid};步骤runId:stepId(:attempt预留)5. 为什么
MessageId不能删RunId和MessageId不是同一层标识。RunId管执行实例,MessageId管消息会话链路。删除
MessageId后会直接影响:LLMCallModule依赖messageId关联_pending请求。AIChatMessageRunIdResolver通过messageId反解runId。msg:{messageId}无法稳定归并 start/content/end。message_id维度丢失。结论:
MessageId必须保留,但它不是执行主键。6. 落地步骤(建议汇报用)
阶段 1:模型落地
WorkflowExecutionGAgent,定义 owner 绑定字段(workflowAgentId)。executionActorId = runId。阶段 2:链路切换
WorkflowExecutionGAgent产生。threadId统一切到 execution actor id。阶段 3:验证与收口
WorkflowGAgent多 run)。MessageId维度)。/api/runs/{runId})。7. 风险与控制
workflowAgentIdthreadId = executionActorIdchat-{guid}与runId:stepId(:attempt预留)8. 验收标准
WorkflowExecutionGAgent,且AgentId == RunId。WorkflowGAgent并发多 run 时无串线。RUN_STARTED/RUN_FINISHED标识稳定。runId可稳定获取完整报告。MessageId的回归测试明确失败(作为保护性用例)。9. 代码锚点(便于评审)
src/Aevatar.Host.Api/Endpoints/ChatEndpoints.cssrc/Aevatar.Host.Api/Endpoints/ChatWebSocketRunCoordinator.cssrc/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRunApplicationService.cssrc/workflow/Aevatar.Workflow.Application/Runs/WorkflowRunActorResolver.cssrc/workflow/Aevatar.Workflow.Application/Orchestration/WorkflowExecutionRunOrchestrator.cssrc/workflow/Aevatar.Workflow.Core/WorkflowExecutionGAgent.cssrc/workflow/Aevatar.Workflow.Core/WorkflowDefinitionSnapshot.csRunId生成:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRunApplicationService.csRunId注入:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRequestEnvelopeFactory.csRunId提取:src/workflow/Aevatar.Workflow.Core/WorkflowGAgent.csMessageId规范:src/Aevatar.AI.Abstractions/ChatMessageKeys.cssrc/workflow/Aevatar.Workflow.Core/Modules/LLMCallModule.cs相关文档:
docs/IDENTIFIER_RELATIONSHIPS.md