Skip to content

Refactor workflow execution lifecycle and align MessageId semantics#5

Closed
eanzhao wants to merge 1 commit into
devfrom
feature/execution-actor
Closed

Refactor workflow execution lifecycle and align MessageId semantics#5
eanzhao wants to merge 1 commit into
devfrom
feature/execution-actor

Conversation

@eanzhao
Copy link
Copy Markdown
Contributor

@eanzhao eanzhao commented Feb 19, 2026

Workflow Run Lifecycle(重构汇报稿)

执行摘要

这次重构的目标,是把“长期编排职责”和“单次执行职责”拆开,解决当前一个 WorkflowGAgent 同时承担模板管理、运行执行、并发隔离带来的复杂度问题。

目标方案是:

  • WorkflowGAgent 作为长期编排实例,负责 workflow 定义、策略和入口。
  • WorkflowExecutionGAgent 作为一次 run 的执行实例,AgentIdRunId 同值。
  • MessageId 保留为 AI 消息链路标识,不与 RunId 合并。

这套模型的核心收益是:执行隔离更清晰、查询主键统一、并发场景更稳,且与现有 CQRS 投影链路兼容。

1. 背景与重构目标

1.1 当前痛点

  • 同一个编排实例承载多次执行,运行态与编排态边界不清。
  • RunIdActorIdMessageId 在语义上容易被混用,讨论成本高。
  • 并发执行时,步骤回包关联和排障依赖更多约定,维护成本上升。

1.2 重构目标

  1. 每次执行具备独立 actor 身份,执行边界与生命周期清晰。
  2. RunId 成为唯一执行主键,查询和追踪统一。
  3. 保留 MessageId 的消息链路能力,避免流式与投影能力回退。

2. 目标架构

2.1 角色分工

  • WorkflowGAgent(long-lived)
    • 保存 workflow 定义与配置。
    • 接收入口请求并选择/创建执行实例。
    • 不承载单次 run 的可变执行状态。
  • WorkflowExecutionGAgent(per-run)
    • 处理本次 run 的步骤推进、事件发布、回包关联。
    • run 结束后销毁或回收。
flowchart TB
    workflowAgent["WorkflowGAgent (long-lived)"]
    executionA["WorkflowExecutionGAgent (runA)"]
    executionB["WorkflowExecutionGAgent (runB)"]
    runId["RunId = ExecutionActorId"]
    sessionKey["MessageId (message chain key)"]

    workflowAgent --> executionA
    workflowAgent --> executionB
    executionA --> runId
    executionA --> sessionKey
Loading

2.2 标识符职责边界

标识符 作用域 主用途 结论
WorkflowGAgent.AgentId 长期编排主体 入口与编排身份 不等于 RunId
RunId 单次执行 执行主键、查询主键 等于 ExecutionActorId
WorkflowExecutionGAgent.AgentId 单次执行 actor 运行隔离与线程标识 RunId 同值
MessageId AI 消息链路 流式消息聚合、回包匹配 不能删除

3. 关键设计:Execution 如何定位长期 Workflow

核心原则:执行实例不做运行时搜索,编排层在创建时显式绑定 owner。

标准流程:

  1. 入口先定位长期 WorkflowGAgentagentId 命中则复用,否则创建并配置)。
  2. 生成 RunId
  3. 创建 WorkflowExecutionGAgent,并强制 executionActorId = runId
  4. 调用 BindWorkflowAgentId(workflowAgentId) 显式绑定 owner。
  5. 调用 ConfigureWorkflow(workflowYaml, workflowName) 完成 execution 初始化。
  6. 建立父子关系:LinkAsync(workflowAgentId, executionActorId)
  7. 执行阶段若需回调编排实例,直接 SendToAsync(workflowAgentId, evt)

设计约束:

  • 禁止按 workflowName 扫描 owner。
  • workflowAgentId 在 run 生命周期内不可变。
  • owner 丢失时快速失败,输出明确错误事件。

4. 一次请求的调用链与参数赋值

sequenceDiagram
    participant client as Client
    participant api as HostApi
    participant app as WorkflowApplication
    participant resolver as RunActorResolver
    participant execution as WorkflowExecutionGAgent
    participant role as RoleGAgent
    participant projection as ProjectionPipeline

    client->>api: "/api/chat(prompt, workflow, agentId?)"
    api->>app: "ExecuteAsync(request)"
    app->>resolver: "ResolveOrCreate(workflowAgent)"
    app->>app: "Generate runId"
    app->>execution: "Create(actorId = runId)"
    app->>execution: "BindWorkflowAgentId(workflowAgentId)"
    app->>execution: "ConfigureWorkflow(workflowYaml, workflowName)"
    app->>resolver: "LinkAsync(workflowAgentId, executionActorId)"
    app->>execution: "Dispatch run events"
    execution->>role: "ChatRequestEvent(messageId = runId:stepId)"
    role-->>execution: "TextMessageStart/Content/End(messageId)"
    execution-->>projection: "Events(runId, messageId)"
    projection-->>api: "Run report + stream frames"
Loading

参数赋值表:

参数 来源 赋值规则 消费方
prompt 请求体 ChatInput.Prompt 原样传递 ChatRequestEvent.Prompt
workflowName 请求体 为空走默认 workflow workflow 加载/解析
agentId 请求体 定位长期 WorkflowGAgent actor resolver
runId 服务端生成 每次 run 生成一次 执行事件、投影、查询
executionActorId 服务端派生 executionActorId = runId runtime、routing、thread
workflowAgentId 服务端解析 owner id,创建 execution 时写入 execution 回调编排主体
threadId 服务端派生 对外输出使用 executionActorId SSE/WS 输出
messageId 服务端生成 入口 chat-{guid};步骤 runId:stepId:attempt 预留) LLM 回包关联、消息聚合

5. 为什么 MessageId 不能删

RunIdMessageId 不是同一层标识。RunId 管执行实例,MessageId 管消息会话链路。

删除 MessageId 后会直接影响:

  1. 步骤回包匹配失效
    • LLMCallModule 依赖 messageId 关联 _pending 请求。
  2. AI 事件 run 归属能力下降
    • AIChatMessageRunIdResolver 通过 messageId 反解 runId
  3. 流式消息聚合受损
    • AGUI 侧 msg:{messageId} 无法稳定归并 start/content/end。
  4. 读模型可观测性下降
    • role reply 与 timeline 的 message_id 维度丢失。

结论:MessageId 必须保留,但它不是执行主键。

6. 落地步骤(建议汇报用)

阶段 1:模型落地

  • 新增 WorkflowExecutionGAgent,定义 owner 绑定字段(workflowAgentId)。
  • 调整 run 启动流程,改为“先定位 workflow agent,再创建 execution agent”。
  • 强制 executionActorId = runId

阶段 2:链路切换

  • 执行事件改为由 WorkflowExecutionGAgent 产生。
  • 输出 threadId 统一切到 execution actor id。
  • pending/关联键检查并统一为 run 作用域。

阶段 3:验证与收口

  • 并发 run 压测(同一 WorkflowGAgent 多 run)。
  • 回包匹配与流式聚合验证(MessageId 维度)。
  • 读模型查询一致性验证(/api/runs/{runId})。

7. 风险与控制

风险 表现 控制策略
owner 绑定遗漏 execution 无法回调 workflow 创建时强制校验 workflowAgentId
run/thread 标识不一致 前端线程展示混乱 统一 threadId = executionActorId
session 格式不统一 回包匹配或聚合异常 固化规范:chat-{guid}runId:stepId:attempt 预留)
迁移阶段行为漂移 新旧链路结果不一致 灰度开关 + 双链路对比

8. 验收标准

  1. 一次 run 只对应一个 WorkflowExecutionGAgent,且 AgentId == RunId
  2. 同一 WorkflowGAgent 并发多 run 时无串线。
  3. SSE/WS 全链路可用,RUN_STARTED/RUN_FINISHED 标识稳定。
  4. 查询侧按 runId 可稳定获取完整报告。
  5. 删除 MessageId 的回归测试明确失败(作为保护性用例)。

9. 代码锚点(便于评审)

  • 入口解析:src/Aevatar.Host.Api/Endpoints/ChatEndpoints.cs
  • WS 入口:src/Aevatar.Host.Api/Endpoints/ChatWebSocketRunCoordinator.cs
  • 编排入口:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRunApplicationService.cs
  • 编排主体定位:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowRunActorResolver.cs
  • 运行编排:src/workflow/Aevatar.Workflow.Application/Orchestration/WorkflowExecutionRunOrchestrator.cs
  • 执行实例:src/workflow/Aevatar.Workflow.Core/WorkflowExecutionGAgent.cs
  • 工作流快照:src/workflow/Aevatar.Workflow.Core/WorkflowDefinitionSnapshot.cs
  • RunId 生成:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRunApplicationService.cs
  • RunId 注入:src/workflow/Aevatar.Workflow.Application/Runs/WorkflowChatRequestEnvelopeFactory.cs
  • RunId 提取:src/workflow/Aevatar.Workflow.Core/WorkflowGAgent.cs
  • MessageId 规范:src/Aevatar.AI.Abstractions/ChatMessageKeys.cs
  • 会话关联:src/workflow/Aevatar.Workflow.Core/Modules/LLMCallModule.cs

相关文档:docs/IDENTIFIER_RELATIONSHIPS.md

Enforce per-run execution invariants (owner/YAML fail-fast, RunId == execution actor ID, and metadata run_id consistency) and propagate MessageId on workflow completion to keep streaming/projection correlation stable. Also isolate per-run stats from long-lived workflow actors, add compatibility warnings for direct WorkflowGAgent entry, sync identifier/lifecycle docs, and add a destroy-race memo for follow-up soft-destroy/grace-period design.
@eanzhao eanzhao closed this Feb 25, 2026
eanzhao added a commit that referenced this pull request Apr 27, 2026
P1 — workflow termination on publish failure (reviewer's critical):
`PublishFailureAsync` emits `StepCompletedEvent { Success = false }`,
which the kernel routes through `TryRetryAsync` → `TryOnErrorAsync` →
fail. With no retry/on_error policy on `publish_to_twitter`, a Twitter
401/403/429/5xx terminated the entire workflow run as failed. Add
`on_error: { strategy: skip, default_output: "twitter_publish_failed" }`
to the YAML so the run advances to `done` cleanly; the module already
surfaces categorized errors to Lark independently.

#2 — Twitter v2 native error shape: `ClassifyTwitterResponse` now
recognizes the third response shape NyxID can forward verbatim:
`{ "title": "...", "detail": "...", "errors": [...] }` (Twitter's
native problem-details for content-policy / duplicate-tweet rejections).
Falls through to `twitter_publish_rejected` with the Twitter `message`
text in the Lark surfacing so users read the actual rejection reason.

#1 — Duplicate tweet risk: documented in code comment that
`POST /2/tweets` has no server-side dedup; the social_media template
intentionally has no `retry` policy on this step, and `on_error: skip`
advances rather than retrying. Authors customizing the YAML must keep
this invariant.

#3 — Removed redundant `nyxClient!` null-forgiving (no-op cleanup).

#4 — Renamed `ChannelMetadataKeys.LarkProxySlug` →
`LarkOutboundProxySlug` (`channel.lark.outbound_proxy_slug`) to
disambiguate "Lark API surface" from "NyxID provider routing".

#5 — Added xml-doc on `TrySendLarkAsync` documenting the dual-scope
api-key dependency (key must carry both api-twitter AND api-lark-bot
entitlements) so future callers don't silently break Lark surfacing
when narrowing the key's scope.

#6 — Added `RequiredServiceSlugs` field to `SocialMediaTemplateSpec`
for parity with `DailyReportTemplateSpec`; `CreateSocialMediaAgentAsync`
now reads slugs from the spec instead of inlining the list.

Tests:
- 3 new `ClassifyTwitterResponse` tests for the Twitter native error
  shapes (errors-array, RFC-7807 title/detail-only, empty-object
  unexpected-shape).
- Existing social_media test now also asserts `strategy: skip` lands in
  the upserted YAML.
- 482 channel-runtime + 236 workflow.core tests pass; full solution
  builds with 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 27, 2026
…elper

Resolves three followup architecture-review points on PR #451:

### Channel.Runtime drops AI / Workflow direct deps (review #2)

`Channel.Runtime`'s csproj used to pull in `Aevatar.AI.Abstractions` and
`Aevatar.Workflow.Application.Abstractions` because of two files that
straddled the channel/AI and channel/workflow boundary:

- `ChannelContextMiddleware` (an `ILLMCallMiddleware` impl) — moved to
  `Aevatar.GAgents.NyxidChat`, which is the only package that needs it
  and already references `AI.Abstractions`. NyxidChat SCE registers it
  for the LLM call pipeline; Channel.Runtime SCE no longer touches
  `ILLMCallMiddleware`.
- `ChannelCardActionRouting` (builds `WorkflowResumeCommand`) — moved
  to `Aevatar.GAgents.NyxidChat` for the same reason. Its sole consumer
  (`ChannelConversationTurnRunner`) lives there too.

`Channel.Runtime.csproj` now references only `Channel.Abstractions`,
`Foundation.Abstractions`/`Core`, and the `CQRS.Projection.*` slice —
matching the "channel-agnostic flow + projection infrastructure"
charter from the RFC. Tests (`ChannelCardActionRoutingTests`) get the
extra `using Aevatar.GAgents.NyxidChat;`.

### Extract Elasticsearch projection-store toggle helper (review #4)

The `ResolveElasticsearchEnabled` + `BuildElasticsearchOptions` helper
pair was duplicated three times (Channel.Runtime / Device / Scheduled
SCEs) with slightly different log strings and `Console.Error.WriteLine`
output. Centralized into
`Aevatar.CQRS.Projection.Providers.Elasticsearch.DependencyInjection.ElasticsearchProjectionConfiguration`
with two static helpers:

- `IsEnabled(IConfiguration?, ILogger?, string? storeName)` — explicit
  flag → endpoints presence → false; logs a structured warning via
  `ILogger` (when supplied) instead of `Console.Error.WriteLine`.
- `BindOptions(IConfiguration)` — typed binder for
  `ElasticsearchProjectionDocumentStoreOptions`.

All three SCEs now call into this helper; per-package warning text is
parameterized via `storeName`. Section path
(`Projection:Document:Providers:Elasticsearch`) is exposed as a const
so future call sites stay in sync.

### Followup points acknowledged but deferred

- **Cross-package dep chain `NyxidChat → Authoring.Lark → Scheduled →
  Platform.Lark`** (review #1) — pre-existing arch debt that the split
  surfaced rather than introduced. Cleaner would be to invert via
  `IInboundFlowResolver` plug-ins so `ChannelConversationTurnRunner`
  doesn't reach into `AgentBuilderCardFlow` directly. Out of scope for
  the package split; tracking as a separate follow-up.
- **Tombstone compactor "central coordinator"** (review #3) —
  `Channel.Runtime` defines `ITombstoneCompactionTarget` but does not
  reference `Device` / `Scheduled` at the csproj level; per-package
  targets register themselves through DI. The plug-in pattern is
  intentional and keeps the DAG one-way.
- **`Scheduled` package name vs UserAgentCatalog content** (review #5)
  — `UserAgentCatalog` is the delivery-target registry that Scheduled
  agents read at execution time to route output, so co-locating it
  with `SkillRunnerGAgent` / `WorkflowAgentGAgent` is intentional.
  Renaming to `AgentCatalog` would split actors from their primary
  consumer; deferring.

473/473 ChannelRuntime.Tests pass; full slnx still only fails the same
two pre-existing Mainnet hosting `BindAsync on IStudioMemberService`
tests that reproduce on origin/dev.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 29, 2026
…ash-command

Wires up all six pieces the ADR called out as parallel-capable now that
NyxID#549 has shipped, so per-user binding stops requiring a follow-up PR
per layer:

#1 — AuthContext.external_subject (Channel.Abstractions)
    Move ExternalSubjectRef out of Identity.Abstractions into
    Channel.Abstractions where AuthContext lives. Identity is a consumer
    of channel concepts, not their owner; the new typed
    AuthContext.external_subject = 4 field is the broker-mode outbound
    identity carrier (ADR §Outbound Send) and the legacy string
    user_credential_ref keeps working for non-broker callers. Adds a
    new AuthContext.OnBehalfOfExternalSubject helper.

#2 — Projection chain
    ExternalIdentityBindingDocument (proto) + Partial (IProjectionReadModel)
    + MetadataProvider + MaterializationContext + Projector
    (ICurrentStateProjectionMaterializer) + ExternalIdentityBindingProjectionQueryPort
    + ExternalIdentityBindingProjectionReadinessPort (polling impl, write-side
    completion path only — ADR §Projection Readiness explicitly allows this).
    AddChannelIdentityProjection registers the lot.

#3 — NyxIdRemoteCapabilityBroker
    HTTP-backed INyxIdCapabilityBroker + INyxIdBrokerCallbackClient against
    the NyxID#549 wire shape: /oauth/authorize URL building, RFC 8693
    token-exchange with subject_token_type=urn:nyxid:params:oauth:token-type:binding-id,
    /oauth/bindings/{id} delete, authorization-code -> binding_id exchange.
    Maps invalid_grant -> BindingRevokedException so the upper layer
    event-source-revokes the local binding. PkceHelper covers RFC 7636 S256;
    StateTokenCodec seals correlation+verifier+exp into an HMAC token (kid
    header, deterministic Protobuf payload — ADR §Implementation Notes #1).
    AddNyxIdRemoteCapabilityBroker wires HttpClient + options + codec +
    both interfaces.

#4 — /api/oauth/nyxid-callback endpoint (IdentityOAuthEndpoints)
    Decodes state, exchanges code -> binding_id, dispatches CommitBindingCommand
    to ExternalIdentityBindingGAgent via IActorRuntime, waits on the projection
    readiness port, returns a friendly bind-confirmation page (id_token decoded
    locally for the display name — no /oauth/userinfo round-trip per ADR L61).
    Classifies error UX by HTTP status (ADR §Implementation Notes #3):
    400 for state-token issues, 502 for broker exchange failure, 200 with a
    "binding pending propagation" message on projection timeout.

#5 — Slash-command routing in ChannelConversationTurnRunner
    /init and /unbind are handled before the LLM by a new
    TryHandleSlashCommandAsync that resolves identity ports lazily through
    the existing IServiceProvider — deployments without per-user binding
    fall through unchanged. /init replies with the authorize URL (private DM
    only — ADR §Decision); /unbind resolves binding_id, calls
    broker.RevokeBindingAsync, replies with the unbind confirmation. The
    runner short-circuits via the existing SendReplyAsync, so reply
    delivery rides the relay outbound port like any other reply.

#6 — /api/webhooks/nyxid-broker-revocation receiver (Continuous Access Evaluation)
    BrokerRevocationWebhookValidator verifies HMAC-SHA256 over the raw body
    (X-NyxID-Signature: sha256=<hex>), parses the JSON envelope into a
    typed BrokerRevocationNotification, and the endpoint event-source-revokes
    the local binding actor. Aligns with NyxID#549 V2-7's CAE channel — when
    a user revokes from NyxID's console the binding goes inactive in seconds
    rather than waiting for the 5-min access TTL.

Tests:
- 591 ChannelRuntime.Tests pass (39 of which are the new Identity-tagged
  tests covering actor, projector, broker fake, state-token codec, and
  ExternalSubjectRef extension).
- Full solution `dotnet build aevatar.slnx` is clean.
- `tools/docs/lint.sh` 33 file(s) checked, 0 errors.

The ChannelConversationTurnRunner integration is feature-flag-shaped: if
neither IExternalIdentityBindingQueryPort nor INyxIdCapabilityBroker is
registered, the slash-command path is a no-op and existing
bot-owner-shared behaviour is preserved. Production rollout is a DI
toggle (AddChannelIdentityProjection + AddNyxIdRemoteCapabilityBroker
+ MapIdentityOAuthEndpoints) plus the Bot-Owner-Shared termination
strategy choice from the ADR §Bot-Owner-Shared 模式终止策略.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eanzhao added a commit that referenced this pull request Apr 29, 2026
Five round-2 follow-ups from the Codex consensus:

#5 Orphan binding leak on actor-activation failure (resolved):
  The OAuth callback's already-bound branch revoked the orphan
  binding_id; the actor-activation-failed branch (when the local actor
  cannot be created post-exchange) silently returned 503 and left the
  freshly issued binding_id active at NyxID. Now the same
  TryRevokeOrphanBindingAsync helper runs before the 503 response so
  both leak paths cleanup symmetrically.

#3 NyxIdAuthorityResolver footgun on staging clusters (resolved):
  Resolve() now takes an optional ILogger and emits a warning when
  AEVATAR_NYXID_AUTHORITY is unset AND ASPNETCORE_ENVIRONMENT /
  DOTNET_ENVIRONMENT indicate a non-Development environment. Local dev
  remains silent (env is empty / starts with "dev" / "local"); staging
  / qa / prod operators see the warning so a forgotten env-var doesn't
  silently register clients against production NyxID.

#1 csproj coupling — store wiring belongs to the host (resolved):
  Split AddChannelIdentity into two extension methods. AddChannelIdentity
  now registers actors / projector / broker / slash commands but NOT
  document stores. AddChannelIdentityProjectionStores wires the ES vs
  InMemory choice and is called by the composition root (Mainnet host
  now invokes both methods). Tests / demos can mix and match — agent
  module never decides on the host's behalf which physical store to use.

#2 kid rotation grace window (resolved):
  HMAC key rotation previously invalidated all in-flight state tokens
  (kid hardcoded "v1", no multi-key verification). The actor now carries
  current_kid + previous_kid + previous_demoted_at_unix on its state +
  projection. Encode signs with the snapshot's current kid; decode tries
  current first, then previous when demoted_at + state_token_lifetime
  is still in the future. A v1→v2 rotation produces deterministic kid
  succession (parses + increments the integer suffix) so verifiers can
  route signed tokens to the right key.

#4 HMAC key still in projection (partial — code unchanged, ADR
  hardened):
  The projection-store-as-actor-state-mirror pattern is the established
  shape in this codebase (Channel.Runtime same coupling). Removing the
  hmac_key from the document would require a generic actor query/reply
  pattern that CLAUDE.md explicitly forbids ("禁止 generic actor
  query/reply"). The ADR-0018 §Implementation Notes #1 already
  documents the explicit tradeoff (state_token TTL ≤ 5 min, rotation
  command available, ES index access boundary equals actor event-store
  boundary, KMS migration path noted as a follow-up). The proto
  comment on AevatarOAuthClientDocument now also calls out the
  production access-scoping requirement so a least-privileged ES
  reader can't extract the key. Encryption-at-rest with an envelope
  key from env-var is the natural follow-up if a deployment widens ES
  access beyond the actor event-store boundary.

Tests:
  - StateTokenCodec gains a kid-rotation grace test (decode succeeds
    with previous kid before lifetime expires; fails after).
  - AevatarOAuthClientGAgent gains a rotation-demotion test (verifies
    v1→v2 kid succession + previous-key carry-over) and a first-seed
    test (no previous-key fields populated on initial provision).
  - 800 ChannelRuntime tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
loning added a commit that referenced this pull request May 20, 2026
…sign-philosophy label)

Bug:iter18 #719 r4 三 solver 全 propose(consensus 应达成),但 issue 残留 stale
design-philosophy label,meta-judge 强制 escalate:philosophy。导致 controller 看到
escalate 走 label-人路径,绕回死循环。

Fix:meta-judge Step 2.5 加 stale-label override 检查:
- design-philosophy label + 有 maintainer directive 评论(non-AI,narrowing scope)
  → label 视为 stale,trigger #5 不 fire
- 没 maintainer directive 或评论 vague → label 仍 active

⟦AI:AUTO-LOOP⟧

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant