Skip to content

Repair deterministic workflow gate failures with agents#826

Merged
khaliqgant merged 9 commits into
mainfrom
codex/repairable-workflow-failures
May 8, 2026
Merged

Repair deterministic workflow gate failures with agents#826
khaliqgant merged 9 commits into
mainfrom
codex/repairable-workflow-failures

Conversation

@khaliqgant

@khaliqgant khaliqgant commented May 8, 2026

Copy link
Copy Markdown
Member

Summary

  • make deterministic gate repair opt-in via explicit repairRetries so existing workflows do not silently spawn repair agents
  • run deterministic repair agents from the same resolved working directory as the failed gate
  • route GitHub SDK helper exports through the local ./github surface and clean up review-noted markdown/trajectory text
  • add a dedicated SDK workflow reliability contract suite for repairable deterministic failures
  • expand that matrix to cover final hard validation repair, fan-out sibling isolation, repair-agent fallback, invalid repairAgent fallback, and start-from resume context

Test Plan

  • Tests added/updated
  • Manual testing completed
  • npm --prefix packages/sdk run check
  • npx vitest run --root packages/sdk --config vitest.config.ts src/__tests__/workflow-runner.test.ts -t "deterministic"
  • npx vitest run --root packages/sdk --config vitest.config.ts src/workflows/__tests__/workflow-reliability-contract.test.ts
  • git diff --check

Note: the full local workflow-runner.test.ts file still hits an existing trajectory harness failure (readCompletedTrajectoryFile returns null in should record review completion in trajectory with decision and reason). The deterministic repair tests touched by this PR pass, and CI for the prior commit was green.

khaliqgant and others added 5 commits May 5, 2026 16:10
…l + cloud adapters

Mirrors the github-primitive's adapter shape so the same workflow file
runs locally (gh CLI / Slack token) and in cloud (Nango). Two
flagship verbs:

  - postMessage — fire-and-forget human notification (PR opened,
    workflow done, etc.). Resolves @-mentions and #channel names at
    step time.
  - askQuestion — block the workflow on a human reply. Configurable
    timeout, replyMatch (regex/choice/any), allowedReplyFrom, and
    Block Kit interactive forms. Resumable across sandbox restarts
    via run-record metadata so retries don't re-ask.

Captures the cultural change the primitive enables: agents should ask
for clarification when blocked rather than guess. Includes two
recipes ("Announce + Done", "Ask Before You Guess") that the
writing-agent-relay-workflows skill will pick up when v1 ships.

Cloud-runtime auth reuses the workspace's existing Nango Slack
connection — no new SST resource bindings. Slack tokens don't rotate
per-call, so the proxy form (nango.proxy({ endpoint: '/chat.postMessage' }))
is the right shape — avoids the get-token-vs-proxy confusion the
github-primitive had to work through.

Phasing:
  A — postMessage + resolvers
  B — askQuestion (blocking, resumable)
  C — Block Kit interactive forms + utility verbs

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exposes the bundled `@agent-relay/github-primitive` from the root
`@agent-relay/sdk` so workflow authors no longer need the subpath
import. Two new shapes added alongside the existing `/github` subpath:

- `import { github } from '@agent-relay/sdk'` — full namespaced surface,
  no collision risk with other root exports.
- `import { createGitHubStep, GitHubClient } from '@agent-relay/sdk'` —
  curated helpers for the common workflow-author path.

Avoided a flat `export *` because the primitive ships ~40 generic-named
action helpers (createFile, readFile, getUser, errorMessage, ...) that
would pollute the root namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@khaliqgant khaliqgant requested a review from willwashburn as a code owner May 8, 2026 15:03
@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements deterministic repair-agent support for deterministic workflow steps (types/schema/builder, runner resolution/prompt/execution), adds SDK GitHub re-exports, compacts and finalizes many trajectory artifacts and updates the index, and introduces a Slack primitive design spec with tests and README updates.

Changes

Deterministic Gate Repair Feature

Layer / File(s) Summary
Types & Schema
packages/sdk/src/workflows/types.ts, packages/sdk/src/workflows/schema.json, packages/sdk/src/workflows/builder.ts
ErrorHandlingConfig extended with optional repairAgent?: string and repairRetries?: number; builder copies these options into the internal error-handling config; JSON schema updated.
Runner Implementation
packages/sdk/src/workflows/runner.ts
Refactors deterministic-step execution to persist last command/CWD/output, resolve a repair agent (config override, step agent, reversed deps, scored fallback), build repair prompts with command/error/output/CWD context, and route repair execution via executor/API/local subprocess before retrying the deterministic command.
Docs & Tests
packages/sdk/src/workflows/README.md, packages/sdk/src/__tests__/workflow-runner.test.ts, packages/sdk/src/workflows/__tests__/workflow-reliability-contract.test.ts
README documents repair-before-retry and global errorHandling fields; integration and contract tests validate repair invocation, prompt/context passing, retry budgets, thrown-repair handling, soft-fail behavior, fan-out semantics, resume-from-cache behavior, and fail-fast opt-out.

GitHub Primitive SDK Exports

Layer / File(s) Summary
GitHub Module Export
packages/sdk/src/github.ts, packages/sdk/src/index.ts
Module header docs updated; SDK entrypoint re-exports a github namespace and curated createGitHubStep and GitHubClient helpers for direct import.

Workflow Trajectory Artifacts

Layer / File(s) Summary
Compacted Trajectory Batch
.trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.json, .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md
Adds compacted batch artifacts aggregating decisions (categorized), conventions, lessons learned, open questions, findings, and stats for Apr 10–May 8.
Completed Workflow Records
.trajectories/completed/2026-05/*.json, .trajectories/completed/2026-05/*.md
Updates many per-run artifacts: change status to completed, add completedAt and chapter endedAt timestamps, append retrospective objects, and condense large raw evidence payloads.
Trajectory Index
.trajectories/index.json
Bumps lastUpdated, flips several active entries to completed (adds completedAt and compactedInto), finalizes path entries under completed/, and appends new completed entries for the batch.

Slack Primitive Specification

Layer / File(s) Summary
Slack Primitive Design
specs/slack-primitive.md
New design spec defining Slack runtimes (local/cloud/auto), auth sourcing, postMessage and askQuestion semantics (mention resolution, reply filtering, resumability), utility actions, typed error codes, implementation layout, skill recipes, open questions, and phased v1 acceptance criteria.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 "I nibbled at a broken gate,
I patched the flow and made it great,
GitHub shines and Slack can speak,
Trajectories logged, tests not weak,
Hoppy fixes—now workflows streak!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: implementing deterministic workflow gate failure repair using agents.
Description check ✅ Passed The PR description follows the template with both Summary and Test Plan sections clearly filled out, detailing the specific changes, verification steps, and test execution results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/repairable-workflow-failures

Comment @coderabbitai help to get the list of available commands and usage tips.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread packages/sdk/src/workflows/runner.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
.trajectories/completed/2026-05/traj_o9cx33xn5u39.json (1)

367-367: The review chapter took 18 days to complete.

The review chapter's duration from startedAt (2026-04-20T15:19:38.968Z) to endedAt (2026-05-08T13:33:35.341Z) spans approximately 18 days, accounting for nearly all of the workflow's total duration. While this could be expected for PR review processes involving human approval, it may indicate the workflow was blocked or paused for an extended period.

Consider tracking whether long-running review steps indicate:

  • Human review bottlenecks that could benefit from notifications/escalation
  • Workflow pause/resume functionality that should be explicitly recorded
  • Separate "blocked" vs "active" states for better observability
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.trajectories/completed/2026-05/traj_o9cx33xn5u39.json at line 367, The
trajectory JSON currently only records startedAt and endedAt for the review
chapter, which obscures long pauses; update the review chapter object (the entry
with startedAt/endedAt) to include explicit lifecycle fields such as state
(active|blocked|paused|completed), blockedAt and resumedAt timestamps, and an
optional pausedReason or reviewerEscalation flag, and emit/record a derived
durationSpentActive vs durationBlocked so observers can distinguish human-review
wait time from active processing and enable notifications/escalation when
blocked durations exceed thresholds.
packages/sdk/src/index.ts (1)

26-26: ⚡ Quick win

Re-export curated GitHub helpers via ./github.js for a single API source.

Line 26 currently bypasses the local GitHub surface, which can create drift later if curation changes.

Proposed change
 export * as github from './github.js';
-export { createGitHubStep, GitHubClient } from '@agent-relay/github-primitive';
+export { createGitHubStep, GitHubClient } from './github.js';
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/sdk/src/index.ts` at line 26, The current export line re-exports
createGitHubStep and GitHubClient directly from '@agent-relay/github-primitive',
bypassing the local curated surface; change the export to re-export these
symbols from the local module (./github) instead so the SDK has a single source
of truth—update the export statement for createGitHubStep and GitHubClient to
import/export from './github' (ensuring ./github exports those same symbols).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md:
- Line 29: The workflow text on line containing "**Apply autofix-swarm fixes in
the parent worktree at /Users/khaliqgant/Projects/AgentWorkforce/relay, not the
.relay/workspace cwd copy**" embeds a machine-specific absolute path and user
name; update that sentence to use a repo-root placeholder (e.g. "<repo-root>" or
"{REPO_ROOT}") and keep the rest of the message unchanged so it reads something
like "Apply autofix-swarm fixes in the parent worktree at <repo-root>/relay, not
the .relay/workspace cwd copy" to avoid leaking user-specific paths.
- Line 12: Replace the malformed token-format text "at*live*\*" in the decision
table with an inline code-formatted token pattern (e.g. `at*live*`) so the
prefix/wildcard renders unambiguously; search for the exact string "at*live*\*"
in the compacted decision table entry and update it to use backticks around the
intended token pattern.

In @.trajectories/completed/2026-05/traj_o9cx33xn5u39.json:
- Around line 405-409: Update the terse retrospective.summary value ("merged")
to a more descriptive sentence that captures what was accomplished; locate the
"retrospective" JSON object and change the "summary" field to something like
"Successfully added --register flag to agent-relay-broker mcp-args subcommand,
enabling token self-registration and unblocking cloud cleanup PR" or an
equivalent that names the change, the feature (--register flag /
agent-relay-broker mcp-args), and the outcome. Ensure you only modify the
"summary" string in the retrospective object so other fields (approach,
confidence) remain unchanged.

In `@packages/sdk/src/workflows/runner.ts`:
- Around line 3972-3979: The repair step currently copies context.step.cwd and
context.step.workdir which can be unset and cause repairs to run from the
agent's own directory; update the construction of the repairStep object (cwd and
workdir fields) to use the resolved context.cwd and context.workdir (already
derived from the failing deterministic execution) and fall back to
context.step.cwd/context.step.workdir only if those resolved values are
undefined, so repairStep runs in the same resolved working directory as the
failed deterministic gate.

In `@specs/slack-primitive.md`:
- Around line 237-258: The markdown fenced code block that lists the package
tree (the block starting with the line "packages/slack-primitive/") is missing a
language identifier; update its opening triple-backtick to include a language
(e.g., ```text) so the linter (MD040) passes and CI stops flagging the unlabeled
fence. Locate the block that begins with "packages/slack-primitive/" in
specs/slack-primitive.md and change the opening ``` to ```text, leaving the rest
of the block content and closing backticks unchanged.

---

Nitpick comments:
In @.trajectories/completed/2026-05/traj_o9cx33xn5u39.json:
- Line 367: The trajectory JSON currently only records startedAt and endedAt for
the review chapter, which obscures long pauses; update the review chapter object
(the entry with startedAt/endedAt) to include explicit lifecycle fields such as
state (active|blocked|paused|completed), blockedAt and resumedAt timestamps, and
an optional pausedReason or reviewerEscalation flag, and emit/record a derived
durationSpentActive vs durationBlocked so observers can distinguish human-review
wait time from active processing and enable notifications/escalation when
blocked durations exceed thresholds.

In `@packages/sdk/src/index.ts`:
- Line 26: The current export line re-exports createGitHubStep and GitHubClient
directly from '@agent-relay/github-primitive', bypassing the local curated
surface; change the export to re-export these symbols from the local module
(./github) instead so the SDK has a single source of truth—update the export
statement for createGitHubStep and GitHubClient to import/export from './github'
(ensuring ./github exports those same symbols).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4f0114db-dde3-4e11-a1e8-01c84ae307a0

📥 Commits

Reviewing files that changed from the base of the PR and between dc8247d and 31aa3d2.

📒 Files selected for processing (24)
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.json
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md
  • .trajectories/completed/2026-05/traj_1775914133873_35667beb.json
  • .trajectories/completed/2026-05/traj_1775914133873_35667beb.md
  • .trajectories/completed/2026-05/traj_1776073106646_1839be2d.json
  • .trajectories/completed/2026-05/traj_1776073106646_1839be2d.md
  • .trajectories/completed/2026-05/traj_1776113772922_bc92f121.json
  • .trajectories/completed/2026-05/traj_1776113772922_bc92f121.md
  • .trajectories/completed/2026-05/traj_3b3p1z4y7qlo.json
  • .trajectories/completed/2026-05/traj_3b3p1z4y7qlo.md
  • .trajectories/completed/2026-05/traj_itgr2w8qs3xn.json
  • .trajectories/completed/2026-05/traj_itgr2w8qs3xn.md
  • .trajectories/completed/2026-05/traj_o9cx33xn5u39.json
  • .trajectories/completed/2026-05/traj_o9cx33xn5u39.md
  • .trajectories/index.json
  • packages/sdk/src/__tests__/workflow-runner.test.ts
  • packages/sdk/src/github.ts
  • packages/sdk/src/index.ts
  • packages/sdk/src/workflows/README.md
  • packages/sdk/src/workflows/builder.ts
  • packages/sdk/src/workflows/runner.ts
  • packages/sdk/src/workflows/schema.json
  • packages/sdk/src/workflows/types.ts
  • specs/slack-primitive.md

Comment thread .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md Outdated
Comment thread .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md Outdated
Comment thread .trajectories/completed/2026-05/traj_o9cx33xn5u39.json
Comment thread packages/sdk/src/workflows/runner.ts
Comment thread specs/slack-primitive.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md:
- Line 5: Correct the repeated hostname typo by replacing every occurrence of
the misspelled string "orgin.agentrelay.net" with the correct
"origin.agentrelay.net" in the document; search for the exact token
"orgin.agentrelay.net" (appears four times per the review) and update each
instance so references like the production SST alias read
"origin.agentrelay.net".

In `@specs/slack-primitive.md`:
- Line 52: Update the Slack token-prefix examples in specs/slack-primitive.md to
use clear, non-ambiguous token prefixes (e.g., "xoxb-" for bot tokens and
"xoxp-" for user tokens) instead of the confusing "xoxb-_" / "xoxp-_"; locate
the sentence that currently reads with "xoxb-_" and "xoxp-_" and replace those
token examples with "xoxb-" and "xoxp-" respectively and ensure the paragraph
still explains that the primitive expects a bot user OAuth token (xoxb-) and
validates scopes (chat:write, channels:history, groups:history) on first call,
throwing a typed error if missing.
- Around line 109-110: The spec text for the timeoutSeconds property is
contradictory: it currently says both "30 min default" and "required to set
explicitly"; choose one policy and make the spec and examples consistent: either
(A) make timeoutSeconds optional with a default of 1800 — remove the "required"
language, document "default: 1800 (30 min)" and ensure any schema/validation
(e.g., the property in the primitive definition) marks it optional, or (B) make
timeoutSeconds required — remove "default" wording, state "required" and update
examples to always supply a value and update validation to reject missing
timeoutSeconds; refer to the timeoutSeconds symbol in the spec and update the
comment and any schema/validation rules accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e85259c5-bd83-4f73-b508-a6573c41a27f

📥 Commits

Reviewing files that changed from the base of the PR and between 31aa3d2 and 10f1bfb.

📒 Files selected for processing (16)
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.json
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md
  • .trajectories/completed/2026-05/traj_2tqxnib25omk.json
  • .trajectories/completed/2026-05/traj_2tqxnib25omk.md
  • .trajectories/completed/2026-05/traj_m7mpv7j8n78h.json
  • .trajectories/completed/2026-05/traj_m7mpv7j8n78h.md
  • .trajectories/completed/2026-05/traj_o9cx33xn5u39.json
  • .trajectories/index.json
  • packages/sdk/src/__tests__/workflow-runner.test.ts
  • packages/sdk/src/index.ts
  • packages/sdk/src/workflows/README.md
  • packages/sdk/src/workflows/__tests__/workflow-reliability-contract.test.ts
  • packages/sdk/src/workflows/runner.ts
  • packages/sdk/src/workflows/schema.json
  • packages/sdk/src/workflows/types.ts
  • specs/slack-primitive.md
✅ Files skipped from review due to trivial changes (6)
  • .trajectories/completed/2026-05/traj_2tqxnib25omk.json
  • .trajectories/completed/2026-05/traj_m7mpv7j8n78h.json
  • .trajectories/completed/2026-05/traj_o9cx33xn5u39.json
  • packages/sdk/src/workflows/README.md
  • .trajectories/index.json
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.json
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/sdk/src/index.ts
  • packages/sdk/src/workflows/types.ts
  • packages/sdk/src/workflows/schema.json
  • packages/sdk/src/workflows/runner.ts

Comment thread .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md Outdated
Comment thread specs/slack-primitive.md Outdated
Comment thread specs/slack-primitive.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/sdk/src/workflows/__tests__/workflow-reliability-contract.test.ts (1)

32-32: ⚡ Quick win

Return cloned step rows from the DB stub to avoid false-positive tests.

getStepsByRunId currently returns live objects from steps. That can let code mutate persisted state without calling updateStep, which weakens this contract suite.

Suggested fix
-    getStepsByRunId: vi.fn(async (runId: string) => [...steps.values()].filter((step) => step.runId === runId)),
+    getStepsByRunId: vi.fn(async (runId: string) =>
+      [...steps.values()]
+        .filter((step) => step.runId === runId)
+        .map((step) => ({ ...step }))
+    ),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/sdk/src/workflows/__tests__/workflow-reliability-contract.test.ts`
at line 32, getStepsByRunId currently returns live objects from the in-memory
steps store, allowing tests to mutate persisted state without calling
updateStep; change the stubbed getStepsByRunId implementation (the vi.fn for
getStepsByRunId) to return cloned step rows (e.g., map matched rows to a deep
copy via structuredClone or JSON.parse(JSON.stringify(...))) instead of the
original objects from steps.values(), so any mutations in test code won't alter
the stored step entries and will force callers to use updateStep to persist
changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@packages/sdk/src/workflows/__tests__/workflow-reliability-contract.test.ts`:
- Line 32: getStepsByRunId currently returns live objects from the in-memory
steps store, allowing tests to mutate persisted state without calling
updateStep; change the stubbed getStepsByRunId implementation (the vi.fn for
getStepsByRunId) to return cloned step rows (e.g., map matched rows to a deep
copy via structuredClone or JSON.parse(JSON.stringify(...))) instead of the
original objects from steps.values(), so any mutations in test code won't alter
the stored step entries and will force callers to use updateStep to persist
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9ca275b2-c414-40ed-9db1-d8babbf67f6b

📥 Commits

Reviewing files that changed from the base of the PR and between 10f1bfb and 3679aec.

📒 Files selected for processing (4)
  • .trajectories/completed/2026-05/traj_60qc24ufr96g.json
  • .trajectories/completed/2026-05/traj_60qc24ufr96g.md
  • .trajectories/index.json
  • packages/sdk/src/workflows/__tests__/workflow-reliability-contract.test.ts
✅ Files skipped from review due to trivial changes (3)
  • .trajectories/completed/2026-05/traj_60qc24ufr96g.json
  • .trajectories/completed/2026-05/traj_60qc24ufr96g.md
  • .trajectories/index.json

@khaliqgant

Copy link
Copy Markdown
Member Author

Addressed the remaining review feedback in 26a945b:

  • cloned step rows from the workflow reliability contract DB stub so tests cannot mutate persisted state by reference
  • corrected origin.agentrelay.net spelling in compacted trajectory markdown and JSON
  • clarified Slack token examples as xoxb-* / xoxp-*
  • made timeoutSeconds wording consistently explicit-required

Validation rerun locally:

  • npm --prefix packages/sdk run check
  • npx vitest run --root packages/sdk --config vitest.config.ts src/workflows/__tests__/workflow-reliability-contract.test.ts
  • git diff --check
  • compacted trajectory JSON parse check

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
.trajectories/index.json (1)

10-10: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize trajectory path values to repo-relative paths.

Lines like Line 10 and Line 256 embed machine-specific absolute paths, which leak user/environment details and make the index non-portable across dev/CI environments.

Suggested fix pattern
-      "path": "/Users/khaliqgant/Projects/AgentWorkforce/relay/.trajectories/completed/2026-05/traj_1775914133873_35667beb.json",
+      "path": ".trajectories/completed/2026-05/traj_1775914133873_35667beb.json",

-      "path": "/home/runner/work/relay/relay/.trajectories/completed/2026-04/traj_0t92gxaz6igh.json",
+      "path": ".trajectories/completed/2026-04/traj_0t92gxaz6igh.json",

Apply this normalization consistently to all changed path entries in this file.

Also applies to: 18-18, 26-26, 34-34, 42-42, 50-50, 58-58, 66-66, 74-74, 82-82, 90-90, 98-98, 106-106, 114-114, 129-129, 137-137, 145-145, 153-153, 161-161, 169-169, 177-177, 185-185, 193-193, 208-208, 216-216, 224-224, 232-232, 240-240, 248-248, 256-256, 263-263, 270-270, 277-277, 284-284

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.trajectories/index.json at line 10, The index.json contains absolute,
machine-specific "path" values (e.g., entries like
"/Users/khaliqgant/Projects/AgentWorkforce/relay/.trajectories/...") which must
be normalized to repo-relative paths; update every "path" value in
.trajectories/index.json (all entries referenced in the review) to be relative
to the repository root (for example
".trajectories/completed/2026-05/traj_1775914133873_35667beb.json") ensuring the
normalization is applied consistently to each changed "path" entry.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@specs/slack-primitive.md`:
- Line 150: Clarify the DM validation behavior for askQuestion: update the
statement about askQuestion against a DM so it consistently says that currently
askQuestion rejects/throws when invoked on a DM unless the workspace has granted
the im:history scope, and that full DM support (including explicit validation/UX
changes) will be implemented in v2; specifically amend the sentence that
currently asserts a hard throw ("askQuestion against a DM throws at validation
time") to state the conditional behavior and/or note that v2 will remove the
restriction, and also adjust the later line that asks whether to support DMs to
reflect this current conditional behavior rather than implying a contradiction
with existing validation. Ensure references to askQuestion, im:history, and
cloud-runtime remain in the text so readers can locate the described behavior.
- Line 280: The "Question idempotency on retry" entry is marked as an open
question but the spec already prescribes stashing (questionTs, runId, stepName)
in the workflow run record before the polling loop; update the "Open questions"
section to either remove that bullet or clarify the exact unresolved item (e.g.,
"Implementation approach defined: stash (questionTs, runId, stepName) in run
record; pending: confirm run-record schema changes and migration strategy"), and
reference the existing resumability guidance and the (questionTs, runId,
stepName) tuple so readers can reconcile the duplicate guidance.
- Around line 93-94: The spec is ambiguous about whether an unknown mention
causes the step to fail; update the wording to state that mentions resolved via
users.lookupByEmail or the user-cache that are not found do not fail the step:
the step completes with a success status but records a typed soft error
SlackPostBackError(unknown_mention) in the step output metadata so downstream
steps can detect it; explicitly state where to find it (e.g.,
stepResult.output.errors or stepResult.metadata.postBackErrors) and that
consumers should check for SlackPostBackError(unknown_mention) to treat it as a
warning rather than a hard failure (same soft-error behavior as
github-primitive).
- Line 52: The spec's scope list is incomplete: add reactions:write and
im:history to the documented scopes and clearly separate "required" (e.g.,
chat:write, channels:history, groups:history) vs "optional/for-DMs/reactions"
(e.g., im:history, reactions:write) in specs/slack-primitive.md, and update the
primitive's scope-validation logic (the validateScopes / first-call scope check)
to enforce only required scopes while warning or listing optional scopes in the
typed error message so missing optional scopes are not treated as fatal.

---

Duplicate comments:
In @.trajectories/index.json:
- Line 10: The index.json contains absolute, machine-specific "path" values
(e.g., entries like
"/Users/khaliqgant/Projects/AgentWorkforce/relay/.trajectories/...") which must
be normalized to repo-relative paths; update every "path" value in
.trajectories/index.json (all entries referenced in the review) to be relative
to the repository root (for example
".trajectories/completed/2026-05/traj_1775914133873_35667beb.json") ensuring the
normalization is applied consistently to each changed "path" entry.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7fb186db-0fe1-4316-98eb-a321be75261c

📥 Commits

Reviewing files that changed from the base of the PR and between 3679aec and 26a945b.

📒 Files selected for processing (7)
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.json
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.md
  • .trajectories/completed/2026-05/traj_vkozdglobkyg.json
  • .trajectories/completed/2026-05/traj_vkozdglobkyg.md
  • .trajectories/index.json
  • packages/sdk/src/workflows/__tests__/workflow-reliability-contract.test.ts
  • specs/slack-primitive.md
✅ Files skipped from review due to trivial changes (3)
  • .trajectories/completed/2026-05/traj_vkozdglobkyg.json
  • .trajectories/completed/2026-05/traj_vkozdglobkyg.md
  • .trajectories/compacted/compact_j5u7qhaw4q6a_2026-05-08.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/sdk/src/workflows/tests/workflow-reliability-contract.test.ts

Comment thread specs/slack-primitive.md

Cloud's lambda already wires `Resource.NangoSecretKey.value` and resolves `(workspaceId, provider) → connectionId`. The Slack primitive's `cloud-runtime` reuses that resolver — no new resource binding.

For Slack, we expect the connection to be a **bot user OAuth token** (`xoxb-*`), not user-token (`xoxp-*`). Posting and reading replies both work with `chat:write`, `channels:history`, and `groups:history` scopes. The primitive validates scopes on first call and throws a typed error early if they're missing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Verify that the scope list is complete.

The spec lists chat:write, channels:history, and groups:history as required scopes. However, line 139 describes adding a :eyes: reaction, which typically requires the reactions:write scope. Additionally, line 276 mentions im:history for potential DM support. Consider explicitly documenting all scopes that may be needed for the full feature set, or clarifying which scopes are required vs. optional for different operations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/slack-primitive.md` at line 52, The spec's scope list is incomplete:
add reactions:write and im:history to the documented scopes and clearly separate
"required" (e.g., chat:write, channels:history, groups:history) vs
"optional/for-DMs/reactions" (e.g., im:history, reactions:write) in
specs/slack-primitive.md, and update the primitive's scope-validation logic (the
validateScopes / first-call scope check) to enforce only required scopes while
warning or listing optional scopes in the typed error message so missing
optional scopes are not treated as fatal.

Comment thread specs/slack-primitive.md
Comment on lines +93 to +94

- **Mentions are resolved before send.** `@khaliq` is looked up via `users.lookupByEmail` or the user-cache; if not found, the message still posts but a typed `SlackPostBackError(unknown_mention)` is logged on the step output. This is the same "fail soft on cosmetic errors, fail hard on real errors" pattern as github-primitive.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify error-handling semantics for unknown_mention.

Line 94 states that when a mention isn't found, "a typed SlackPostBackError(unknown_mention) is logged on the step output" and "the message still posts." It's unclear whether the step succeeds or fails from the workflow's perspective. Does the step complete successfully with error metadata attached, or does it fail? Please specify the step's final status and how downstream steps can access or check for this soft error.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/slack-primitive.md` around lines 93 - 94, The spec is ambiguous about
whether an unknown mention causes the step to fail; update the wording to state
that mentions resolved via users.lookupByEmail or the user-cache that are not
found do not fail the step: the step completes with a success status but records
a typed soft error SlackPostBackError(unknown_mention) in the step output
metadata so downstream steps can detect it; explicitly state where to find it
(e.g., stepResult.output.errors or stepResult.metadata.postBackErrors) and that
consumers should check for SlackPostBackError(unknown_mention) to treat it as a
warning rather than a hard failure (same soft-error behavior as
github-primitive).

Comment thread specs/slack-primitive.md

- **Workflows must be allowed to block on external input.** The runner already supports long-running steps (verification gates, sandbox bootstraps), so this is reusing existing plumbing — not inventing new lifecycle.
- **The step must be resumable.** If the workflow crashes between posting the question and receiving the answer, the resumed run must find the existing question (by run-id-tagged metadata in the message) and continue waiting from there, not re-ask. Implementation: stash `(questionTs, runId, stepName)` in the workflow run record before the polling loop starts; on resume, look up the row and rejoin the poll.
- **The channel's history must include the question.** This means cloud-runtime cannot use private DMs (the bot can't read DM history without `im:history` scope and that scope is rarely granted). `askQuestion` against a DM throws at validation time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve the DM validation inconsistency.

Line 150 states "askQuestion against a DM throws at validation time," but line 276 asks "Should askQuestion to a DM be supported when im:history is granted? Probably yes...Defer to v2." These statements contradict each other.

Clarify whether:

  • (A) DM usage currently throws an error (as line 150 states), and v2 will add support when the scope is present, or
  • (B) DM support is already conditional based on scope availability

Update one of these sections to remove the ambiguity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/slack-primitive.md` at line 150, Clarify the DM validation behavior for
askQuestion: update the statement about askQuestion against a DM so it
consistently says that currently askQuestion rejects/throws when invoked on a DM
unless the workspace has granted the im:history scope, and that full DM support
(including explicit validation/UX changes) will be implemented in v2;
specifically amend the sentence that currently asserts a hard throw
("askQuestion against a DM throws at validation time") to state the conditional
behavior and/or note that v2 will remove the restriction, and also adjust the
later line that asks whether to support DMs to reflect this current conditional
behavior rather than implying a contradiction with existing validation. Ensure
references to askQuestion, im:history, and cloud-runtime remain in the text so
readers can locate the described behavior.

Comment thread specs/slack-primitive.md
- **Slack Connect / shared channels.** The primitive should treat shared channels exactly like internal ones — the bot just needs to be invited. Need to verify Nango's Slack provider exposes them correctly.
- **Audit trail.** Cloud should write every `askQuestion` exchange to the workflow run record so post-mortems can see what the agent asked and how the human answered. This is straightforward but needs schema work; out of scope for the primitive itself.
- **Default channel resolution.** If the workflow doesn't specify a channel, should the primitive default to the workspace's "wf-default" channel? I think no — the workflow author should be explicit. But cloud could surface the default as `Resource.SlackDefaultChannel.value` for convenience.
- **Question idempotency on retry.** When a step retries (e.g. `retries: 2`), the second attempt should _not_ re-ask. The primitive should check the channel for an existing question with the same `(runId, stepName)` tag and resume waiting. Mentioned above under resumability — calling out here as the same mechanism.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Clarify whether question idempotency is settled or open.

Line 280 lists "Question idempotency on retry" as an open question, noting it's "Mentioned above under resumability — calling out here as the same mechanism." However, line 149 already provides specific implementation guidance: "stash (questionTs, runId, stepName) in the workflow run record before the polling loop starts."

If the implementation approach is already defined, remove this from "Open questions" or clearly state what aspect remains unresolved (e.g., "Implementation approach is defined; pending: confirmation of run-record schema changes").

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@specs/slack-primitive.md` at line 280, The "Question idempotency on retry"
entry is marked as an open question but the spec already prescribes stashing
(questionTs, runId, stepName) in the workflow run record before the polling
loop; update the "Open questions" section to either remove that bullet or
clarify the exact unresolved item (e.g., "Implementation approach defined: stash
(questionTs, runId, stepName) in run record; pending: confirm run-record schema
changes and migration strategy"), and reference the existing resumability
guidance and the (questionTs, runId, stepName) tuple so readers can reconcile
the duplicate guidance.

@khaliqgant khaliqgant merged commit 0e536f4 into main May 8, 2026
41 checks passed
@khaliqgant khaliqgant deleted the codex/repairable-workflow-failures branch May 8, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant