Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/replay-specversion-probe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@workflow/web': patch
---

Replay/Re-run probes the target deployment's specVersion via health check before recreating the run, so the correct queue transport (JSON for old deployments, CBOR for new) is used. Falls back to the original run's specVersion if the probe fails.
24 changes: 24 additions & 0 deletions packages/web/app/server/workflow-server-actions.server.ts
Original file line number Diff line number Diff line change
Expand Up @@ -815,11 +815,35 @@ export async function recreateRun(
): Promise<ServerActionResult<string>> {
try {
const world = await getWorldFromEnv({ ...worldEnv });

// Probe the target deployment's specVersion via health check so we use
// the correct queue transport (JSON for old deployments, CBOR for new).
// Falls back to the run's specVersion inside recreateRunFromExisting
// if the probe fails (e.g. old deployment without health check support).
let specVersion: number | undefined;
try {
let targetDeploymentId = deploymentId;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: when deploymentId is not provided, this calls world.runs.get(runId) to get the run's deploymentId, and then recreateRunFromExisting internally calls world.runs.get(runId, { resolveData: 'all' }) again a moment later. Not a correctness issue, just a duplicated round-trip. Could be avoided by fetching the run once here and passing both run and the resolved ID through, but that would require a larger signature change on recreateRunFromExisting. Probably not worth it for a non-hot path.

if (!targetDeploymentId) {
const run = await world.runs.get(runId, { resolveData: 'none' });
targetDeploymentId = run.deploymentId;
}
const hc = await healthCheck(world, 'workflow', {
deploymentId: targetDeploymentId,
timeout: 10_000,
});
if (hc.healthy && hc.specVersion != null) {
specVersion = hc.specVersion;
}
} catch {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latency concern: this adds up to 10s of UI wait for the exact case the PR is trying to help.

The healthCheck() implementation polls for a response until timeout (helpers.ts:254-294). An old deployment that doesn't recognize the __wkf_workflow_health_check queue topic will silently drop the message — there's no fast-fail signal, so the probe will wait the full 10s before returning healthy: false.

This means: a user clicking "Replay" on a run from a pre-health-check deployment pays 10 seconds of UI latency before the replay even starts. That's the exact scenario the PR needs to handle gracefully (old deployment \u2192 JSON transport), but it's now the slowest case.

Options to consider:

  1. Shorter timeout (e.g. 3\u20134s) \u2014 if the deployment is alive and supports health check, it typically responds in < 500ms, so 10s is overkill. 3s should be safe.

  2. Version-gate the probe \u2014 only probe if run.specVersion suggests the deployment might have been upgraded. If run.specVersion >= SPEC_VERSION_SUPPORTS_HEALTH_CHECK, probe; otherwise skip. (Would need to add that constant.)

  3. Non-blocking probe with a short budget \u2014 race the probe against a short timeout (say 2s); if it doesn't resolve in time, fall back. The successful path stays fast, the failure path isn't punished.

Option 1 is the simplest and probably sufficient.

// Health check failed — fall back to run's specVersion.
}

const newRunId = await workflowRunHelpers.recreateRunFromExisting(
world,
runId,
{
deploymentId,
specVersion,
}
);
return createResponse(newRunId);
Expand Down
Loading