fix(runtime): refresh cloud API token in-process for long-running agents (cloud#2307 Half 2) by khaliqgant · Pull Request #246 · AgentWorkforce/workforce

khaliqgant · 2026-06-19T07:30:13Z

What

Half 2 of the cloud#2307 fix — the consumer that actually rotates the proactive sandbox's cloud API token before it expires. Stacks on cloud#2313 (Half 1, which mints the token as a refreshable pair + injects the refresh env vars).

Why

/api/v1/slack/post-message is authenticated with CLOUD_API_ACCESS_TOKEN — a relayfile-audience JWT minted once at sandbox creation with a ~2h TTL and no in-sandbox refresh. Posts within 2h work; a long-running proactive agent that posts after the TTL (e.g. this incident, where the Slack post fired only after a multi-hour DB recovery) sends an expired-but-validly-signed bearer → verifier returns null → wholesale 401. Latent expiry bug, not a regression.

How

New packages/runtime/src/cloud-api-token-refresh.ts:
- ensureFreshCloudApiToken(env) — refreshes via POST ${CLOUD_API_REFRESH_URL} (relayauth /v1/tokens/refresh) only when the access token is within the 5-min skew of expiry; rotates-and-persists both CLOUD_API_ACCESS_TOKEN and CLOUD_API_TOKEN (the relayflows adapter env fallback) in place; serialized single-in-flight (relayauth refresh tokens are single-use — reuse cascade-revokes the session); throws CloudApiTokenHorizonError on a 401 (delegation horizon elapsed → out-of-band re-mint).
- startCloudApiTokenRefresher(opts) — background loop (setTimeout, unref'd), tick < skew.
Wired into startRunner (runner.ts): the refresher runs inside the node runner.mjs subprocess where the relayflows Slack client is constructed and reads process.env.CLOUD_API_TOKEN. (A parent-process refresher would NOT reach it — env is copied at spawn.) Guarded on CLOUD_API_REFRESH_URL, stopped in a finally.

Placement note

The persona's Slack client (relayflows SlackStepExecutor → new SlackClient) is built per step in the runner subprocess and resolves the token config.cloudApiToken ?? env.CLOUD_API_TOKEN. Keeping that subprocess's process.env fresh means each per-step client picks up a live token — no relayflows change, no token-provider needed.

Testing

10 new unit tests (fresh-passthrough, within-skew, expired, no-material, concurrent-serialization=single rotation, 401-horizon, non-401 failure, loop rotate/stop/horizon).
Full runtime package suite 123/123 pass; tsc --noEmit clean.

Sequencing

No-op without cloud#2313's injected refresh vars, so it's safe to merge in either order, but the 401 only fully clears once both land + deploy.

🤖 Generated with Claude Code

…nts (cloud#2307) The proactive sandbox's CLOUD_API_ACCESS_TOKEN is a relayfile-audience JWT minted once at launch with a ~2h TTL and no in-sandbox refresh. A long-running proactive agent that posts to Slack after the TTL lapses sends an expired-but-validly-signed bearer; the cloud verifier returns null and /api/v1/slack/post-message 401s wholesale. Add `ensureFreshCloudApiToken` + `startCloudApiTokenRefresher` and start the refresher in `startRunner` (the `node runner.mjs` subprocess where the relayflows Slack client is built and reads process.env.CLOUD_API_TOKEN). The refresher: - ticks under the 5-min skew, refreshing only within the window; - rotates-and-persists both CLOUD_API_ACCESS_TOKEN and CLOUD_API_TOKEN (the relayflows adapter env fallback) in place; - serializes a single in-flight refresh (relayauth refresh tokens are single-use; reuse cascade-revokes the session); - stops + signals on delegation-horizon (refresh 401) for out-of-band re-mint. Stacks on cloud#2313, which mints the token as a refreshable pair and injects CLOUD_API_REFRESH_TOKEN / CLOUD_API_ACCESS_TOKEN_EXPIRES_AT / CLOUD_API_REFRESH_URL. No-op when that refresh material is absent (local/dev). 10 new unit tests; full runtime suite 123/123, tsc --noEmit clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-19T07:30:22Z

Warning

Review limit reached

@khaliqgant, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 8 minutes. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 735fc9d2-f9d3-44c8-9337-f067dc6bd628

📥 Commits

Reviewing files that changed from the base of the PR and between eea7446 and 967b8d6.

📒 Files selected for processing (3)

packages/runtime/src/cloud-api-token-refresh.test.ts
packages/runtime/src/cloud-api-token-refresh.ts
packages/runtime/src/runner.ts

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/2307-workforce-cloud-api-token-refresh

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a mechanism to keep the cloud API token fresh for the proactive sandbox runner by implementing an in-place token refresher and a background loop. The reviewer's feedback highlights several important improvements: performing an initial token refresh check immediately on startup to prevent using an expired token during the first 60 seconds, using a WeakMap keyed by the environment object to isolate in-flight refresh promises and prevent cross-contamination in concurrent or multi-tenant environments, and adding a timeout to the token refresh fetch request to avoid blocking indefinitely if the authentication server hangs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-19T07:31:43Z

 import { buildCtx, type CtxBuildOptions } from './ctx.js';
 import { getTrajectoryRecorder, type TrajectoryRecorder } from './trajectory.js';
 import { isWorkforceHandler } from './handler.js';
+import { startCloudApiTokenRefresher } from './cloud-api-token-refresh.js';


Import ensureFreshCloudApiToken and CloudApiTokenHorizonError to perform an initial token refresh check on startup.

Suggested change

import { startCloudApiTokenRefresher } from './cloud-api-token-refresh.js';

import { ensureFreshCloudApiToken, startCloudApiTokenRefresher, CloudApiTokenHorizonError } from './cloud-api-token-refresh.js';

gemini-code-assist · 2026-06-19T07:31:43Z

+  const tokenRefresher = process.env.CLOUD_API_REFRESH_URL
+    ? startCloudApiTokenRefresher({
+        onHorizonElapsed: (err) =>
+          ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message }),
+        onError: (err) =>
+          ctx.log('warn', 'runner.cloud-api-token.refresh-failed', {
+            error: err instanceof Error ? err.message : String(err)
+          })
+      })
+    : null;


If the runner is cold-started and the token in the environment is already expired or near expiry, the background refresher (which ticks only after 60 seconds by default) will leave a 60-second window where the runner uses an expired token and fails with 401. Performing an initial refresh check immediately on startup ensures the token is fresh before any envelopes are processed. Additionally, guard both the initial refresh and the background refresher on both CLOUD_API_REFRESH_URL and CLOUD_API_REFRESH_TOKEN to avoid unnecessary work if the refresh token is missing.

if (process.env.CLOUD_API_REFRESH_URL && process.env.CLOUD_API_REFRESH_TOKEN) { try { await ensureFreshCloudApiToken(); } catch (err) { if (err instanceof CloudApiTokenHorizonError) { ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message }); } else { ctx.log('warn', 'runner.cloud-api-token.refresh-failed', { error: err instanceof Error ? err.message : String(err) }); } } } const tokenRefresher = process.env.CLOUD_API_REFRESH_URL && process.env.CLOUD_API_REFRESH_TOKEN ? startCloudApiTokenRefresher({ onHorizonElapsed: (err) => ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message }), onError: (err) => ctx.log('warn', 'runner.cloud-api-token.refresh-failed', { error: err instanceof Error ? err.message : String(err) }) }) : null;

gemini-code-assist · 2026-06-19T07:31:43Z

+// The sandbox holds one cloud API identity, so a process-wide guard is correct:
+// concurrent posters await the SAME rotation rather than each spending the
+// single-use refresh token (which would cascade-revoke the session).
+let inFlightRefresh: Promise<string> | null = null;


Using a single global inFlightRefresh promise can cause cross-contamination if multiple runners or environments are executed in the same process (e.g., in concurrent test suites or multi-tenant environments). Using a WeakMap keyed by the env object isolates the in-flight refresh promises per environment.

Suggested change

let inFlightRefresh: Promise<string> | null = null;

let inFlightRefreshes = new WeakMap<CloudApiTokenEnv, Promise<string>>();

gemini-code-assist · 2026-06-19T07:31:44Z

+  if (!inFlightRefresh) {
+    const fetchImpl = opts.fetchImpl ?? fetch;
+    inFlightRefresh = refreshAndPersist(env, refreshUrl, refreshToken, fetchImpl).finally(() => {
+      inFlightRefresh = null;
+    });
+  }
+  return inFlightRefresh;


Update the in-flight refresh resolution to use the new inFlightRefreshes WeakMap to ensure proper isolation per environment.

let inFlight = inFlightRefreshes.get(env); if (!inFlight) { const fetchImpl = opts.fetchImpl ?? fetch; inFlight = refreshAndPersist(env, refreshUrl, refreshToken, fetchImpl).finally(() => { inFlightRefreshes.delete(env); }); inFlightRefreshes.set(env, inFlight); } return inFlight;

gemini-code-assist · 2026-06-19T07:31:44Z

+  const response = await fetchImpl(refreshUrl, {
+    method: 'POST',
+    headers: { 'content-type': 'application/json' },
+    body: JSON.stringify({ refreshToken })
+  });


The fetch request to refresh the token does not have a timeout. If the auth server hangs or is extremely slow, this request will block indefinitely, preventing any concurrent or subsequent callers from proceeding. Adding a timeout using AbortSignal.timeout prevents this issue.

Suggested change

const response = await fetchImpl(refreshUrl, {

method: 'POST',

headers: { 'content-type': 'application/json' },

body: JSON.stringify({ refreshToken })

});

const response = await fetchImpl(refreshUrl, {

method: 'POST',

headers: { 'content-type': 'application/json' },

body: JSON.stringify({ refreshToken }),

signal: AbortSignal.timeout(15000)

});

gemini-code-assist · 2026-06-19T07:31:44Z

+export function resetInFlightRefreshForTests(): void {
+  inFlightRefresh = null;
+}


Update the test helper to reset the inFlightRefreshes WeakMap instead of the single global variable.

Suggested change

export function resetInFlightRefreshForTests(): void {

inFlightRefresh = null;

}

export function resetInFlightRefreshForTests(): void {

inFlightRefreshes = new WeakMap<CloudApiTokenEnv, Promise<string>>();

}

agent-relay-code · 2026-06-19T07:33:49Z

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 967b8d6f74

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-19T07:33:55Z

+  if (accessToken && expiresAt) {
+    const expiresMs = Date.parse(expiresAt);
+    if (Number.isFinite(expiresMs) && expiresMs - now > skewMs) {
+      return accessToken;


Keep CLOUD_API_TOKEN in sync on fresh tokens

When the access token is still outside the skew window, this early return leaves CLOUD_API_TOKEN untouched. In runtimes where only CLOUD_API_ACCESS_TOKEN is injected, or where CLOUD_API_TOKEN still contains an older value, the runner's relayflows path that reads process.env.CLOUD_API_TOKEN continues to use a missing/stale bearer until the first refresh near expiry. Copy the fresh access token into the alias before returning, or include the alias in the freshness check.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-19T07:33:55Z

+    schedule();
+  }
+
+  schedule();


Refresh once before waiting for the first interval

Starting the refresher only schedules the first check after intervalMs, so if the runner process starts with a token already expired or inside the skew window (for example after restarting in a long-lived sandbox), startRunner can dispatch an envelope and post to Slack during this initial 60s gap with the old bearer. Kick off tick() immediately, or refresh before the dispatch loop begins, so startup state is covered too.

Useful? React with 👍 / 👎.

agent-relay-code · 2026-06-19T07:37:33Z

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

khaliqgant · 2026-06-19T07:44:20Z

Closing: placement disproven. The cloud#2307 401 is on /api/v1/slack/post-message, whose only caller is relayflows SlackCloudRelayClient (cloud-relay-runtime.ts:119), running sandbox-side in the relayflows WorkflowRunner = cloud bootstrap-inner.mjs — NOT the workforce startRunner process. Workforce personas emit Slack via VFS writeback (ctx.ts:176-181; persona slack-comms 'never by calling an API'); startRunner has no SlackCloudRelayClient/CLOUD_API_TOKEN caller, so refreshing env there reaches nothing. Half-2 moves to cloud bootstrap-inner.mjs (extend refreshExpiringCredentials():811 to refresh the cloud-API token within-skew, not pre-run-only). The refresh logic from this PR (cloud-api-token-refresh.ts) is portable to that home. Tracked with cloud#2313 (Half-1).

agent-relay-code · 2026-06-19T07:44:27Z

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-19T07:44:49Z

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-19T07:55:07Z

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

I made no file edits — the PR is mechanically clean (build, lint, typecheck all pass) and the logic is sound, so there was nothing to auto-fix. Here is my review.

Review: PR #246 — `fix(runtime): refresh cloud API token in-process for long-running agents`

Summary

Adds an in-process cloud API token refresher to packages/runtime so long-running proactive agents don't 401 after their ~2h bearer expires (cloud#2307). New module cloud-api-token-refresh.ts + tests, wired into startRunner with a background loop that rotates-and-persists the CLOUD_API_* env vars in place. The scope is tight: only packages/runtime is touched, and the new module is consumed only by runner.ts.

Verification (the way CI runs it)

Ran the canonical CI pipeline from .github/workflows/ci.yml (pnpm install --frozen-lockfile → pnpm -r run build → pnpm run lint → pnpm run typecheck → pnpm run test):

Build: pass
Lint: pass
Typecheck: pass (including examples)
Test: packages/runtime 105/105 pass, including all 11 new token-refresh tests. All other packages pass except packages/cli (2 failures).

The 2 CLI test failures are environmental, not a regression

Both cli.test.js and launch-metadata.test.js fail at import with:

@relayburn/sdk: native binding not found for linux-x64.
Cannot find module '@relayburn/sdk-linux-x64-gnu'

This is a missing native optional-dependency in this sandbox, unrelated to the PR (which never touches packages/cli or @relayburn/sdk). On real CI with the prebuilt binding installed these pass. I did not edit anything to "fix" this — doing so would be out of scope and would risk weakening a test. Treat CI's CLI job as authoritative.

Correctness assessment (traced across consumers)

ctx.cloudApi.token is a live getter reading process.env.CLOUD_API_ACCESS_TOKEN on each access (packages/runtime/src/ctx.ts:235-236, ctx.ts:264), so in-place mutation is picked up — the design is sound.
The dual-write of both CLOUD_API_ACCESS_TOKEN (the ctx.cloudApi reader) and CLOUD_API_TOKEN (the relayflows adapter env fallback) at cloud-api-token-refresh.ts:146-147 correctly keeps both consumer paths in sync.
Fail-closed behavior is preserved: refresh failure throws and leaves env untouched (cloud-api-token-refresh.ts:126-149); 401 raises CloudApiTokenHorizonError and stops the loop rather than silently succeeding. No fail-open conversion.
The single-flight guard (inFlightRefresh) correctly prevents concurrent callers from double-spending the single-use rotating refresh token.
The runner.ts change wraps the existing dispatch loop in try/finally purely to stop() the timer this PR introduces; the dispatch loop body (envelope shim, continue, dispatch, end-log) is byte-for-byte unchanged aside from indentation. This is the PR's own resource teardown, not a change to pre-existing lifecycle/reaper/dispatch behavior.

Addressed comments

No human reviews, approvals, or bot comments were available to validate: gh CLI is not installed in this environment and .workforce/context.json contains no review/comment payload. I therefore could not enumerate or reproduce any prior threads. If review threads exist on GitHub, they were not provided to this run.

Advisory Notes

These are observations only — no code changed, and none are required for this PR:

cloud-api-token-refresh.ts:307 (inFlightRefresh) is module-global while ensureFreshCloudApiToken accepts a per-call env. If two different env bags were ever refreshed concurrently in one process, the later caller would await the earlier env's rotation. This is safe today (one identity per process / one process.env) and is documented in the file header; flagging only for future multi-tenant reuse.
The header comment (cloud-api-token-refresh.ts:16-19) describes the upstream /v1/tokens/agent provisioning of CLOUD_API_REFRESH_URL; the code itself just POSTs to whatever URL is configured. No action needed — noting that the comment's contract depends on how the env var is minted outside this repo.

Conclusion

The change is well-scoped, type-safe, lint-clean, and the runtime test suite (including new tests) passes in full. The only red CI signal I could reproduce locally is the pre-existing @relayburn/sdk native-binding gap in the CLI tests, which is environmental and unrelated to this PR.

I'm not printing READY: I could not confirm the live state of GitHub's required checks or mergeability from this environment (no gh, no check/review metadata), and the CLI test job's real-CI status is outside what I can verify here. A human should confirm the GitHub CI run is green (the CLI binding present) and that there are no outstanding review threads before merge.

agent-relay-code · 2026-06-19T07:59:40Z

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-19T08:07:34Z

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-19T08:07:54Z

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

gemini-code-assist Bot reviewed Jun 19, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 19, 2026

View reviewed changes

khaliqgant closed this Jun 19, 2026

	import { startCloudApiTokenRefresher } from './cloud-api-token-refresh.js';
	import { ensureFreshCloudApiToken, startCloudApiTokenRefresher, CloudApiTokenHorizonError } from './cloud-api-token-refresh.js';

	let inFlightRefresh: Promise<string> \| null = null;
	let inFlightRefreshes = new WeakMap<CloudApiTokenEnv, Promise<string>>();

Conversation

khaliqgant commented Jun 19, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Placement note

Testing

Sequencing

Uh oh!

coderabbitai Bot commented Jun 19, 2026

Review limit reached

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Uh oh!

khaliqgant commented Jun 19, 2026

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Review: PR #246 — fix(runtime): refresh cloud API token in-process for long-running agents

Summary

Verification (the way CI runs it)

The 2 CLI test failures are environmental, not a regression

Correctness assessment (traced across consumers)

Addressed comments

Advisory Notes

Conclusion

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Uh oh!

agent-relay-code Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

khaliqgant commented Jun 19, 2026 •

edited by cubic-dev-ai Bot

Loading

Review: PR #246 — `fix(runtime): refresh cloud API token in-process for long-running agents`