Skip to content

fix(runtime): refresh cloud API token in-process for long-running agents (cloud#2307 Half 2)#246

Closed
khaliqgant wants to merge 1 commit into
mainfrom
fix/2307-workforce-cloud-api-token-refresh
Closed

fix(runtime): refresh cloud API token in-process for long-running agents (cloud#2307 Half 2)#246
khaliqgant wants to merge 1 commit into
mainfrom
fix/2307-workforce-cloud-api-token-refresh

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 19, 2026

Copy link
Copy Markdown
Member

What

Half 2 of the cloud#2307 fix — the consumer that actually rotates the proactive sandbox's cloud API token before it expires. Stacks on cloud#2313 (Half 1, which mints the token as a refreshable pair + injects the refresh env vars).

Why

/api/v1/slack/post-message is authenticated with CLOUD_API_ACCESS_TOKEN — a relayfile-audience JWT minted once at sandbox creation with a ~2h TTL and no in-sandbox refresh. Posts within 2h work; a long-running proactive agent that posts after the TTL (e.g. this incident, where the Slack post fired only after a multi-hour DB recovery) sends an expired-but-validly-signed bearer → verifier returns null → wholesale 401. Latent expiry bug, not a regression.

How

  • New packages/runtime/src/cloud-api-token-refresh.ts:
    • ensureFreshCloudApiToken(env) — refreshes via POST ${CLOUD_API_REFRESH_URL} (relayauth /v1/tokens/refresh) only when the access token is within the 5-min skew of expiry; rotates-and-persists both CLOUD_API_ACCESS_TOKEN and CLOUD_API_TOKEN (the relayflows adapter env fallback) in place; serialized single-in-flight (relayauth refresh tokens are single-use — reuse cascade-revokes the session); throws CloudApiTokenHorizonError on a 401 (delegation horizon elapsed → out-of-band re-mint).
    • startCloudApiTokenRefresher(opts) — background loop (setTimeout, unref'd), tick < skew.
  • Wired into startRunner (runner.ts): the refresher runs inside the node runner.mjs subprocess where the relayflows Slack client is constructed and reads process.env.CLOUD_API_TOKEN. (A parent-process refresher would NOT reach it — env is copied at spawn.) Guarded on CLOUD_API_REFRESH_URL, stopped in a finally.

Placement note

The persona's Slack client (relayflows SlackStepExecutornew SlackClient) is built per step in the runner subprocess and resolves the token config.cloudApiToken ?? env.CLOUD_API_TOKEN. Keeping that subprocess's process.env fresh means each per-step client picks up a live token — no relayflows change, no token-provider needed.

Testing

  • 10 new unit tests (fresh-passthrough, within-skew, expired, no-material, concurrent-serialization=single rotation, 401-horizon, non-401 failure, loop rotate/stop/horizon).
  • Full runtime package suite 123/123 pass; tsc --noEmit clean.

Sequencing

No-op without cloud#2313's injected refresh vars, so it's safe to merge in either order, but the 401 only fully clears once both land + deploy.

🤖 Generated with Claude Code

Review in cubic

…nts (cloud#2307)

The proactive sandbox's CLOUD_API_ACCESS_TOKEN is a relayfile-audience JWT
minted once at launch with a ~2h TTL and no in-sandbox refresh. A long-running
proactive agent that posts to Slack after the TTL lapses sends an
expired-but-validly-signed bearer; the cloud verifier returns null and
/api/v1/slack/post-message 401s wholesale.

Add `ensureFreshCloudApiToken` + `startCloudApiTokenRefresher` and start the
refresher in `startRunner` (the `node runner.mjs` subprocess where the relayflows
Slack client is built and reads process.env.CLOUD_API_TOKEN). The refresher:
- ticks under the 5-min skew, refreshing only within the window;
- rotates-and-persists both CLOUD_API_ACCESS_TOKEN and CLOUD_API_TOKEN
  (the relayflows adapter env fallback) in place;
- serializes a single in-flight refresh (relayauth refresh tokens are
  single-use; reuse cascade-revokes the session);
- stops + signals on delegation-horizon (refresh 401) for out-of-band re-mint.

Stacks on cloud#2313, which mints the token as a refreshable pair and injects
CLOUD_API_REFRESH_TOKEN / CLOUD_API_ACCESS_TOKEN_EXPIRES_AT / CLOUD_API_REFRESH_URL.
No-op when that refresh material is absent (local/dev).

10 new unit tests; full runtime suite 123/123, tsc --noEmit clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@khaliqgant, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 8 minutes. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 735fc9d2-f9d3-44c8-9337-f067dc6bd628

📥 Commits

Reviewing files that changed from the base of the PR and between eea7446 and 967b8d6.

📒 Files selected for processing (3)
  • packages/runtime/src/cloud-api-token-refresh.test.ts
  • packages/runtime/src/cloud-api-token-refresh.ts
  • packages/runtime/src/runner.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2307-workforce-cloud-api-token-refresh

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to keep the cloud API token fresh for the proactive sandbox runner by implementing an in-place token refresher and a background loop. The reviewer's feedback highlights several important improvements: performing an initial token refresh check immediately on startup to prevent using an expired token during the first 60 seconds, using a WeakMap keyed by the environment object to isolate in-flight refresh promises and prevent cross-contamination in concurrent or multi-tenant environments, and adding a timeout to the token refresh fetch request to avoid blocking indefinitely if the authentication server hangs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

import { buildCtx, type CtxBuildOptions } from './ctx.js';
import { getTrajectoryRecorder, type TrajectoryRecorder } from './trajectory.js';
import { isWorkforceHandler } from './handler.js';
import { startCloudApiTokenRefresher } from './cloud-api-token-refresh.js';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Import ensureFreshCloudApiToken and CloudApiTokenHorizonError to perform an initial token refresh check on startup.

Suggested change
import { startCloudApiTokenRefresher } from './cloud-api-token-refresh.js';
import { ensureFreshCloudApiToken, startCloudApiTokenRefresher, CloudApiTokenHorizonError } from './cloud-api-token-refresh.js';

Comment on lines +150 to +159
const tokenRefresher = process.env.CLOUD_API_REFRESH_URL
? startCloudApiTokenRefresher({
onHorizonElapsed: (err) =>
ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message }),
onError: (err) =>
ctx.log('warn', 'runner.cloud-api-token.refresh-failed', {
error: err instanceof Error ? err.message : String(err)
})
})
: null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If the runner is cold-started and the token in the environment is already expired or near expiry, the background refresher (which ticks only after 60 seconds by default) will leave a 60-second window where the runner uses an expired token and fails with 401. Performing an initial refresh check immediately on startup ensures the token is fresh before any envelopes are processed. Additionally, guard both the initial refresh and the background refresher on both CLOUD_API_REFRESH_URL and CLOUD_API_REFRESH_TOKEN to avoid unnecessary work if the refresh token is missing.

  if (process.env.CLOUD_API_REFRESH_URL && process.env.CLOUD_API_REFRESH_TOKEN) {
    try {
      await ensureFreshCloudApiToken();
    } catch (err) {
      if (err instanceof CloudApiTokenHorizonError) {
        ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message });
      } else {
        ctx.log('warn', 'runner.cloud-api-token.refresh-failed', {
          error: err instanceof Error ? err.message : String(err)
        });
      }
    }
  }

  const tokenRefresher = process.env.CLOUD_API_REFRESH_URL && process.env.CLOUD_API_REFRESH_TOKEN
    ? startCloudApiTokenRefresher({
        onHorizonElapsed: (err) =>
          ctx.log('warn', 'runner.cloud-api-token.horizon-elapsed', { error: err.message }),
        onError: (err) =>
          ctx.log('warn', 'runner.cloud-api-token.refresh-failed', {
            error: err instanceof Error ? err.message : String(err)
          })
      })
    : null;

// The sandbox holds one cloud API identity, so a process-wide guard is correct:
// concurrent posters await the SAME rotation rather than each spending the
// single-use refresh token (which would cascade-revoke the session).
let inFlightRefresh: Promise<string> | null = null;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a single global inFlightRefresh promise can cause cross-contamination if multiple runners or environments are executed in the same process (e.g., in concurrent test suites or multi-tenant environments). Using a WeakMap keyed by the env object isolates the in-flight refresh promises per environment.

Suggested change
let inFlightRefresh: Promise<string> | null = null;
let inFlightRefreshes = new WeakMap<CloudApiTokenEnv, Promise<string>>();

Comment on lines +105 to +111
if (!inFlightRefresh) {
const fetchImpl = opts.fetchImpl ?? fetch;
inFlightRefresh = refreshAndPersist(env, refreshUrl, refreshToken, fetchImpl).finally(() => {
inFlightRefresh = null;
});
}
return inFlightRefresh;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the in-flight refresh resolution to use the new inFlightRefreshes WeakMap to ensure proper isolation per environment.

  let inFlight = inFlightRefreshes.get(env);
  if (!inFlight) {
    const fetchImpl = opts.fetchImpl ?? fetch;
    inFlight = refreshAndPersist(env, refreshUrl, refreshToken, fetchImpl).finally(() => {
      inFlightRefreshes.delete(env);
    });
    inFlightRefreshes.set(env, inFlight);
  }
  return inFlight;

Comment on lines +120 to +124
const response = await fetchImpl(refreshUrl, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ refreshToken })
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fetch request to refresh the token does not have a timeout. If the auth server hangs or is extremely slow, this request will block indefinitely, preventing any concurrent or subsequent callers from proceeding. Adding a timeout using AbortSignal.timeout prevents this issue.

Suggested change
const response = await fetchImpl(refreshUrl, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ refreshToken })
});
const response = await fetchImpl(refreshUrl, {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({ refreshToken }),
signal: AbortSignal.timeout(15000)
});

Comment on lines +233 to +235
export function resetInFlightRefreshForTests(): void {
inFlightRefresh = null;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the test helper to reset the inFlightRefreshes WeakMap instead of the single global variable.

Suggested change
export function resetInFlightRefreshForTests(): void {
inFlightRefresh = null;
}
export function resetInFlightRefreshForTests(): void {
inFlightRefreshes = new WeakMap<CloudApiTokenEnv, Promise<string>>();
}

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 967b8d6f74

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +91 to +94
if (accessToken && expiresAt) {
const expiresMs = Date.parse(expiresAt);
if (Number.isFinite(expiresMs) && expiresMs - now > skewMs) {
return accessToken;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep CLOUD_API_TOKEN in sync on fresh tokens

When the access token is still outside the skew window, this early return leaves CLOUD_API_TOKEN untouched. In runtimes where only CLOUD_API_ACCESS_TOKEN is injected, or where CLOUD_API_TOKEN still contains an older value, the runner's relayflows path that reads process.env.CLOUD_API_TOKEN continues to use a missing/stale bearer until the first refresh near expiry. Copy the fresh access token into the alias before returning, or include the alias in the freshness check.

Useful? React with 👍 / 👎.

schedule();
}

schedule();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh once before waiting for the first interval

Starting the refresher only schedules the first check after intervalMs, so if the runner process starts with a token already expired or inside the skew window (for example after restarting in a long-lived sandbox), startRunner can dispatch an envelope and post to Slack during this initial 60s gap with the old bearer. Kick off tick() immediately, or refresh before the dispatch loop begins, so startup state is covered too.

Useful? React with 👍 / 👎.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@khaliqgant

Copy link
Copy Markdown
Member Author

Closing: placement disproven. The cloud#2307 401 is on /api/v1/slack/post-message, whose only caller is relayflows SlackCloudRelayClient (cloud-relay-runtime.ts:119), running sandbox-side in the relayflows WorkflowRunner = cloud bootstrap-inner.mjs — NOT the workforce startRunner process. Workforce personas emit Slack via VFS writeback (ctx.ts:176-181; persona slack-comms 'never by calling an API'); startRunner has no SlackCloudRelayClient/CLOUD_API_TOKEN caller, so refreshing env there reaches nothing. Half-2 moves to cloud bootstrap-inner.mjs (extend refreshExpiringCredentials():811 to refresh the cloud-API token within-skew, not pre-run-only). The refresh logic from this PR (cloud-api-token-refresh.ts) is portable to that home. Tracked with cloud#2313 (Half-1).

@khaliqgant khaliqgant closed this Jun 19, 2026
@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

I made no file edits — the PR is mechanically clean (build, lint, typecheck all pass) and the logic is sound, so there was nothing to auto-fix. Here is my review.


Review: PR #246fix(runtime): refresh cloud API token in-process for long-running agents

Summary

Adds an in-process cloud API token refresher to packages/runtime so long-running proactive agents don't 401 after their ~2h bearer expires (cloud#2307). New module cloud-api-token-refresh.ts + tests, wired into startRunner with a background loop that rotates-and-persists the CLOUD_API_* env vars in place. The scope is tight: only packages/runtime is touched, and the new module is consumed only by runner.ts.

Verification (the way CI runs it)

Ran the canonical CI pipeline from .github/workflows/ci.yml (pnpm install --frozen-lockfilepnpm -r run buildpnpm run lintpnpm run typecheckpnpm run test):

  • Build: pass
  • Lint: pass
  • Typecheck: pass (including examples)
  • Test: packages/runtime 105/105 pass, including all 11 new token-refresh tests. All other packages pass except packages/cli (2 failures).

The 2 CLI test failures are environmental, not a regression

Both cli.test.js and launch-metadata.test.js fail at import with:

@relayburn/sdk: native binding not found for linux-x64.
Cannot find module '@relayburn/sdk-linux-x64-gnu'

This is a missing native optional-dependency in this sandbox, unrelated to the PR (which never touches packages/cli or @relayburn/sdk). On real CI with the prebuilt binding installed these pass. I did not edit anything to "fix" this — doing so would be out of scope and would risk weakening a test. Treat CI's CLI job as authoritative.

Correctness assessment (traced across consumers)

  • ctx.cloudApi.token is a live getter reading process.env.CLOUD_API_ACCESS_TOKEN on each access (packages/runtime/src/ctx.ts:235-236, ctx.ts:264), so in-place mutation is picked up — the design is sound.
  • The dual-write of both CLOUD_API_ACCESS_TOKEN (the ctx.cloudApi reader) and CLOUD_API_TOKEN (the relayflows adapter env fallback) at cloud-api-token-refresh.ts:146-147 correctly keeps both consumer paths in sync.
  • Fail-closed behavior is preserved: refresh failure throws and leaves env untouched (cloud-api-token-refresh.ts:126-149); 401 raises CloudApiTokenHorizonError and stops the loop rather than silently succeeding. No fail-open conversion.
  • The single-flight guard (inFlightRefresh) correctly prevents concurrent callers from double-spending the single-use rotating refresh token.
  • The runner.ts change wraps the existing dispatch loop in try/finally purely to stop() the timer this PR introduces; the dispatch loop body (envelope shim, continue, dispatch, end-log) is byte-for-byte unchanged aside from indentation. This is the PR's own resource teardown, not a change to pre-existing lifecycle/reaper/dispatch behavior.

Addressed comments

  • No human reviews, approvals, or bot comments were available to validate: gh CLI is not installed in this environment and .workforce/context.json contains no review/comment payload. I therefore could not enumerate or reproduce any prior threads. If review threads exist on GitHub, they were not provided to this run.

Advisory Notes

These are observations only — no code changed, and none are required for this PR:

  • cloud-api-token-refresh.ts:307 (inFlightRefresh) is module-global while ensureFreshCloudApiToken accepts a per-call env. If two different env bags were ever refreshed concurrently in one process, the later caller would await the earlier env's rotation. This is safe today (one identity per process / one process.env) and is documented in the file header; flagging only for future multi-tenant reuse.
  • The header comment (cloud-api-token-refresh.ts:16-19) describes the upstream /v1/tokens/agent provisioning of CLOUD_API_REFRESH_URL; the code itself just POSTs to whatever URL is configured. No action needed — noting that the comment's contract depends on how the env var is minted outside this repo.

Conclusion

The change is well-scoped, type-safe, lint-clean, and the runtime test suite (including new tests) passes in full. The only red CI signal I could reproduce locally is the pre-existing @relayburn/sdk native-binding gap in the CLI tests, which is environmental and unrelated to this PR.

I'm not printing READY: I could not confirm the live state of GitHub's required checks or mergeability from this environment (no gh, no check/review metadata), and the CLI test job's real-CI status is outside what I can verify here. A human should confirm the GitHub CI run is green (the CLI binding present) and that there are no outstanding review threads before merge.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

1 similar comment
@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

pr-reviewer could not complete review for #246 in AgentWorkforce/workforce.
The review harness exited with code 1.
No review was posted; this needs operator attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant