Skip to content

feat(scripts): scheduled factory sync-fidelity canary (cron/launchd)#382

Closed
khaliqgant wants to merge 1 commit into
mainfrom
fix/factory-canary-cron
Closed

feat(scripts): scheduled factory sync-fidelity canary (cron/launchd)#382
khaliqgant wants to merge 1 commit into
mainfrom
fix/factory-canary-cron

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 18, 2026

Copy link
Copy Markdown
Member

Adds a scheduled canary that catches Linear sync-fidelity regressions before they silently block factory dispatch — the regression-prevention control for the class of bug fixed in cloud#2284 (Nango sync dropped state.id/team/labels) and factory#10 (factory read tolerance + the factory canary command).

What it does

scripts/factory-canary.sh runs factory canary <issue> against the live relayfile mount and asserts a known "Ready for Agent" issue is still classified dispatch-ready by the real dry-run triage path. On failure it exits non-zero and (optionally) posts a Slack alert.

  • scripts/factory-canary.sh — runs from the pear root (reuses the running Pear broker), bounded by FACTORY_CANARY_TIMEOUT, alerts via FACTORY_CANARY_SLACK_WEBHOOK.
  • scripts/com.agentrelay.factory-canary.plist — every-6h launchd template with install steps.
  • scripts/README.md — usage + the CI-vs-cron rationale.

Why cron, not package CI

A live canary needs the operator's workspace creds + the relayfile mount, so it belongs in a scheduled job in the operator environment — not in the factory package's CI. The canary logic itself is already covered by the factory unit suite (the stub-primary golden test runs on every PR in factory#10).

Dependency

Requires a factory build with the canary command (factory#10+). Until that publishes, point FACTORY_BIN at a local build (the plist has a slot for it).

🤖 Generated with Claude Code

Review in cubic

Adds factory-canary.sh + a launchd template that run `factory canary <issue>`
against the live relayfile mount on a schedule and assert a known
"Ready for Agent" issue is still dispatch-ready by the real triage path. This
is the regression detector for Linear sync-fidelity drift (sparse records /
stub primaries — cloud#2284 / factory#10): on failure it exits non-zero and
optionally posts a Slack alert, catching the regression before it silently
blocks every factory dispatch.

- factory-canary.sh: runs the canary from the pear root (reuses the Pear
  broker), bounded by FACTORY_CANARY_TIMEOUT, alerts via FACTORY_CANARY_SLACK_WEBHOOK.
- com.agentrelay.factory-canary.plist: every-6h launchd template.
- scripts/README.md: usage + the CI-vs-cron rationale (live canary needs the
  operator workspace; the canary logic itself is covered by factory unit CI).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@khaliqgant, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 16 minutes and 59 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 702b37a3-1633-4dc6-b56d-4d4d6cfc7764

📥 Commits

Reviewing files that changed from the base of the PR and between 6d52cdd and ca707ff.

📒 Files selected for processing (3)
  • scripts/README.md
  • scripts/com.agentrelay.factory-canary.plist
  • scripts/factory-canary.sh
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/factory-canary-cron

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sync-fidelity canary regression detector (factory-canary.sh), its documentation, and a launchd plist template for macOS scheduling. The review feedback focuses on improving the robustness of the bash script, including ensuring the script exits if directory navigation fails, adding macOS gtimeout support while preserving stderr, handling timeouts and empty responses explicitly in failure alerts, and allowing curl errors to be logged to stderr.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread scripts/factory-canary.sh
Comment on lines +35 to +36
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$ROOT"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Ensure that the script exits immediately if changing directory to the repository root fails, preventing subsequent commands from running in an incorrect or unexpected directory.

Suggested change
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$ROOT"
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "$ROOT" || exit 1

Comment thread scripts/factory-canary.sh
Comment on lines +53 to +57
if command -v timeout >/dev/null 2>&1; then
OUT="$(timeout "$TIMEOUT" "${RUN[@]}" 2>/dev/null)"
else
OUT="$("${RUN[@]}" 2>/dev/null)"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

On macOS, timeout is not installed by default, but gtimeout (from coreutils via Homebrew) is commonly available. Adding support for gtimeout ensures timeout protection works on macOS. Additionally, removing 2>/dev/null allows stderr to flow to the log file, which is critical for diagnosing why the canary failed.

Suggested change
if command -v timeout >/dev/null 2>&1; then
OUT="$(timeout "$TIMEOUT" "${RUN[@]}" 2>/dev/null)"
else
OUT="$("${RUN[@]}" 2>/dev/null)"
fi
if command -v timeout >/dev/null 2>&1; then
OUT="$(timeout "$TIMEOUT" "${RUN[@]}")"
elif command -v gtimeout >/dev/null 2>&1; then
OUT="$(gtimeout "$TIMEOUT" "${RUN[@]}")"
else
OUT="$("${RUN[@]}")"
fi

Comment thread scripts/factory-canary.sh
Comment on lines +71 to +74
# Failure: alert.
REASON="$(printf '%s' "$VERDICT" | node -e 'let s="";process.stdin.on("data",d=>s+=d).on("end",()=>{try{const v=JSON.parse(s);console.log(`${v.status||"error"}: ${v.reason||"unknown"}`)}catch{console.log("unparseable verdict")}})' 2>/dev/null)"
MSG=":rotating_light: factory canary FAILED for ${ISSUE} — ${REASON}. Sync fidelity may have regressed (issue no longer dispatch-ready). See AgentWorkforce/cloud#2284 / factory#10."
echo "[$TS] $MSG" >&2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the command times out (exit code 124) or fails without outputting JSON, the parsed REASON will be "unparseable verdict". Handling these cases explicitly provides much more actionable Slack alerts.

Suggested change
# Failure: alert.
REASON="$(printf '%s' "$VERDICT" | node -e 'let s="";process.stdin.on("data",d=>s+=d).on("end",()=>{try{const v=JSON.parse(s);console.log(`${v.status||"error"}: ${v.reason||"unknown"}`)}catch{console.log("unparseable verdict")}})' 2>/dev/null)"
MSG=":rotating_light: factory canary FAILED for ${ISSUE}${REASON}. Sync fidelity may have regressed (issue no longer dispatch-ready). See AgentWorkforce/cloud#2284 / factory#10."
echo "[$TS] $MSG" >&2
# Failure: alert.
if [[ $CODE -eq 124 ]]; then
REASON="timeout after ${TIMEOUT}s (broker/mount may be wedged)"
elif [[ -z "$VERDICT" ]]; then
REASON="empty response (factory command failed)"
else
REASON="$(printf '%s' "$VERDICT" | node -e 'let s="";process.stdin.on("data",d=>s+=d).on("end",()=>{try{const v=JSON.parse(s);console.log(`${v.status||"error"}: ${v.reason||"unknown"}`)}catch{console.log("unparseable verdict")}})' 2>/dev/null)"
fi
MSG=":rotating_light: factory canary FAILED for ${ISSUE}${REASON}. Sync fidelity may have regressed (issue no longer dispatch-ready). See AgentWorkforce/cloud#2284 / factory#10."
echo "[$TS] $MSG" >&2

Comment thread scripts/factory-canary.sh
Comment on lines +76 to +82
if [[ -n "${FACTORY_CANARY_SLACK_WEBHOOK:-}" ]]; then
curl -sS -m 15 -X POST -H 'Content-type: application/json' \
--data "$(node -e 'process.stdout.write(JSON.stringify({text:process.argv[1]}))' "$MSG")" \
"$FACTORY_CANARY_SLACK_WEBHOOK" >/dev/null 2>&1 \
&& echo "[$TS] factory-canary: posted Slack alert" >&2 \
|| echo "[$TS] factory-canary: Slack alert post failed" >&2
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

By redirecting both stdout and stderr of curl to /dev/null, any network or webhook failures are completely silenced. Removing 2>&1 allows curl -sS to print error messages to stderr on failure, which will be captured in the log file.

Suggested change
if [[ -n "${FACTORY_CANARY_SLACK_WEBHOOK:-}" ]]; then
curl -sS -m 15 -X POST -H 'Content-type: application/json' \
--data "$(node -e 'process.stdout.write(JSON.stringify({text:process.argv[1]}))' "$MSG")" \
"$FACTORY_CANARY_SLACK_WEBHOOK" >/dev/null 2>&1 \
&& echo "[$TS] factory-canary: posted Slack alert" >&2 \
|| echo "[$TS] factory-canary: Slack alert post failed" >&2
fi
if [[ -n "${FACTORY_CANARY_SLACK_WEBHOOK:-}" ]]; then
curl -sS -m 15 -X POST -H 'Content-type: application/json' \
--data "$(node -e 'process.stdout.write(JSON.stringify({text:process.argv[1]}))' "$MSG")" \
"$FACTORY_CANARY_SLACK_WEBHOOK" >/dev/null \
&& echo "[$TS] factory-canary: posted Slack alert" >&2 \
|| echo "[$TS] factory-canary: Slack alert post failed" >&2
fi

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca707ff5c8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/factory-canary.sh
# prints a JSON verdict {ok,issue,status,reason}; exit code mirrors ok. A hung
# run (e.g. broker/mount wedge) is bounded by FACTORY_CANARY_TIMEOUT.
TIMEOUT="${FACTORY_CANARY_TIMEOUT:-180}"
RUN=(node "$BIN" factory canary "$ISSUE" --config "$CONFIG" --backend internal)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove the extra factory argument

Because package-lock.json registers bin/factory.mjs itself as the factory executable (bin.factory = "bin/factory.mjs"), invoking node "$BIN" factory canary ... passes an extra positional and runs the equivalent of factory factory canary ... instead of the documented factory canary .... With a normal subcommand CLI this makes every scheduled canary fail as an unknown command and can generate false failure alerts rather than checking the issue.

Useful? React with 👍 / 👎.

Comment thread scripts/factory-canary.sh
Comment on lines +53 to +56
if command -v timeout >/dev/null 2>&1; then
OUT="$(timeout "$TIMEOUT" "${RUN[@]}" 2>/dev/null)"
else
OUT="$("${RUN[@]}" 2>/dev/null)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the timeout on launchd hosts

In the macOS launchd setup added with this script, stock systems do not provide GNU timeout, so this fallback runs the factory process with no bound at all even though FACTORY_CANARY_TIMEOUT is meant to catch broker/mount wedges. If the live mount or broker hangs, the launchd job can remain stuck indefinitely and stop producing the failure alert the canary is supposed to provide.

Useful? React with 👍 / 👎.

@agent-relay-code

Copy link
Copy Markdown
Contributor

All references are consistent. The plist Label, filename, README reference, and launchctl commands all use com.agentrelay.factory-canary correctly.

I made no code edits. Everything in the diff is correct or unverifiable-but-not-fixable mechanically. No edits were left in the working tree.

Summary

PR #382 adds a scheduled "factory sync-fidelity canary": a standalone bash script (scripts/factory-canary.sh), a launchd template (scripts/com.agentrelay.factory-canary.plist), and README docs. It's a benign operational monitoring tool — runs factory canary <issue> periodically and posts a Slack alert if a known issue stops being dispatch-ready.

Impact on CI: None. The PR touches only *.sh, *.plist, and *.md. CI (ci.yml) runs eslint (scoped to src/**/*.{ts,tsx} and scripts/**/*.mjs), typecheck, vitest, and build — none of which inspect these files. No TypeScript, callers, types, tests, or config are affected. Build/test/typecheck remain unaffected.

What I verified in the current checkout:

  • bash -n syntax: clean. Executable bit set (matches 100755).
  • CODE=$? correctly captures the factory/timeout exit code through the if/else command-substitution block (reproduced: exit 7 → CODE=7).
  • timeout → exit 124 path handled; falls through to the failure/alert branch correctly.
  • Slack payload node -e 'JSON.stringify({text:process.argv[1]})' "$MSG" safely JSON-encodes the message (no injection); argv[1] indexing verified.
  • Verdict-reason parser reads stdin, parses JSON, and falls back to "unparseable verdict" on error — verified working (initial failures were sandbox stdin races, not a script defect).
  • plist Label / filename / README references are all consistent.

No auto-edits made. Nothing was mechanically wrong (no lint/format/typo/import issues in the linted file set; the .sh/.plist/.md files aren't linted). Nothing left in the working tree.

Addressed comments

  • No bot or human reviewer comments were present in the harness metadata (.workforce/context.json contains only PR metadata; no review threads). Nothing to address.

Advisory Notes

  • scripts/factory-canary.sh:52RUN=(node "$BIN" factory canary "$ISSUE" ...). $BIN is factory.mjs, so the CLI receives factory canary AR-305 as positional args. The README and script comments describe the command as "factory canary <issue>", but since the binary already is factory, the CLI may expect canary AR-305 (without the leading factory token). This depends entirely on the argument parser in the not-yet-published factory canary command (factory#10+), which isn't installed here, so I could not reproduce or confirm. Worth a quick check against the factory#10 CLI before relying on the cron job — if the parser doesn't accept a leading factory subcommand, every canary run will error out (exit 1) and fire false Slack alerts. I left the code unchanged because this is a behavior question requiring the real CLI to verify, not a mechanical fix.
  • --backend internal flag is passed unconditionally; confirm factory#10's canary accepts it. Same unverifiable-dependency caveat as above.

The PR is documentation + an operator-only scheduled script that does not run in CI. Whether it's mergeable / all required checks are green is a post-harness determination I can't make from here, and the one open item (the factory-token invocation) needs a human with the factory#10 build to confirm. I'm not printing READY.

@agent-relay-code

Copy link
Copy Markdown
Contributor

ℹ️ pr-reviewer: review only — no file changes were applied to the PR (nothing to commit after review). The notes below are advisory and were not pushed.

Review: PR #382feat(scripts): scheduled factory sync-fidelity canary (cron/launchd)

Summary

This PR is purely additive operational tooling: a new bash script (scripts/factory-canary.sh), a launchd template (scripts/com.agentrelay.factory-canary.plist), and a README section. No existing source, types, tests, or config are modified, and nothing in the repo imports or executes these files (verified via grep — only self-references). The canary runs factory canary <issue> against the live relayfile mount on an operator machine and alerts on sync-fidelity regressions.

Verification performed

  • bash -n scripts/factory-canary.sh → syntax OK.
  • CI workflow (.github/workflows/ci.yml) runs verify:mcp-resources-drift, lint (eslint .), typecheck:web, typecheck:node, test, vitest run, build, plus Playwright and mac packaging. None of these touch .sh / .plist / .md, and the new files are imported by nothing, so this PR cannot affect any CI step. No shellcheck stage exists in CI.
  • Reproduced the exit-code propagation: CODE=$? after the if … OUT="$(…)" … fi block correctly captures the inner command-substitution exit code (tested: exit 7 → CODE=7). Fail-closed behavior is preserved — non-zero/timeout (124) falls through to the alert path and exit 1; only CODE -eq 0 exits 0.
  • node_modules/@agent-relay/factory is not installed in this checkout, so the exact factory canary argv shape could not be verified against the real binary. The script itself documents the dependency may be unpublished (Until that publishes, point FACTORY_BIN at a local build), so this is a known forward-looking design, not a checkout-verifiable defect.

Findings

No blocking issues. The script's safety-relevant logic (exit codes, timeout handling, JSON verdict parsing) is correct and fail-closed. I made no edits — there was no mechanical lint/format/typo fix to apply, and I will not modify the operational logic.

Advisory Notes

These are out of scope for a no-edit mechanical pass and/or unverifiable in this checkout; recorded for human consideration, code left unchanged:

  • Command argv shape unverifiable here. RUN=(node "$BIN" factory canary "$ISSUE" …) passes a leading factory token before canary. Whether the CLI expects factory canary <issue> or just canary <issue> could not be confirmed because @agent-relay/factory isn't installed. Worth a one-line manual smoke (node $FACTORY_BIN factory canary AR-305 …) before relying on the scheduled job.
  • OUT swallows stderr (2>/dev/null). On real failures the verdict line may be empty, yielding unparseable verdict in the Slack message. Functional (still fail-closed), but a human may prefer capturing stderr to a log for diagnosis.

Addressed comments

  • No bot or human review comments are present. .workforce/context.json contains no comments, and there is no comments file in .workforce/. Nothing to address.

The PR is low-risk additive tooling with correct fail-closed logic and no CI impact, but the live factory canary argv shape remains unverified in this sandbox (dependency not installed) — a human should confirm that one smoke before relying on the scheduled canary. I am not printing READY because the canary's core invocation could not be reproduced against the real binary here, so a meaningful piece of verification still requires human confirmation rather than being genuinely complete-and-passing.

@khaliqgant

Copy link
Copy Markdown
Member Author

Superseded by AgentWorkforce/factory#11 — the canary scheduling tooling (factory-canary.sh + launchd template) now lives in the factory package, next to the factory canary command.

@khaliqgant khaliqgant closed this Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant