Skip to content

fix(pr-agent): always ignore caretaker/pr-readiness in CI eval (self-deadlock)#433

Merged
ianlintner merged 1 commit intomainfrom
fix/ignore-self-readiness-check
Apr 21, 2026
Merged

fix(pr-agent): always ignore caretaker/pr-readiness in CI eval (self-deadlock)#433
ianlintner merged 1 commit intomainfrom
fix/ignore-self-readiness-check

Conversation

@ianlintner
Copy link
Copy Markdown
Owner

Live e2e on #431 surfaced a self-gating deadlock: caretaker's own caretaker/pr-readiness check is posted on every PR and is always pending/action_required while the state machine is still evaluating. evaluate_ci then kept returning PENDING, so the PR agent decided to wait — forever — and never handed the lint-failure task to the custom executor.

Adds _ALWAYS_IGNORED_CHECK_NAMES = {'caretaker/pr-readiness'} to evaluate_ci so every consumer inherits the fix without needing to touch ci.ignore_jobs. Also added the explicit entry to the caretaker dogfood config for documentation.

Run that proved the bug: 24706775007 — logs show PR #431: ci_pending → ci_pending (action: wait) despite the lint check being in FAILURE.

Test plan

🤖 Generated with Claude Code
EOF
)

Surfaced during live custom-agent e2e on #431: caretaker sees the
lint failure, ranks the PR as ``ci_pending``, and waits — forever —
because its OWN ``caretaker/pr-readiness`` check is also on the
check-runs list and it's always either ``pending`` or
``action_required`` while the state machine is still deciding. The
state machine then refuses to act until all checks settle, which
requires it to act, which… deadlock.

Fix in two places:

1. ``src/caretaker/pr_agent/states.py`` — ``evaluate_ci`` now merges
   the caller's ``ignore_jobs`` with a hard-coded
   ``_ALWAYS_IGNORED_CHECK_NAMES`` set that includes
   ``caretaker/pr-readiness``. Every downstream consumer inherits
   the fix without needing to touch their config.
2. ``.github/maintainer/config.yml`` — explicit ignore entry with a
   comment so the dogfood config demonstrates the pattern for
   anyone copying it.

Diagnostic trail:

- Run 24706775007 on PR #431 (deliberate E501 for the e2e):
  ``PR #431: ci_pending → ci_pending (action: wait)`` — FoundryExecutor
  was ready but never dispatched because evaluate_ci returned
  PENDING on the pr-readiness check.
- After fix: evaluate_ci sees only the upstream checks (``lint`` in
  FAILURE); CIStatus becomes FAILING; PR agent builds a
  ``CopilotTask(LINT_FAILURE)`` and hands it to the dispatcher;
  dispatcher finds LINT_FAILURE in the allowlist and routes to
  Foundry.

Full pytest suite still green (907 passed).
@ianlintner ianlintner merged commit dccba36 into main Apr 21, 2026
9 of 11 checks passed
@ianlintner ianlintner deleted the fix/ignore-self-readiness-check branch April 21, 2026 06:10
ianlintner added a commit that referenced this pull request Apr 21, 2026
ianlintner added a commit that referenced this pull request Apr 21, 2026
…f custom executor) (#431)

* docs(readme): add Fleet registry + custom coding agent sections (e2e test)

Also introduces a deliberate ruff E501 violation in
``src/caretaker/fleet/api.py`` (well outside the code-path and marked
with a big ``DELIBERATE E2E TEST`` comment) so we can observe the new
custom coding agent end-to-end on this PR:

1. CI fails on ``ruff check`` with the expected E501.
2. caretaker's PR agent sees the lint failure and constructs a
   ``CopilotTask(task_type=LINT_FAILURE)``.
3. ``ExecutorDispatcher.route()`` picks ``provider: auto`` + Foundry
   eligible + same-repo → dispatches to ``FoundryExecutor.run()``.
4. Foundry fixes the E501 (reformat / wrap / remove the comment),
   commits, pushes. CI re-runs green.
5. PR reaches merge-ready.

If anything in that loop breaks, the fix lands as a follow-up PR —
the deliberate violation can always be cleaned up by a human commit
if the agent doesn't finish.

README additions summarise the real shipped features (fleet
registry, custom coding agent, routing labels) so the front page
reflects post-sprint state.

* fix(workflow): install llm-multi extra so Foundry executor is reachable

Discovered during the live e2e test on #431: caretaker's own
workflow logs showed

    executor.foundry.enabled=True but LiteLLM provider is
    unavailable (missing credentials or package). Routing stays
    on Copilot.

The credentials are present — ``ANTHROPIC_API_KEY`` and
``AZURE_AI_API_KEY`` are both set as repo secrets — but the
``litellm`` package itself is only pulled in by the optional
``llm-multi`` extras group, and the install step ran
``pip install .``.

Fixes:

- ``.github/workflows/maintainer.yml`` (caretaker dogfood) now
  installs ``".[llm-multi]"`` so LiteLLM is present in the runner's
  Python env. Without this, every dispatch cascades through
  ``LiteLLMProvider.available == False`` and falls back to Copilot,
  defeating the whole Foundry routing path.
- ``setup-templates/templates/workflows/maintainer.yml`` (consumer
  template) installs ``litellm`` as a second pip step. We can't use
  the ``[llm-multi]`` extras form in the git-URL install because
  that spec is name-sensitive and caretaker was renamed from
  ``caretaker`` to ``caretaker-github`` at v0.8.1; a bare URL + a
  separate ``pip install litellm`` works across the rename boundary.

Once these changes land and the next caretaker run fires, the
Foundry executor should actually attempt the LINT_FAILURE fix on
#431 instead of routing straight to Copilot.

* revert(fleet): remove deliberate E501 — e2e experiment concluded, custom-agent wiring verified via #432/#433/#434
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant