chore: bump deps + adopt pydantic-ai 1.79 patterns#13
Merged
Conversation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…processors
Applies the approved adoption plan from the 1.77→1.79 docs review:
- Move summarize_tool_results to the native `history_processors=` Agent
parameter instead of wrapping it in the HistoryProcessor capability —
matches every example in the restructured upstream docs.
- Add `Hooks` capability with `on_tool_execute_error` as a safety net for
unanticipated exceptions in tool bodies (anticipated Spotify/DB errors
are still handled inside the tools themselves). Recovery payload lets
the agent respond conversationally instead of crashing the SSE stream.
- Add `AGENT_REASONING_EFFORT` config knob wiring reasoning through
`OpenAIResponsesModelSettings`. Defaults to None (no behavior change);
enable via env var and validate against the eval suite before shipping.
- Scaffold pydantic-evals at backend/tests/evals/:
* `test_prepare_tools.py` — 8 deterministic TestModel-backed tests
verifying runtime tool filtering for Playlist/Classification caps.
* `test_brain_agent_evals.py` — 7 real-model tool-routing cases gated
behind `@pytest.mark.eval`, deselected from the default pytest run.
- Add `.claude/rules/backend/pydantic-ai.md` — flags that llms-full.txt
is now reference-only (as of the 1.79 refresh) and guides are at the
web docs site, so future Claude runs don't grep for walkthroughs that
have been moved out of the local snapshot.
Test plan: `pytest` → 59 passed, 1 deselected. Pre-commit (ruff, mypy,
format) clean. Eval suite runs with `pytest -m eval` and requires a real
OPENAI_API_KEY — intentionally kept out of CI default.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- hooks.py: switch to the public Hooks(tool_execute_error=...) constructor form; drops the decorator indirection for a single hook and removes the private `_get()` access from tests. - test_brain_agent_hooks.py: replace _FakeCall/_FakeToolDef stubs with real ToolCallPart / ToolDefinition instances; tests now break if pydantic-ai renames those fields instead of silently drifting. - tests/conftest.py: move the dummy OPENAI_API_KEY setup to the session- level root conftest so it runs once before any test imports brain_agent. Eval conftest no longer needs noqa: E402 gymnastics. - tests/evals/conftest.py: add fake_spotify_client() and fake_eeg_model() helpers using MagicMock(spec=...) so intent is explicit and type-checked. - test_prepare_tools.py: use the new MagicMock(spec=...) helpers; stops hinting that the capabilities "check truthiness" (they check `is None`). - test_brain_agent_evals.py: BrainAgentInput now carries with_spotify so build_relaxation_playlist_needs_confirmation can run with a fake Spotify client — previously it was passing for the wrong reason (agent refusing because Spotify was disconnected, not because it was awaiting confirm). - brain_agent.py: drop the explicit OpenAIResponsesModelSettings | None annotation; mypy infers it. - DEVELOPMENT.md: document the -m eval two-tier test workflow under Tests. pytest → 57 passed, 1 deselected. pre-commit clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The original cut was ~63 lines and mostly restated facts already in CLAUDE.md or trivially grep-able from the code (capability class names, sdk_version=6, logfire wiring, SUMMARIZABLE_TOOLS location). Those bullets invite staleness — rename a capability, bump sdk_version, or add a fifth capability and the rule rots silently. Keep only the two things that aren't derivable from grepping: 1. The split between llms-full.txt (reference) and ai.pydantic.dev/ (guides) — the one fact that saves future sessions from fruitlessly grepping for walkthroughs that have been moved upstream. 2. The policy that tool error handling should flow through the hook rather than in-tool try/except — a behavioral rule, not a code fact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
history_processors=parameter,Hookscapability for tool-failure recovery, and apydantic_evalsscaffold for tool-routing regression tests.AGENT_REASONING_EFFORTconfig knob (default off) wiringOpenAIResponsesModelSettings.openai_reasoning_effortthrough the agent, ready to experiment behind the new eval suite.Changes
Dependency bumps (
a82dd54)pydantic-ai1.77→1.79,ruff0.15.9→0.15.10, plus transitive refreshes (anthropic 0.91→0.93, openai 2.30→2.31, huggingface-hub 1.9→1.10, temporalio 1.24→1.25, etc.)next16.2.2→16.2.3,ai6.0.151→6.0.156,@ai-sdk/react3.0.153→3.0.158,@tanstack/react-query5.96→5.97,react/react-dom19.2.4→19.2.5,axios1.14→1.15,lucide-react1.7→1.8,@biomejs/biome2.4.10→2.4.11,ultracite7.4.3→7.4.4, pluspostcss,@types/nodedocs/pydantic-ai-llms-full.txtrefreshed — the upstream restructured from tutorial-first to reference-first, so the file shrank from 168k→74k lines. No APIs deprecated; content moved to the web docs site. See.claude/rules/backend/pydantic-ai.mdfor details.Agent pattern adoption (
1f06f5b)backend/src/cortexdj/agents/brain_agent.py—summarize_tool_resultsnow registered via the nativehistory_processors=Agent parameter instead of wrapped inHistoryProcessor(...)insidecapabilities=[]. Matches every example in the updated upstream docs.backend/src/cortexdj/agents/hooks.py(new) —build_brain_agent_hooks()returns aHookscapability withon_tool_execute_erroras a safety net for unanticipated tool exceptions (anticipated Spotify/DB errors are still handled inside the tools themselves). The hook returns a structured{"error": "tool_failed", ...}recovery payload so the agent can apologize conversationally instead of crashing the Vercel AI SDK stream mid-response.backend/tests/evals/(new directory):test_prepare_tools.py— 8 deterministicTestModel-backed tests verifyingPlaylistCapability.prepare_toolsandClassificationCapability.prepare_toolsfilter the offered tool set correctly when Spotify / EEG model are missing.test_brain_agent_evals.py— 7 real-model tool-routing cases usingpydantic_evals.Dataset/Case/Evaluator, gated behind@pytest.mark.eval.backend/pyproject.tomladdspydantic-evals>=1.79.0to dev deps, registers theevalmarker, and setsaddopts = "-m 'not eval'"so real-model tests are excluded from the default pytest run.backend/src/cortexdj/core/config.py— addsAGENT_REASONING_EFFORT: Literal["low","medium","high"] | None = None. When set,brain_agent.pythreads it throughOpenAIResponsesModelSettings(openai_reasoning_effort=...)via themodel_settings=parameter. Default None preserves current behavior..claude/rules/backend/pydantic-ai.md(new) — explains the upstream docs reorganization (llms-full.txt is now reference-only, tutorials live at ai.pydantic.dev) so future sessions don't grep for walkthroughs that have been moved out.backend/tests/test_brain_agent_hooks.py(new) — 4 unit tests for the recovery payload helper and the_recover_tool_errorhandler using realToolCallPart/ToolDefinitioninstances.Code review fixes (
c3c39a0)hooks.pyuses the publicHooks[AgentDeps](tool_execute_error=...)constructor form instead of the decorator, removing private-API access from tests.OPENAI_API_KEYmoved to session-levelbackend/tests/conftest.py(eliminatesnoqa: E402gymnastics in the evals conftest).tests/evals/conftest.pyexposesfake_spotify_client()/fake_eeg_model()helpers viaMagicMock(spec=...)so test intent is explicit and type-checked.test_brain_agent_evals.py—BrainAgentInputnow carrieswith_spotify, sobuild_relaxation_playlist_needs_confirmationruns with a fake Spotify client and actually tests the confirmation flow instead of passing for the wrong reason (disconnected-refusal).brain_agent.pydrops the explicitOpenAIResponsesModelSettings | Noneannotation (mypy infers it).DEVELOPMENT.mddocuments the-m evaltwo-tier test workflow under the Tests section.Breaking Changes
None. Existing behavior is preserved across the board:
history_processors=andHistoryProcessor(...)insidecapabilities=[]are both valid; this PR swaps one for the other with no functional change.Hooksis additive —on_tool_execute_erroronly fires on unhandled exceptions; tools that return structured error dicts work unchanged.AGENT_REASONING_EFFORTdefaults toNone→ noOpenAIResponsesModelSettingsis constructed → agent behaves identically to before the PR unless the env var is explicitly set.addopts = "-m 'not eval'"excludes new eval-marked tests from the defaultpytestinvocation; all existing tests still run.Test Plan
uv run --directory backend pytest→ 57 passed, 1 deselecteduv run --directory backend pre-commit run --all-files→ all hooks pass (ruff, ruff-format, mypy)pnpm -C frontend lint→ cleancode-revieweragent review applied (should-fixes + nice-to-haves)🤖 Generated with Claude Code