fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692
fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692Chibionos wants to merge 5 commits into
Conversation
🚨 Heads up:
|
5e4bdb2 to
ae78cbe
Compare
|
Re: the Full suite is green: all eval test-cases (calculator-evals, simulation-testcase, tools-evals, etc.) pass across alpha/cloud/staging; lint, SonarCloud, build all pass. |
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _inline_defs( |
There was a problem hiding this comment.
Can we have separate classes for how we are doing Anthropic, OpenAI, Gemini etc. ?
mjnovice
left a comment
There was a problem hiding this comment.
Minor comment about making the generate_structured_output more modular.
…models work Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini). The mockers requested structured output via OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`; for Claude that content is empty/None, so `json.loads(...)` raised. Switch both mockers to provider-agnostic function calling (mirrors llm_as_judge_evaluator): build a forced tool that wraps the output/input schema under a `response` property, force it via tool_choice, and read `tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested `$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary nested schemas survive (the ToolDefinition converter only emits flat properties). Regression introduced by #1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model, so non-OpenAI providers were never exercised on this path. Fixes AE-1646. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.62. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation The normalized gateway accepts $ref/$defs in response_format but not inside a tool's parameters. Tool outputs typed as nested Pydantic models/enums (e.g. calculator's get_random_operator -> Wrapper[Operator]) produced a tool schema with $ref/$defs that the gateway rejected, so simulation failed. Inline the definitions into a self-contained schema (cyclic refs keep their $defs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All-tool-calling regressed OpenAI tool simulation (calculator-evals 'Test Random Addition Using LLM' became flaky: gpt_4_1_mini returned wrong/empty values for a nested-enum output schema via function calling, where response_format was reliable). Make structured-output generation adaptive: prefer response_format (honored reliably by OpenAI, native $defs support) and fall back to a forced tool call only when content comes back empty (the non-OpenAI failure mode, e.g. Claude/Bedrock). Shared in generate_structured_output(), used by both mockers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b4954be to
5bddc0f
Compare
|
There was a problem hiding this comment.
Pull request overview
This PR addresses eval tool/input simulation failures for non-OpenAI models by introducing a provider-agnostic structured-output helper and updating the eval mockers and LLM gateway integration to support function-calling style structured responses (including nested schemas via raw-dict tool passthrough).
Changes:
- Add
generate_structured_output()helper to preferresponse_formatwhen available and fall back to a forced tool call when content is empty/unsupported. - Update eval LLM mocker and input mocker to use the shared structured-output helper, and adjust/extend unit tests for the new behavior (including non-OpenAI fallback).
- Update
UiPathLlmChatService.chat_completions()to accept raw-dict tools (pass-through) so nested JSON-schema tool parameters are preserved; bump package versions accordingly.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/uipath/uv.lock | Bumps uipath and uipath-platform locked versions. |
| packages/uipath/pyproject.toml | Bumps uipath version and raises minimum uipath-platform dependency. |
| packages/uipath-platform/uv.lock | Bumps uipath-platform locked version. |
| packages/uipath-platform/pyproject.toml | Bumps uipath-platform version. |
| packages/uipath-platform/src/uipath/platform/chat/_llm_gateway_service.py | Allows passing raw-dict tools through to the normalized gateway request body. |
| packages/uipath-platform/tests/services/test_uipath_llm_integration.py | Adds coverage ensuring raw-dict tool schemas are forwarded unchanged and tool_choice serialization works. |
| packages/uipath/src/uipath/eval/mocks/_structured_output.py | New shared helper for structured output with response_format-first + tool-call fallback; includes schema $defs inlining logic. |
| packages/uipath/src/uipath/eval/mocks/_llm_mocker.py | Switches mock response generation to the shared structured-output helper; improves error propagation. |
| packages/uipath/src/uipath/eval/mocks/_input_mocker.py | Switches input generation to the shared structured-output helper. |
| packages/uipath/tests/cli/eval/mocks/test_structured_output.py | New unit tests validating schema wrapping/inlining and response extraction/fallback behavior. |
| packages/uipath/tests/cli/eval/mocks/test_mocks.py | Updates existing mocks and adds a non-OpenAI fallback regression test (AE-1646). |
| packages/uipath/tests/cli/eval/mocks/test_input_mocker.py | Adds assertion that OpenAI path uses response_format without tool fallback. |
Comments suppressed due to low confidence (1)
packages/uipath/src/uipath/eval/mocks/_input_mocker.py:158
generate_llm_input()no longer surfaces a clear JSON-parsing error when the LLM returns invalid JSON (the previous code raisedUiPathInputMockingError("Failed to parse LLM response as JSON: ...")). Now ajson.JSONDecodeErrorfromgenerate_structured_output()is wrapped as a generic "Failed to generate input" error, which makes debugging structured-output failures harder.
except UiPathInputMockingError:
raise
except Exception as e:
raise UiPathInputMockingError(f"Failed to generate input: {str(e)}") from e
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Provider-agnostic structured output for the eval mockers. | ||
|
|
||
| The normalized LLM Gateway honors OpenAI-style ``response_format`` (json_schema) | ||
| only for OpenAI models — and does so reliably, including native ``$defs`` | ||
| support. Non-OpenAI providers (Anthropic/Claude via Bedrock, Gemini) return such | ||
| requests with ``choices[0].message.content`` empty/None, which breaks JSON | ||
| parsing. Function calling is honored across providers but is less reliable for | ||
| OpenAI on some schemas, so it is used only as a fallback: prefer | ||
| ``response_format`` and fall back to a forced tool call when the content comes | ||
| back empty. |
| presence_penalty: float = 0, | ||
| top_p: float | None = 1, | ||
| top_k: int | None = None, | ||
| tools: list[ToolDefinition] | None = None, | ||
| tools: list[ToolDefinition | dict[str, Any]] | None = None, | ||
| tool_choice: ToolChoice | None = None, | ||
| response_format: dict[str, Any] | type[BaseModel] | None = None, |



Summary
Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with
AGENT_RUNTIME.UNEXPECTED_ERRORfor non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.Both eval mockers requested structured output via OpenAI-only
response_formatjson_schema and parsedresponse.choices[0].message.content. On the normalized LLM Gateway,response_formatstructured output is only honored for OpenAI models; for Claude the content comes back empty/None, sojson.loads(None)raised → wrapped asUiPathMockResponseGenerationError→AGENT_RUNTIME.UNEXPECTED_ERROR.Fixes AE-1646 (customer: Sarasota Memorial Health Care System).
Root cause / regression
Regression from #1555, which started routing the agent's model into simulations. Before that, simulation always used a fixed OpenAI model (
gpt_4_1_mini), so non-OpenAI providers were never exercised on this path — which is why Claude "worked before."Fix
Switch both mockers to provider-agnostic function calling, mirroring
llm_as_judge_evaluator(whose docstring already states function calling is the way to get structured output across OpenAI/Claude/Gemini):responseproperty, force it viatool_choice=required, and readtool_calls[0].arguments["response"](already a parsed dict).$defsto the tool-parameters root so$refs from nested Pydantic models still resolve once the schema is wrapped.chat_completionsnow accepts raw-dict tools (pass-through) so arbitrary nested schemas survive — theToolDefinitionconverter only emits flat properties.New shared helper
eval/mocks/_structured_output.pykeeps both mockers DRY.Tests
test_llm_mockable_structured_output_via_tool_call— parametrized overgpt-4.1-mini,anthropic.claude-sonnet-4-5,gemini-2.5-pro; reproduces AE-1646 (content=None + tool_calls) and asserts the new contract.test_build_response_tool_hoists_defs_to_root+ helper error-branch unit tests.test_raw_dict_tool_passthrough_mocked(platform) — asserts a nested array schema is forwarded byte-for-byte.tests/cli/evalsuite + platform mocked LLM tests green; ruff + mypy clean.Note for reviewers
The OpenAI path also moves to function-calling (no longer
response_format), matching the judge. Worth a live check that a Claude/Bedrock agent simulating a nested-model tool output round-trips through the gateway, since nested$defsin toolparametershas no prior precedent in this repo (the judge only used flat schemas).🤖 Generated with Claude Code