fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646) by Chibionos · Pull Request #1692 · UiPath/uipath-python

Chibionos · 2026-05-29T07:20:57Z

Summary

Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.

Both eval mockers requested structured output via OpenAI-only response_format json_schema and parsed response.choices[0].message.content. On the normalized LLM Gateway, response_format structured output is only honored for OpenAI models; for Claude the content comes back empty/None, so json.loads(None) raised → wrapped as UiPathMockResponseGenerationError → AGENT_RUNTIME.UNEXPECTED_ERROR.

Fixes AE-1646 (customer: Sarasota Memorial Health Care System).

Root cause / regression

Regression from #1555, which started routing the agent's model into simulations. Before that, simulation always used a fixed OpenAI model (gpt_4_1_mini), so non-OpenAI providers were never exercised on this path — which is why Claude "worked before."

Fix

Switch both mockers to provider-agnostic function calling, mirroring llm_as_judge_evaluator (whose docstring already states function calling is the way to get structured output across OpenAI/Claude/Gemini):

Build a forced tool that wraps the output/input schema under a response property, force it via tool_choice=required, and read tool_calls[0].arguments["response"] (already a parsed dict).
Hoist nested $defs to the tool-parameters root so $refs from nested Pydantic models still resolve once the schema is wrapped.
The normalized gateway's chat_completions now accepts raw-dict tools (pass-through) so arbitrary nested schemas survive — the ToolDefinition converter only emits flat properties.

New shared helper eval/mocks/_structured_output.py keeps both mockers DRY.

Tests

test_llm_mockable_structured_output_via_tool_call — parametrized over gpt-4.1-mini, anthropic.claude-sonnet-4-5, gemini-2.5-pro; reproduces AE-1646 (content=None + tool_calls) and asserts the new contract.
test_build_response_tool_hoists_defs_to_root + helper error-branch unit tests.
test_raw_dict_tool_passthrough_mocked (platform) — asserts a nested array schema is forwarded byte-for-byte.
Existing mocker/input/span tests updated to the function-calling contract (behavior assertions preserved).
Full tests/cli/eval suite + platform mocked LLM tests green; ruff + mypy clean.

Note for reviewers

The OpenAI path also moves to function-calling (no longer response_format), matching the judge. Worth a live check that a Claude/Bedrock agent simulating a nested-model tool output round-trips through the gateway, since nested $defs in tool parameters has no prior precedent in this repo (the judge only used flat schemas).

🤖 Generated with Claude Code

github-actions · 2026-05-29T07:53:00Z

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

Chibionos · 2026-05-29T08:28:48Z

Re: the uipath-langchain cross-test heads-up above — that was from an earlier commit. After the adaptive fix (prefer response_format, fall back to function calling only when content is empty), the cross-tests pass on the latest commit: langchain-cross / {alpha,cloud,staging} and test-uipath-langchain (3.11/3.12/3.13 × ubuntu/windows) are all green. The heads-up comment is stale and can be disregarded.

Full suite is green: all eval test-cases (calculator-evals, simulation-testcase, tools-evals, etc.) pass across alpha/cloud/staging; lint, SonarCloud, build all pass.

mjnovice · 2026-06-03T01:04:06Z

+logger = logging.getLogger(__name__)
+
+
+def _inline_defs(


Can we have separate classes for how we are doing Anthropic, OpenAI, Gemini etc. ?

mjnovice

Minor comment about making the generate_structured_output more modular.

…models work Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini). The mockers requested structured output via OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`; for Claude that content is empty/None, so `json.loads(...)` raised. Switch both mockers to provider-agnostic function calling (mirrors llm_as_judge_evaluator): build a forced tool that wraps the output/input schema under a `response` property, force it via tool_choice, and read `tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested `$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary nested schemas survive (the ToolDefinition converter only emits flat properties). Regression introduced by #1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model, so non-OpenAI providers were never exercised on this path. Fixes AE-1646. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.62. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ation The normalized gateway accepts $ref/$defs in response_format but not inside a tool's parameters. Tool outputs typed as nested Pydantic models/enums (e.g. calculator's get_random_operator -> Wrapper[Operator]) produced a tool schema with $ref/$defs that the gateway rejected, so simulation failed. Inline the definitions into a self-contained schema (cyclic refs keep their $defs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

All-tool-calling regressed OpenAI tool simulation (calculator-evals 'Test Random Addition Using LLM' became flaky: gpt_4_1_mini returned wrong/empty values for a nested-enum output schema via function calling, where response_format was reliable). Make structured-output generation adaptive: prefer response_format (honored reliably by OpenAI, native $defs support) and fall back to a forced tool call only when content comes back empty (the non-OpenAI failure mode, e.g. Claude/Bedrock). Shared in generate_structured_output(), used by both mockers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-06-08T17:50:59Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot

Pull request overview

This PR addresses eval tool/input simulation failures for non-OpenAI models by introducing a provider-agnostic structured-output helper and updating the eval mockers and LLM gateway integration to support function-calling style structured responses (including nested schemas via raw-dict tool passthrough).

Changes:

Add generate_structured_output() helper to prefer response_format when available and fall back to a forced tool call when content is empty/unsupported.
Update eval LLM mocker and input mocker to use the shared structured-output helper, and adjust/extend unit tests for the new behavior (including non-OpenAI fallback).
Update UiPathLlmChatService.chat_completions() to accept raw-dict tools (pass-through) so nested JSON-schema tool parameters are preserved; bump package versions accordingly.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packages/uipath/uv.lock	Bumps `uipath` and `uipath-platform` locked versions.
packages/uipath/pyproject.toml	Bumps `uipath` version and raises minimum `uipath-platform` dependency.
packages/uipath-platform/uv.lock	Bumps `uipath-platform` locked version.
packages/uipath-platform/pyproject.toml	Bumps `uipath-platform` version.
packages/uipath-platform/src/uipath/platform/chat/_llm_gateway_service.py	Allows passing raw-dict tools through to the normalized gateway request body.
packages/uipath-platform/tests/services/test_uipath_llm_integration.py	Adds coverage ensuring raw-dict tool schemas are forwarded unchanged and tool_choice serialization works.
packages/uipath/src/uipath/eval/mocks/_structured_output.py	New shared helper for structured output with response_format-first + tool-call fallback; includes schema `$defs` inlining logic.
packages/uipath/src/uipath/eval/mocks/_llm_mocker.py	Switches mock response generation to the shared structured-output helper; improves error propagation.
packages/uipath/src/uipath/eval/mocks/_input_mocker.py	Switches input generation to the shared structured-output helper.
packages/uipath/tests/cli/eval/mocks/test_structured_output.py	New unit tests validating schema wrapping/inlining and response extraction/fallback behavior.
packages/uipath/tests/cli/eval/mocks/test_mocks.py	Updates existing mocks and adds a non-OpenAI fallback regression test (AE-1646).
packages/uipath/tests/cli/eval/mocks/test_input_mocker.py	Adds assertion that OpenAI path uses `response_format` without tool fallback.

Comments suppressed due to low confidence (1)

packages/uipath/src/uipath/eval/mocks/_input_mocker.py:158

generate_llm_input() no longer surfaces a clear JSON-parsing error when the LLM returns invalid JSON (the previous code raised UiPathInputMockingError("Failed to parse LLM response as JSON: ...")). Now a json.JSONDecodeError from generate_structured_output() is wrapped as a generic "Failed to generate input" error, which makes debugging structured-output failures harder.

    except UiPathInputMockingError:
        raise
    except Exception as e:
        raise UiPathInputMockingError(f"Failed to generate input: {str(e)}") from e

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+"""Provider-agnostic structured output for the eval mockers.
+
+The normalized LLM Gateway honors OpenAI-style ``response_format`` (json_schema)
+only for OpenAI models — and does so reliably, including native ``$defs``
+support. Non-OpenAI providers (Anthropic/Claude via Bedrock, Gemini) return such
+requests with ``choices[0].message.content`` empty/None, which breaks JSON
+parsing. Function calling is honored across providers but is less reliable for
+OpenAI on some schemas, so it is used only as a fallback: prefer
+``response_format`` and fall back to a forced tool call when the content comes
+back empty.


        presence_penalty: float = 0,
        top_p: float | None = 1,
        top_k: int | None = None,
-        tools: list[ToolDefinition] | None = None,
+        tools: list[ToolDefinition | dict[str, Any]] | None = None,
        tool_choice: ToolChoice | None = None,
        response_format: dict[str, Any] | type[BaseModel] | None = None,


Chibionos mentioned this pull request May 29, 2026

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646) #1691

Closed

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 29, 2026

Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from 5e4bdb2 to ae78cbe Compare May 29, 2026 08:06

mjnovice reviewed Jun 3, 2026

View reviewed changes

mjnovice approved these changes Jun 3, 2026

View reviewed changes

Chibionos and others added 5 commits June 8, 2026 10:43

chore: bump uipath to 2.10.79 and uipath-platform to 0.1.62

54a6557

The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.62. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(eval): add explicit type params to _FakeLLM for mypy

5bddc0f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 8, 2026 17:47

Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from b4954be to 5bddc0f Compare June 8, 2026 17:47

Copilot started reviewing on behalf of Chibionos June 8, 2026 17:47 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692
Chibionos wants to merge 5 commits into
mainfrom
fix/ae-1646-mocker-non-openai-models

Chibionos commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

mjnovice Jun 3, 2026

Uh oh!

mjnovice left a comment

Uh oh!

sonarqubecloud Bot commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger = logging.getLogger(__name__)


		def _inline_defs(

Conversation

Chibionos commented May 29, 2026

Summary

Root cause / regression

Fix

Tests

Note for reviewers

Uh oh!

github-actions Bot commented May 29, 2026

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

mjnovice Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

mjnovice left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Jun 8, 2026

Quality Gate passed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨