Skip to content

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692

Open
Chibionos wants to merge 5 commits into
mainfrom
fix/ae-1646-mocker-non-openai-models
Open

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692
Chibionos wants to merge 5 commits into
mainfrom
fix/ae-1646-mocker-non-openai-models

Conversation

@Chibionos

Copy link
Copy Markdown
Contributor

Summary

Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.

Both eval mockers requested structured output via OpenAI-only response_format json_schema and parsed response.choices[0].message.content. On the normalized LLM Gateway, response_format structured output is only honored for OpenAI models; for Claude the content comes back empty/None, so json.loads(None) raised → wrapped as UiPathMockResponseGenerationErrorAGENT_RUNTIME.UNEXPECTED_ERROR.

Fixes AE-1646 (customer: Sarasota Memorial Health Care System).

Root cause / regression

Regression from #1555, which started routing the agent's model into simulations. Before that, simulation always used a fixed OpenAI model (gpt_4_1_mini), so non-OpenAI providers were never exercised on this path — which is why Claude "worked before."

Fix

Switch both mockers to provider-agnostic function calling, mirroring llm_as_judge_evaluator (whose docstring already states function calling is the way to get structured output across OpenAI/Claude/Gemini):

  • Build a forced tool that wraps the output/input schema under a response property, force it via tool_choice=required, and read tool_calls[0].arguments["response"] (already a parsed dict).
  • Hoist nested $defs to the tool-parameters root so $refs from nested Pydantic models still resolve once the schema is wrapped.
  • The normalized gateway's chat_completions now accepts raw-dict tools (pass-through) so arbitrary nested schemas survive — the ToolDefinition converter only emits flat properties.

New shared helper eval/mocks/_structured_output.py keeps both mockers DRY.

Tests

  • test_llm_mockable_structured_output_via_tool_call — parametrized over gpt-4.1-mini, anthropic.claude-sonnet-4-5, gemini-2.5-pro; reproduces AE-1646 (content=None + tool_calls) and asserts the new contract.
  • test_build_response_tool_hoists_defs_to_root + helper error-branch unit tests.
  • test_raw_dict_tool_passthrough_mocked (platform) — asserts a nested array schema is forwarded byte-for-byte.
  • Existing mocker/input/span tests updated to the function-calling contract (behavior assertions preserved).
  • Full tests/cli/eval suite + platform mocked LLM tests green; ruff + mypy clean.

Note for reviewers

The OpenAI path also moves to function-calling (no longer response_format), matching the judge. Worth a live check that a Claude/Bedrock agent simulating a nested-model tool output round-trips through the gateway, since nested $defs in tool parameters has no prior precedent in this repo (the judge only used flat schemas).

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

@Chibionos Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from 5e4bdb2 to ae78cbe Compare May 29, 2026 08:06
@Chibionos

Copy link
Copy Markdown
Contributor Author

Re: the uipath-langchain cross-test heads-up above — that was from an earlier commit. After the adaptive fix (prefer response_format, fall back to function calling only when content is empty), the cross-tests pass on the latest commit: langchain-cross / {alpha,cloud,staging} and test-uipath-langchain (3.11/3.12/3.13 × ubuntu/windows) are all green. The heads-up comment is stale and can be disregarded.

Full suite is green: all eval test-cases (calculator-evals, simulation-testcase, tools-evals, etc.) pass across alpha/cloud/staging; lint, SonarCloud, build all pass.

logger = logging.getLogger(__name__)


def _inline_defs(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have separate classes for how we are doing Anthropic, OpenAI, Gemini etc. ?

@mjnovice mjnovice left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about making the generate_structured_output more modular.

Chibionos and others added 5 commits June 8, 2026 10:43
…models work

Tool simulation and input generation in Studio Debug and Evaluation Set runs
failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic
Claude via Bedrock, Gemini). The mockers requested structured output via
OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`;
for Claude that content is empty/None, so `json.loads(...)` raised.

Switch both mockers to provider-agnostic function calling (mirrors
llm_as_judge_evaluator): build a forced tool that wraps the output/input schema
under a `response` property, force it via tool_choice, and read
`tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested
`$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still
resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary
nested schemas survive (the ToolDefinition converter only emits flat properties).

Regression introduced by #1555, which started routing the agent's model into
simulations; before that, simulation always used a fixed OpenAI model, so
non-OpenAI providers were never exercised on this path.

Fixes AE-1646.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mocker fix in uipath depends on the dict-tool passthrough in
uipath-platform, so uipath's lower-bound pin is raised to 0.1.62.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation

The normalized gateway accepts $ref/$defs in response_format but not inside a
tool's parameters. Tool outputs typed as nested Pydantic models/enums (e.g.
calculator's get_random_operator -> Wrapper[Operator]) produced a tool schema
with $ref/$defs that the gateway rejected, so simulation failed. Inline the
definitions into a self-contained schema (cyclic refs keep their $defs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All-tool-calling regressed OpenAI tool simulation (calculator-evals 'Test Random
Addition Using LLM' became flaky: gpt_4_1_mini returned wrong/empty values for a
nested-enum output schema via function calling, where response_format was
reliable). Make structured-output generation adaptive: prefer response_format
(honored reliably by OpenAI, native $defs support) and fall back to a forced
tool call only when content comes back empty (the non-OpenAI failure mode, e.g.
Claude/Bedrock). Shared in generate_structured_output(), used by both mockers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 8, 2026 17:47
@Chibionos Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from b4954be to 5bddc0f Compare June 8, 2026 17:47
@sonarqubecloud

sonarqubecloud Bot commented Jun 8, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses eval tool/input simulation failures for non-OpenAI models by introducing a provider-agnostic structured-output helper and updating the eval mockers and LLM gateway integration to support function-calling style structured responses (including nested schemas via raw-dict tool passthrough).

Changes:

  • Add generate_structured_output() helper to prefer response_format when available and fall back to a forced tool call when content is empty/unsupported.
  • Update eval LLM mocker and input mocker to use the shared structured-output helper, and adjust/extend unit tests for the new behavior (including non-OpenAI fallback).
  • Update UiPathLlmChatService.chat_completions() to accept raw-dict tools (pass-through) so nested JSON-schema tool parameters are preserved; bump package versions accordingly.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/uipath/uv.lock Bumps uipath and uipath-platform locked versions.
packages/uipath/pyproject.toml Bumps uipath version and raises minimum uipath-platform dependency.
packages/uipath-platform/uv.lock Bumps uipath-platform locked version.
packages/uipath-platform/pyproject.toml Bumps uipath-platform version.
packages/uipath-platform/src/uipath/platform/chat/_llm_gateway_service.py Allows passing raw-dict tools through to the normalized gateway request body.
packages/uipath-platform/tests/services/test_uipath_llm_integration.py Adds coverage ensuring raw-dict tool schemas are forwarded unchanged and tool_choice serialization works.
packages/uipath/src/uipath/eval/mocks/_structured_output.py New shared helper for structured output with response_format-first + tool-call fallback; includes schema $defs inlining logic.
packages/uipath/src/uipath/eval/mocks/_llm_mocker.py Switches mock response generation to the shared structured-output helper; improves error propagation.
packages/uipath/src/uipath/eval/mocks/_input_mocker.py Switches input generation to the shared structured-output helper.
packages/uipath/tests/cli/eval/mocks/test_structured_output.py New unit tests validating schema wrapping/inlining and response extraction/fallback behavior.
packages/uipath/tests/cli/eval/mocks/test_mocks.py Updates existing mocks and adds a non-OpenAI fallback regression test (AE-1646).
packages/uipath/tests/cli/eval/mocks/test_input_mocker.py Adds assertion that OpenAI path uses response_format without tool fallback.
Comments suppressed due to low confidence (1)

packages/uipath/src/uipath/eval/mocks/_input_mocker.py:158

  • generate_llm_input() no longer surfaces a clear JSON-parsing error when the LLM returns invalid JSON (the previous code raised UiPathInputMockingError("Failed to parse LLM response as JSON: ...")). Now a json.JSONDecodeError from generate_structured_output() is wrapped as a generic "Failed to generate input" error, which makes debugging structured-output failures harder.
    except UiPathInputMockingError:
        raise
    except Exception as e:
        raise UiPathInputMockingError(f"Failed to generate input: {str(e)}") from e

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +10
"""Provider-agnostic structured output for the eval mockers.

The normalized LLM Gateway honors OpenAI-style ``response_format`` (json_schema)
only for OpenAI models — and does so reliably, including native ``$defs``
support. Non-OpenAI providers (Anthropic/Claude via Bedrock, Gemini) return such
requests with ``choices[0].message.content`` empty/None, which breaks JSON
parsing. Function calling is honored across providers but is less reliable for
OpenAI on some schemas, so it is used only as a fallback: prefer
``response_format`` and fall back to a forced tool call when the content comes
back empty.
Comment on lines 401 to 406
presence_penalty: float = 0,
top_p: float | None = 1,
top_k: int | None = None,
tools: list[ToolDefinition] | None = None,
tools: list[ToolDefinition | dict[str, Any]] | None = None,
tool_choice: ToolChoice | None = None,
response_format: dict[str, Any] | type[BaseModel] | None = None,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants