-
Notifications
You must be signed in to change notification settings - Fork 27
feat: extract evaluation framework into uipath-eval package #1710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -224,6 +224,76 @@ jobs: | |
| UIPATH_FOLDER_KEY: ${{ secrets.UIPATH_MEMORY_FOLDER }} | ||
| run: uv run pytest tests/services/test_memory_service_e2e.py -m e2e -v --no-cov | ||
|
|
||
| test-uipath-eval: | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This new Useful? React with 👍 / 👎. |
||
| name: Test (uipath-eval, ${{ matrix.python-version }}, ${{ matrix.os }}) | ||
| needs: detect-changed-packages | ||
| runs-on: ${{ matrix.os }} | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| python-version: ["3.11", "3.12", "3.13"] | ||
| os: [ubuntu-latest, windows-latest] | ||
| steps: | ||
| - name: Check if package changed | ||
| id: check | ||
| shell: bash | ||
| run: | | ||
| if echo '${{ needs.detect-changed-packages.outputs.packages }}' | jq -e 'index("uipath-eval")' > /dev/null; then | ||
| echo "skip=false" >> $GITHUB_OUTPUT | ||
| else | ||
| echo "skip=true" >> $GITHUB_OUTPUT | ||
| fi | ||
|
|
||
| - name: Skip | ||
| if: steps.check.outputs.skip == 'true' | ||
| shell: bash | ||
| run: echo "Skipping - no changes to uipath-eval" | ||
|
|
||
| - name: Checkout | ||
| if: steps.check.outputs.skip != 'true' | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Setup uv | ||
| if: steps.check.outputs.skip != 'true' | ||
| uses: astral-sh/setup-uv@v5 | ||
|
|
||
| - name: Setup Python | ||
| if: steps.check.outputs.skip != 'true' | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
|
|
||
| - name: Install dependencies | ||
| if: steps.check.outputs.skip != 'true' | ||
| working-directory: packages/uipath-eval | ||
| run: uv sync --all-extras --python ${{ matrix.python-version }} | ||
|
|
||
| - name: Run tests | ||
| if: steps.check.outputs.skip != 'true' && !(matrix.os == 'ubuntu-latest' && matrix.python-version == '3.13') | ||
| working-directory: packages/uipath-eval | ||
| run: uv run pytest | ||
|
|
||
| - name: Run tests with coverage | ||
| if: steps.check.outputs.skip != 'true' && matrix.os == 'ubuntu-latest' && matrix.python-version == '3.13' | ||
| working-directory: packages/uipath-eval | ||
| run: uv run pytest --cov-report=xml --cov-report=html --tb=short | ||
|
|
||
| - name: Upload coverage HTML report | ||
| if: steps.check.outputs.skip != 'true' && matrix.os == 'ubuntu-latest' && matrix.python-version == '3.13' && always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: coverage-html-uipath-eval | ||
| path: packages/uipath-eval/htmlcov/ | ||
| retention-days: 30 | ||
|
|
||
| - name: Upload coverage XML report | ||
| if: steps.check.outputs.skip != 'true' && matrix.os == 'ubuntu-latest' && matrix.python-version == '3.13' && always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: coverage-xml-uipath-eval | ||
| path: packages/uipath-eval/coverage.xml | ||
| retention-days: 30 | ||
|
|
||
| test-uipath: | ||
| name: Test (uipath, ${{ matrix.python-version }}, ${{ matrix.os }}) | ||
| needs: detect-changed-packages | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| 3.11 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code (claude.ai/code) when working with the `uipath-eval` package. | ||
|
|
||
| ## Package Purpose | ||
|
|
||
| The evaluation framework for UiPath agents, providing the `uipath.eval` namespace package. Extracted from the main `uipath` SDK so consumers (e.g. the python eval worker in the agents backend) can depend on evaluators without the CLI and the rest of the SDK. Depends on `uipath-core`, `uipath-platform`, and `uipath-runtime`. | ||
|
|
||
| ## Development Commands | ||
|
|
||
| ```bash | ||
| cd packages/uipath-eval | ||
|
|
||
| uv sync --all-extras # Install dependencies | ||
| pytest # Run all tests | ||
| pytest tests/eval/mocks/ # Run a test subdirectory | ||
| ruff check . # Lint | ||
| ruff format --check . # Format check | ||
| mypy src # Type check | ||
| ``` | ||
|
|
||
| No justfile exists for this package — run commands directly. | ||
|
|
||
| ## Module Layout (`src/uipath/eval/`) | ||
|
|
||
| | Module | Purpose | | ||
| |--------|---------| | ||
| | `evaluators/` | Deterministic evaluators (ExactMatch, Contains, JsonSimilarity, classification, tool-call order/args/count/output), LLM-judge evaluators, legacy evaluators, evaluator factory + registration | | ||
| | `models/` | `EvaluationSet`, `EvaluationItem`, `EvaluationResult`, `AgentExecution`, score types | | ||
| | `mocks/` | `@mockable` decorator, LLM tool/input mocking, mockito integration, response caching, `UiPathMockRuntime` | | ||
| | `runtime/` | `UiPathEvalRuntime`, `UiPathEvalContext`, `evaluate()` entry point, eval events, parallelization, exporters | | ||
| | `helpers.py` | `EvalHelpers` (eval set loading/migration), `get_agent_model()` | | ||
|
|
||
| ## Constraints | ||
|
|
||
| - This is a **namespace package**: `src/uipath/` has no `__init__.py`; only `src/uipath/eval/` does. Import paths are unchanged from when the code lived in the main SDK (`from uipath.eval...`). | ||
| - Do not import from the main `uipath` package internals (`uipath._cli`, `uipath._utils`, `uipath.agent`, ...) — only `uipath.core`, `uipath.platform`, and `uipath.runtime` are available here. The main `uipath` package depends on this one, not vice versa. | ||
| - Structured output across model providers must use function calling, not `response_format` (Claude returns prose, Gemini returns empty content for `response_format` on the normalized gateway — see `mocks/_structured_output.py`). | ||
| - The CLI-facing progress reporters (`_progress_reporter.py`, `_console_progress_reporter.py`) intentionally stay in `packages/uipath/src/uipath/_cli/_evals/` — they are CLI infrastructure, not part of this package. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| # uipath-eval | ||
|
|
||
| Evaluation framework for UiPath agents, extracted from the main `uipath` SDK so | ||
| it can be consumed standalone (for example by the python eval worker in the | ||
| agents backend) without pulling in the CLI and the rest of the SDK. | ||
|
|
||
| Provides the `uipath.eval` namespace package: | ||
|
|
||
| - **`uipath.eval.evaluators`** — deterministic evaluators (ExactMatch, Contains, | ||
| JsonSimilarity, classification, tool-call order/args/count/output) and | ||
| LLM-based evaluators (LLM-judge output/trajectory), plus their legacy | ||
| counterparts and the evaluator factory/registration system. | ||
| - **`uipath.eval.models`** — evaluation sets, evaluation results, score types, | ||
| agent execution models. | ||
| - **`uipath.eval.mocks`** — the `@mockable` decorator, LLM tool/input mocking, | ||
| mockito integration, and response caching used by simulation runs. | ||
| - **`uipath.eval.runtime`** — `UiPathEvalRuntime`, `UiPathEvalContext`, the | ||
| `evaluate()` entry point, eval events, parallelization, and exporters. | ||
|
|
||
| ## Installation | ||
|
|
||
| ```bash | ||
| uv pip install uipath-eval | ||
| ``` | ||
|
|
||
| Import paths are unchanged from when this code lived in the `uipath` package: | ||
|
|
||
| ```python | ||
| from uipath.eval.evaluators import ExactMatchEvaluator | ||
| from uipath.eval.models.evaluation_set import EvaluationSet | ||
| from uipath.eval.runtime import UiPathEvalContext, evaluate | ||
| ``` | ||
|
|
||
| ## Development | ||
|
|
||
| ```bash | ||
| cd packages/uipath-eval | ||
|
|
||
| uv sync --all-extras # Install dependencies | ||
| pytest # Run all tests | ||
| ruff check . # Lint | ||
| ruff format --check . # Format check | ||
| mypy src # Type check | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a release publishes a new
uipath-coreanduipath-evalversion without also publishinguipath-platformoruipath,wait-for-uipath-coreis skipped because its condition only mentions platform/uipath, and this new eval build only waits onwait-for-uipath-platform(which just skips if platform is not being published). In that scenarioneeds-relock: truerunsuv lock --no-sourcesfor eval before the new core version is visible on PyPI, causing intermittent release failures or locking against the previous core if the lower bound was not updated.Useful? React with 👍 / 👎.