fix(eval): emit string justification for coded evaluators#1709
Draft
ameyjain wants to merge 1 commit into
Draft
Conversation
Coded evaluators (tool-call count/args/output/order, json-similarity)
stored their explanation under per-evaluator keys (explained_tool_calls_*,
lcs, matched_leaves) with no 'justification' key, so the eval worker's
d.get('justification') always resolved to null while the structured detail
was still present.
Add a computed 'justification' string field to each coded justification
model, derived from its existing structured detail via a shared
format_explained_tool_calls helper. model_dump() now emits a string
justification for every evaluator, matching LLMJudgeJustification, without
changing the structured fields or the worker.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Problem
Coded evaluators (tool-call count/args/order/output, json-similarity) returned a
nulljustificationin their score output, while the LLM-judge evaluators returned a populated one.Root cause: each coded evaluator stored its explanation under a per-evaluator key (
explained_tool_calls_count,explained_tool_calls_args,explained_tool_calls_outputs,lcs,matched_leaves/total_leaves) and never under ajustificationkey. The downstream eval worker readsdetails["justification"], which therefore always resolved tonull— even though the structured detail was present.Fix
Add a computed
justification: strfield to each of the five coded justification models, derived from the model's existing structured detail via a sharedformat_explained_tool_callshelper.model_dump()now emits a stringjustificationfor every coded evaluator, matchingLLMJudgeJustification, without changing any of the structured fields.justificationnow derived fromexplained_tool_calls_countexplained_tool_calls_argsexplained_tool_calls_outputslcsmatched_leaves/total_leavesThe base
BaseEvaluatorJustificationis intentionally left untouched: adding a computedjustificationthere collides withLLMJudgeJustification's realjustificationfield (pydantic raisesTypeError). So the computed field lives on each coded subclass instead.Verification
pytest tests/evaluators tests/cli/eval— all passmypy— cleanruff check+ruff format— cleanRelated
Pairs with a
python-eval-workerchange that stops flattening these structured justifications, so the full object (including thisjustificationstring) reaches the client asjustificationObjectfor per-evaluator-type rendering.