Agentic Inference updates by hvagadia · Pull Request #314 · mlcommons/endpoints

hvagadia · 2026-05-18T19:41:06Z

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-05-18T19:41:17Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces comprehensive support for multi-turn conversation benchmarking, including a new MultiTurnDataset class, a MultiTurnStrategy for turn sequencing, and a ConversationManager for state tracking. It also provides extensive documentation, validation schemas, and utility scripts for dataset conversion and analysis. Feedback focuses on optimizing performance and memory efficiency within the MultiTurnDataset class, specifically recommending the use of vectorized pandas operations to avoid memory-intensive dictionary conversions and suggesting an optimization to reduce the algorithmic complexity of building message histories from quadratic to linear.

nv-alicheng

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 11 issues across 7 files (3 couldn't be posted inline — see summary comment).

nv-alicheng · 2026-05-19T05:09:39Z

Review Council — Multi-AI Code Review

Reviewed by: Codex + Claude | Depth: thorough

Found 11 issues across 7 files.

⚠️ 3 issues below couldn't be posted as inline comments (lines not in the diff hunk) — they are included here with file:line references.

🔴 Must Fix (critical/high)

Issues that will cause hangs, data loss, or incorrect behavior in production.

#	File	Line	Category	Reviewer(s)	Summary
1	`src/inference_endpoint/load_generator/multi_turn_strategy.py`	326	bug	Codex	⚠️ inline not possible — Timeout path decrements `inflight` but never calls `_drain_event.set()`. If the last in-flight turn times out, `_drain_inflight` in `session.py` waits forever — the benchmark hangs indefinitely. Fix: mirror session.py:540 (`if self._phase_issuer.inflight <= 0: self._drain_event.set()`) in the timeout handler.
2	`src/inference_endpoint/load_generator/multi_turn_strategy.py`	232	bug	Codex+Claude	Cursor advanced to `cursor + 1` before the delayed turn fires. If a timeout aborts the conversation, `_abort_remaining_turns` reads the already-advanced cursor and skips the cancelled delayed turn — `mark_turn_failed` and `_publish_synthetic_failure` are never called for it, silently under-counting failures.
3	`src/inference_endpoint/load_generator/session.py`	537	bug	Codex	⚠️ inline not possible — `completed_uuids` set accumulates every completed + timed-out query ID for the full phase and is never cleared. At 50 k QPS / 600 s this can hold millions of UUID strings and consume several hundred MB before the phase ends. The set's purpose (guard against late double-responses) only requires remembering IDs that were synthetically completed after a timeout — not every successful query.
4	`examples/09_MultiTurn/accuracy/score_inline_accuracy.py`	328	data-integrity	Claude	`float("nan")` serialized as bare `NaN` token by `json.dumps` without `allow_nan=False`. `NaN` is not valid JSON per RFC 8259; `jq`, Go's `encoding/json`, and most strict parsers reject the output file. Replace with `None` → `null`.

🟡 Should Fix (medium)

Real issues that produce incorrect results under reachable conditions.

#	File	Line	Category	Reviewer(s)	Summary
5	`src/inference_endpoint/core/types.py`	224	bug	Codex	⚠️ inline not possible — `as_message_parts_after_first_chunk` returns the full `self.tool_calls` for TPOT tokenization, but for streaming tool-call responses some of those tokens arrived in the first chunk (before TTFT). TPOT denominator is systematically inflated for all tool-calling runs.
6	`examples/09_MultiTurn/accuracy/score_inline_accuracy.py`	132	bug	Claude	`_PATH_LEAF.search(tokens[i]).group(0)` raises `AttributeError` when `tokens[i]` is `"/"` (bare slash). `re.compile(r"[^/]+$")` returns `None` for that input; unguarded `.group(0)` aborts the entire scoring run.
7	`examples/09_MultiTurn/accuracy/score_inline_accuracy.py`	340	data-integrity	Claude	`_build_index_to_key` has no guard against a mismatch between the gt JSONL and the benchmark run dataset. A difference in turn count/order silently maps model outputs to wrong turns, producing incorrect scores with no warning.

🔵 Consider (low)

Valid improvements, suitable as follow-ups.

#	File	Line	Category	Reviewer(s)	Summary
8	`tests/integration/test_multi_turn.py`	939	error-handling	Claude	`except Exception` swallows parse failures with `logger = None` — no log, no comment. AGENTS.md requires every `except` to explain or log.
9	`src/inference_endpoint/openai/types.py`	132	design	Claude	`gc=False` AT-RISK comment (line 112) lists `messages`, `tools`, `logit_bias` but omits the newly added `chat_template_kwargs: dict[str, Any] \| None`.
10	`src/inference_endpoint/dataset_manager/multi_turn_dataset.py`	413	design	Claude	Salt silently not applied to conversations without a system prompt; no warning logged when `enable_salt=True` is partially defeated.
11	`examples/09_MultiTurn/accuracy/score_inline_accuracy.py`	343	performance	Claude	`gt_jsonl.read_text().splitlines()` loads the entire file into memory. Use line-by-line streaming (same pattern as `_iter_assistant_turns` line 268).

🤖 Generated by /review-council (Codex + Claude, thorough depth).

viraatc · 2026-05-19T23:23:48Z

+  # Mandatory: with the default warmup behaviour, every request fails with
+  # ConnectionResetError because uvicorn closes pre-warmed idle sockets after 5s.
+  client:
+    warmup_connections: 0


whats the http error code ur seeing on this?

dont remember seeing this in the past with sglang.
(i expect tcp-keepalive is disabled (default))

also max-idle-time should is likely not making a difference here, lets use default if we can.

Let me get back on the error. I got errors while back, need to test on latest code.

arekay-nv

Review Council — Multi-AI Code Review

Claude-only review (Codex unavailable — bwrap/pivot_root not permitted in this environment). Depth: thorough.

arekay-nv

Review Council — Multi-AI Code Review

Claude-only review (Codex unavailable — bwrap/pivot_root not permitted in this environment). Depth: thorough.

arekay-nv · 2026-05-19T23:40:29Z

                        "name",
                        "tool_calls",
                        "tool_results",
+                        "reasoning_content",


[Claude] medium (api-contract): reasoning_content is unconditionally forwarded in prior-assistant-turn history messages. This is a Kimi/SGLang-specific extension, not in the OpenAI Chat Completions spec for request message objects. Servers that validate input strictly (standard vLLM, TRT-LLM serve) will reject with 400/422 when prior assistant turns carry this field. Since ChatMessage.reasoning_content defaults to None and uses omit_defaults=True, this only triggers for datasets recorded from Kimi runs — but there is no api_type gate, config option to strip it, or documentation warning.

This is intentional for Kimi replay. The dataset has reasoning_content because SGLang/TRT-LLM with the kimi_k2 reasoning parser returns thinking separately from final content.

For replay, Kimi needs prior thinking in history; with chat_template_kwargs.preserve_thinking: true, the chat template renders reasoning_content back into the prompt. We’ve verified this path with SGLang and TRT-LLM.

arekay-nv · 2026-05-19T23:40:56Z

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — bwrap/pivot_root not permitted in this environment) | Depth: thorough | PR: #314 "Agentic Inference updates"

Found 5 new issues posted as inline comments (3 previously covered by earlier review pass were deduplicated). Commit hygiene: 14 commits, 6 chore/fix — consider squashing before merge.

🔴 Must Fix (critical/high)

#	File	Line	Category	Summary
1	`factory.py`	121	bug	`assert isinstance(...)` → must be `InputValidationError`; silently skipped under `python -O`
2	`multi_turn_strategy.py`	247	concurrency	`execute()` early-exit + `_fill_slot` completion check both missing `not self._delay_handles`; first-turn-delay scenario can hang or terminate early

🟡 Should Fix (medium)

#	File	Line	Category	Summary
3	`multi_turn_dataset.py`	438	api-contract	`reasoning_content` unconditionally forwarded in history messages; breaks non-Kimi endpoints (vLLM, TRT-LLM) with 400/422
4	`score_inline_accuracy.py`	461	data-integrity	`client_turn + 1` implicitly couples to `_validate_turn_numbering` invariant; silent misscoring if invariant not enforced upstream

🔵 Consider (low)

#	File	Line	Category	Summary
5	`score_inline_accuracy.py`	507	error-handling	`--model` path has no existence check; raises raw `FileNotFoundError` traceback vs `--report-dir` path's clean error

⚠️ Commit hygiene: This PR has 14 commits including 6 apparent chore/fix commits. Consider squashing before merge.

arekay-nv

Can you refactor to reuse the existing accuracy infrastructure. This will enable us to reuse the same setup across models/datasets.

viraatc

Review Council — Multi-AI Code Review

Reviewed by: Claude | Depth: thorough

Codex review failed (CLI/cloud-requirements compatibility issue after recent CLI upgrade, then internal bwrap hangs on Ubuntu 24 AppArmor restriction). Falling back to Claude-only.

7 issues posted inline after deduping against the 44 existing comments and re-verifying each finding against current HEAD. Critical Pandas dict-assignment bug verified by repro. Schema gap on delay_seconds verified empirically.

viraatc · 2026-05-20T21:38:36Z

Review Council — Summary

Reviewed by: Claude (Codex unavailable: post-upgrade cloud-requirements compatibility + Ubuntu 24 AppArmor bwrap lockup) | Depth: thorough

Found 7 issues across 4 files. All findings re-verified against HEAD (56db1b21) and deduped against the 48 existing inline comments.

🔴 Must Fix (critical/high)

File	Line	Category	Summary
`openai/openai_msgspec_adapter.py`	83	bug	`chat_template_kwargs` dict silently becomes NaN through `AddStaticColumns` (Pandas dict-assignment); YAML setting never reaches the server. Repro'd.

🟡 Should Fix (medium)

File	Line	Category	Summary
`load_generator/session.py`	232	bug	`conversation_id` flows raw into `X-Session-ID` header; no CR/LF validation.
`scripts/multi_turn_dataset_schema.json`	427	data-integrity	`delay_seconds` only documented, not in any schema definition; validator accepts `"not-a-number"` and `-5`. Verified.
`accuracy/score_inline_accuracy.py`	195	bug	`_BARE_INTENT_RE` missing `re.IGNORECASE` (sibling `_INTENT_RE` has it).
`accuracy/score_inline_accuracy.py`	298	data-integrity	`gt_by_key`/`model_by_key` silently overwrite duplicate `(conv,turn)` keys.
`accuracy/score_inline_accuracy.py`	328	design	`pass_rate` excludes `missing_in_model` from denominator → high-drop backends score artificially well.

🔵 Consider (low)

File	Line	Category	Summary
`accuracy/score_inline_accuracy.py`	128	bug	`_WRAPPERS` case-sensitive but leaf is `.lower()`-ed → `"SUDO ls"` silently misses `ls`.

Inline thread: #314 (review)

hvagadia · 2026-05-25T05:01:32Z

I have removed the accuracy script for now. It is still under discussion in TF, so doesnt make sense to integrate it yet. I will put up a separate draft PR for folks to experiment.

Several OpenAI-compatible servers (notably SGLang and vLLM with their reasoning/tool parsers) drift from the strict OpenAI spec in ways the existing decoders silently swallowed: * SSEDelta.reasoning was the wrong field name. SGLang/vLLM emit `reasoning_content` on streaming deltas (matching the non-streaming ChatCompletionResponseMessage). The previous code never matched, so every reasoning chunk on a thinking-mode response was dropped on the floor with no error. * SSEDelta fields default to None (servers send `null`, not `""`). Previous defaults of `""` looked truthy when the server actually had no payload for that channel in a given chunk. * ChatCompletionResponseMessage / ChatCompletionChoice / ChatCompletionResponse: many fields (`content`, `refusal`, `finish_reason`, `usage`, `system_fingerprint`) are not always emitted. Without defaults, msgspec rejected non-streaming responses from these servers entirely. * decode_sse_message now catches msgspec.ValidationError, logs the raw chunk preview, and returns None so the stream keeps draining. The prior behaviour bubbled into the worker's outer Exception handler and lost the diagnostic. TextModelOutput now carries `tool_calls` and `finish_reason` as first-class fields (in addition to the existing metadata copy) so multi-turn replay can build history without reaching into metadata. A diagnostic `chunk_stats` dict counts content / reasoning / tool-call chunks per response for replay-determinism debugging. `array_like=False` on TextModelOutput is required to safely add fields without breaking the positional wire layout. PromptData and ErrorData flip too for consistency.

Two correctness changes for multi-turn replay against thinking-mode and tool-using models: 1. reasoning_content propagation. Prior assistant turns must replay their thinking trace as part of the message history; without it, the chat-template-rendered prompt diverges from what the original capture sent and outputs differ from the captured trajectory even at temperature=0. 2. tools propagation across all turns. Every request must carry the same tools array as turn 1; SGLang's tool-call parser is gated on a non-empty `tools` field in the request, and turns that omit it silently bypass the parser, leaking literal tool-call markup into the assistant `content` channel. New optional cache-bursting salt (multi_turn.enable_salt: bool, default False) appends a per-conversation blake2b digest to the end of each trajectory's system prompt. This keeps within-trajectory prefix caching intact while preventing cross-trajectory KV-cache leak during replay of datasets that share a long system prompt across many trajectories. The salt is computed once per conversation_id and reused on every turn of that conversation. apply_salt is idempotent on already-salted prompts so re-runs are stable. See examples/09_MultiTurn/docs for the full methodology and validation; not included in this commit. Mechanical change in dataset.py: load_from_file gains a **dataset_kwargs passthrough so the factory can forward enable_salt without expanding the signature for every future option.

The customer_support example and its two near-duplicate config files predated the agentic datasets and no longer earn their keep: - multi_turn_benchmark.yaml and multi_turn_with_concurrency.yaml differ only in `name`, two inline comments, and `report_dir`. Both already set target_concurrency: 32 — the README's Concurrency Control section documents the same knob inline. Keeping a second YAML to teach a setting that's already in the first is noise. - customer_support_conversations.jsonl was the toy dataset backing those YAMLs. With them gone it has no consumer. The remaining agentic YAMLs are tuned to actually work as written: - model_params.name: "/model" matches SGLang's --model-path mount - temperature: 0 (greedy) per MLPerf-inference reproducibility convention - max_new_tokens sized to the longest observed turn in each capture - target_concurrency reduced from theoretical maxima to values that match real B200-class server capacity on a 1T MoE - client.warmup_connections: 0 / max_idle_time: 0.5 work around uvicorn closing pre-warmed idle sockets after 5s, which otherwise causes every first request to fail with ConnectionResetError README updated: Basic Configuration example uses agentic_coding / agentic_coding_flat.jsonl (the dataset that's actually in scope post- deletion); Using Configuration File invokes agentic_coding_benchmark.yaml; Example Datasets section dropped (the only entry was customer_support).

…alysis utility - score_inline_accuracy.py: single-script scorer for multi-turn benchmark runs. Coding turns score by multiset IoU on a curated whitelist of ~40 canonical bash exes; workflow turns score by exact-match on `intent: IXXX`. Folds in events.jsonl -> model_assistants.jsonl conversion so a benchmark report dir can be scored in one call. - analyze_flat_jsonl.py: produces a single composite summary plot (turns/conv, ISL/OSL distributions, per-turn growth, token-class violins) for any flat multi-turn JSONL.

…ance - MultiTurnConfig: add `enable_salt: bool = False` knob. - MultiTurnDataset: when enabled, append `\n\n[cache_salt: <hex>]` to the system message once per trajectory, where hex is `blake2b(conversation_id, digest_size=8).hexdigest()`. Same salt is reused across all turns of one trajectory so within-trajectory prefix caching is preserved; differs across trajectories so the cross-trajectory cache match terminates at the salt boundary. - MultiTurnDataset: drop rows with no `conversation_id` after load (e.g. the `_type: dataset_metadata` license/source sentinel some upstream snapshots prepend). Methodology + full-dataset salt-vs-no-salt accuracy comparison in examples/09_MultiTurn/docs/EVALUATION.md.

- README: new "Accuracy Evaluation" subsection under "Running Multi-Turn Benchmarks" documenting the score_inline_accuracy.py command and what it writes into report_dir. Completes the convert -> replay -> score pipeline. - score_inline_accuracy.py: ruff-format/lint pass (formatting only, no behavior change).

- Add ModelParams.chat_template_kwargs (forwarded per request to vLLM/SGLang). Enables Kimi-K2.6 Thinking-mode by sending {'thinking': True, 'preserve_thinking': True} on every request so prior assistant reasoning is rendered back into the prompt as <think>...</think> blocks during multi-turn replays. - Wire field through ChatCompletionRequest msgspec struct and openai_msgspec_adapter so it lands on the wire as a top-level request key. - Score workflow accuracy from the structured intent_codes set stamped on each scorable assistant row (no regex fallback): a turn is a hit iff the model's extracted I### code is a member of the per-turn ground-truth set. Tool-only assistant turns remain unscorable by design. - inject_tool_delay knob on multi_turn dataset config: when set, the strategy defers the next turn issue by the dataset's embedded delay_seconds via loop.call_later, modelling the producer-side tool/human pause that the original capture saw. - Update agentic_{coding,workflow}_benchmark.yaml to point at the t1 datasets, T=1.0/top_p=0.95 sampling, salt enabled, and chat_template_kwargs preset for Kimi-K2.6. - Update README + score_inline_accuracy.py + JSONL schema to match the new _t1.jsonl naming and intent_codes/delay_seconds fields. - Regenerate full templates to expose the new schema field. Tests cover end-to-end chat_template_kwargs propagation, the intent_codes scoring rule, and the inject_tool_delay scheduling path.

hvagadia requested a review from a team May 18, 2026 19:41

hvagadia marked this pull request as draft May 18, 2026 19:41

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py

Comment thread src/inference_endpoint/dataset_manager/multi_turn_dataset.py