Skip to content

[Feature] Add tool call accuracy tests for large models to nightly CI #17933

@harvenstar

Description

@harvenstar

Motivation

Tool call tests today only cover a small model (Llama-3.2-1B in test/registered/openai_server/function_call/test_openai_function_calling.py). Large models like DeepSeek V3.2 have no tool call coverage in CI, so bugs like #17593 and #17551 only get caught when users hit them. This issue tracks adding tool call tests to the nightly 8-GPU suite.

Scenarios

These should be common across all models that support tool calling:

Basic

  1. Format check — tool_calls is a non-empty list, function.name / function.arguments present, arguments is valid JSON, finish_reason is "tool_calls"
  2. Field placement — tool call goes in tool_calls, not content ([Bug] DeepSeek-V3.2 tool calls incorrectly output to content field instead of tool_calls field #17593)
  3. Streaming — chunks concatenate to valid JSON, finish_reason correct

tool_choice

  1. "required" — always returns tool call
  2. "none" — never returns tool call
  3. Specific function — returns the specified one

Multi-turn

  1. Tool result follow-up — pass tool result back, model replies based on it
  2. Thinking + tool call — after tool result, output in content not reasoning_content ([Bug] DeepSeek V3.2: All output marked as reasoning_content after tool call result #17551, DeepSeek specific for now — might be model-internal, will write the test first and see)

Other

  1. Parallel tool calls — multiple tool calls in one request
  2. Strict mode — strict: true enforces schema

CI integration

Add to test/registered/8-gpu-models/test_deepseek_v32.py, two variants:

Non-MTP:

--tp=8 --dp=8 --enable-dp-attention
--tool-call-parser deepseekv32
--reasoning-parser deepseek-v3

MTP (speculative decoding):

same as above +
--speculative-algorithm=EAGLE
--speculative-num-steps=3
--speculative-eagle-topk=1
--speculative-num-draft-tokens=4
env: SGLANG_ENABLE_SPEC_V2=1

Both in nightly-8-gpu-common.

Plan

Start with DeepSeek V3.2, then extend to GLM / Qwen / others reusing the same scenarios.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions