Skip to content

feat: enforce stop strings on the OpenXLA serve path (#449 M3 Stage 2d)#478

Merged
inureyes merged 1 commit into
mainfrom
feature/449-stage2d-xla-stop-strings
Jun 29, 2026
Merged

feat: enforce stop strings on the OpenXLA serve path (#449 M3 Stage 2d)#478
inureyes merged 1 commit into
mainfrom
feature/449-stage2d-xla-stop-strings

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Stage 2d follow-up of the OpenXLA/IREE milestone (#449 M3): the serve path (Stage 2c plus 2d-sampling) terminated only on EOS or max_tokens and warned that stop strings were ignored. This enforces them streaming-safely, so the OpenXLA backend honors the OpenAI stop parameter the same way it already honors sampling.

What's here

  • stop_matcher.rs (new): a pure StopMatcher that turns the post-hoc truncation of apply_stop_sequences into an incremental, streaming-safe form. It buffers decoded text, withholds any suffix that could be the start of a stop string (so a stop string split across tokens, e.g. "STOP" arriving as "ST" then "OP", is still caught), and ends the request at the earliest full match, excluding the stop string and everything after it. It has no device state, so it is always compiled and unit-tested; the xla-iree worker is its only consumer today, so a scoped allow(dead_code) covers the feature-off build.
  • StreamingDecodeState::finish_truncated (xla-iree only): truncates the non-streaming result.text to the bytes already streamed (the emitted text is always a prefix of the decode buffer), keeping streaming and non-streaming consistent. The MLX path's finish is untouched.
  • XlaServeWorker: builds a per-request matcher from options.stop_sequences, routes each decoded piece through it, and on a match frees the engine slot (cancel) and sends a terminal Done with finish reason stop. On a natural end (EOS or length) it flushes the held tail first, since that tail never completed a stop string and is therefore real output. Requests with no stop strings take an is_active() fast path and stream byte-for-byte as before.

Semantics

Matches apply_stop_sequences (the rule the Anthropic route already uses): the earliest match across all stop strings wins, and the stop string is excluded from the output. A unit test feeds the same text through the matcher in every chunking (whole, char-by-char, each two-piece split) and asserts the result equals apply_stop_sequences on the whole string.

Validation

  • Unit tests (run in ordinary cargo test, no feature needed): 12 stop_matcher tests covering single-piece, split-across-pieces, partial false-alarm, earliest-of-multiple, duplicate, at-start, unicode, non-boundary safety, and the apply_stop_sequences equivalence sweep.
  • E2E on GB10 CUDA (/v1/completions, greedy; stop strings derived from the deterministic no-stop baseline so the expected truncation is exact):
    • single-word, multi-word (token-spanning), and duplicate (earliest-wins) stops truncate exactly, report finish_reason=stop, and exclude the stop string;
    • multiple stop strings resolve to the earliest across the set;
    • a non-occurring stop leaves output unchanged (full text, length);
    • streaming: concatenated deltas equal the truncation and the stop string never leaks (proves the cross-token buffering);
    • 3 concurrent requests with distinct stops each truncate independently (per-slot matcher isolation under continuous batching).
  • cargo fmt and cargo clippy -D warnings are clean (default no-feature lib plus tests, and --features xla-iree lib); mlxcel-server builds and links with xla-iree; the server:: lib suite (1222 tests) passes.

The MLX serving path is unchanged (no trait or scheduler changes); default and CI builds compile with the XLA worker absent.

Scope / non-goals

Enforces stop strings on the OpenXLA serve path. Still not applied on this backend: repetition / frequency / presence penalties and DRY (warned once); logprobs, structured output, and multimodal (rejected). On-device sampling, paged-KV, and chunked prefill remain later Stage 2d/2e follow-ups.

Refs #449 (epic). Follows #471 (Stage 2d sampling), #470 (Stage 2c serving), and #469/#468/#466/#462.

The OpenXLA serve worker (Stage 2c) terminated only on EOS or max_tokens
and warned that stop strings were ignored. Enforce them streaming-safely:

- Add a pure StopMatcher (src/server/batch/stop_matcher.rs) that buffers
  decoded text, withholds any tail that could begin a stop string, and ends
  the request at the earliest full match, excluding the stop string and
  everything after it. Semantics match apply_stop_sequences, proven by an
  equivalence test over every chunking. Always compiled and unit-tested;
  only the xla-iree worker consumes it.
- Add StreamingDecodeState::finish_truncated (xla-iree only) so a
  non-streaming result is truncated to the bytes already streamed.
- Wire it into XlaServeWorker: build a per-request matcher from
  options.stop_sequences, route decoded pieces through it, free the engine
  slot and finish with reason "stop" on a match, flush the held tail on a
  natural end. Requests without stop strings stream exactly as before.

The MLX serving path is unchanged. Validated on GB10 CUDA: truncation,
earliest-match, multi-stop, non-occurring, streaming cross-token, and
concurrent per-slot isolation.

Refs #449.
@inureyes inureyes added area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions labels Jun 29, 2026
@inureyes inureyes merged commit 1d590dd into main Jun 29, 2026
5 checks passed
@inureyes inureyes deleted the feature/449-stage2d-xla-stop-strings branch June 29, 2026 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:architecture Architecture and code structure changes priority:medium Medium priority status:done Completed type:enhancement New features, capabilities, or significant additions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant