feat: enforce stop strings on the OpenXLA serve path (#449 M3 Stage 2d)#478
Merged
Conversation
The OpenXLA serve worker (Stage 2c) terminated only on EOS or max_tokens and warned that stop strings were ignored. Enforce them streaming-safely: - Add a pure StopMatcher (src/server/batch/stop_matcher.rs) that buffers decoded text, withholds any tail that could begin a stop string, and ends the request at the earliest full match, excluding the stop string and everything after it. Semantics match apply_stop_sequences, proven by an equivalence test over every chunking. Always compiled and unit-tested; only the xla-iree worker consumes it. - Add StreamingDecodeState::finish_truncated (xla-iree only) so a non-streaming result is truncated to the bytes already streamed. - Wire it into XlaServeWorker: build a per-request matcher from options.stop_sequences, route decoded pieces through it, free the engine slot and finish with reason "stop" on a match, flush the held tail on a natural end. Requests without stop strings stream exactly as before. The MLX serving path is unchanged. Validated on GB10 CUDA: truncation, earliest-match, multi-stop, non-occurring, streaming cross-token, and concurrent per-slot isolation. Refs #449.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stage 2d follow-up of the OpenXLA/IREE milestone (#449 M3): the serve path (Stage 2c plus 2d-sampling) terminated only on EOS or
max_tokensand warned that stop strings were ignored. This enforces them streaming-safely, so the OpenXLA backend honors the OpenAIstopparameter the same way it already honors sampling.What's here
stop_matcher.rs(new): a pureStopMatcherthat turns the post-hoc truncation ofapply_stop_sequencesinto an incremental, streaming-safe form. It buffers decoded text, withholds any suffix that could be the start of a stop string (so a stop string split across tokens, e.g."STOP"arriving as"ST"then"OP", is still caught), and ends the request at the earliest full match, excluding the stop string and everything after it. It has no device state, so it is always compiled and unit-tested; thexla-ireeworker is its only consumer today, so a scopedallow(dead_code)covers the feature-off build.StreamingDecodeState::finish_truncated(xla-ireeonly): truncates the non-streamingresult.textto the bytes already streamed (the emitted text is always a prefix of the decode buffer), keeping streaming and non-streaming consistent. The MLX path'sfinishis untouched.XlaServeWorker: builds a per-request matcher fromoptions.stop_sequences, routes each decoded piece through it, and on a match frees the engine slot (cancel) and sends a terminalDonewith finish reasonstop. On a natural end (EOS or length) it flushes the held tail first, since that tail never completed a stop string and is therefore real output. Requests with no stop strings take anis_active()fast path and stream byte-for-byte as before.Semantics
Matches
apply_stop_sequences(the rule the Anthropic route already uses): the earliest match across all stop strings wins, and the stop string is excluded from the output. A unit test feeds the same text through the matcher in every chunking (whole, char-by-char, each two-piece split) and asserts the result equalsapply_stop_sequenceson the whole string.Validation
cargo test, no feature needed): 12stop_matchertests covering single-piece, split-across-pieces, partial false-alarm, earliest-of-multiple, duplicate, at-start, unicode, non-boundary safety, and theapply_stop_sequencesequivalence sweep./v1/completions, greedy; stop strings derived from the deterministic no-stop baseline so the expected truncation is exact):finish_reason=stop, and exclude the stop string;length);cargo fmtandcargo clippy -D warningsare clean (default no-feature lib plus tests, and--features xla-ireelib);mlxcel-serverbuilds and links withxla-iree; theserver::lib suite (1222 tests) passes.The MLX serving path is unchanged (no trait or scheduler changes); default and CI builds compile with the XLA worker absent.
Scope / non-goals
Enforces stop strings on the OpenXLA serve path. Still not applied on this backend: repetition / frequency / presence penalties and DRY (warned once); logprobs, structured output, and multimodal (rejected). On-device sampling, paged-KV, and chunked prefill remain later Stage 2d/2e follow-ups.
Refs #449 (epic). Follows #471 (Stage 2d sampling), #470 (Stage 2c serving), and #469/#468/#466/#462.