feat: add preStop hook for llamacpp and tgi in the BackendRuntime#381
Merged
Conversation
googs1025
reviewed
Apr 27, 2025
Member
Thanks @cr7258 I didn't know this before, I was wondering why inference engine don't support this, which they should. |
cr7258
commented
Apr 27, 2025
cr7258
commented
Apr 27, 2025
Member
|
/lgtm Thanks! |
Member
|
/lgtm Thanks! |
1 similar comment
Member
|
/lgtm Thanks! |
Member
|
/triage accepted |
Member
|
/lgtm |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it
Add preStop hook for llamacpp and tgi in the BackendRuntime to ensure graceful termination. I didn't add preStop hook for Ollama and SGLang, because:
In this PR, I also increase the
terminationGracePeriodSecondsfrom default30sto130. Generally, the termination grace period needs to last longer than the slowest request we expect to serve plus any extra time spent waiting for load balancers to take the model server out of rotation. For the detailed explanation, please see here.Which issue(s) this PR fixes
Fixes #320
Special notes for your reviewer
llamacpp
related doc
output logs:
tgi:
related doc
output logs:
Terminating: Running: 1, Waiting: 0 2025-04-27T05:28:12.853024Z INFO completions{total_time="4.198707127s" validation_time="168.004µs" queue_time="64.762µs" inference_time="4.198474621s" time_per_token="4.198474ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success 2025-04-27T05:28:14.820936Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 1008 Terminating: Running: 1, Waiting: 0 2025-04-27T05:28:18.999206Z INFO completions{total_time="4.178477298s" validation_time="177.684µs" queue_time="74.152µs" inference_time="4.178225622s" time_per_token="4.178225ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success Terminating: No active or waiting requests, safe to terminateDoes this PR introduce a user-facing change?