feat: add preStop hook for llamacpp and tgi in the BackendRuntime by cr7258 · Pull Request #381 · InftyAI/llmaz

cr7258 · 2025-04-27T06:07:29Z

What this PR does / why we need it

Add preStop hook for llamacpp and tgi in the BackendRuntime to ensure graceful termination. I didn't add preStop hook for Ollama and SGLang, because:

Ollama does not provide metrics for the number of requests being processed, and it currently even lacks a Add metrics endpoint and basic request metrics otel based ollama/ollama#6537. I have created a Provide an API to retrieve the number of requests being processed ollama/ollama#10419 on the Ollama repo.
SGLang refreshes the Prometheus metrics every 30 seconds, which is too long for us to determine the termination. Additionally, SGLang natively supports graceful termination, see logs below.

[2025-04-26 10:13:01] SIGTERM received. signum=None frame=None. Draining requests and shutting down...
[2025-04-26 10:13:04] Gracefully exiting... remaining number of requests 3
[2025-04-26 10:13:09] Gracefully exiting... remaining number of requests 2
[2025-04-26 10:13:14] Gracefully exiting... remaining number of requests 2
2025-04-26 10:13:18,881 - INFO - flashinfer.jit: Finished loading JIT ops: cascade
[2025-04-26 10:13:18 TP0] Decode batch. #running-req: 1, #token: 11, token usage: 0.00, gen throughput (token/s): 0.85, #queue-req: 0, 
[2025-04-26 10:13:19] Gracefully exiting... remaining number of requests 2
[2025-04-26 10:13:19 TP0] Decode batch. #running-req: 1, #token: 51, token usage: 0.00, gen throughput (token/s): 342.30, #queue-req: 0, 
[2025-04-26 10:13:19 TP0] Decode batch. #running-req: 1, #token: 91, token usage: 0.00, gen throughput (token/s): 381.66, #queue-req: 0, 
[2025-04-26 10:13:19] INFO:     127.0.0.1:50206 - "POST /v1/completions HTTP/1.1" 200 OK
[2025-04-26 10:13:21 TP0] Prefill batch. #new-seq: 1, #new-token: 1, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-26 10:13:24] Gracefully exiting... remaining number of requests 0

In this PR, I also increase the terminationGracePeriodSeconds from default 30s to 130. Generally, the termination grace period needs to last longer than the slowest request we expect to serve plus any extra time spent waiting for load balancers to take the model server out of rotation. For the detailed explanation, please see here.

Which issue(s) this PR fixes

Fixes #320

Special notes for your reviewer

llamacpp
related doc
output logs:

Terminating: Running: 1, Waiting: 0
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
Terminating: Running: 1, Waiting: 0
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
srv  log_server_r: request: GET /metrics 127.0.0.1 200
Terminating: Running: 1, Waiting: 0
srv  log_server_r: request: GET /health 240.243.170.78 200
srv  cancel_tasks: cancel task, id_task = 0
slot      release: id  0 | task 0 | stop processing: n_past = 2810, truncated = 0
srv  update_slots: all slots are idle
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /metrics 127.0.0.1 200
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /metrics 127.0.0.1 200
Terminating: No active or waiting requests, safe to terminate
srv    operator(): operator(): cleaning up before exit...

tgi:

related doc
output logs:

Terminating: Running: 1, Waiting: 0
2025-04-27T05:28:12.853024Z  INFO completions{total_time="4.198707127s" validation_time="168.004µs" queue_time="64.762µs" inference_time="4.198474621s" time_per_token="4.198474ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success
2025-04-27T05:28:14.820936Z  INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 1008
Terminating: Running: 1, Waiting: 0
2025-04-27T05:28:18.999206Z  INFO completions{total_time="4.178477298s" validation_time="177.684µs" queue_time="74.152µs" inference_time="4.178225622s" time_per_token="4.178225ms" seed="None"}: text_generation_router::server: router/src/server.rs:402: Success
Terminating: No active or waiting requests, safe to terminate

Does this PR introduce a user-facing change?

add preStop hook for llamacpp and tgi in the BackendRuntime

kerthcet · 2025-04-27T11:16:51Z

Additionally, SGLang natively supports graceful termination, see logs below.

Thanks @cr7258 I didn't know this before, I was wondering why inference engine don't support this, which they should.

kerthcet

Only one comment.

kerthcet · 2025-04-27T14:01:31Z

/lgtm
/approve
/kind feature

Thanks!

kerthcet · 2025-04-27T14:01:45Z

/lgtm
/approve
/kind feature

Thanks!

kerthcet · 2025-04-27T14:01:45Z

/lgtm
/approve
/kind feature

Thanks!

kerthcet · 2025-04-27T14:03:34Z

/triage accepted

kerthcet · 2025-04-27T14:09:32Z

/lgtm

feat: add preStop hook for llamacpp and tgi in the BackendRuntime

28d8f3d

InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. do-not-merge/needs-kind Indicates a PR lacks a label and requires one. labels Apr 27, 2025

InftyAI-Agent requested a review from kerthcet April 27, 2025 06:07

googs1025 reviewed Apr 27, 2025

View reviewed changes

Comment thread pkg/controller/inference/playground_controller.go

Merge branch 'main' into prestop-lifecycle

cadc592

kerthcet reviewed Apr 27, 2025

View reviewed changes

Comment thread pkg/controller/inference/playground_controller.go

cr7258 commented Apr 27, 2025

View reviewed changes

Comment thread pkg/controller/inference/playground_controller.go

Add TODO

d6e7c21

cr7258 commented Apr 27, 2025

View reviewed changes

Comment thread pkg/controller/inference/playground_controller.go Outdated

cr7258 added 2 commits April 27, 2025 21:15

Remove extra blank

ae1ae3e

fix golang lint

5935aef

cr7258 requested a review from kerthcet April 27, 2025 13:32

InftyAI-Agent assigned kerthcet Apr 27, 2025

InftyAI-Agent added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a label and requires one. labels Apr 27, 2025

Merge branch 'main' into prestop-lifecycle

781d4f1

InftyAI-Agent removed the lgtm Looks good to me, indicates that a PR is ready to be merged. label Apr 27, 2025

InftyAI-Agent added the lgtm Looks good to me, indicates that a PR is ready to be merged. label Apr 27, 2025

InftyAI-Agent merged commit fb95a7d into InftyAI:main Apr 27, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add preStop hook for llamacpp and tgi in the BackendRuntime#381

feat: add preStop hook for llamacpp and tgi in the BackendRuntime#381
InftyAI-Agent merged 6 commits into
InftyAI:mainfrom
cr7258:prestop-lifecycle

cr7258 commented Apr 27, 2025

Uh oh!

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

cr7258 commented Apr 27, 2025

What this PR does / why we need it

Which issue(s) this PR fixes

Special notes for your reviewer

Does this PR introduce a user-facing change?

Uh oh!

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

kerthcet commented Apr 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants