-
Notifications
You must be signed in to change notification settings - Fork 19
Description
The litellm backend is capable of returning streaming responses from inference calls rather than just sync responses.
Let us implement an optional capability in chatPanel which (for just the litelllm backends for now) sends stream=True and attempts to stream responses back to users.
It's important to note that for Qwen (not sure about all models) "thinking" is wrapped in a <think> contenthere blah blah </think> tag. Not sure if that is always true, but if a stream starts with let's have a circle spinner/accordion which users can open to view all stream while it comes in, but by default stays closed while waiting for end of thinking. A spinning animation or breathing animation would be nice so users are aware that work is happening in the background. Then stream the rest of response to output.
This is working python code with requests to call to litellm and response/print the stream in real time.
chat2 = r.post(
"<litellmurl>/v1/chat/completions",
headers={...},
json={
"model": "qwen3-32b",
"messages": [{"role": "user", "content": "prompt?"}],
"stream": True,
},
stream=True,
)
for line in chat2.iter_lines():
if line:
decoded = line.decode("utf-8")
if decoded.startswith("data: ") and decoded != "data: [DONE]":
chunk = json.loads(decoded[6:]) # strip "data: " prefix
content = chunk["choices"][0]["delta"].get("content", "")
print(content, end="", flush=True)