Skip to content

steaming response for chat panel #516

@NotChristianGarcia

Description

@NotChristianGarcia

The litellm backend is capable of returning streaming responses from inference calls rather than just sync responses.

Let us implement an optional capability in chatPanel which (for just the litelllm backends for now) sends stream=True and attempts to stream responses back to users.

It's important to note that for Qwen (not sure about all models) "thinking" is wrapped in a <think> contenthere blah blah </think> tag. Not sure if that is always true, but if a stream starts with let's have a circle spinner/accordion which users can open to view all stream while it comes in, but by default stays closed while waiting for end of thinking. A spinning animation or breathing animation would be nice so users are aware that work is happening in the background. Then stream the rest of response to output.

This is working python code with requests to call to litellm and response/print the stream in real time.

chat2 = r.post(
    "<litellmurl>/v1/chat/completions",
    headers={...},
    json={
        "model": "qwen3-32b",
        "messages": [{"role": "user", "content": "prompt?"}],
        "stream": True,
    },
    stream=True,
)

for line in chat2.iter_lines():
    if line:
        decoded = line.decode("utf-8")
        if decoded.startswith("data: ") and decoded != "data: [DONE]":
            chunk = json.loads(decoded[6:])  # strip "data: " prefix
            content = chunk["choices"][0]["delta"].get("content", "")
            print(content, end="", flush=True)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions