Streaming

Server-sent events on /v1/chat/completions and /v1/responses, the chunk shape, stream_options.include_usage, and why Zumik buffers the stream to deliver real content with correct billing.

Set stream: true and Zumik returns a server-sent event stream of OpenAI-shaped chunks. The stream works end to end - real content, a clean terminator, and (when you ask for it) a trailing usage chunk with the exact token counts you were billed for.

Stream a chat completion

Python

stream = client.chat.completions.create(
    model="code.fast",
    messages=[{"role": "user", "content": "Explain the diff."}],
    stream=True,
    stream_options={"include_usage": True},
)
for chunk in stream:
    if chunk.choices:
        print(chunk.choices[0].delta.content or "", end="")
    elif chunk.usage:
        print("\ncached:", chunk.usage.prompt_tokens_details.cached_tokens)

A streamed response sets Content-Type: text/event-stream and Cache-Control: no-cache, and the body is a sequence of data: lines.

The chunk shape

Each event is data: followed by a JSON object whose object is chat.completion.chunk. The frames arrive in this order:

Role frame

Opens the message. The delta carries only the role.

{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
 "choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

Content frames

One or more frames carrying text in delta.content.

{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
 "choices":[{"index":0,"delta":{"content":"The patch "},"finish_reason":null}]}

Finish frame

An empty delta with a finish_reason (stop, length, tool_calls, or content_filter).

{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
 "choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

Usage frame (opt-in)

Emitted only when stream_options.include_usage is true. choices is empty and usage carries the final counts, including cached input.

{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
 "choices":[],
 "usage":{"prompt_tokens":1240,"completion_tokens":180,"total_tokens":1420,
          "prompt_tokens_details":{"cached_tokens":1024}}}

Terminator

The stream ends with the literal sentinel.

data: [DONE]

stream_options.include_usage

By default a streamed response carries no usage object - matching OpenAI. Set stream_options.include_usage: true to receive one trailing chunk after the finish frame whose usage block reports prompt_tokens, completion_tokens, total_tokens, and prompt_tokens_details.cached_tokens. The cached count is how you confirm prompt-cache capture on a streamed call.

Buffered streaming

Zumik buffers the stream. The generation is obtained from the provider in full, the project is charged on exact usage, and then the real completion is re-emitted as chat.completion.chunk frames.

This is a deliberate tradeoff. Charging happens before the first byte reaches you, so stream: true delivers real content with correct billing - the usage chunk reports the same tokens you were charged, not an estimate. The cost is time to first token: you see output once the provider call completes rather than token by token. Token-by-token passthrough (which would move the charge to after the stream closes) is a planned follow-up.

Because billing is settled before the stream opens, a budget or rate-limit rejection surfaces as a normal error response, not as a half-streamed body that gets cut off. The retry rules still apply: do not auto-replay a generation after observable streamed output unless the path supports resumability.

On /v1/responses and /v2/responses

The Responses surfaces use the same buffered model: the request is fully resolved and charged, then the result is returned. The OpenAI-compatible /v1/responses keeps its body shape exact; the native /v2/responses adds execution_profile and a formal qos_outcome to the response object. For per-token chat streaming today, use /v1/chat/completions with stream: true.

Confirm cache capture

Read cached_tokens from the usage chunk to see how much of the prefix was reused.

Idempotency and retries

Safe retries around a stream that disconnects.