Streaming
Server-sent events on /v1/chat/completions and /v1/responses, the chunk shape, stream_options.include_usage, and why Zumik buffers the stream to deliver real content with correct billing.
Set stream: true and Zumik returns a server-sent event stream of OpenAI-shaped chunks. The stream
works end to end - real content, a clean terminator, and (when you ask for it) a trailing usage chunk
with the exact token counts you were billed for.
Stream a chat completion
stream = client.chat.completions.create(
model="code.fast",
messages=[{"role": "user", "content": "Explain the diff."}],
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if chunk.choices:
print(chunk.choices[0].delta.content or "", end="")
elif chunk.usage:
print("\ncached:", chunk.usage.prompt_tokens_details.cached_tokens)A streamed response sets Content-Type: text/event-stream and Cache-Control: no-cache, and the body
is a sequence of data: lines.
The chunk shape
Each event is data: followed by a JSON object whose object is chat.completion.chunk. The frames
arrive in this order:
Role frame
Opens the message. The delta carries only the role.
{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
"choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}Content frames
One or more frames carrying text in delta.content.
{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
"choices":[{"index":0,"delta":{"content":"The patch "},"finish_reason":null}]}Finish frame
An empty delta with a finish_reason (stop, length, tool_calls, or content_filter).
{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
"choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}Usage frame (opt-in)
Emitted only when stream_options.include_usage is true. choices is empty and usage carries
the final counts, including cached input.
{"id":"chatcmpl-…","object":"chat.completion.chunk","created":1750000000,"model":"code.fast",
"choices":[],
"usage":{"prompt_tokens":1240,"completion_tokens":180,"total_tokens":1420,
"prompt_tokens_details":{"cached_tokens":1024}}}Terminator
The stream ends with the literal sentinel.
data: [DONE]stream_options.include_usage
By default a streamed response carries no usage object - matching OpenAI. Set
stream_options.include_usage: true to receive one trailing chunk after the finish frame whose
usage block reports prompt_tokens, completion_tokens, total_tokens, and
prompt_tokens_details.cached_tokens. The cached count is how you confirm
prompt-cache capture on a streamed call.
Buffered streaming
Zumik buffers the stream. The generation is obtained from the provider in full, the project is charged
on exact usage, and then the real completion is re-emitted as chat.completion.chunk frames.
This is a deliberate tradeoff. Charging happens before the first byte reaches you, so stream: true
delivers real content with correct billing - the usage chunk reports the same tokens you were charged,
not an estimate. The cost is time to first token: you see output once the provider call completes
rather than token by token. Token-by-token passthrough (which would move the charge to after the
stream closes) is a planned follow-up.
Because billing is settled before the stream opens, a budget or rate-limit rejection surfaces as a normal error response, not as a half-streamed body that gets cut off. The retry rules still apply: do not auto-replay a generation after observable streamed output unless the path supports resumability.
On /v1/responses and /v2/responses
The Responses surfaces use the same buffered model: the request is fully resolved and charged, then
the result is returned. The OpenAI-compatible /v1/responses keeps its
body shape exact; the native /v2/responses adds execution_profile
and a formal qos_outcome to the response object. For per-token chat streaming
today, use /v1/chat/completions with stream: true.
Snapshots and reproducibility
Pin a snapshot to freeze a branch head, ordering, and prompt-compiler revision, understand exactly what a snapshot fixes, and replay recorded traffic against a pinned snapshot and alias release.
Idempotency and retries
Make mutating requests safely retryable with Agent-Idempotency-Key, understand the three retry types, and keep tool side effects safe when a request is replayed.