Chat completions
POST /v1/chat/completions. The exact OpenAI chat-completions shape, including streaming over Server-Sent Events and stream_options.include_usage.
Create a chat completion. Request and response bodies are byte-for-byte OpenAI-compatible, so an existing OpenAI client works after a base-URL swap. Zumik-proprietary signal rides on response headers, never the body.
Create a chat completion
POST https://api.zumik.ai/v1/chat/completionsParameters
modelstringrequiredThe model to use. Either a Zumik alias such as code.fast or auto.balanced, resolved to a concrete provider model at request time, or a concrete provider model name passed through unchanged.
messagesarrayrequiredThe conversation so far, as a non-empty array of message objects. Each has a role (system, user, or assistant) and string content. An empty array is rejected with 400.
temperaturenumberSampling temperature. Optional; passed through to the resolved provider.
max_tokensintegerMaximum tokens to generate. Optional; passed through to the resolved provider.
streambooleandefault: falseWhen true, the response streams as Server-Sent Events (text/event-stream) instead of a single JSON body. See streaming.
stream_optionsobjectStreaming options. Set stream_options.include_usage to true to receive a trailing usage chunk after the content. Only meaningful when stream is true.
Request
curl https://api.zumik.ai/v1/chat/completions \
-H "Authorization: Bearer zk_live_..." \
-H "Content-Type: application/json" \
-d '{
"model": "code.fast",
"messages": [
{"role": "system", "content": "You are a terse code reviewer."},
{"role": "user", "content": "Is this loop off-by-one?"}
]
}'Response
{
"id": "chatcmpl-rsp_01jy7n3q8v6m4k2x...",
"object": "chat.completion",
"created": 1750000123,
"model": "code.fast",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "Yes, the bound should be < len, not <= len." },
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 27,
"completion_tokens": 12,
"total_tokens": 39,
"prompt_tokens_details": { "cached_tokens": 8 }
}
}idstringThe completion id, chatcmpl-....
objectstringAlways "chat.completion".
createdintegerUnix timestamp (seconds) when the completion was created.
modelstringThe model value from the request, echoed back.
choicesarrayThe generated choices. Each has an index, a message (role, content), and a finish_reason (stop, length, tool_calls, or content_filter).
usageobjectToken counts: prompt_tokens, completion_tokens, total_tokens, and prompt_tokens_details.cached_tokens (provider-reported cached prefix tokens, which keep reuse measurable).
The resolved provider, alias release, trace id, and QoS signal arrive on Agent-* response headers.
Streaming
Set stream: true to receive the completion as Server-Sent Events. Each event is a data: line carrying a chat.completion.chunk object, and the stream ends with data: [DONE].
The chunk sequence is: one frame with the role delta, then one or more frames with content deltas (concatenating them reproduces the full text), then a frame with the finish_reason, then [DONE].
stream = client.chat.completions.create(
model="code.fast",
messages=[{"role": "user", "content": "Stream a short answer."}],
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
delta = chunk.choices[0].delta.content if chunk.choices else None
if delta:
print(delta, end="")Chunk shape
data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[{"index":0,"delta":{"role":"assistant"}}]}
data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[{"index":0,"delta":{"content":"The bound should be "}}]}
data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]choices[].deltaobjectThe incremental update. The first frame carries role; subsequent frames carry content; the final frame carries an empty delta with finish_reason set.
Usage on streamed responses
By default a streamed response does not include a usage block. Set stream_options.include_usage to true to receive one final chunk, sent with an empty choices array and the usage object, just before [DONE]:
data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":9,"total_tokens":27,"prompt_tokens_details":{"cached_tokens":0}}}See the streaming guide for client patterns.
Errors
| HTTP | code | When |
|---|---|---|
| 400 | (none) | messages is empty, or the body is malformed. |
| 401 | invalid_api_key | Missing or invalid bearer key. |
| 402 | credits_required | The prepaid credit balance is empty. |
| 403 | region_not_allowed | The resolved region is blocked by regional policy. |
| 429 | quota_exceeded | Project or per-key budget reached. |
| 429 | rate_limit_exceeded | Per-key request-rate limit hit. |
| 504 | deadline_exceeded | A QoS deadline_ms elapsed before the provider responded; not charged. |
See the full error reference.
Responses
The OpenAI-compatible Responses API on Zumik. Create, retrieve, delete, and cancel responses, list input items, compact context, and count input tokens.
Embeddings
POST /v1/embeddings. The OpenAI-compatible embeddings shape, taking a single string or an array of strings and returning a data list of vectors.