Zumik
v1 · OpenAI-compatible

Chat completions

POST /v1/chat/completions. The exact OpenAI chat-completions shape, including streaming over Server-Sent Events and stream_options.include_usage.

Create a chat completion. Request and response bodies are byte-for-byte OpenAI-compatible, so an existing OpenAI client works after a base-URL swap. Zumik-proprietary signal rides on response headers, never the body.

Create a chat completion

POST https://api.zumik.ai/v1/chat/completions

Parameters

modelstringrequired

The model to use. Either a Zumik alias such as code.fast or auto.balanced, resolved to a concrete provider model at request time, or a concrete provider model name passed through unchanged.

messagesarrayrequired

The conversation so far, as a non-empty array of message objects. Each has a role (system, user, or assistant) and string content. An empty array is rejected with 400.

temperaturenumber

Sampling temperature. Optional; passed through to the resolved provider.

max_tokensinteger

Maximum tokens to generate. Optional; passed through to the resolved provider.

streambooleandefault: false

When true, the response streams as Server-Sent Events (text/event-stream) instead of a single JSON body. See streaming.

stream_optionsobject

Streaming options. Set stream_options.include_usage to true to receive a trailing usage chunk after the content. Only meaningful when stream is true.

Request

curl
curl https://api.zumik.ai/v1/chat/completions \
  -H "Authorization: Bearer zk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "code.fast",
    "messages": [
      {"role": "system", "content": "You are a terse code reviewer."},
      {"role": "user", "content": "Is this loop off-by-one?"}
    ]
  }'

Response

{
  "id": "chatcmpl-rsp_01jy7n3q8v6m4k2x...",
  "object": "chat.completion",
  "created": 1750000123,
  "model": "code.fast",
  "choices": [
    {
      "index": 0,
      "message": { "role": "assistant", "content": "Yes, the bound should be < len, not <= len." },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 27,
    "completion_tokens": 12,
    "total_tokens": 39,
    "prompt_tokens_details": { "cached_tokens": 8 }
  }
}
idstring

The completion id, chatcmpl-....

objectstring

Always "chat.completion".

createdinteger

Unix timestamp (seconds) when the completion was created.

modelstring

The model value from the request, echoed back.

choicesarray

The generated choices. Each has an index, a message (role, content), and a finish_reason (stop, length, tool_calls, or content_filter).

usageobject

Token counts: prompt_tokens, completion_tokens, total_tokens, and prompt_tokens_details.cached_tokens (provider-reported cached prefix tokens, which keep reuse measurable).

The resolved provider, alias release, trace id, and QoS signal arrive on Agent-* response headers.

Streaming

Set stream: true to receive the completion as Server-Sent Events. Each event is a data: line carrying a chat.completion.chunk object, and the stream ends with data: [DONE].

The chunk sequence is: one frame with the role delta, then one or more frames with content deltas (concatenating them reproduces the full text), then a frame with the finish_reason, then [DONE].

OpenAI SDK
stream = client.chat.completions.create(
    model="code.fast",
    messages=[{"role": "user", "content": "Stream a short answer."}],
    stream=True,
    stream_options={"include_usage": True},
)
for chunk in stream:
    delta = chunk.choices[0].delta.content if chunk.choices else None
    if delta:
        print(delta, end="")

Chunk shape

data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[{"index":0,"delta":{"role":"assistant"}}]}

data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[{"index":0,"delta":{"content":"The bound should be "}}]}

data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]
choices[].deltaobject

The incremental update. The first frame carries role; subsequent frames carry content; the final frame carries an empty delta with finish_reason set.

Usage on streamed responses

By default a streamed response does not include a usage block. Set stream_options.include_usage to true to receive one final chunk, sent with an empty choices array and the usage object, just before [DONE]:

data: {"id":"chatcmpl-rsp_01jy...","object":"chat.completion.chunk","created":1750000123,"model":"code.fast","choices":[],"usage":{"prompt_tokens":18,"completion_tokens":9,"total_tokens":27,"prompt_tokens_details":{"cached_tokens":0}}}

See the streaming guide for client patterns.

Errors

HTTPcodeWhen
400(none)messages is empty, or the body is malformed.
401invalid_api_keyMissing or invalid bearer key.
402credits_requiredThe prepaid credit balance is empty.
403region_not_allowedThe resolved region is blocked by regional policy.
429quota_exceededProject or per-key budget reached.
429rate_limit_exceededPer-key request-rate limit hit.
504deadline_exceededA QoS deadline_ms elapsed before the provider responded; not charged.

See the full error reference.

On this page