Zumik
Core concepts

Quality of service

QoS classes, the request you submit, and the formal outcome object that makes the platform accountable for whether latency and reliability targets were actually met.

Most systems let you ask for a service level. Zumik also tells you, after the fact, whether you got it. The QoS model has two halves: a request that states your target, and a formal outcome that reports admission, completion, whether the target was met, and a stable reason code when it was not. The outcome is what makes the platform accountable rather than aspirational.

Classes

Every request belongs to one of four classes, which set its scheduling intent.

ClassIntent
interactiveUser is waiting; prefill latency matters most
standardNormal background-of-app work
backgroundNon-urgent, can tolerate queueing
batchBulk, latency-insensitive, eligible for Batch API lanes

The request

{
  "qos": {
    "class": "interactive",
    "target_ttft_ms": 500,
    "deadline_ms": 5000,
    "priority": 80,
    "degrade_policy": "allow_compatible_fallback"
  }
}

degrade_policy is the one to think about: forbid means "fail rather than route me to a fallback", allow_compatible_fallback means "I would rather get a compatible answer from another path than be rejected".

Tip

Set deadline_ms honestly. An interactive request with a 200ms deadline against a model that cannot start that fast will report customer_deadline_too_short rather than silently miss - useful signal, but only if the deadline reflects a real budget.

The outcome

After the request runs, a formal outcome object reports what happened:

{
  "qos_outcome": {
    "admission": "admitted",
    "completion": "completed",
    "target_met": true,
    "ttft_ms": 382,
    "latency_ms": 2710,
    "deadline_met": true,
    "degraded": false,
    "fallback_used": false,
    "reason_code": null
  }
}

target_met is derived from the request: it is true when ttft_ms <= target_ttft_ms, and deadline_met compares latency_ms against deadline_ms. When a target was not set, the corresponding flag is unknown (null) rather than a guess.

Outcome states

admitted, queued, rejected, expired_before_start. Whether the request ever started running.

completed, failed, cancelled, expired_during_execution. How it ended once it started.

target_met (true/false/unknown), degraded, fallback_used. The quality signals.

Reason codes

When a target is missed or a request is degraded, the outcome carries a stable, machine-readable reason_code. It is a closed enum, so adapters cannot invent free-form text that breaks your dashboards.

Reason codeMeaning
queue_saturationThe admission queue was full
provider_rate_limitThe provider rate-limited the call
provider_timeoutThe provider did not respond in time
region_unavailableNo target available in the allowed region
alias_no_compatible_targetThe alias release had no compatible target under policy
cache_missExpected reuse did not materialize
cache_transfer_slower_than_recomputeFetching cached KV would have been slower than recomputing
fallback_profile_usedA fallback execution profile served the request
customer_deadline_too_shortThe deadline could not be met under any path

On /v1

The full outcome object is never inserted into OpenAI-compatible response JSON. A compact subset rides on response headers, with the rest available through the telemetry API:

Agent-QoS-Admission: admitted
Agent-QoS-Target-Met: true
Agent-QoS-Fallback-Used: false
Agent-Trace-Id: trc_...

On this page