Quality of service
QoS classes, the request you submit, and the formal outcome object that makes the platform accountable for whether latency and reliability targets were actually met.
Most systems let you ask for a service level. Zumik also tells you, after the fact, whether you got it. The QoS model has two halves: a request that states your target, and a formal outcome that reports admission, completion, whether the target was met, and a stable reason code when it was not. The outcome is what makes the platform accountable rather than aspirational.
Classes
Every request belongs to one of four classes, which set its scheduling intent.
| Class | Intent |
|---|---|
interactive | User is waiting; prefill latency matters most |
standard | Normal background-of-app work |
background | Non-urgent, can tolerate queueing |
batch | Bulk, latency-insensitive, eligible for Batch API lanes |
The request
{
"qos": {
"class": "interactive",
"target_ttft_ms": 500,
"deadline_ms": 5000,
"priority": 80,
"degrade_policy": "allow_compatible_fallback"
}
}degrade_policy is the one to think about: forbid means "fail rather than route me to a fallback", allow_compatible_fallback means "I would rather get a compatible answer from another path than be rejected".
Tip
Set deadline_ms honestly. An interactive request with a 200ms deadline against a model that cannot start that fast will report customer_deadline_too_short rather than silently miss - useful signal, but only if the deadline reflects a real budget.
The outcome
After the request runs, a formal outcome object reports what happened:
{
"qos_outcome": {
"admission": "admitted",
"completion": "completed",
"target_met": true,
"ttft_ms": 382,
"latency_ms": 2710,
"deadline_met": true,
"degraded": false,
"fallback_used": false,
"reason_code": null
}
}target_met is derived from the request: it is true when ttft_ms <= target_ttft_ms, and deadline_met compares latency_ms against deadline_ms. When a target was not set, the corresponding flag is unknown (null) rather than a guess.
Outcome states
admitted, queued, rejected, expired_before_start. Whether the request ever started running.
completed, failed, cancelled, expired_during_execution. How it ended once it started.
target_met (true/false/unknown), degraded, fallback_used. The quality signals.
Reason codes
When a target is missed or a request is degraded, the outcome carries a stable, machine-readable reason_code. It is a closed enum, so adapters cannot invent free-form text that breaks your dashboards.
| Reason code | Meaning |
|---|---|
queue_saturation | The admission queue was full |
provider_rate_limit | The provider rate-limited the call |
provider_timeout | The provider did not respond in time |
region_unavailable | No target available in the allowed region |
alias_no_compatible_target | The alias release had no compatible target under policy |
cache_miss | Expected reuse did not materialize |
cache_transfer_slower_than_recompute | Fetching cached KV would have been slower than recomputing |
fallback_profile_used | A fallback execution profile served the request |
customer_deadline_too_short | The deadline could not be met under any path |
On /v1
The full outcome object is never inserted into OpenAI-compatible response JSON. A compact subset rides on response headers, with the rest available through the telemetry API:
Agent-QoS-Admission: admitted
Agent-QoS-Target-Met: true
Agent-QoS-Fallback-Used: false
Agent-Trace-Id: trc_...Agent Hints
A vendor-neutral, versioned contract for expressing intent - reuse, QoS, routing, retry safety - without exposing any provider- or engine-specific knob.
Execution profiles
Managed-provider default, BYOK, BYOC, hybrid, and OpenRouter emergency fallback - what each is for and who owns the control plane inside it.