Rate limits
The three layers of rate limiting - Cloudflare edge, Bifrost per-key, and API Core per-project - the 429 with Retry-After, and the stricter limits on auth endpoints.
Zumik rate-limits at three layers rather than trusting a single choke point. The outermost edge drops abuse before it reaches origin, the gateway enforces per-key throughput, and the API core enforces per-project budget and policy. A request has to pass all three.
The three layers
Cloudflare edge (outermost)
Per-IP thresholds drop or challenge abusive traffic before it reaches origin. Limits are stricter on auth and inference endpoints than on read endpoints, and admin endpoints carry strict per-project limits regardless of IP.
Bifrost gateway (per-key)
The Bifrost gateway enforces per-API-key quotas: requests per minute,
tokens per minute, and concurrent streams. Exceeding a quota returns 429 with Retry-After.
API Core (per-project)
Per-project budget limits and per-tenant rate-limit policy are enforced before dispatch to the Execution Broker. A per-key request-rate limit is checked first as the cheapest possible reject - it bounds abuse that a money cap does not, since a money cap is not a throughput cap.
Limits by endpoint class
| Endpoint class | Examples | Posture |
|---|---|---|
| Auth | login, sign-up, password reset | Strictest. 5 failed logins per IP per 10 minutes; 3 sign-up or reset requests per IP per hour. |
| Inference | POST /v1/chat/completions, POST /v1/responses, POST /v2/responses | Per-IP burst caps at the edge; per-key requests-per-minute and tokens-per-minute at Bifrost. |
| Read | GET /v1/models, GET /v2/artifacts | Higher limits, lower rate-limiting priority. |
| Admin | POST /v2/purge-jobs, POST /v2/replay-runs | Strict per-project limits regardless of IP. |
The 429 and Retry-After
When a throughput limit is exceeded, the gateway returns 429 Too Many Requests with a Retry-After
header telling you how long to wait. Honor it - back off for at least that many seconds before
retrying.
HTTP/1.1 429 Too Many Requests
Retry-After: 12A per-key request-rate rejection at API Core returns the OpenAI-shaped error envelope so existing backoff logic handles it:
{
"error": {
"message": "Per-key request rate limit exceeded; reduce request rate and retry.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}A 429 rate_limit_exceeded is about throughput, not money. A budget cap returns a different 429
(quota_exceeded); raise the cap or enable overage to fix that one. See
Billing and budgets. Both are 429 so a stock OpenAI client's
retry/backoff treats them the same way.
Handling 429 well
Respect Retry-After
Wait at least the header's value before the next attempt. Do not hammer a limited endpoint.
Use exponential backoff with jitter
Past the first retry, back off exponentially and add jitter so a fleet of clients does not retry in lockstep.
Keep retries idempotent
Attach an Agent-Idempotency-Key so a retry after a 429 cannot
double-execute if the original was in flight.
Smooth your token-per-minute load
A 429 on tokens-per-minute means trimming request rate will not help - reduce tokens per request
(tighter prompts, better layout) or spread load over time.
Regional policy
Set data residency and allowed regions for a project, and understand how the Execution Broker enforces them by returning region_not_allowed before any provider call.
Security overview
Security-by-default at Zumik, tenant isolation, secret hygiene, encryption, and honest retention and purge, treated as baseline rather than a premium add-on.