Rate limits

The three layers of rate limiting - Cloudflare edge, Bifrost per-key, and API Core per-project - the 429 with Retry-After, and the stricter limits on auth endpoints.

Zumik rate-limits at three layers rather than trusting a single choke point. The outermost edge drops abuse before it reaches origin, the gateway enforces per-key throughput, and the API core enforces per-project budget and policy. A request has to pass all three.

The three layers

Cloudflare edge (outermost)

Per-IP thresholds drop or challenge abusive traffic before it reaches origin. Limits are stricter on auth and inference endpoints than on read endpoints, and admin endpoints carry strict per-project limits regardless of IP.

Bifrost gateway (per-key)

The Bifrost gateway enforces per-API-key quotas: requests per minute, tokens per minute, and concurrent streams. Exceeding a quota returns 429 with Retry-After.

API Core (per-project)

Per-project budget limits and per-tenant rate-limit policy are enforced before dispatch to the Execution Broker. A per-key request-rate limit is checked first as the cheapest possible reject - it bounds abuse that a money cap does not, since a money cap is not a throughput cap.

Limits by endpoint class

Endpoint class	Examples	Posture
Auth	login, sign-up, password reset	Strictest. 5 failed logins per IP per 10 minutes; 3 sign-up or reset requests per IP per hour.
Inference	`POST /v1/chat/completions`, `POST /v1/responses`, `POST /v2/responses`	Per-IP burst caps at the edge; per-key requests-per-minute and tokens-per-minute at Bifrost.
Read	`GET /v1/models`, `GET /v2/artifacts`	Higher limits, lower rate-limiting priority.
Admin	`POST /v2/purge-jobs`, `POST /v2/replay-runs`	Strict per-project limits regardless of IP.

The 429 and Retry-After

When a throughput limit is exceeded, the gateway returns 429 Too Many Requests with a Retry-After header telling you how long to wait. Honor it - back off for at least that many seconds before retrying.

HTTP/1.1 429 Too Many Requests
Retry-After: 12

A per-key request-rate rejection at API Core returns the OpenAI-shaped error envelope so existing backoff logic handles it:

{
  "error": {
    "message": "Per-key request rate limit exceeded; reduce request rate and retry.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

A 429 rate_limit_exceeded is about throughput, not money. A budget cap returns a different 429 (quota_exceeded); raise the cap or enable overage to fix that one. See Billing and budgets. Both are 429 so a stock OpenAI client's retry/backoff treats them the same way.

Handling 429 well

Respect Retry-After

Wait at least the header's value before the next attempt. Do not hammer a limited endpoint.

Use exponential backoff with jitter

Past the first retry, back off exponentially and add jitter so a fleet of clients does not retry in lockstep.

Keep retries idempotent

Attach an Agent-Idempotency-Key so a retry after a 429 cannot double-execute if the original was in flight.

Smooth your token-per-minute load

A 429 on tokens-per-minute means trimming request rate will not help - reduce tokens per request (tighter prompts, better layout) or spread load over time.

Errors reference

Every error code, status, and envelope field.

Authentication

Keys, bearer auth, and the auth-endpoint limits.