Bifrost gateway
The optional Tier 1 gateway in front of the Product API Core - auth, quotas, layered rate limits, automatic failover across 23+ providers, the OpenRouter emergency path, and where each limit is enforced.
Bifrost is the optional Tier 1 AI gateway that can sit in front of the Product API Core. It is the centralized entry point for provider execution: it holds the platform's provider keys, normalizes many providers behind one OpenAI-style interface, and handles failover and load balancing so the Core does not carry provider plumbing. It is a deploy and architecture component, not a customer-facing API.
Bifrost is an upstream, not a requirement. The Core runs standalone. When BIFROST_BASE_URL is
set, managed-provider requests are sent OpenAI-style to the gateway; otherwise the Core talks to
provider adapters directly, or in dev returns a deterministic placeholder. Do not substitute LiteLLM
or other middleware - Bifrost is the designated high-performance gateway for all managed and BYOK
provider traffic.
Who owns what
The split is deliberate: the gateway owns transport and provider access; the Core owns everything that is Zumik-specific.
| Bifrost (Tier 1) | Product API Core (behind it) |
|---|---|
| TLS termination, API-key auth, JWT, request ids | /v1 and /v2 surfaces, project policy |
| Global rate limits and quotas | Model alias resolution and immutable releases |
| Load balancing and automatic provider failover | Session state, branches, snapshots |
| Semantic caching | Diagnostics, replay, purge evidence |
| Access to 23+ providers behind one interface | QoS admission, billing, the usage meter |
In production the Core also sits behind nginx + Cloudflare, which terminate TLS and apply edge and per-IP rate limits. Bifrost is the provider-facing layer, not the public edge.
Configuration
The gateway holds the five first-class provider keys from the environment and load-balances by weight with automatic failover. A minimal config:
{
"providers": {
"openai": { "keys": [{ "value": "env.OPENAI_API_KEY", "weight": 1 }] },
"anthropic": { "keys": [{ "value": "env.ANTHROPIC_API_KEY", "weight": 1 }] },
"xai": { "keys": [{ "value": "env.XAI_API_KEY", "weight": 1 }] },
"gemini": { "keys": [{ "value": "env.GEMINI_API_KEY", "weight": 1 }] },
"fireworks": { "keys": [{ "value": "env.FIREWORKS_API_KEY", "weight": 1 }] }
},
"routing": { "automatic_failover": true, "load_balancing": "weighted" },
"semantic_cache": { "enabled": true, "ttl_seconds": 300 },
"rate_limits": { "global_requests_per_minute": 6000 }
}Keys are referenced as env.*, never inlined - the values come from the environment or a secret store
(the Terraform provider-secrets module renders them out of band). The five
first-class providers configured here are the cost- and speed-optimized adapters; Bifrost reaches the
rest of its 23+ backends through the same interface.
Providers and execution profiles
The managed-provider profile routes through Bifrost to the platform's contracted accounts across the five primary providers - OpenAI, Anthropic, xAI, Google Gemini, and Fireworks AI - plus broad coverage of others. This is the default profile: fastest onboarding, lowest operational burden, full access to provider-native prompt caching and Batch APIs.
The BYOK profile bypasses the platform keys: the Execution Broker calls the resolved
provider with the customer's own sealed credential. The BYOC profile routes
concentrated, reusable workloads to self-hosted clusters. Which profile served a request comes back on
Agent-Execution-Profile.
Failover and the emergency fallback
Bifrost does automatic multi-provider failover within the managed path. Beyond that, Zumik has one last-resort continuity layer: the OpenRouter emergency fallback. It is intentionally narrow - it fires only after a verified primary failure for a required model path, never for price arbitration, is gated behind explicit policy, and is audited on every use.
The execution mode is reported on Agent-Execution-Mode: live (primary gateway),
openrouter_fallback, or placeholder (no gateway configured).
Rate limiting
Limits are layered; no single choke point is trusted:
Cloudflare edge
Per-IP thresholds drop or challenge abusive traffic before it reaches the origin, with stricter limits on auth and inference endpoints. See Terraform for the WAF rules.
Bifrost
Per-API-key request-per-minute and token-per-minute limits, plus the global RPM cap, on inference
endpoints. Returns 429 with Retry-After when exceeded.
API Core
Per-key request-rate limiting and per-project budgets. An exceeded rate returns
429 rate_limit_exceeded, distinct from a budget 429 quota_exceeded.
Read endpoints (GET /v1/models, GET /v2/artifacts/{id}) carry higher limits than the inference
endpoints. See troubleshooting for the rate-limit and quota error
codes.
Fireworks AI
Fireworks through Zumik - the cost- and speed-optimized lane for open-source models, speculative decoding for sub-100ms TTFT, serverless and dedicated tiers, async batch, no prompt caching, and when the broker routes here.
BYOC stack
The NVIDIA Dynamo + SGLang + FlashInfer + LMCache + Mooncake + AIBrix data plane - what each component does, what runs where, the infra/byoc Helm chart, replay-gated activation, and the cluster registry. Running it needs your own GPU cluster.