Fireworks AI
Fireworks through Zumik - the cost- and speed-optimized lane for open-source models, speculative decoding for sub-100ms TTFT, serverless and dedicated tiers, async batch, no prompt caching, and when the broker routes here.
Fireworks AI is the cost- and speed-optimized lane for open-source models (Llama, Qwen, DeepSeek, Mixtral, and fine-tuned variants). It is typically 3x to 10x lower cost per token than frontier providers for equivalent-quality tasks, and it delivers sub-100ms time-to-first-token on many models through speculative decoding. It is the one first-class provider that offers a dedicated deployment tier and speculative decoding.
It is available on both the managed-provider and
BYOK profiles. Requests that resolve here report
Agent-Resolved-Provider: fireworks_ai.
No prompt caching - this is a different lever
Fireworks does not do provider-native prompt caching (prompt_cache_supported is false). Its
economics come from cheap open-source inference and speed, not a cache-read discount, so the
front-load-the-stable-prefix rule that drives the other four providers does not save tokens here. The
levers that matter are the deployment tier and quantization:
- Serverless for variable and burst traffic - auto-scaling, pay per token.
- Dedicated for sustained high-throughput workloads, cheaper per token once utilization clears the break-even threshold.
- Quantization (fp16 / int8 / int4): int8 is a good quality-cost balance; int4 maximizes throughput at acceptable quality loss. Match the choice to what replay confirms for the workload.
Speed
Speculative decoding is enabled by default for compatible model sizes and reduces TTFT, which is why Fireworks is the primary speed-optimized path for open-source models. For latency-critical interactive requests on open-source models, this is the route.
Batch
Fireworks supports an async batch path (batch_api_supported is true) for high-throughput offline
inference, though without the fixed 50% discount that OpenAI, Anthropic, and Gemini Batch carry. In the
batch routing policy, Fireworks serverless real-time is also the lowest-cost
real-time fallback when no other batch path applies.
When the broker routes here
Open-source models
The primary lane for Llama, Qwen, DeepSeek, Mixtral, and fine-tuned variants where frontier capability is not required.
Lowest per-token cost
3x to 10x cheaper than frontier providers for equivalent-quality open-source tasks.
Latency-critical interactive
Speculative decoding plus the dedicated tier deliver sub-100ms TTFT on many models.
Sustained throughput
Switch serverless to dedicated once utilization makes the dedicated tier the cheaper per-token option.
At a glance
| Capability | Value |
|---|---|
| Prompt caching | None |
| Batch API | Async |
| Context window | 131,072 tokens |
| Multimodal input | No |
| Live search | No |
| Dedicated deployment | Yes |
| Speculative decoding | Yes |
| Service tiers | serverless, dedicated |
| Data retention | standard |
| Regions | us |
Manifest revision cap_2026_06_09. The capability manifest records
that Fireworks is the only first-class provider with a dedicated tier and speculative decoding, which
is what makes it the broker's open-source cost and speed lane. Note the dedicated tier here is a
provider-managed product, not the same thing as self-hosting under BYOC.
Google Gemini
Gemini through Zumik - both implicit and explicit caching at up to a 75% discount, the largest context window in the set at 1M+ tokens, first-class multimodal input, the Batch API at 50% off, manual cache clearing, and when the broker routes here.
Bifrost gateway
The optional Tier 1 gateway in front of the Product API Core - auth, quotas, layered rate limits, automatic failover across 23+ providers, the OpenRouter emergency path, and where each limit is enforced.