Fireworks AI

Fireworks through Zumik - the cost- and speed-optimized lane for open-source models, speculative decoding for sub-100ms TTFT, serverless and dedicated tiers, async batch, no prompt caching, and when the broker routes here.

Fireworks AI is the cost- and speed-optimized lane for open-source models (Llama, Qwen, DeepSeek, Mixtral, and fine-tuned variants). It is typically 3x to 10x lower cost per token than frontier providers for equivalent-quality tasks, and it delivers sub-100ms time-to-first-token on many models through speculative decoding. It is the one first-class provider that offers a dedicated deployment tier and speculative decoding.

It is available on both the managed-provider and BYOK profiles. Requests that resolve here report Agent-Resolved-Provider: fireworks_ai.

No prompt caching - this is a different lever

Fireworks does not do provider-native prompt caching (prompt_cache_supported is false). Its economics come from cheap open-source inference and speed, not a cache-read discount, so the front-load-the-stable-prefix rule that drives the other four providers does not save tokens here. The levers that matter are the deployment tier and quantization:

Serverless for variable and burst traffic - auto-scaling, pay per token.
Dedicated for sustained high-throughput workloads, cheaper per token once utilization clears the break-even threshold.
Quantization (fp16 / int8 / int4): int8 is a good quality-cost balance; int4 maximizes throughput at acceptable quality loss. Match the choice to what replay confirms for the workload.

Speed

Speculative decoding is enabled by default for compatible model sizes and reduces TTFT, which is why Fireworks is the primary speed-optimized path for open-source models. For latency-critical interactive requests on open-source models, this is the route.

Batch

Fireworks supports an async batch path (batch_api_supported is true) for high-throughput offline inference, though without the fixed 50% discount that OpenAI, Anthropic, and Gemini Batch carry. In the batch routing policy, Fireworks serverless real-time is also the lowest-cost real-time fallback when no other batch path applies.

When the broker routes here

Open-source models

The primary lane for Llama, Qwen, DeepSeek, Mixtral, and fine-tuned variants where frontier capability is not required.

Lowest per-token cost

3x to 10x cheaper than frontier providers for equivalent-quality open-source tasks.

Latency-critical interactive

Speculative decoding plus the dedicated tier deliver sub-100ms TTFT on many models.

Sustained throughput

Switch serverless to dedicated once utilization makes the dedicated tier the cheaper per-token option.

At a glance

Capability	Value
Prompt caching	None
Batch API	Async
Context window	131,072 tokens
Multimodal input	No
Live search	No
Dedicated deployment	Yes
Speculative decoding	Yes
Service tiers	serverless, dedicated
Data retention	standard
Regions	us

Manifest revision cap_2026_06_09. The capability manifest records that Fireworks is the only first-class provider with a dedicated tier and speculative decoding, which is what makes it the broker's open-source cost and speed lane. Note the dedicated tier here is a provider-managed product, not the same thing as self-hosting under BYOC.