Google Gemini
Gemini through Zumik - both implicit and explicit caching at up to a 75% discount, the largest context window in the set at 1M+ tokens, first-class multimodal input, the Batch API at 50% off, manual cache clearing, and when the broker routes here.
Google Gemini is the long-context, multimodal, low-friction option. It is the only first-class provider that does both implicit and explicit caching, it has the largest context window in the set at over 1M tokens, and it is the only one that supports manual cache clearing. For any workload with long repeated prompts and no hard requirement for a specific frontier model, Gemini implicit caching plus long context often delivers the best cost-per-useful-output of the managed providers.
It is available on both the managed-provider and
BYOK profiles. Requests that resolve here report
Agent-Resolved-Provider: gemini.
Caching economics
Gemini gives you two mechanisms, and you can use either or both:
- Implicit caching is automatic for requests whose prefix (1,024+ tokens) matches a recent request, up to a 75% discount with zero client-side instrumentation. This is the lowest-friction cost optimization across all providers - no breakpoints, no cache IDs.
- Explicit caching (the Context Caching API) manually caches a content block - a document, an instruction set, a tool list - referenced by cache ID in later requests, billed by cache storage duration. Use it when the same large block is referenced across hundreds of requests per day.
| Fact | Value |
|---|---|
| Cache type | Both (implicit + explicit) |
| Minimum cacheable prefix | 1,024 tokens |
| Cache-read discount | up to 75% |
| Default TTL | 3,600 seconds (1h) |
| Extended TTL | 86,400 seconds (24h) |
| Cached-token reporting | Yes |
| Manual cache clear | Supported |
Gemini is the one first-class provider here that reports manual_cache_clear_supported. That lets a
managed-provider purge reach a stronger guarantee for Gemini-cached state than the expiry-bound best
effort other managed providers are capped at. See retention and purge.
Batch, long context, and multimodal
The Batch API delivers a 50% cost reduction at up to 24h turnaround for bulk inference with async result delivery. The 1M+ token context window lets a task load a full document and skip a RAG retrieval step entirely. And Gemini has first-class vision, audio, video, and code input - the broker routes multimodal workloads here by default.
When the broker routes here
Long repeated prompts
Workloads with prompts of 1,024+ tokens that do not need explicit cache management - implicit caching delivers up to 75% off automatically.
Very long context
Loading a whole document into the 1M+ window to avoid building a retrieval pipeline.
Multimodal input
Vision, audio, video, and code input route here by default.
Large batch jobs
Bulk async inference at the 50% Batch discount.
At a glance
| Capability | Value |
|---|---|
| Context window | 1,048,576 tokens |
| Multimodal input | Yes |
| Live search | No |
| Dedicated deployment | No |
| Service tiers | standard |
| Data retention | standard |
| Regions | us, global |
Manifest revision cap_2026_06_09. The capability manifest records
that Gemini holds the largest context window and the only manual cache clear in the set, both of which
shape where the broker sends long-context and multimodal work.
xAI (Grok)
xAI Grok through Zumik - context caching at a 75% read discount, live web-search grounding, Grok-3 and the cost-optimized Grok-3 Mini, no Batch tier, and when the broker routes here.
Fireworks AI
Fireworks through Zumik - the cost- and speed-optimized lane for open-source models, speculative decoding for sub-100ms TTFT, serverless and dedicated tiers, async batch, no prompt caching, and when the broker routes here.