Prompt caching
Capture provider-native prompt caching through Zumik across OpenAI, Anthropic, Gemini, xAI, and Fireworks - and measure how much you actually reused.
Prompt caching reuses the KV state of a repeated prefix so it bills at a reduced read rate instead of being recomputed. Every provider does it differently. Zumik captures the discount on whichever provider answers and reports how much you actually reused, with an evidence level attached so a prediction is never mistaken for a measured fact.
The universal rule
Keep stable content at the front of the request and push volatile content to the end. A timestamp, a request id, or a freshly shuffled tool list near the top resets the prefix and drops your hit rate to near zero - the single most common cause of a low capture rate.
1. system policy 6. compacted checkpoints
2. developer policy 7. ordered branch history
3. stable tool bundle 8. dynamic retrieval
4. response schema 9. latest user message
5. workspace context 10. latest tool resultThis is exactly what a bundle plus a session encode: the bundle is the stable prefix, the branch history is the ordered middle, and the latest turn is the volatile tail. Get the ordering right once and every provider's cache works in your favor. For the full ordering rules and a linter that catches cache-defeating layout, see Prompt layout.
By provider
How much you can capture, and what you have to do for it, depends on the resolved provider. The
resolved provider comes back on the Agent-Resolved-Provider header. Each provider has a different
mechanism, minimum, and discount - so the same prompt captures different amounts depending on where it
lands.
| Provider | Mechanism | Minimum | Read discount | What you do |
|---|---|---|---|---|
| OpenAI | Automatic exact-prefix | ~1,024 tokens | ~50% on cached input | Nothing. Eligible prefixes cache automatically; extended retention reaches up to 24h on supported models. Keep volatile content off the front. |
| Anthropic | Explicit cache_control | ~1,000 tokens | up to 90% | Place breakpoints on stable blocks only. A cache write costs more than a normal token, so don't mark volatile content. 5-minute TTL by default, 1-hour available on Claude 3.5+. |
| Google Gemini | Implicit | ~1,024 tokens | up to ~75% | Recent prefix overlap is discounted with no breakpoints and no config. Convenient, but capture varies run to run - watch the realized ratio. |
| xAI (Grok) | Cached context | provider-defined | provider-defined | Reuses a cached context across consecutive requests. No batch tier - route background work elsewhere. |
| Fireworks | Tiered | tier-defined | tier-defined | Discount depends on the cache tier the request lands in; a warm prefix hits the cheaper tier. |
Providers do not support active manual cache clearing. A managed-provider cache expires under the
provider's own policy, which is why a purge of managed-provider state
reports a best_effort_expiry guarantee with a concrete expires_at rather than an immediate
physical purge.
OpenAI - automatic
Forgiving: long stable prefixes cache without any markup. The trap is a volatile token near the top that resets the whole prefix.
Anthropic - explicit
The deepest discount (90%), but you must place cache_control breakpoints, and a cache write
costs more than a normal token. Mark only blocks you will reuse.
Gemini - implicit
No breakpoints to manage; discounts apply on recent prefix overlap. Capture is convenient but less predictable, so watch the realized ratio.
xAI / Fireworks
xAI reuses a cached context across consecutive calls; Fireworks discounts by cache tier. Both reward a stable, warm prefix.
Tip
Attaching a subscription credential routes eligible traffic through a Claude
Code (90%) or ChatGPT Codex (50%) allowance at the provider's cache-discounted price - the discount
shows up in cached_tokens just like metered traffic.
How to measure capture
Capture is measurable in three places, from a stock client up to the full waterfall.
On /v1, the usage object reports cached tokens where the provider exposes them, so reuse stays
measurable with a stock OpenAI SDK. On a streamed call the same counts arrive in the trailing usage
chunk when you set stream_options.include_usage (see Streaming):
r = client.chat.completions.create(model="code.balanced", messages=[…])
print(r.usage.prompt_tokens, r.usage.prompt_tokens_details.cached_tokens)On /v2/usage, the full reuse waterfall is available per event with an
evidence level and a cache_tier, so a predicted hit is never reported as a measured one:
curl "https://api.zumik.ai/v2/usage?group_by=provider" \
-H "Authorization: Bearer zk_live_..."{
"input_tokens": 1000,
"cached_tokens": 400,
"realized_reused_tokens": 350,
"reuse_evidence_level": "provider_reported",
"cache_tier": "provider"
}The evidence level grades how much to trust the number, strongest first:
| Evidence level | Meaning |
|---|---|
provider_reported | The provider returned a cached-token count (e.g. OpenAI cached_tokens). |
runtime_confirmed | A self-hosted runtime confirmed the KV reuse directly. |
router_inferred | The router predicted reuse the runtime did not confirm. |
trace_estimated | A diagnostic estimated opportunity from workload similarity. |
unknown | The adapter has no trustworthy signal. |
The cache_tier (provider, gpu, host_ram, nvme, remote_kv, or unknown) says where the
reuse was served from.
Opportunity vs. capture
A repeated prefix is an opportunity, not a guaranteed hit. The gap between what could be reused and what was reused is the missed-opportunity gap, and it is almost always closed by prompt ordering, not new infrastructure. Run a workload diagnostic to see the gap for your traffic before you change anything, and read Reuse metrics for the mental model behind opportunity versus capture.
OpenAI migration guide
Point an existing OpenAI client at Zumik with no body changes - the full /v1 compatibility surface, the base-url swap, what carries over unchanged, the optional Agent-* headers, and a testing checklist.
Prompt layout
The recommended block ordering that preserves a reusable prefix, the reuse-killers that quietly defeat provider caching, and the zumik lint CLI and web prompt-linter that catch them.