Prompt caching

Capture provider-native prompt caching through Zumik across OpenAI, Anthropic, Gemini, xAI, and Fireworks - and measure how much you actually reused.

Prompt caching reuses the KV state of a repeated prefix so it bills at a reduced read rate instead of being recomputed. Every provider does it differently. Zumik captures the discount on whichever provider answers and reports how much you actually reused, with an evidence level attached so a prediction is never mistaken for a measured fact.

The universal rule

Keep stable content at the front of the request and push volatile content to the end. A timestamp, a request id, or a freshly shuffled tool list near the top resets the prefix and drops your hit rate to near zero - the single most common cause of a low capture rate.

Recommended order

1. system policy        6. compacted checkpoints
2. developer policy     7. ordered branch history
3. stable tool bundle   8. dynamic retrieval
4. response schema      9. latest user message
5. workspace context   10. latest tool result

This is exactly what a bundle plus a session encode: the bundle is the stable prefix, the branch history is the ordered middle, and the latest turn is the volatile tail. Get the ordering right once and every provider's cache works in your favor. For the full ordering rules and a linter that catches cache-defeating layout, see Prompt layout.

By provider

How much you can capture, and what you have to do for it, depends on the resolved provider. The resolved provider comes back on the Agent-Resolved-Provider header. Each provider has a different mechanism, minimum, and discount - so the same prompt captures different amounts depending on where it lands.

Provider	Mechanism	Minimum	Read discount	What you do
OpenAI	Automatic exact-prefix	~1,024 tokens	~50% on cached input	Nothing. Eligible prefixes cache automatically; extended retention reaches up to 24h on supported models. Keep volatile content off the front.
Anthropic	Explicit `cache_control`	~1,000 tokens	up to 90%	Place breakpoints on stable blocks only. A cache write costs more than a normal token, so don't mark volatile content. 5-minute TTL by default, 1-hour available on Claude 3.5+.
Google Gemini	Implicit	~1,024 tokens	up to ~75%	Recent prefix overlap is discounted with no breakpoints and no config. Convenient, but capture varies run to run - watch the realized ratio.
xAI (Grok)	Cached context	provider-defined	provider-defined	Reuses a cached context across consecutive requests. No batch tier - route background work elsewhere.
Fireworks	Tiered	tier-defined	tier-defined	Discount depends on the cache tier the request lands in; a warm prefix hits the cheaper tier.

Providers do not support active manual cache clearing. A managed-provider cache expires under the provider's own policy, which is why a purge of managed-provider state reports a best_effort_expiry guarantee with a concrete expires_at rather than an immediate physical purge.

OpenAI - automatic

Forgiving: long stable prefixes cache without any markup. The trap is a volatile token near the top that resets the whole prefix.

Anthropic - explicit

The deepest discount (90%), but you must place cache_control breakpoints, and a cache write costs more than a normal token. Mark only blocks you will reuse.

Gemini - implicit

No breakpoints to manage; discounts apply on recent prefix overlap. Capture is convenient but less predictable, so watch the realized ratio.

xAI / Fireworks

xAI reuses a cached context across consecutive calls; Fireworks discounts by cache tier. Both reward a stable, warm prefix.

Tip

Attaching a subscription credential routes eligible traffic through a Claude Code (90%) or ChatGPT Codex (50%) allowance at the provider's cache-discounted price - the discount shows up in cached_tokens just like metered traffic.

How to measure capture

Capture is measurable in three places, from a stock client up to the full waterfall.

On /v1, the usage object reports cached tokens where the provider exposes them, so reuse stays measurable with a stock OpenAI SDK. On a streamed call the same counts arrive in the trailing usage chunk when you set stream_options.include_usage (see Streaming):

r = client.chat.completions.create(model="code.balanced", messages=[…])
print(r.usage.prompt_tokens, r.usage.prompt_tokens_details.cached_tokens)

On /v2/usage, the full reuse waterfall is available per event with an evidence level and a cache_tier, so a predicted hit is never reported as a measured one:

curl "https://api.zumik.ai/v2/usage?group_by=provider" \
  -H "Authorization: Bearer zk_live_..."

{
  "input_tokens": 1000,
  "cached_tokens": 400,
  "realized_reused_tokens": 350,
  "reuse_evidence_level": "provider_reported",
  "cache_tier": "provider"
}

The evidence level grades how much to trust the number, strongest first:

Evidence level	Meaning
`provider_reported`	The provider returned a cached-token count (e.g. OpenAI `cached_tokens`).
`runtime_confirmed`	A self-hosted runtime confirmed the KV reuse directly.
`router_inferred`	The router predicted reuse the runtime did not confirm.
`trace_estimated`	A diagnostic estimated opportunity from workload similarity.
`unknown`	The adapter has no trustworthy signal.

The cache_tier (provider, gpu, host_ram, nvme, remote_kv, or unknown) says where the reuse was served from.

Opportunity vs. capture

A repeated prefix is an opportunity, not a guaranteed hit. The gap between what could be reused and what was reused is the missed-opportunity gap, and it is almost always closed by prompt ordering, not new infrastructure. Run a workload diagnostic to see the gap for your traffic before you change anything, and read Reuse metrics for the mental model behind opportunity versus capture.