Reuse metrics
Opportunity versus realized reuse, the metrics that define each, and the five evidence levels that keep a prediction from being mistaken for a measured fact.
This is the distinction the whole product exists to make. A reusable handle is not proof of a cache hit. A repeated prefix is not proof of saved compute. Zumik reports what could have been reused and what was as two separate numbers, and it tags every realized number with how strong the evidence behind it is.
Opportunity: what was possible
Opportunity metrics come from the request and from prior compatible requests. They describe the ceiling, not the result.
| Metric | Meaning |
|---|---|
input_tokens | Total model-visible input tokens |
eligible_reuse_tokens | Tokens belonging to reuse-eligible blocks |
candidate_reuse_tokens | Maximum tokens that could have been reused from prior compatible requests |
opportunity_reuse_ratio | candidate_reuse_tokens / input_tokens |
prefix_family_id | Opaque internal identifier for an equivalent reusable prefix family |
reuse_window_ms | Time since the most relevant compatible prior request |
Realized: what actually happened
Realized metrics come from the provider or the runtime after the request ran. They describe the result.
| Metric | Meaning |
|---|---|
realized_reused_tokens | Tokens confirmed reused by the provider or runtime |
realized_reuse_ratio | realized_reused_tokens / input_tokens |
reuse_capture_rate | realized_reused_tokens / candidate_reuse_tokens |
missed_opportunity_tokens | candidate_reuse_tokens - realized_reused_tokens |
cache_tier | Where reuse came from: provider, gpu, host_ram, nvme, remote_kv, or unknown |
prefill_compute_tokens | Tokens actually recomputed, where measurable |
reuse_capture_rate is the number to watch. It answers "of the reuse that was available, how much did we actually get?" A high opportunity with a low capture rate means the reuse is there but something - usually prompt ordering or a cache-key choice - is leaving it on the table.
The capture gap
Opportunity and realized capture rarely match, and the gap between them is where optimization lives.
Total input tokens 100%
└─ Eligible reuse tokens 78%
└─ Candidate reuse tokens 66%
└─ Realized reused tokens 41%
└─ Missed opportunity 25%Each tier is a subset of the one above it. The missed-opportunity gap at the bottom is usually closed by fixing how prompts are constructed, before anyone considers new infrastructure. This waterfall is the centerpiece of the workload diagnostic.
Evidence levels
A realized-reuse number is only as trustworthy as its source. Every measurement carries an evidence level, ordered from strongest to weakest:
provider_reported
The provider returned cached-token usage directly, for example OpenAI's cached_tokens. This is a measured fact.
runtime_confirmed
A BYOC runtime confirmed KV reuse itself. Also a measured fact, from your own infrastructure.
router_inferred
The router predicted reuse but the runtime did not confirm it. A reasonable estimate, not a confirmation.
trace_estimated
A diagnostic estimated opportunity from workload similarity. Useful for sizing, but it has not run.
unknown
The adapter has no trustworthy signal. The number is reported as unknown rather than guessed.
Never read a router_inferred or trace_estimated number as if it were provider_reported. The evidence level exists precisely so an estimate is never quietly billed or reported as a confirmed result. Billing's reuse credit is computed from realized, evidence-backed reuse, not from opportunity.
Reading it on a request
On /v1, realized capture surfaces through the standard usage.prompt_tokens_details.cached_tokens field, so a vanilla OpenAI SDK can read it. On /v2, the full opportunity-plus-realized report is an explicit response field with the evidence level attached.
Handles and fingerprints
Opaque public IDs that callers hold, versus internal tenant-scoped HMAC fingerprints that never leave the isolation boundary - and why raw content hashes are never exposed.
Workload Reuse Score
The WRS formula, its six weighted components, the interpretation bands, and the deliberately separate deployment-readiness score.