Reuse metrics

Opportunity versus realized reuse, the metrics that define each, and the five evidence levels that keep a prediction from being mistaken for a measured fact.

This is the distinction the whole product exists to make. A reusable handle is not proof of a cache hit. A repeated prefix is not proof of saved compute. Zumik reports what could have been reused and what was as two separate numbers, and it tags every realized number with how strong the evidence behind it is.

Opportunity: what was possible

Opportunity metrics come from the request and from prior compatible requests. They describe the ceiling, not the result.

Metric	Meaning
`input_tokens`	Total model-visible input tokens
`eligible_reuse_tokens`	Tokens belonging to reuse-eligible blocks
`candidate_reuse_tokens`	Maximum tokens that could have been reused from prior compatible requests
`opportunity_reuse_ratio`	`candidate_reuse_tokens / input_tokens`
`prefix_family_id`	Opaque internal identifier for an equivalent reusable prefix family
`reuse_window_ms`	Time since the most relevant compatible prior request

Realized: what actually happened

Realized metrics come from the provider or the runtime after the request ran. They describe the result.

Metric	Meaning
`realized_reused_tokens`	Tokens confirmed reused by the provider or runtime
`realized_reuse_ratio`	`realized_reused_tokens / input_tokens`
`reuse_capture_rate`	`realized_reused_tokens / candidate_reuse_tokens`
`missed_opportunity_tokens`	`candidate_reuse_tokens - realized_reused_tokens`
`cache_tier`	Where reuse came from: `provider`, `gpu`, `host_ram`, `nvme`, `remote_kv`, or `unknown`
`prefill_compute_tokens`	Tokens actually recomputed, where measurable

reuse_capture_rate is the number to watch. It answers "of the reuse that was available, how much did we actually get?" A high opportunity with a low capture rate means the reuse is there but something - usually prompt ordering or a cache-key choice - is leaving it on the table.

The capture gap

Opportunity and realized capture rarely match, and the gap between them is where optimization lives.

Total input tokens                 100%
  └─ Eligible reuse tokens          78%
      └─ Candidate reuse tokens     66%
          └─ Realized reused tokens 41%
              └─ Missed opportunity 25%

Each tier is a subset of the one above it. The missed-opportunity gap at the bottom is usually closed by fixing how prompts are constructed, before anyone considers new infrastructure. This waterfall is the centerpiece of the workload diagnostic.

Evidence levels

A realized-reuse number is only as trustworthy as its source. Every measurement carries an evidence level, ordered from strongest to weakest:

provider_reported

The provider returned cached-token usage directly, for example OpenAI's cached_tokens. This is a measured fact.

runtime_confirmed

A BYOC runtime confirmed KV reuse itself. Also a measured fact, from your own infrastructure.

router_inferred

The router predicted reuse but the runtime did not confirm it. A reasonable estimate, not a confirmation.

trace_estimated

A diagnostic estimated opportunity from workload similarity. Useful for sizing, but it has not run.

unknown

The adapter has no trustworthy signal. The number is reported as unknown rather than guessed.

Never read a router_inferred or trace_estimated number as if it were provider_reported. The evidence level exists precisely so an estimate is never quietly billed or reported as a confirmed result. Billing's reuse credit is computed from realized, evidence-backed reuse, not from opportunity.

Reading it on a request

On /v1, realized capture surfaces through the standard usage.prompt_tokens_details.cached_tokens field, so a vanilla OpenAI SDK can read it. On /v2, the full opportunity-plus-realized report is an explicit response field with the evidence level attached.

Workload Reuse Score

How these metrics roll up into a single fit score.

Workload diagnostics

Pull the waterfall and evidence level for real traffic.