Zumik
Core concepts

Reuse metrics

Opportunity versus realized reuse, the metrics that define each, and the five evidence levels that keep a prediction from being mistaken for a measured fact.

This is the distinction the whole product exists to make. A reusable handle is not proof of a cache hit. A repeated prefix is not proof of saved compute. Zumik reports what could have been reused and what was as two separate numbers, and it tags every realized number with how strong the evidence behind it is.

Opportunity: what was possible

Opportunity metrics come from the request and from prior compatible requests. They describe the ceiling, not the result.

MetricMeaning
input_tokensTotal model-visible input tokens
eligible_reuse_tokensTokens belonging to reuse-eligible blocks
candidate_reuse_tokensMaximum tokens that could have been reused from prior compatible requests
opportunity_reuse_ratiocandidate_reuse_tokens / input_tokens
prefix_family_idOpaque internal identifier for an equivalent reusable prefix family
reuse_window_msTime since the most relevant compatible prior request

Realized: what actually happened

Realized metrics come from the provider or the runtime after the request ran. They describe the result.

MetricMeaning
realized_reused_tokensTokens confirmed reused by the provider or runtime
realized_reuse_ratiorealized_reused_tokens / input_tokens
reuse_capture_raterealized_reused_tokens / candidate_reuse_tokens
missed_opportunity_tokenscandidate_reuse_tokens - realized_reused_tokens
cache_tierWhere reuse came from: provider, gpu, host_ram, nvme, remote_kv, or unknown
prefill_compute_tokensTokens actually recomputed, where measurable

reuse_capture_rate is the number to watch. It answers "of the reuse that was available, how much did we actually get?" A high opportunity with a low capture rate means the reuse is there but something - usually prompt ordering or a cache-key choice - is leaving it on the table.

The capture gap

Opportunity and realized capture rarely match, and the gap between them is where optimization lives.

Total input tokens                 100%
  └─ Eligible reuse tokens          78%
      └─ Candidate reuse tokens     66%
          └─ Realized reused tokens 41%
              └─ Missed opportunity 25%

Each tier is a subset of the one above it. The missed-opportunity gap at the bottom is usually closed by fixing how prompts are constructed, before anyone considers new infrastructure. This waterfall is the centerpiece of the workload diagnostic.

Evidence levels

A realized-reuse number is only as trustworthy as its source. Every measurement carries an evidence level, ordered from strongest to weakest:

provider_reported

The provider returned cached-token usage directly, for example OpenAI's cached_tokens. This is a measured fact.

runtime_confirmed

A BYOC runtime confirmed KV reuse itself. Also a measured fact, from your own infrastructure.

router_inferred

The router predicted reuse but the runtime did not confirm it. A reasonable estimate, not a confirmation.

trace_estimated

A diagnostic estimated opportunity from workload similarity. Useful for sizing, but it has not run.

unknown

The adapter has no trustworthy signal. The number is reported as unknown rather than guessed.

Never read a router_inferred or trace_estimated number as if it were provider_reported. The evidence level exists precisely so an estimate is never quietly billed or reported as a confirmed result. Billing's reuse credit is computed from realized, evidence-backed reuse, not from opportunity.

Reading it on a request

On /v1, realized capture surfaces through the standard usage.prompt_tokens_details.cached_tokens field, so a vanilla OpenAI SDK can read it. On /v2, the full opportunity-plus-realized report is an explicit response field with the evidence level attached.

On this page