The inference control plane

Why persistent agents pay repeatedly for the same context, and how Zumik measures reuse, preserves state, routes cheaply, reproduces routing, and proves deletion.

An agent does not send one prompt. It sends thousands, and most of them are nearly identical. The system instructions, the tool registry, the response schema, the repository policy, the long-lived documents, the conversation so far: that bulk is stable. What changes between calls is a short suffix at the end. Every time the stable part is re-tokenized and re-prefilled, the customer pays again, in latency and in tokens, for work that was already done.

Zumik is the control plane that sits over that pattern. It does not try to be a faster inference engine or a cheaper model marketplace. It measures how much of your input is genuinely reusable, gives you stable references for the reusable parts, routes each request to the cheapest reliable path, records exactly how that routing decision was made so it can be reproduced, and produces signed evidence when you delete data.

The five jobs

Measure reuse honestly

A repeated prefix is not a cache hit. Zumik reports reuse opportunity and realized capture as separate numbers, each tagged with an evidence level, so a prediction is never mistaken for a fact.

Preserve state as a first-class object

Reusable inputs become artifacts, bundles, sessions, branches, and snapshots: a small set of objects with explicit immutability and ordering rules, not opaque cache keys.

Route to the cheapest reliable path

Managed providers by default, with BYOK and BYOC as evidence-backed escalations. One scheduler owns replica selection per profile; provider-native caching does the heavy lifting before you ever self-host.

Reproduce every decision

Aliases resolve through immutable releases. Snapshots pin ordering and compiler versions. A response pins one snapshot and one alias release, so any past routing decision can be replayed.

Prove deletion

Delete revokes access; purge removes retained representations and returns a signed receipt with a guarantee class that never exceeds what the underlying profile can actually deliver.

Two surfaces, one engine

You reach the control plane through either of two public APIs over the same internal execution system.

Surface	Purpose	Shape
`/v1`	Migrate an existing OpenAI integration with one base-URL change	Byte-for-byte OpenAI request and response objects
`/v2`	Use explicit state, branching, replay, purge, QoS, and rich telemetry	Native Zumik objects with opaque handles

Proprietary behavior never leaks into /v1 JSON. It rides on optional headers, or it lives on /v2. A client that sends no Zumik headers still gets correct OpenAI behavior. See API surfaces for the boundary rules.

The idea that holds it together

The single most important distinction in the platform is that logical state is not physical KV state. Two requests can reference the same logical artifact and still need entirely different physical caches, because a different tokenizer, a different quantization, or a different region all break KV compatibility while leaving the logical content untouched.

Keeping these layers apart is what lets handles stay stable and opaque while caches churn underneath. It is worth reading the identity model before anything else.

The inference control plane

The five jobs

Measure reuse honestly

Preserve state as a first-class object

Route to the cheapest reliable path

Reproduce every decision

Prove deletion

Two surfaces, one engine

The idea that holds it together

Where the concepts connect

Reuse metrics

Workload Reuse Score

Identity model

Artifacts

Bundles

Sessions

Branches

Snapshots

Handles and fingerprints

Model aliases

Agent Hints

QoS

Execution profiles

Capability manifests

Retention and purge

API surfaces

On this page