Zumik
Core concepts

The three identity layers

Logical identity, materialization identity, and KV-realization compatibility - why the same logical state can need different physical caches, and why cache details never reach the public API.

This is the architectural decision the rest of the platform depends on. Get it wrong and cache implementation details leak into your product semantics: handles stop being stable, deletion gets confusing, and "is this reusable?" turns into a question no one can answer. Zumik separates identity into three layers and never collapses them.

Layer 1: logical identity

Logical identity describes the customer-visible reusable state: artifacts, bundles, sessions, branches, and snapshots. It answers "what content does the model see, and in what order?"

Logical identity is independent of provider, model, tokenizer, engine, and GPU topology. A snapshot is the same snapshot whether it runs on OpenAI today or a self-hosted SGLang cluster tomorrow. This is the layer your handles point at.

Layer 2: materialization identity

Materialization identity describes the exact model-visible representation: the actual bytes a tokenizer produced from the logical state. Change the tokenizer or the chat template and the logical content is unchanged, but the materialization is different.

The materialization key is a digest over:

materialization_key =
    snapshot_id
  + prompt_compiler_revision
  + tokenizer_revision
  + chat_template_revision
  + tool_serialization_revision
  + response_schema_serialization_revision
  + ordered_block_manifest_digest

Fields are length-prefixed before hashing, so ("ab", "c") and ("a", "bc") never collide into the same digest. Two materializations are equal only when every field matches.

Layer 3: KV-realization compatibility

KV-realization compatibility answers a narrower, physical question: can an existing GPU-resident KV cache be reused safely for this request? It extends the materialization digest with everything about the runtime:

kv_compatibility_key =
    materialization_digest
  + resolved_model_revision
  + weight_digest
  + quantization_profile
  + engine_family + engine_version
  + cache_abi_revision
  + attention_backend
  + rope_configuration
  + parallelism_topology
  + block_size
  + cache_format_revision
  + region
  + isolation_namespace_generation

That last field, isolation_namespace_generation, is what a purge increments. Bump it and every prior KV entry in that namespace fails the compatibility check, so stale physical cache cannot be resurrected after a deletion.

The scenarios that matter

Two requests can share the same logical artifact yet still require different physical caches. This table is the whole point of the model.

ScenarioSame logical?Same materialization?Same KV realization?
Same repository instructions, same model and provideryeslikelymaybe
Same instructions, different tokenizeryesnono
Same instructions, same model, different quantizationyesyesno
Same instructions, managed provider vs. BYOCyesmaybeno
Same session, different branch headpartialnono
Same artifacts, reordered tool schemamaybenono

Notice that "same logical" can pair with "no KV realization". That is normal and expected. A reusable handle is a promise about logical identity, never a promise that a physical cache hit will occur. The platform reports the difference through reuse metrics.

Why this keeps the API clean

Because handles live at Layer 1, they survive everything that happens at Layers 2 and 3. You can switch quantization, move regions, or migrate from a managed provider to your own cluster, and your artifact and session IDs do not change. The cache churns; the contract does not.

It also makes cache invalidation tractable. Any change to a Layer 2 or Layer 3 input produces a different key, so a stale cache can never match a request it should not serve. There is no separate invalidation pass to get wrong: identity is the invalidation rule.

On this page