Prompt layout
The recommended block ordering that preserves a reusable prefix, the reuse-killers that quietly defeat provider caching, and the zumik lint CLI and web prompt-linter that catch them.
Prompt-cache capture lives and dies by layout. Every provider rewards the same shape: keep stable content at the front so the prefix stays byte-identical across requests, and push everything that changes every call to the very end. Get the ordering right once and the discount applies on whichever provider answers. Get a volatile token into the prefix and your hit rate falls to near zero.
The recommended order
Assemble logical blocks from most stable to most dynamic:
1. system policy 6. compacted checkpoints
2. developer policy 7. ordered branch history
3. stable tool bundle 8. dynamic retrieval blocks
4. response schema 9. latest user input
5. stable tenant/workspace context 10. latest tool resultThis is exactly what Zumik state encodes: a bundle is the stable prefix (blocks 1-5), a branch is the ordered middle (blocks 6-7), and the latest turn is the volatile tail (blocks 8-10). OpenAI, Anthropic, and Gemini all recommend stable-first construction for the same reason - it maximizes the cacheable prefix.
Zumik may recommend prefix-friendly construction, but it will not silently reorder semantically meaningful events. Ordering is your decision; the platform measures whether your decision preserves reuse and tells you when it does not.
Reuse-killers to avoid
A "reuse-killer" is anything that changes the bytes of the stable prefix from one request to the next. Once the prefix differs, the cache cannot match it, and the discount is gone.
| Reuse-killer | Why it hurts | Fix |
|---|---|---|
| A timestamp or "current date" in the system block | Changes every request, so the prefix never repeats | Move per-request values into the latest user turn |
| A request id, trace id, session id, or UUID in the prefix | Same: a unique token at the top resets the whole prefix | Keep ids out of stable blocks entirely |
| A freshly shuffled or re-serialized tool list | Reordered tool schema is a different prefix even with identical tools | Serialize tools once, deterministically, and reuse the bytes |
| Dynamic retrieval placed before stable blocks | Pushes the stable content past the cacheable boundary | Put retrieval after the stable prefix, before the latest turn |
| The latest user turn not last | Anything after it cannot be reused | Place the changes-every-request content at the very end |
| A stable prefix below ~1,024 tokens | Below the provider minimum, no cache engages at all | Consolidate instructions, tools, and schema into the prefix |
The single most common cause of a low capture rate is the first row: a volatile value sitting in an otherwise-stable system prompt.
Lint your layout
Zumik ships a layout linter that reads an ordered prompt and flags these reuse-killers before you ever spend a token on a cold prefix. It checks layout, not content - whether the structure preserves a reusable prefix, not whether the instructions are any good.
zumik lint (CLI)
zumik lint accepts an OpenAI messages array, a bare message array, or Zumik blocks (the same shape
with a kind field):
zumik lint prompt.jsonPrompt-layout score: 70/100
Stable-prefix tokens: ~1840
[HIGH] block 0 (system): stable-prefix block contains volatile content (iso timestamp)
fix: Move per-request values (timestamps, ids, dates) into the latest user turn so the
prefix stays byte-stable across requests.
[MED ] block 3 (user): UserInput block appears after more dynamic content; this shortens the cacheable prefix
fix: Reorder so stable content (system, tools, schema, context) precedes history, retrieval,
and the latest user input.Add --json for machine-readable output to wire into CI:
{
"layout_score": 70,
"stable_prefix_tokens": 1840,
"findings": [
{
"severity": "high",
"block": 0,
"role": "system",
"message": "stable-prefix block contains volatile content (iso timestamp)",
"fix": "Move per-request values (timestamps, ids, dates) into the latest user turn..."
}
]
}The layout_score starts at 100 and subtracts per finding: 30 for a high-severity issue (volatile
content in the prefix), 15 for a medium (ordering regression), 5 for a low (final block is not the
latest turn, or the prefix is below the cache minimum). A clean, cache-friendly layout scores 100.
What the linter detects
Volatile content in the prefix (HIGH)
Timestamps, ISO datetimes, UUIDs, long numeric ids, and time/id keywords found inside a system, tools, schema, or context block.
Ordering regressions (MED)
A stable block appearing after more dynamic content, which shortens the cacheable prefix.
Volatile tail missing (LOW)
The final block is not the latest user input or tool result, so content that could be reused sits after content that changes.
Prefix too short (LOW)
A stable prefix below the ~1,024-token minimum providers require before caching engages.
Web prompt-linter
The same checks run in the browser at the prompt-linter tool - paste a prompt, read the score and findings, and iterate without the CLI. It is the fastest way to sanity-check a template before shipping it.
Prompt caching
Capture provider-native prompt caching through Zumik across OpenAI, Anthropic, Gemini, xAI, and Fireworks - and measure how much you actually reused.
Sessions and branching
Create a session, append events with optimistic concurrency, handle the 409 branch_version_conflict, fork branches, and merge explicitly with the four supported strategies.