Workload diagnostics
Capture metadata traces, run the Agent Workload Efficiency Diagnostic, read the Workload Reuse Score and reuse waterfall, and get a recommended execution profile - before you change any infrastructure.
The Agent Workload Efficiency Diagnostic scores how much reuse your traffic actually contains and tells you what to do about it. It runs on metadata - lengths, timing, fingerprints, lineage - so you never have to hand over raw prompt text to find out whether reuse is worth chasing.
The four steps
Capture traces
Export metadata traces from your existing traffic. Each trace records what was requested and what was observed - token counts, timing, the resolved target, and how much was reusable vs. reused. The fastest way to capture without instrumenting your app is the CLI proxy, which sits in front of an OpenAI-compatible endpoint and writes metadata-only traces to a file.
Score the workload
POST /v2/diagnostics computes a Workload Reuse Score (WRS) from six components, plus a band and a
recommended action.
Read the reuse waterfall
The waterfall separates the reuse you could capture from the reuse you did, surfacing the missed-opportunity gap.
Act on the recommended profile
The report names the next step - usually prompt ordering, sometimes provider tuning, rarely a BYOC pilot - and a signed report you can hand to a stakeholder.
Capture traces locally
The zumik CLI runs a metadata-only proxy in front of any OpenAI-compatible endpoint and appends one
trace per request to a JSONL file - no prompt text leaves your machine:
zumik proxy --upstream https://api.openai.com --out zumik-traces.jsonl
# point your client at http://127.0.0.1:8080, run real traffic, then:
zumik diagnose zumik-traces.jsonlzumik diagnose builds the full report locally, or runs it against a live deployment's
/v2/diagnostics when you pass --api-key. See the CLI reference.
Run it
traces is a non-empty array of metadata trace envelopes. Use privacy_mode: "metadata" by default;
richer modes (tokenized, encrypted_full_fidelity, synthetic) exist for replay
but are not needed to score a workload.
curl https://api.zumik.ai/v2/diagnostics \
-H "Authorization: Bearer zk_live_..." \
-H "Content-Type: application/json" \
-d '{
"traces": [
{
"trace_id": "trc_a1",
"privacy_mode": "metadata",
"prefix_family_id": "pf_agent_main",
"schedule": { "session_id": "ses_1" },
"observed": {
"resolved_target": "anthropic/claude@2025-02-01",
"ttft_ms": 700, "latency_ms": 1200,
"input_tokens": 10000, "candidate_reuse_tokens": 9000,
"realized_reused_tokens": 8500, "output_tokens": 200,
"attempt_count": 1
}
}
]
}'{
"id": "dgn_01jy…",
"object": "diagnostic",
"project_id": "prj_01jy…",
"created_at": "2026-06-15T12:00:00Z",
"report": {
"object": "diagnostic_report",
"trace_count": 1,
"workload_reuse_score": 78.5,
"band": "strong_fit",
"recommended_action": "prioritize optimization pilot",
"components": { "opportunity_ratio": 0.9, "recurrence_score": 0.8, "retention_locality": 0.7, "ttft_sensitivity": 0.6, "session_continuity": 0.5, "payload_redundancy": 0.4 },
"waterfall": { "total_input_tokens": 10000, "eligible_reuse_tokens": 9000, "candidate_reuse_tokens": 9000, "realized_reused_tokens": 8500, "missed_opportunity_tokens": 500 },
"recommended_profile": "managed_provider_tuning",
"notes": ["Of 9000 candidate reusable tokens, 8500 were captured (94% capture rate)."]
}
}Reading the Workload Reuse Score
The WRS is a 0-100 score built from six weighted components. A high score means there is reuse to capture; it does not mean you should self-host.
| Component | Weight | What it measures |
|---|---|---|
opportunity_ratio | 0.35 | Share of input tokens that could be served from cache. |
recurrence_score | 0.20 | How often the same prefix family recurs. |
retention_locality | 0.15 | Whether recurrences land close enough in time to stay warm. |
ttft_sensitivity | 0.15 | How much first-token latency matters for this traffic. |
session_continuity | 0.10 | How much work stays within a single session. |
payload_redundancy | 0.05 | Repeated payloads across requests. |
The score maps to a band and a recommended action:
| Band | Score | Recommended action |
|---|---|---|
strong_fit | ≥ 70 | Prioritize an optimization pilot. |
plausible_fit | ≥ 45 | Run diagnostic and provider tuning. |
limited_fit | ≥ 20 | Optimize prompt construction first. |
weak_fit | < 20 | Don't pursue BYOC or custom caching. |
The reuse waterfall
The waterfall is where opportunity meets reality. Each tier is a subset of the one above it:
total_input_tokens 10000 everything sent
eligible_reuse_tokens 9000 could be reused given the prefix family
candidate_reuse_tokens 9000 the runtime considered for reuse
realized_reused_tokens 8500 actually served from cache, billed at the read rate
missed_opportunity_tokens 500 candidate − realized: the gap to closeA large missed_opportunity_tokens relative to candidate_reuse_tokens is the signal that prompt
ordering or provider tuning will pay off. A small gap with a high capture rate means the providers are
already doing the work - and no migration will beat them.
The recommended profile
recommended_profile is deliberately conservative. BYOC is only ever recommended when a large
missed gap justifies it, never on prompt length alone:
| Profile | When |
|---|---|
optimize_prompt_construction | Weak or limited fit - fix ordering before anything else. |
managed_provider_tuning | Plausible fit, or strong fit where capture is already high (≥ 70%). |
byoc_pilot_worth_evaluating | Strong fit and a large fraction of reuse is still being missed. |
Signed report
GET /v2/diagnostics/{id}/report returns the same report wrapped with a generated_at timestamp and
an evidence_digest (sig_<64 hex>) over the serialized report. The digest lets a recipient confirm
the numbers were not altered after the fact - useful when the diagnostic is the basis for a pilot
decision.
Why a high score isn't a buy signal
Deployment readiness is scored separately from reuse opportunity.
Idempotency and retries
Make mutating requests safely retryable with Agent-Idempotency-Key, understand the three retry types, and keep tool side effects safe when a request is replayed.
Replay
The trace envelope and its privacy modes, the five replay classes, the metrics a replay compares, and the signed report with provenance and an evidence digest - the system that justifies (or rejects) a BYOC migration.