Zumik
Guides

Workload diagnostics

Capture metadata traces, run the Agent Workload Efficiency Diagnostic, read the Workload Reuse Score and reuse waterfall, and get a recommended execution profile - before you change any infrastructure.

The Agent Workload Efficiency Diagnostic scores how much reuse your traffic actually contains and tells you what to do about it. It runs on metadata - lengths, timing, fingerprints, lineage - so you never have to hand over raw prompt text to find out whether reuse is worth chasing.

The four steps

Capture traces

Export metadata traces from your existing traffic. Each trace records what was requested and what was observed - token counts, timing, the resolved target, and how much was reusable vs. reused. The fastest way to capture without instrumenting your app is the CLI proxy, which sits in front of an OpenAI-compatible endpoint and writes metadata-only traces to a file.

Score the workload

POST /v2/diagnostics computes a Workload Reuse Score (WRS) from six components, plus a band and a recommended action.

Read the reuse waterfall

The waterfall separates the reuse you could capture from the reuse you did, surfacing the missed-opportunity gap.

Act on the recommended profile

The report names the next step - usually prompt ordering, sometimes provider tuning, rarely a BYOC pilot - and a signed report you can hand to a stakeholder.

Capture traces locally

The zumik CLI runs a metadata-only proxy in front of any OpenAI-compatible endpoint and appends one trace per request to a JSONL file - no prompt text leaves your machine:

zumik proxy --upstream https://api.openai.com --out zumik-traces.jsonl
# point your client at http://127.0.0.1:8080, run real traffic, then:
zumik diagnose zumik-traces.jsonl

zumik diagnose builds the full report locally, or runs it against a live deployment's /v2/diagnostics when you pass --api-key. See the CLI reference.

Run it

traces is a non-empty array of metadata trace envelopes. Use privacy_mode: "metadata" by default; richer modes (tokenized, encrypted_full_fidelity, synthetic) exist for replay but are not needed to score a workload.

curl https://api.zumik.ai/v2/diagnostics \
  -H "Authorization: Bearer zk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "traces": [
      {
        "trace_id": "trc_a1",
        "privacy_mode": "metadata",
        "prefix_family_id": "pf_agent_main",
        "schedule": { "session_id": "ses_1" },
        "observed": {
          "resolved_target": "anthropic/claude@2025-02-01",
          "ttft_ms": 700, "latency_ms": 1200,
          "input_tokens": 10000, "candidate_reuse_tokens": 9000,
          "realized_reused_tokens": 8500, "output_tokens": 200,
          "attempt_count": 1
        }
      }
    ]
  }'
Response
{
  "id": "dgn_01jy…",
  "object": "diagnostic",
  "project_id": "prj_01jy…",
  "created_at": "2026-06-15T12:00:00Z",
  "report": {
    "object": "diagnostic_report",
    "trace_count": 1,
    "workload_reuse_score": 78.5,
    "band": "strong_fit",
    "recommended_action": "prioritize optimization pilot",
    "components": { "opportunity_ratio": 0.9, "recurrence_score": 0.8, "retention_locality": 0.7, "ttft_sensitivity": 0.6, "session_continuity": 0.5, "payload_redundancy": 0.4 },
    "waterfall": { "total_input_tokens": 10000, "eligible_reuse_tokens": 9000, "candidate_reuse_tokens": 9000, "realized_reused_tokens": 8500, "missed_opportunity_tokens": 500 },
    "recommended_profile": "managed_provider_tuning",
    "notes": ["Of 9000 candidate reusable tokens, 8500 were captured (94% capture rate)."]
  }
}

Reading the Workload Reuse Score

The WRS is a 0-100 score built from six weighted components. A high score means there is reuse to capture; it does not mean you should self-host.

ComponentWeightWhat it measures
opportunity_ratio0.35Share of input tokens that could be served from cache.
recurrence_score0.20How often the same prefix family recurs.
retention_locality0.15Whether recurrences land close enough in time to stay warm.
ttft_sensitivity0.15How much first-token latency matters for this traffic.
session_continuity0.10How much work stays within a single session.
payload_redundancy0.05Repeated payloads across requests.

The score maps to a band and a recommended action:

BandScoreRecommended action
strong_fit≥ 70Prioritize an optimization pilot.
plausible_fit≥ 45Run diagnostic and provider tuning.
limited_fit≥ 20Optimize prompt construction first.
weak_fit< 20Don't pursue BYOC or custom caching.

The reuse waterfall

The waterfall is where opportunity meets reality. Each tier is a subset of the one above it:

total_input_tokens        10000   everything sent
eligible_reuse_tokens      9000   could be reused given the prefix family
candidate_reuse_tokens     9000   the runtime considered for reuse
realized_reused_tokens     8500   actually served from cache, billed at the read rate
missed_opportunity_tokens   500   candidate − realized: the gap to close

A large missed_opportunity_tokens relative to candidate_reuse_tokens is the signal that prompt ordering or provider tuning will pay off. A small gap with a high capture rate means the providers are already doing the work - and no migration will beat them.

recommended_profile is deliberately conservative. BYOC is only ever recommended when a large missed gap justifies it, never on prompt length alone:

ProfileWhen
optimize_prompt_constructionWeak or limited fit - fix ordering before anything else.
managed_provider_tuningPlausible fit, or strong fit where capture is already high (≥ 70%).
byoc_pilot_worth_evaluatingStrong fit and a large fraction of reuse is still being missed.

Signed report

GET /v2/diagnostics/{id}/report returns the same report wrapped with a generated_at timestamp and an evidence_digest (sig_<64 hex>) over the serialized report. The digest lets a recipient confirm the numbers were not altered after the fact - useful when the diagnostic is the basis for a pilot decision.

Why a high score isn't a buy signal

Deployment readiness is scored separately from reuse opportunity.

On this page