Replay
The trace envelope and its privacy modes, the five replay classes, the metrics a replay compares, and the signed report with provenance and an evidence digest - the system that justifies (or rejects) a BYOC migration.
Replay re-runs recorded traffic against a candidate configuration and produces a signed comparison. It is how Zumik turns "self-hosting might be cheaper" into evidence: a baseline, a candidate, a metric delta, and a digest a stakeholder can verify. Replay also drives provider comparison, prompt-ordering experiments, cache-retention experiments, router tuning, QoS validation, regression testing, and purge verification.
The trace envelope
Replay runs over a formal trace schema. Each trace records the request shape and what was observed,
keyed to a privacy_mode that decides how much is stored.
{
"trace_schema_version": "2026-06-01",
"trace_id": "trc_…",
"project_id": "prj_…",
"privacy_mode": "tokenized",
"request": {
"api_surface": "v1_responses",
"snapshot_id": "snp_…",
"ordered_block_manifest": ["blk_1", "blk_2", "blk_3"],
"prompt_compiler_revision": "pc_11",
"alias_release_id": "alr_…"
},
"schedule": { "arrival_offset_ms": 1850, "session_id": "ses_…", "branch_id": "br_…", "concurrency_group": "cg_4" },
"observed": {
"resolved_target": "openai/gpt-4o@2025-01-01",
"ttft_ms": 410, "latency_ms": 2880,
"input_tokens": 18240, "candidate_reuse_tokens": 17100,
"realized_reused_tokens": 14980, "output_tokens": 412, "attempt_count": 1
}
}Privacy modes
Replay never requires raw prompt text. Choose the least-revealing mode that still answers your question:
| Mode | Stored | Use for |
|---|---|---|
metadata | Lengths, timing, fingerprints, lineage, usage. | Low-risk diagnostics. |
tokenized | Token ids plus redacted metadata. | Faithful performance replay without plaintext. |
encrypted_full_fidelity | Encrypted source payloads under your policy. | Output-quality evaluation. |
synthetic | A generated workload with matching structure. | Public benchmarks and stress testing. |
Raw prompt text is never retained by default - see Data privacy and retention.
Replay classes
replay_class sets how much work the run does:
| Class | What it does |
|---|---|
routing_simulation | Compares routing policy without running inference. Cheapest; runs inline and completes immediately. |
synthetic_performance | Tests throughput and TTFT with structurally similar prompts. |
tokenized_performance | Tests exact token-shape reuse without plaintext. |
full_fidelity_evaluation | Compares latency, outputs, tool calls, and quality. |
purge_verification | Confirms invalidated artifacts cannot be reused (see Purge semantics). |
Run a replay
POST /v2/replay-runs. routing_simulation is the default and runs without inference, so it returns
status: completed right away with routing-agreement metrics:
curl https://api.zumik.ai/v2/replay-runs \
-H "Authorization: Bearer zk_live_..." \
-H "Content-Type: application/json" \
-d '{
"baseline": "openai/gpt-4o@2025-01-01",
"candidate": "byoc_us_east",
"replay_class": "routing_simulation",
"repetitions": 100
}'{
"id": "rpl_01jy…",
"object": "replay_run",
"status": "completed",
"baseline": "openai/gpt-4o@2025-01-01",
"candidate": "byoc_us_east",
"replay_class": "routing_simulation",
"repetitions": 100,
"metrics": { "routing_agreement": 0.94, "divergence": 0.06, "samples": 100 }
}Performance and evaluation classes run over a captured traffic manifest. Supply it inline as a
traces array, or set traffic_manifest_ref to "usage:N" to build one from your last N usage
events (token shapes and timing only — never prompt text):
curl https://api.zumik.ai/v2/replay-runs \
-H "Authorization: Bearer zk_live_..." \
-H "Content-Type: application/json" \
-d '{
"baseline": "managed",
"candidate": "byoc_us_east",
"replay_class": "tokenized_performance",
"traffic_manifest_ref": "usage:500"
}'These classes are scheduled onto the background runner and move queued → running → completed; poll
GET /v2/replay-runs/{id} or list them. A scheduled_for (RFC 3339) holds a run until its time; a
still-queued run can be canceled with POST /v2/replay-runs/{id}/cancel. full_fidelity_evaluation
re-executes recorded messages for real and is budget-gated; the other classes are deterministic
projections over the manifest.
What replay compares
A performance or evaluation run compares TTFT, end-to-end latency, output-token rate, reuse capture rate, tool-call validity, structured-output validity, error rate, deadline misses, GPU-seconds where measurable, provider cost, and a task/semantic-quality score. Outputs are not required to be byte-identical unless the backend explicitly supports deterministic reproduction.
The signed report
GET /v2/replay-runs/{id}/report returns a report built for a decision you can defend:
{
"object": "replay_report",
"replay_run_id": "rpl_01jy…",
"generated_at": "2026-06-15T12:05:00Z",
"baseline": "openai/gpt-4o@2025-01-01",
"candidate": "byoc_us_east",
"replay_class": "routing_simulation",
"provenance": {
"trace_schema_version": "2026-06-01",
"replay_runner_version": "rr_1.0.0",
"cache_mode": "provider_default",
"concurrency": 1,
"repetitions": 100
},
"metrics": { "routing_agreement": 0.94, "divergence": 0.06, "samples": 100 },
"assumptions": [
"Outputs are not required to be byte-identical unless the backend supports deterministic reproduction.",
"Cost and latency deltas are computed against the recorded traffic manifest, not live re-billing."
],
"known_limitations": ["routing_simulation does not execute inference; it compares routing policy only."],
"evidence_digest": "sig_9f86d081…"
}The provenance block records the trace schema, runner version, cache mode, concurrency, and
repetitions; assumptions and known_limitations state what the run does and does not prove. The
evidence_digest is computed once when the run completes and pinned to it, so it never drifts if an
alias is edited later. When Zumik is configured with a signing key the digest is a keyed HMAC-SHA256
(sig_<64 hex>) only Zumik can produce — so an altered report won't match, and Zumik can verify
integrity on request. Without a signing key it is an honest unkeyed checksum (sha256_<64 hex>),
labeled distinctly so it is never mistaken for a signature.
When replay justifies a profile change
A self-hosted profile is activated only when replay proves it beats the managed path after full cost.
Workload diagnostics
Capture metadata traces, run the Agent Workload Efficiency Diagnostic, read the Workload Reuse Score and reuse waterfall, and get a recommended execution profile - before you change any infrastructure.
BYOK setup
Attach your own provider keys for all five providers, route eligible traffic through them, rotate without downtime, and bill against your own provider relationship.