Replay runs
Pin a baseline and a candidate execution profile, replay a captured traffic manifest, and render a signed report with full provenance and an evidence digest.
Replay is a product, not a script. A run pins a baseline and a candidate execution profile and compares them over a captured traffic manifest. The report carries the provenance §20.5 requires and a verifiable digest (§20.7). Runs are prefixed rpl_. See the replay guide.
Lifecycle
A run moves through queued → running → completed | failed, or canceled if you cancel it before it starts.
routing_simulationruns inline at create time — it only compares routing policy (no inference), fully determined by the alias resolver over a fixed seed sweep, so it returnscompletedimmediately.- The other four classes are scheduled onto the background runner, which drains due runs and records the outcome. A run with a future
scheduled_forstaysqueueduntil its time comes. - If api-core restarts mid-run, the interrupted run is moved to
failed(with a reason) rather than wedging the queue — resubmit to retry.
Traffic manifest
Every class except routing_simulation replays an ordered list of §20.2 trace envelopes. Supply it one of two ways:
- Inline — pass a
tracesarray. Use this forfull_fidelity_evaluation, where each trace carries themessagesto re-execute (anencrypted_full_fidelitycapture). - From your usage — set
traffic_manifest_refto"usage:N"to build a manifest from your last N recorded usage events. These carry token shapes and timing but never prompt text, so a token-shape replay needs no new capture pipeline and leaks no PII.
A single run's manifest is capped at 5,000 traces.
All requests require a bearer API key. See authentication.
Create a replay run
POST /v2/replay-runs
baselinestringrequiredThe baseline target or profile, e.g. openai/gpt-4o@2025-01-01 or managed. A name matching a live alias resolves through its weighted release; provider/model[@revision] is a fixed target.
candidatestringrequiredThe candidate to compare against, e.g. byoc_us_east or anthropic/claude-3-7-sonnet.
replay_classstringdefault: routing_simulationOne of routing_simulation, synthetic_performance, tokenized_performance, full_fidelity_evaluation, purge_verification. See classes.
traffic_manifest_refstringA manifest reference. "usage:N" builds a manifest from your last N usage events.
tracesarrayInline manifest. Each entry is a trace envelope: input_tokens, output_tokens, optional candidate_reuse_tokens, realized_reused_tokens, ttft_ms, latency_ms, namespace_generation, and — for full_fidelity_evaluation — messages ([{ role, content }]).
repetitionsintegerdefault: 1Replay repetitions, clamped to 1–1000.
scheduled_forstringRFC 3339 time to hold the run until. Absent means run as soon as possible.
concurrencyintegerdefault: 1The concurrency group the runner models; recorded in provenance.
curl https://api.zumik.ai/v2/replay-runs \
-H "Authorization: Bearer $ZUMIK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"baseline": "managed",
"candidate": "byoc_us_east",
"replay_class": "tokenized_performance",
"traffic_manifest_ref": "usage:500"
}'{
"id": "rpl_01jy7nfg01j2k3l4m5n6o7p8qr",
"object": "replay_run",
"project_id": "prj_01jy7n0a4c8m2t6v9q3wrxk7bd",
"created_at": "2026-06-15T16:18:09Z",
"status": "queued",
"baseline": "managed",
"candidate": "byoc_us_east",
"replay_class": "tokenized_performance",
"traffic_manifest_ref": "usage:500",
"repetitions": 1,
"concurrency": 1,
"scheduled_for": null,
"started_at": null,
"completed_at": null,
"failure_reason": null,
"attempt": 0,
"manifest_size": 500,
"metrics": null
}statusstringqueued, running, completed, failed, or canceled.
manifest_sizeintegerHow many trace envelopes the run will replay.
started_at / completed_atstringRFC 3339 timestamps set as the run is claimed and finalized.
failure_reasonstringPresent only on a failed run.
attemptintegerIncrements each time the runner claims the run.
metricsobjectComputed results once completed; null while queued/running. Shape depends on the class.
Replay classes
| Class | What it does | Manifest |
|---|---|---|
routing_simulation | Compares baseline vs candidate routing over a seed sweep — no inference. Runs inline. | none |
tokenized_performance | Replays exact recorded token shapes through the reuse-adjusted cost model; reports per-request cost + reuse-capture deltas with confidence intervals, plus routing divergence. | token shapes |
synthetic_performance | Generates a structurally similar workload from the manifest's distribution and projects cost and TTFT (TTFT grounded in each trace's observed TTFT, reduced by extra reuse). | token shapes |
full_fidelity_evaluation | Re-executes recorded turns through the broker for both targets and measures real latency, output-token rate, and output divergence. Budget-gated real spend. | messages |
purge_verification | Confirms artifacts from a purged namespace generation can no longer be reused. | namespace_generation |
List replay runs
GET /v2/replay-runs
Returns { "object": "list", "data": [...] }, newest first. Each entry is a run object (without the manifest body).
Cancel a replay run
POST /v2/replay-runs/{replay_run_id}/cancel
Cancels a still-queued run. A run that is already running or finished returns 400.
Retrieve a replay run
GET /v2/replay-runs/{replay_run_id}
Returns the run object. Poll it (or list) to watch a scheduled run progress to completed.
Render a signed report
GET /v2/replay-runs/{replay_run_id}/report
The self-describing report (§20.5/§20.7): the full provenance block, a traffic-manifest summary, the computed metrics (or a pending note before completion), stated assumptions and known limitations, the recommended profile, and a verifiable evidence_digest.
{
"object": "replay_report",
"replay_run_id": "rpl_01jy7nfg01j2k3l4m5n6o7p8qr",
"generated_at": "2026-06-15T16:19:42Z",
"status": "completed",
"baseline": "managed",
"candidate": "byoc_us_east",
"replay_class": "tokenized_performance",
"provenance": {
"trace_schema_version": "2026-06-01",
"replay_runner_version": "rr_1.0.0",
"runtime_engine_version": "api-core/0.1.0",
"model_alias_release": { "baseline": null, "candidate": null },
"resolved_model_revision": { "baseline": "unpinned", "candidate": "unpinned" },
"prompt_compiler_revision": "pc_11",
"tokenizer_revision": "tok_7",
"cache_mode": "provider_default",
"warmup_period_s": 0,
"cold_start_period_s": 0,
"request_arrival_schedule": "as_fast_as_possible",
"concurrency": 1,
"retry_policy": "none",
"provider_rate_limits": "provider_default",
"repetitions": 1,
"confidence_intervals": "95% normal-approximation on per-request samples; p50/p95/p99 reported",
"quality_evaluator_version": "qe_none",
"failures_and_dropped": { "failures": 0, "dropped": 0 }
},
"traffic_manifest": {
"ref": "usage:500",
"traces": 500,
"total_input_tokens": 8400000,
"total_output_tokens": 210000,
"total_realized_reuse_tokens": 3100000,
"traces_with_full_fidelity_payload": 0
},
"metrics": {
"metric_deltas": {
"provider_cost_micros": { "baseline": 22100000, "candidate": 22100000, "delta": 0, "pct": 0 },
"reuse_capture_pct": { "baseline": 36.9, "candidate": 36.9 }
},
"confidence_intervals": { "per_request_cost_savings_micros": { "n": 500, "mean": 0, "p50": 0, "p95": 0, "p99": 0, "ci95_low": 0, "ci95_high": 0 } },
"recommended_profile": "baseline",
"failures": 0,
"dropped": 0
},
"assumptions": ["..."],
"quality_guardrails": "token-shape replay does not execute inference; output quality is unchanged from the recorded run.",
"known_limitations": ["..."],
"recommended_profile": "baseline",
"evidence_digest": "sig_b71e...4d"
}provenanceobjectEvery §20.5 field: schema/runner/engine versions, alias releases, resolved model revisions, prompt-compiler and tokenizer revisions, cache mode, warmup and cold-start periods, arrival schedule, concurrency, retry policy, provider rate limits, repetitions, confidence-interval policy, quality-evaluator version, and the failure/dropped counts.
traffic_manifestobjectA summary of the replayed traffic: trace count, total input/output/reuse tokens, and how many traces carried a full-fidelity payload.
metricsobjectClass-specific comparison: metric deltas, confidence intervals, and failures / dropped.
quality_guardrailsstringWhat the class does and does not guarantee about output quality.
recommended_profilestringThe profile the report recommends (baseline, candidate, either, or n/a).
evidence_digeststringPinned at completion over provenance, the manifest summary, and metrics. With a signing key it is a keyed HMAC-SHA256 (sig_…) only Zumik can produce — tamper-evident; without one it is an unkeyed checksum (sha256_…).
Errors
| Status | Code | When |
|---|---|---|
| 400 | invalid_request_error | baseline/candidate empty, a class that needs a manifest got none, an unsupported traffic_manifest_ref, a bad scheduled_for, or canceling a non-queued run. |
| 401 | invalid_api_key | Missing or invalid API key. |
| 404 | invalid_request_error | The run does not exist in this project. |
See the full table on errors.
Diagnostics
Run the Agent Workload Efficiency Diagnostic over metadata traces, retrieve a stored run, and render a signed report with an evidence digest.
Purge jobs
Run an auditable purge over artifacts and retrieve a signed receipt. Delete revokes access; purge removes retained representations and proves what was done.