Trace-capture proxy
zumik proxy sits in front of an OpenAI-compatible endpoint and records one metadata-only trace per request - token counts, timing, and a prefix fingerprint, never raw prompts.
zumik proxy is the capture step of the workload-analysis funnel. It sits in front of any OpenAI-compatible endpoint, forwards every request untouched, and records one trace per call. It never writes raw prompt text - only the metadata the diagnostic needs - so you can capture a representative workload without handing over your prompts.
It is a subcommand of the zumik CLI.
Put it in front of your endpoint
zumik proxy --upstream https://api.openai.com --listen 127.0.0.1:8080 --out workload.jsonlThen point your OpenAI client's base URL at the proxy and run a representative slice of normal traffic:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-...")
client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "..."}],
)The proxy forwards the request to --upstream unchanged - it copies your headers (dropping only hop-by-hop ones it must recompute) and your body verbatim, so your API key still authenticates against the upstream. Stop the proxy when you have enough traffic; workload.jsonl is now a trace bundle ready for zumik diagnose, zumik score, or the @zumik/trace-analyzer npm tool.
| Flag | Default | Purpose |
|---|---|---|
--upstream | (required) | Base URL to forward to, e.g. https://api.openai.com |
--listen | 127.0.0.1:8080 | Address to bind |
--out | zumik-traces.jsonl | JSONL file to append traces to |
What it records
For each request the proxy appends one trace with privacy_mode: "metadata":
resolved_target- themodelfield from the request body.- Token estimates - a whitespace-based
input_tokensestimate over the chatmessagesor the Responsesinput, andoutput_tokensfrom the upstreamusageblock. The estimate is an honest approximation, not a claim of tokenizer precision. candidate_reuse_tokens- the stable-prefix token count, but only when this prefix family has been seen before. That is the maximum a prior compatible request could have supplied.realized_reused_tokens- read from the upstream response'susage.prompt_tokens_details.cached_tokenswhen present, so it reflects what the provider actually cached.prefix_family_id- apf_-prefixed id derived from the prefix fingerprint, used to group recurring prefixes.- Timing - request-to-response latency. The proxy is non-streaming, so first-token time is not separable and
ttft_msequalslatency_ms.
{
"trace_id": "trc_0",
"privacy_mode": "metadata",
"prefix_family_id": "pf_9f1c2a7b4e08",
"schedule": { "arrival_offset_ms": 1840 },
"observed": {
"resolved_target": "gpt-4o",
"ttft_ms": 2310,
"latency_ms": 2310,
"input_tokens": 18240,
"candidate_reuse_tokens": 17110,
"realized_reused_tokens": 15360,
"output_tokens": 412,
"attempt_count": 1
}
}No raw prompts
The privacy guarantee is structural, not a setting:
- The prefix fingerprint is a SHA-256 over each prefix block's
roleandcontentonly. Identical prefixes hash identically - which is how recurring families are detected - but the hash is one-way, so the plaintext is never stored. - Only the fingerprint (truncated into the
prefix_family_id), token estimates, timing, and the model name are written. The prompt and completion text are forwarded to the upstream and dropped. - Captures append to a local JSONL file you control. Nothing is sent to Zumik unless you later run
zumik diagnose --api-key.
The fingerprint deliberately covers everything before the final, changes-every-request turn. A bare string input with no separable prefix is hashed whole so exact repeats still group, but reports a prefix of zero tokens. Request bodies above 8 MiB are still forwarded but not token-estimated, to bound memory on a local tool.
Trace privacy modes go beyond metadata (tokenized, encrypted_full_fidelity, synthetic) for richer captures, but zumik proxy only ever emits metadata. See data privacy and retention for what each mode means.
Feed it to diagnose
# local report, nothing leaves your machine
zumik diagnose workload.jsonl
# or store the run on a live deployment
zumik diagnose workload.jsonl --api-key zk_live_...The diagnostic builds the reuse waterfall and recommends the lowest-complexity execution profile the evidence supports. See the CLI reference for the full diagnose flags.
trace-analyzer
The @zumik/trace-analyzer npm tool turns a metadata-only trace bundle into a reuse waterfall and Workload Reuse Score, with no Rust toolchain and no raw prompts.
Prompt linter
zumik lint and the web prompt-linter check a prompt's layout for the structure that defeats provider-native caching - volatile content in the stable prefix, bad ordering, and a sub-1024-token prefix.