Zumik
CLI & tools

Trace-capture proxy

zumik proxy sits in front of an OpenAI-compatible endpoint and records one metadata-only trace per request - token counts, timing, and a prefix fingerprint, never raw prompts.

zumik proxy is the capture step of the workload-analysis funnel. It sits in front of any OpenAI-compatible endpoint, forwards every request untouched, and records one trace per call. It never writes raw prompt text - only the metadata the diagnostic needs - so you can capture a representative workload without handing over your prompts.

It is a subcommand of the zumik CLI.

Put it in front of your endpoint

zumik proxy --upstream https://api.openai.com --listen 127.0.0.1:8080 --out workload.jsonl

Then point your OpenAI client's base URL at the proxy and run a representative slice of normal traffic:

Python
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="sk-...")
client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}],
)

The proxy forwards the request to --upstream unchanged - it copies your headers (dropping only hop-by-hop ones it must recompute) and your body verbatim, so your API key still authenticates against the upstream. Stop the proxy when you have enough traffic; workload.jsonl is now a trace bundle ready for zumik diagnose, zumik score, or the @zumik/trace-analyzer npm tool.

FlagDefaultPurpose
--upstream(required)Base URL to forward to, e.g. https://api.openai.com
--listen127.0.0.1:8080Address to bind
--outzumik-traces.jsonlJSONL file to append traces to

What it records

For each request the proxy appends one trace with privacy_mode: "metadata":

  • resolved_target - the model field from the request body.
  • Token estimates - a whitespace-based input_tokens estimate over the chat messages or the Responses input, and output_tokens from the upstream usage block. The estimate is an honest approximation, not a claim of tokenizer precision.
  • candidate_reuse_tokens - the stable-prefix token count, but only when this prefix family has been seen before. That is the maximum a prior compatible request could have supplied.
  • realized_reused_tokens - read from the upstream response's usage.prompt_tokens_details.cached_tokens when present, so it reflects what the provider actually cached.
  • prefix_family_id - a pf_-prefixed id derived from the prefix fingerprint, used to group recurring prefixes.
  • Timing - request-to-response latency. The proxy is non-streaming, so first-token time is not separable and ttft_ms equals latency_ms.
A captured trace
{
  "trace_id": "trc_0",
  "privacy_mode": "metadata",
  "prefix_family_id": "pf_9f1c2a7b4e08",
  "schedule": { "arrival_offset_ms": 1840 },
  "observed": {
    "resolved_target": "gpt-4o",
    "ttft_ms": 2310,
    "latency_ms": 2310,
    "input_tokens": 18240,
    "candidate_reuse_tokens": 17110,
    "realized_reused_tokens": 15360,
    "output_tokens": 412,
    "attempt_count": 1
  }
}

No raw prompts

The privacy guarantee is structural, not a setting:

  • The prefix fingerprint is a SHA-256 over each prefix block's role and content only. Identical prefixes hash identically - which is how recurring families are detected - but the hash is one-way, so the plaintext is never stored.
  • Only the fingerprint (truncated into the prefix_family_id), token estimates, timing, and the model name are written. The prompt and completion text are forwarded to the upstream and dropped.
  • Captures append to a local JSONL file you control. Nothing is sent to Zumik unless you later run zumik diagnose --api-key.

The fingerprint deliberately covers everything before the final, changes-every-request turn. A bare string input with no separable prefix is hashed whole so exact repeats still group, but reports a prefix of zero tokens. Request bodies above 8 MiB are still forwarded but not token-estimated, to bound memory on a local tool.

Trace privacy modes go beyond metadata (tokenized, encrypted_full_fidelity, synthetic) for richer captures, but zumik proxy only ever emits metadata. See data privacy and retention for what each mode means.

Feed it to diagnose

# local report, nothing leaves your machine
zumik diagnose workload.jsonl

# or store the run on a live deployment
zumik diagnose workload.jsonl --api-key zk_live_...

The diagnostic builds the reuse waterfall and recommends the lowest-complexity execution profile the evidence supports. See the CLI reference for the full diagnose flags.

On this page