Observability

The /metrics Prometheus surface api-core exposes, the two Grafana dashboards in infra/observability, and what a registered BYOC cluster's collector ships - one for the managed API today, one for GPU clusters when they exist.

Operational telemetry for the Zumik stack. The managed-API side runs today on the VPS - it dashboards api-core - and the BYOC side lights up once a GPU cluster is registered and its exporters are scraped. It is a Prometheus + Grafana stack: api-core exposes /metrics, Prometheus scrapes it, and two provisioned Grafana dashboards render it.

The `/metrics` surface

api-core exposes Prometheus text-exposition metrics at :8080/metrics from a tiny, dependency-free registry. It is deliberately small - labeled counters plus one latency histogram - which is enough for request rate, error rate, latency percentiles, inference volume by provider, and spend.

Series	Type	What it carries
`http_requests_total{method,status}`	counter	Every HTTP request, labeled by method and status.
`http_request_duration_ms_bucket{le}`	histogram	Request latency, with `_sum` and `_count`.
`zumik_inference_requests_total{provider,profile,region}`	counter	Inference requests from the Execution Broker, labeled by resolved provider, execution profile, and region.
`zumik_inference_charged_micros_total`	counter	Cumulative charged micros, the basis for the spend panel.

The inference counters are emitted by the broker on dispatch, so the profile label is exactly the managed/byok/subscription/byoc value reported on Agent-Execution-Profile.

Keep /metrics internal - nginx must not proxy it to the public internet. If you set METRICS_TOKEN on api-core, add the matching bearer_token to prometheus.yml so the scrape authenticates. The surface is scraped over the internal docker network only.

The Grafana dashboards

Two dashboards live in infra/observability, auto-loaded by the provisioning config on start.

Zumik API (managed providers)

Request rate, p50/p95/p99 latency, 5xx error rate, inference requests by provider, and spend in USD/hour. Backed entirely by the metrics api-core already exposes - it works the moment Prometheus scrapes the origin.

BYOC runtime (GPU / KV cache)

GPU utilization, TTFT-vs-SLA, LMCache hit ratio, Dynamo queue depth, AIBrix replica count, and Mooncake transfer throughput. Populated once a GPU cluster is registered and its exporters are scraped.

The API dashboard derives latency percentiles from the histogram (for example histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket[5m])) by (le))) and spend from the charged-micros counter (sum(rate(zumik_inference_charged_micros_total[1h])) * 3600 / 1000000).

Run it on the VPS

Prometheus and Grafana live in an overlay compose file, kept out of the core stack so the ops-agent's docker compose restart never has to interpolate Grafana's password. Neither service is published to the host.

docker compose --env-file infra/deploy/.env \
               -f infra/deploy/docker-compose.yml \
               -f infra/observability/docker-compose.observability.yml \
               up -d prometheus grafana

--env-file is required so Grafana receives GRAFANA_ADMIN_PASSWORD - it ships with no default password and refuses to start without one. Reach Grafana over an SSH tunnel (ssh -L 3000:localhost:3000 vps) or put it behind a separately authenticated Cloudflare hostname. Prometheus is bounded to roughly 8 GB / 15 days so the TSDB can't fill the NVMe.

What a BYOC collector ships

A registered BYOC or portable Kubernetes data plane runs its own collector in the customer cloud and exposes DCGM and runtime exporters. To wire it in:

Scrape the cluster's exporters

Add the cluster endpoint as a scrape target, ideally via service discovery in the customer cloud. It exposes DCGM (GPU) plus SGLang/vLLM, LMCache, Dynamo (or the EPP router), and Mooncake series.

Label the series with the cluster id

Tag every series with cluster=<byc_id> - the registered cluster's id from the registry.

Pick it up on the dashboard

The BYOC dashboard's $cluster variable populates from label_values(zumik_byoc_replicas, cluster), so the right cluster appears once its series arrive.

The dashboard reads DCGM for GPU utilization, sglang_time_to_first_token_seconds_bucket against zumik_byoc_target_ttft_ms for TTFT-vs-SLA, LMCache hit/lookup counters for the cache hit ratio, dynamo_router_queue_depth for queue depth (the EPP router substitutes here on the portable profile), zumik_byoc_replicas for the autoscaler's replica count, and mooncake_transfer_bytes_total for transfer throughput. The control plane never touches the GPUs, so this data only exists once a cluster is running it.