Observability
The /metrics Prometheus surface api-core exposes, the two Grafana dashboards in infra/observability, and what a registered BYOC cluster's collector ships - one for the managed API today, one for GPU clusters when they exist.
Operational telemetry for the Zumik stack. The managed-API side runs today on the VPS - it dashboards
api-core - and the BYOC side lights up once a GPU cluster is registered
and its exporters are scraped. It is a Prometheus + Grafana stack: api-core exposes /metrics,
Prometheus scrapes it, and two provisioned Grafana dashboards render it.
The /metrics surface
api-core exposes Prometheus text-exposition metrics at :8080/metrics from a tiny, dependency-free
registry. It is deliberately small - labeled counters plus one latency histogram - which is enough for
request rate, error rate, latency percentiles, inference volume by provider, and spend.
| Series | Type | What it carries |
|---|---|---|
http_requests_total{method,status} | counter | Every HTTP request, labeled by method and status. |
http_request_duration_ms_bucket{le} | histogram | Request latency, with _sum and _count. |
zumik_inference_requests_total{provider,profile,region} | counter | Inference requests from the Execution Broker, labeled by resolved provider, execution profile, and region. |
zumik_inference_charged_micros_total | counter | Cumulative charged micros, the basis for the spend panel. |
The inference counters are emitted by the broker on dispatch, so the profile label is exactly the
managed/byok/subscription/byoc value reported on Agent-Execution-Profile.
Keep /metrics internal - nginx must not proxy it to the public internet. If you set METRICS_TOKEN
on api-core, add the matching bearer_token to prometheus.yml so the scrape authenticates. The
surface is scraped over the internal docker network only.
The Grafana dashboards
Two dashboards live in infra/observability, auto-loaded by the provisioning config on start.
Zumik API (managed providers)
Request rate, p50/p95/p99 latency, 5xx error rate, inference requests by provider, and spend in USD/hour. Backed entirely by the metrics api-core already exposes - it works the moment Prometheus scrapes the origin.
BYOC runtime (GPU / KV cache)
GPU utilization, TTFT-vs-SLA, LMCache hit ratio, Dynamo queue depth, AIBrix replica count, and Mooncake transfer throughput. Populated once a GPU cluster is registered and its exporters are scraped.
The API dashboard derives latency percentiles from the histogram (for example
histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket[5m])) by (le))) and spend from the
charged-micros counter (sum(rate(zumik_inference_charged_micros_total[1h])) * 3600 / 1000000).
Run it on the VPS
Prometheus and Grafana live in an overlay compose file, kept out of the core stack so the
ops-agent's docker compose restart never has to interpolate
Grafana's password. Neither service is published to the host.
docker compose --env-file infra/deploy/.env \
-f infra/deploy/docker-compose.yml \
-f infra/observability/docker-compose.observability.yml \
up -d prometheus grafana--env-file is required so Grafana receives GRAFANA_ADMIN_PASSWORD - it ships with no default
password and refuses to start without one. Reach Grafana over an SSH tunnel
(ssh -L 3000:localhost:3000 vps) or put it behind a separately authenticated Cloudflare hostname.
Prometheus is bounded to roughly 8 GB / 15 days so the TSDB can't fill the NVMe.
What a BYOC collector ships
A registered BYOC or portable Kubernetes data plane runs its own collector in the customer cloud and exposes DCGM and runtime exporters. To wire it in:
Scrape the cluster's exporters
Add the cluster endpoint as a scrape target, ideally via service discovery in the customer cloud. It exposes DCGM (GPU) plus SGLang/vLLM, LMCache, Dynamo (or the EPP router), and Mooncake series.
Label the series with the cluster id
Tag every series with cluster=<byc_id> - the registered cluster's id from the
registry.
Pick it up on the dashboard
The BYOC dashboard's $cluster variable populates from label_values(zumik_byoc_replicas, cluster),
so the right cluster appears once its series arrive.
The dashboard reads DCGM for GPU utilization, sglang_time_to_first_token_seconds_bucket against
zumik_byoc_target_ttft_ms for TTFT-vs-SLA, LMCache hit/lookup counters for the cache hit ratio,
dynamo_router_queue_depth for queue depth (the EPP router substitutes here on the portable profile),
zumik_byoc_replicas for the autoscaler's replica count, and mooncake_transfer_bytes_total for
transfer throughput. The control plane never touches the GPUs, so this data only exists once a cluster
is running it.
Multi-region & global infrastructure
Zumik's region model - the control-plane region registry, the GET /v2/regions map, data residency, and the Cloudflare load-balancer that geo-steers api.zumik.ai across regions with /healthz failover.
Uptime monitoring and self-healing
How Zumik's Cloudflare worker probes every property from the edge and auto-remediates the VPS during an incident, paging a human only when automation is exhausted.