BYOC stack
The NVIDIA Dynamo + SGLang + FlashInfer + LMCache + Mooncake + AIBrix data plane - what each component does, what runs where, the infra/byoc Helm chart, replay-gated activation, and the cluster registry. Running it needs your own GPU cluster.
This is the architecture and deploy reference for the BYOC profile. The data plane
is a Kubernetes-native inference stack that runs in the customer's own cloud; Zumik's global control
plane keeps owning policy, alias resolution, diagnostics, billing, and purge evidence. The stack ships
as the infra/byoc Helm chart.
This page documents a deploy that requires a GPU Kubernetes cluster, which you provide. The
control plane (api-core, the cluster registry) ships today; the
chart is what gets helm install-ed into your GPU cluster the day replay justifies
it. Zumik does not provision GPUs for you. Start cheap and managed - BYOC is the escalation path.
The stack
Each layer has one job, and one scheduler owns replica selection.
| Layer | Component | Role |
|---|---|---|
| Orchestration | NVIDIA Dynamo | KV-aware router, disaggregated prefill/decode, SLA-based scheduling across replicas. |
| Runtime | SGLang + FlashInfer | The serving runtime with RadixAttention; FlashInfer kernels for paged KV-cache, Cascade Attention for shared prefixes, MLA, and FP8/FP4. |
| KV management | LMCache | The designated KV-cache layer with pluggable backends (CPU RAM, Redis, Mooncake, S3, NIXL). |
| Cross-node KV | Mooncake Transfer Engine | RDMA-based zero-copy KV transfer for disaggregated prefill/decode. |
| Autoscaling | AIBrix PodAutoscaler | LLM-tailored autoscaler that scales replicas to hold the TTFT SLA, plus distributed KV cache and LoRA management. |
| Control-plane tie | BYOC operator | Registers the cluster and heartbeats it to the Zumik control plane. |
One scheduler owns replica selection per profile. This stack's owner is the Dynamo router - never run Dynamo and an EPP router over the same path. If you want llm-d / EPP scheduling, KServe, or vLLM instead, that is the portable Kubernetes profile.
What runs where
Client
↓
Global product control plane (Zumik: policy, alias resolution, diagnostics, purge, billing)
↓
Customer-cloud data plane (your cloud, your GPUs)
↓
NVIDIA Dynamo router (replica selection, KV-aware routing)
↓
SGLang + FlashInfer runtime (the serving lane)
↓
LMCache + Mooncake (tiered KV reuse + RDMA cross-node transfer)The control plane never touches the GPUs. The data plane never sees another tenant's policy. That split is what lets a purge in the control plane invalidate a customer-cloud cache without Zumik holding the keys to that cloud - the basis for the stronger purge evidence BYOC delivers.
The KV hierarchy follows GPU HBM, then host RAM, then local NVMe, then an optional remote KV backend. A remote backend is only enabled when the expected recompute cost exceeds the combined lookup, transfer, decompression, queue-delay, and failure-risk cost - never on the assumption that remote reuse is automatically beneficial.
The infra/byoc Helm chart
Install once GPUs exist. The chart deploys the runtime, the Dynamo router, the AIBrix autoscaler, and the operator, and registers the cluster with the control plane.
Create the control-plane secret
The operator uses a Zumik API key to register and heartbeat.
kubectl create namespace zumik-byoc
kubectl -n zumik-byoc create secret generic zumik-control-plane \
--from-literal=api-key=zk_live_xxxInstall, pointing at your model and GPU product
helm install us-east ./infra/byoc \
--namespace zumik-byoc \
--set model.served=meta-llama/Llama-3.1-8B-Instruct \
--set runtime.gpusPerReplica=1 \
--set autoscaling.targetTtftMs=400Wire heartbeats to the returned cluster id
The post-install register job logs a byc_... id. Set it so the heartbeat CronJob targets the
right cluster.
helm upgrade us-east ./infra/byoc --reuse-values --set operator.clusterId=byc_xxxCluster prerequisites: the NVIDIA GPU operator (or device plugin); the AIBrix CRDs
(autoscaling.aibrix.ai) if autoscaling.enabled; and RDMA-capable nodes if mooncake.enabled.
Without AIBrix, set autoscaling.enabled=false and use a standard HPA on a TTFT metric.
The defaults in values.yaml are a modest single-GPU request and are placeholders, not a
recommendation: real accelerators must exist on the nodes at deploy time, and tensor-parallel serving
means raising model.tensorParallelSize and the GPU count together. Every knob - model size,
quantization, attention backend, LMCache backend, the Mooncake toggle, the Dynamo router, and the
autoscaling envelope - is documented in the chart's values.yaml.
Replay-gated activation
The chart is deployable-later by design. You do not stand up a cluster on a hunch:
- A workload diagnostic returning
byoc_pilot_worth_evaluatingis the entry point. - A replay run over recorded traffic is the gate - it must show the blended total cost (infrastructure, operations, the platform fee, engineering) beats the managed-provider bill at equal reliability.
- The workload must sustain the utilization BYOC needs to break even (typically 60-70% GPU utilization during business hours), with a managed-provider fallback for when capacity is short.
For most workloads, provider-native caching wins before this point - see BYOC for the full decision.
Cluster registry and heartbeat
The registry is control plane only - no GPUs run here. It tracks which clusters exist, their health, and their autoscaling envelope so the Execution Broker can route to them. Registering a cluster does not start one; it records one your operator is running. The full surface is the BYOC clusters API.
Register
POST /v2/byoc/clusters. name and region are required; the runtime stack defaults to this profile
(sglang+flashinfer / lmcache+mooncake / dynamo).
curl https://api.zumik.ai/v2/byoc/clusters \
-H "Authorization: Bearer zk_live_..." \
-H "Content-Type: application/json" \
-d '{
"name": "us-east hot lane",
"region": "us-east-1",
"autoscaling": { "min_replicas": 1, "max_replicas": 8, "target_ttft_ms": 400 }
}'{
"id": "byc_01jy...",
"object": "byoc_cluster",
"project_id": "prj_...",
"name": "us-east hot lane",
"region": "us-east-1",
"status": "registering",
"runtime": "sglang+flashinfer",
"kv_cache": "lmcache+mooncake",
"orchestrator": "dynamo",
"endpoint": null,
"autoscaling": { "min_replicas": 1, "max_replicas": 8, "target_ttft_ms": 400 },
"last_heartbeat_at": null
}Heartbeat and lifecycle
A cluster starts registering. The operator reports health with
POST /v2/byoc/clusters/{id}/heartbeat and a status of active, draining, or down; each
heartbeat updates status, last_heartbeat_at, and updated_at. The broker only routes to a cluster
that is heartbeating active.
curl https://api.zumik.ai/v2/byoc/clusters/byc_01jy.../heartbeat \
-H "Authorization: Bearer zk_live_..." \
-H "Content-Type: application/json" \
-d '{"status":"active"}'GET /v2/byoc/clusters lists the project's clusters; DELETE /v2/byoc/clusters/{id} deregisters one.
Register and deregister are written to the audit log (byoc.cluster.register,
byoc.cluster.deregister).
Observability and purge
The runtime, router, and GPUs expose Prometheus metrics. Scrape them labeled cluster=<byc_id> and the
BYOC runtime Grafana dashboard renders GPU utilization, TTFT-vs-SLA, LMCache hit ratio, Dynamo queue
depth, replica count, and Mooncake transfer throughput. See observability.
A live cluster participates in purge: a purge job adds a
byoc_runtime processor and invalidates the cluster's KV namespace, so a receipt can reach
verified_namespace_invalidation or verified_physical_purge for state the runtime confirms.
Bifrost gateway
The optional Tier 1 gateway in front of the Product API Core - auth, quotas, layered rate limits, automatic failover across 23+ providers, the OpenRouter emergency path, and where each limit is enforced.
Portable Kubernetes
The EPP-owned BYOC alternative - llm-d / Gateway API Inference Extension + KServe + vLLM + AIBrix + LMCache, the infra/k8s-portable Helm chart, and how it differs from the Dynamo stack. Running it needs your own GPU cluster.