Zumik
Infrastructure

BYOC stack

The NVIDIA Dynamo + SGLang + FlashInfer + LMCache + Mooncake + AIBrix data plane - what each component does, what runs where, the infra/byoc Helm chart, replay-gated activation, and the cluster registry. Running it needs your own GPU cluster.

This is the architecture and deploy reference for the BYOC profile. The data plane is a Kubernetes-native inference stack that runs in the customer's own cloud; Zumik's global control plane keeps owning policy, alias resolution, diagnostics, billing, and purge evidence. The stack ships as the infra/byoc Helm chart.

This page documents a deploy that requires a GPU Kubernetes cluster, which you provide. The control plane (api-core, the cluster registry) ships today; the chart is what gets helm install-ed into your GPU cluster the day replay justifies it. Zumik does not provision GPUs for you. Start cheap and managed - BYOC is the escalation path.

The stack

Each layer has one job, and one scheduler owns replica selection.

LayerComponentRole
OrchestrationNVIDIA DynamoKV-aware router, disaggregated prefill/decode, SLA-based scheduling across replicas.
RuntimeSGLang + FlashInferThe serving runtime with RadixAttention; FlashInfer kernels for paged KV-cache, Cascade Attention for shared prefixes, MLA, and FP8/FP4.
KV managementLMCacheThe designated KV-cache layer with pluggable backends (CPU RAM, Redis, Mooncake, S3, NIXL).
Cross-node KVMooncake Transfer EngineRDMA-based zero-copy KV transfer for disaggregated prefill/decode.
AutoscalingAIBrix PodAutoscalerLLM-tailored autoscaler that scales replicas to hold the TTFT SLA, plus distributed KV cache and LoRA management.
Control-plane tieBYOC operatorRegisters the cluster and heartbeats it to the Zumik control plane.

One scheduler owns replica selection per profile. This stack's owner is the Dynamo router - never run Dynamo and an EPP router over the same path. If you want llm-d / EPP scheduling, KServe, or vLLM instead, that is the portable Kubernetes profile.

What runs where

Client

Global product control plane   (Zumik: policy, alias resolution, diagnostics, purge, billing)

Customer-cloud data plane      (your cloud, your GPUs)

NVIDIA Dynamo router           (replica selection, KV-aware routing)

SGLang + FlashInfer runtime    (the serving lane)

LMCache + Mooncake             (tiered KV reuse + RDMA cross-node transfer)

The control plane never touches the GPUs. The data plane never sees another tenant's policy. That split is what lets a purge in the control plane invalidate a customer-cloud cache without Zumik holding the keys to that cloud - the basis for the stronger purge evidence BYOC delivers.

The KV hierarchy follows GPU HBM, then host RAM, then local NVMe, then an optional remote KV backend. A remote backend is only enabled when the expected recompute cost exceeds the combined lookup, transfer, decompression, queue-delay, and failure-risk cost - never on the assumption that remote reuse is automatically beneficial.

The infra/byoc Helm chart

Install once GPUs exist. The chart deploys the runtime, the Dynamo router, the AIBrix autoscaler, and the operator, and registers the cluster with the control plane.

Create the control-plane secret

The operator uses a Zumik API key to register and heartbeat.

kubectl create namespace zumik-byoc
kubectl -n zumik-byoc create secret generic zumik-control-plane \
  --from-literal=api-key=zk_live_xxx

Install, pointing at your model and GPU product

helm install us-east ./infra/byoc \
  --namespace zumik-byoc \
  --set model.served=meta-llama/Llama-3.1-8B-Instruct \
  --set runtime.gpusPerReplica=1 \
  --set autoscaling.targetTtftMs=400

Wire heartbeats to the returned cluster id

The post-install register job logs a byc_... id. Set it so the heartbeat CronJob targets the right cluster.

helm upgrade us-east ./infra/byoc --reuse-values --set operator.clusterId=byc_xxx

Cluster prerequisites: the NVIDIA GPU operator (or device plugin); the AIBrix CRDs (autoscaling.aibrix.ai) if autoscaling.enabled; and RDMA-capable nodes if mooncake.enabled. Without AIBrix, set autoscaling.enabled=false and use a standard HPA on a TTFT metric.

The defaults in values.yaml are a modest single-GPU request and are placeholders, not a recommendation: real accelerators must exist on the nodes at deploy time, and tensor-parallel serving means raising model.tensorParallelSize and the GPU count together. Every knob - model size, quantization, attention backend, LMCache backend, the Mooncake toggle, the Dynamo router, and the autoscaling envelope - is documented in the chart's values.yaml.

Replay-gated activation

The chart is deployable-later by design. You do not stand up a cluster on a hunch:

  • A workload diagnostic returning byoc_pilot_worth_evaluating is the entry point.
  • A replay run over recorded traffic is the gate - it must show the blended total cost (infrastructure, operations, the platform fee, engineering) beats the managed-provider bill at equal reliability.
  • The workload must sustain the utilization BYOC needs to break even (typically 60-70% GPU utilization during business hours), with a managed-provider fallback for when capacity is short.

For most workloads, provider-native caching wins before this point - see BYOC for the full decision.

Cluster registry and heartbeat

The registry is control plane only - no GPUs run here. It tracks which clusters exist, their health, and their autoscaling envelope so the Execution Broker can route to them. Registering a cluster does not start one; it records one your operator is running. The full surface is the BYOC clusters API.

Register

POST /v2/byoc/clusters. name and region are required; the runtime stack defaults to this profile (sglang+flashinfer / lmcache+mooncake / dynamo).

curl https://api.zumik.ai/v2/byoc/clusters \
  -H "Authorization: Bearer zk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "us-east hot lane",
    "region": "us-east-1",
    "autoscaling": { "min_replicas": 1, "max_replicas": 8, "target_ttft_ms": 400 }
  }'
Response
{
  "id": "byc_01jy...",
  "object": "byoc_cluster",
  "project_id": "prj_...",
  "name": "us-east hot lane",
  "region": "us-east-1",
  "status": "registering",
  "runtime": "sglang+flashinfer",
  "kv_cache": "lmcache+mooncake",
  "orchestrator": "dynamo",
  "endpoint": null,
  "autoscaling": { "min_replicas": 1, "max_replicas": 8, "target_ttft_ms": 400 },
  "last_heartbeat_at": null
}

Heartbeat and lifecycle

A cluster starts registering. The operator reports health with POST /v2/byoc/clusters/{id}/heartbeat and a status of active, draining, or down; each heartbeat updates status, last_heartbeat_at, and updated_at. The broker only routes to a cluster that is heartbeating active.

curl https://api.zumik.ai/v2/byoc/clusters/byc_01jy.../heartbeat \
  -H "Authorization: Bearer zk_live_..." \
  -H "Content-Type: application/json" \
  -d '{"status":"active"}'

GET /v2/byoc/clusters lists the project's clusters; DELETE /v2/byoc/clusters/{id} deregisters one. Register and deregister are written to the audit log (byoc.cluster.register, byoc.cluster.deregister).

Observability and purge

The runtime, router, and GPUs expose Prometheus metrics. Scrape them labeled cluster=<byc_id> and the BYOC runtime Grafana dashboard renders GPU utilization, TTFT-vs-SLA, LMCache hit ratio, Dynamo queue depth, replica count, and Mooncake transfer throughput. See observability.

A live cluster participates in purge: a purge job adds a byoc_runtime processor and invalidates the cluster's KV namespace, so a receipt can reach verified_namespace_invalidation or verified_physical_purge for state the runtime confirms.

On this page