Portable Kubernetes
The EPP-owned BYOC alternative - llm-d / Gateway API Inference Extension + KServe + vLLM + AIBrix + LMCache, the infra/k8s-portable Helm chart, and how it differs from the Dynamo stack. Running it needs your own GPU cluster.
The portable Kubernetes profile is the cloud-portable, EPP-owned alternative to the
NVIDIA Dynamo BYOC stack. Use it when customers already run
Kubernetes and vLLM, want portability across cloud providers, or prefer llm-d / Gateway API Inference
Extension scheduling over Dynamo. It ships as the infra/k8s-portable Helm chart and registers with
the same control plane as the Dynamo stack.
When this stack serves a request the BYOC profile reports
Agent-Execution-Profile: byoc_epp.
Like the Dynamo stack, this is deployable-later and requires a GPU Kubernetes cluster you provide. Real accelerators must exist on the nodes at deploy time. This page is deploy and architecture documentation, not a turnkey hosted runtime.
The stack
| Layer | Component | Role |
|---|---|---|
| Endpoint selection | llm-d / Gateway API Inference Extension EPP | The single replica-selection owner: prefix-cache-aware + load-aware routing. |
| Model serving | KServe InferenceService | Wraps the runtime with an OpenAI-compatible protocol and request-based autoscaling. |
| Runtime | vLLM (PagedAttention) | The serving runtime. |
| Control plane | AIBrix | LLM-tailored autoscaler, distributed KV cache, and LLM gateway/routing. |
| KV management | LMCache | The designated KV-cache layer with pluggable backends (CPU, Redis, Mooncake, S3, NIXL). |
| Control-plane tie | Portable operator | Registers the cluster and heartbeats it, identical contract to the BYOC operator. |
One scheduler owns replica selection per path. This profile's owner is the llm-d / EPP router - never run it and the Dynamo router over the same path. The EPP router routes by prefix-cache locality and queue depth and demonstrates ~3x higher throughput and ~2x faster TTFT versus round-robin.
How it differs from infra/byoc
Both share the same control-plane registration contract, so a portable cluster shows up in the registry exactly like a Dynamo one, tagged with its runtime stack.
infra/byoc (Dynamo) | infra/k8s-portable (this) | |
|---|---|---|
| Replica-selection owner | NVIDIA Dynamo router | llm-d / EPP router |
| Runtime | SGLang + FlashInfer | vLLM |
| Model serving | direct Deployment | KServe InferenceService |
| Target | NVIDIA-optimized GPU cluster | portable across clouds |
| Autoscaling | AIBrix PodAutoscaler | AIBrix PodAutoscaler + KServe KPA |
| Profile header | byoc_dynamo | byoc_epp |
Pick this profile for portability and an existing Kubernetes/vLLM footprint; pick the Dynamo stack for an NVIDIA-optimized cluster where Dynamo's KV-aware routing and disaggregated serving move the numbers.
The infra/k8s-portable Helm chart
Create the control-plane secret
kubectl create namespace zumik-portable
kubectl -n zumik-portable create secret generic zumik-control-plane \
--from-literal=api-key=zk_live_xxxInstall, pointing at your model and GPU product
helm install us-east ./infra/k8s-portable \
--namespace zumik-portable \
--set model.served=meta-llama/Llama-3.1-8B-Instruct \
--set runtime.gpusPerReplica=1 \
--set aibrix.autoscaler.targetTtftMs=400Wire heartbeats to the returned cluster id
helm upgrade us-east ./infra/k8s-portable --reuse-values --set operator.clusterId=byc_xxxCluster prerequisites: the NVIDIA GPU operator (or device plugin); the KServe CRDs
(serving.kserve.io) if kserve.enabled; the Gateway API Inference Extension CRDs plus the llm-d
scheduler image if router.enabled; and the AIBrix CRDs (autoscaling.aibrix.ai,
orchestration.aibrix.ai) if aibrix.autoscaler.enabled. Without AIBrix, set
aibrix.autoscaler.enabled=false and rely on KServe's request-based scaling.
runtime.resources defaults to a modest single-GPU request - a placeholder, not a recommendation. For
tensor-parallel serving, raise model.tensorParallelSize and the GPU count together, and pin a node
with runtime.gpuProduct (e.g. NVIDIA-H100-80GB-HBM3). The full knob set - vLLM image, KServe
scaling envelope, EPP router scorers, LMCache backend, AIBrix toggles, and the operator schedule - is
in the chart's values.yaml.
Observability
The vLLM runtime, EPP router, and GPUs expose Prometheus metrics. Scrape them labeled
cluster=<byc_id> and reuse the BYOC runtime Grafana dashboard - GPU utilization, TTFT-vs-SLA, LMCache
hit ratio, and replica count are the same series; the queue-depth panel reads the EPP router instead of
Dynamo. See observability.
BYOC stack
The NVIDIA Dynamo + SGLang + FlashInfer + LMCache + Mooncake + AIBrix data plane - what each component does, what runs where, the infra/byoc Helm chart, replay-gated activation, and the cluster registry. Running it needs your own GPU cluster.
Terraform
The infra/terraform modules - Cloudflare DNS, WAF rate limiting, and zone TLS settings, plus a provider-secrets renderer - with init/plan/apply, remote state on R2, and how sensitive variables stay out of plan output.