Portable Kubernetes

The EPP-owned BYOC alternative - llm-d / Gateway API Inference Extension + KServe + vLLM + AIBrix + LMCache, the infra/k8s-portable Helm chart, and how it differs from the Dynamo stack. Running it needs your own GPU cluster.

The portable Kubernetes profile is the cloud-portable, EPP-owned alternative to the NVIDIA Dynamo BYOC stack. Use it when customers already run Kubernetes and vLLM, want portability across cloud providers, or prefer llm-d / Gateway API Inference Extension scheduling over Dynamo. It ships as the infra/k8s-portable Helm chart and registers with the same control plane as the Dynamo stack.

When this stack serves a request the BYOC profile reports Agent-Execution-Profile: byoc_epp.

Like the Dynamo stack, this is deployable-later and requires a GPU Kubernetes cluster you provide. Real accelerators must exist on the nodes at deploy time. This page is deploy and architecture documentation, not a turnkey hosted runtime.

The stack

Layer	Component	Role
Endpoint selection	llm-d / Gateway API Inference Extension EPP	The single replica-selection owner: prefix-cache-aware + load-aware routing.
Model serving	KServe InferenceService	Wraps the runtime with an OpenAI-compatible protocol and request-based autoscaling.
Runtime	vLLM (PagedAttention)	The serving runtime.
Control plane	AIBrix	LLM-tailored autoscaler, distributed KV cache, and LLM gateway/routing.
KV management	LMCache	The designated KV-cache layer with pluggable backends (CPU, Redis, Mooncake, S3, NIXL).
Control-plane tie	Portable operator	Registers the cluster and heartbeats it, identical contract to the BYOC operator.

One scheduler owns replica selection per path. This profile's owner is the llm-d / EPP router - never run it and the Dynamo router over the same path. The EPP router routes by prefix-cache locality and queue depth and demonstrates ~3x higher throughput and ~2x faster TTFT versus round-robin.

How it differs from `infra/byoc`

Both share the same control-plane registration contract, so a portable cluster shows up in the registry exactly like a Dynamo one, tagged with its runtime stack.

	`infra/byoc` (Dynamo)	`infra/k8s-portable` (this)
Replica-selection owner	NVIDIA Dynamo router	llm-d / EPP router
Runtime	SGLang + FlashInfer	vLLM
Model serving	direct Deployment	KServe InferenceService
Target	NVIDIA-optimized GPU cluster	portable across clouds
Autoscaling	AIBrix PodAutoscaler	AIBrix PodAutoscaler + KServe KPA
Profile header	`byoc_dynamo`	`byoc_epp`

Pick this profile for portability and an existing Kubernetes/vLLM footprint; pick the Dynamo stack for an NVIDIA-optimized cluster where Dynamo's KV-aware routing and disaggregated serving move the numbers.

The `infra/k8s-portable` Helm chart

Create the control-plane secret

kubectl create namespace zumik-portable
kubectl -n zumik-portable create secret generic zumik-control-plane \
  --from-literal=api-key=zk_live_xxx

Install, pointing at your model and GPU product

helm install us-east ./infra/k8s-portable \
  --namespace zumik-portable \
  --set model.served=meta-llama/Llama-3.1-8B-Instruct \
  --set runtime.gpusPerReplica=1 \
  --set aibrix.autoscaler.targetTtftMs=400

Wire heartbeats to the returned cluster id

helm upgrade us-east ./infra/k8s-portable --reuse-values --set operator.clusterId=byc_xxx

Cluster prerequisites: the NVIDIA GPU operator (or device plugin); the KServe CRDs (serving.kserve.io) if kserve.enabled; the Gateway API Inference Extension CRDs plus the llm-d scheduler image if router.enabled; and the AIBrix CRDs (autoscaling.aibrix.ai, orchestration.aibrix.ai) if aibrix.autoscaler.enabled. Without AIBrix, set aibrix.autoscaler.enabled=false and rely on KServe's request-based scaling.

runtime.resources defaults to a modest single-GPU request - a placeholder, not a recommendation. For tensor-parallel serving, raise model.tensorParallelSize and the GPU count together, and pin a node with runtime.gpuProduct (e.g. NVIDIA-H100-80GB-HBM3). The full knob set - vLLM image, KServe scaling envelope, EPP router scorers, LMCache backend, AIBrix toggles, and the operator schedule - is in the chart's values.yaml.

Observability

The vLLM runtime, EPP router, and GPUs expose Prometheus metrics. Scrape them labeled cluster=<byc_id> and reuse the BYOC runtime Grafana dashboard - GPU utilization, TTFT-vs-SLA, LMCache hit ratio, and replica count are the same series; the queue-depth panel reads the EPP router instead of Dynamo. See observability.

Portable Kubernetes

The stack

How it differs from infra/byoc

The infra/k8s-portable Helm chart

Observability

On this page

How it differs from `infra/byoc`

The `infra/k8s-portable` Helm chart