Zumik
Infrastructure

Portable Kubernetes

The EPP-owned BYOC alternative - llm-d / Gateway API Inference Extension + KServe + vLLM + AIBrix + LMCache, the infra/k8s-portable Helm chart, and how it differs from the Dynamo stack. Running it needs your own GPU cluster.

The portable Kubernetes profile is the cloud-portable, EPP-owned alternative to the NVIDIA Dynamo BYOC stack. Use it when customers already run Kubernetes and vLLM, want portability across cloud providers, or prefer llm-d / Gateway API Inference Extension scheduling over Dynamo. It ships as the infra/k8s-portable Helm chart and registers with the same control plane as the Dynamo stack.

When this stack serves a request the BYOC profile reports Agent-Execution-Profile: byoc_epp.

Like the Dynamo stack, this is deployable-later and requires a GPU Kubernetes cluster you provide. Real accelerators must exist on the nodes at deploy time. This page is deploy and architecture documentation, not a turnkey hosted runtime.

The stack

LayerComponentRole
Endpoint selectionllm-d / Gateway API Inference Extension EPPThe single replica-selection owner: prefix-cache-aware + load-aware routing.
Model servingKServe InferenceServiceWraps the runtime with an OpenAI-compatible protocol and request-based autoscaling.
RuntimevLLM (PagedAttention)The serving runtime.
Control planeAIBrixLLM-tailored autoscaler, distributed KV cache, and LLM gateway/routing.
KV managementLMCacheThe designated KV-cache layer with pluggable backends (CPU, Redis, Mooncake, S3, NIXL).
Control-plane tiePortable operatorRegisters the cluster and heartbeats it, identical contract to the BYOC operator.

One scheduler owns replica selection per path. This profile's owner is the llm-d / EPP router - never run it and the Dynamo router over the same path. The EPP router routes by prefix-cache locality and queue depth and demonstrates ~3x higher throughput and ~2x faster TTFT versus round-robin.

How it differs from infra/byoc

Both share the same control-plane registration contract, so a portable cluster shows up in the registry exactly like a Dynamo one, tagged with its runtime stack.

infra/byoc (Dynamo)infra/k8s-portable (this)
Replica-selection ownerNVIDIA Dynamo routerllm-d / EPP router
RuntimeSGLang + FlashInfervLLM
Model servingdirect DeploymentKServe InferenceService
TargetNVIDIA-optimized GPU clusterportable across clouds
AutoscalingAIBrix PodAutoscalerAIBrix PodAutoscaler + KServe KPA
Profile headerbyoc_dynamobyoc_epp

Pick this profile for portability and an existing Kubernetes/vLLM footprint; pick the Dynamo stack for an NVIDIA-optimized cluster where Dynamo's KV-aware routing and disaggregated serving move the numbers.

The infra/k8s-portable Helm chart

Create the control-plane secret

kubectl create namespace zumik-portable
kubectl -n zumik-portable create secret generic zumik-control-plane \
  --from-literal=api-key=zk_live_xxx

Install, pointing at your model and GPU product

helm install us-east ./infra/k8s-portable \
  --namespace zumik-portable \
  --set model.served=meta-llama/Llama-3.1-8B-Instruct \
  --set runtime.gpusPerReplica=1 \
  --set aibrix.autoscaler.targetTtftMs=400

Wire heartbeats to the returned cluster id

helm upgrade us-east ./infra/k8s-portable --reuse-values --set operator.clusterId=byc_xxx

Cluster prerequisites: the NVIDIA GPU operator (or device plugin); the KServe CRDs (serving.kserve.io) if kserve.enabled; the Gateway API Inference Extension CRDs plus the llm-d scheduler image if router.enabled; and the AIBrix CRDs (autoscaling.aibrix.ai, orchestration.aibrix.ai) if aibrix.autoscaler.enabled. Without AIBrix, set aibrix.autoscaler.enabled=false and rely on KServe's request-based scaling.

runtime.resources defaults to a modest single-GPU request - a placeholder, not a recommendation. For tensor-parallel serving, raise model.tensorParallelSize and the GPU count together, and pin a node with runtime.gpuProduct (e.g. NVIDIA-H100-80GB-HBM3). The full knob set - vLLM image, KServe scaling envelope, EPP router scorers, LMCache backend, AIBrix toggles, and the operator schedule - is in the chart's values.yaml.

Observability

The vLLM runtime, EPP router, and GPUs expose Prometheus metrics. Scrape them labeled cluster=<byc_id> and reuse the BYOC runtime Grafana dashboard - GPU utilization, TTFT-vs-SLA, LMCache hit ratio, and replica count are the same series; the queue-depth panel reads the EPP router instead of Dynamo. See observability.

On this page