BYOC profile
Self-host the inference data plane in your own cloud while Zumik keeps owning policy, resolution, and purge evidence. When and why to do it, replay-gated activation, and the control-plane ownership split. Running it needs your own GPUs.
BYOC (bring your own cloud) runs the inference data plane inside the customer's own cloud while Zumik's global control plane keeps owning policy, alias resolution, diagnostics, billing, and purge evidence. It is the escalation path: it earns its complexity only on workloads where a few model paths dominate and replay has proven a material benefit over the managed-provider path.
When a request runs against a registered cluster the Agent-Execution-Profile header reads
byoc_dynamo (the NVIDIA Dynamo stack) or byoc_epp (the
portable Kubernetes llm-d / EPP stack).
BYOC requires GPU hardware in your own cloud to actually run. This page documents when to choose the profile and how the control plane and data plane split; the deploy and architecture details live on the BYOC stack page. Zumik does not provision GPUs for you.
When and why to self-host
There are concrete capabilities BYOC delivers that the managed and BYOK profiles cannot, because in those profiles the data plane is the provider's cloud:
Dedicated SLOs
A cluster the autoscaler holds to your time-to-first-token target, with sustained hot-model volume that justifies keeping replicas warm.
Private networking
The data-plane endpoint lives in your cloud and is never on the public internet. The broker dispatches to it directly.
Stronger purge evidence
Because you control the cache, a purge can reach verified_namespace_invalidation or
verified_physical_purge instead of a provider's expiry-bound best effort.
Explicit KV orchestration
Custom models, regional isolation, and direct control over the KV-cache hierarchy (LMCache + Mooncake) for workloads where that orchestration moves the numbers.
Replay-gated activation
Zumik does not activate BYOC on prefix length or raw volume. It activates BYOC only when replay proves it beats the managed-provider path after the full cost is counted - infrastructure, operations, the platform fee, and engineering burden - at equal reliability.
Diagnose
A workload diagnostic that returns byoc_pilot_worth_evaluating
is the entry point. If managed providers already capture most of the available reuse, the
diagnostic says BYOC loses, and you stop here.
Replay
A replay run over recorded traffic is the gate. It compares the candidate BYOC profile against the managed path on the same traffic and reports the blended-cost verdict.
Confirm utilization
Confirm the workload sustains the utilization BYOC needs to break even (typically 60-70% GPU utilization during business hours) and that a managed-provider fallback exists for when cluster capacity is insufficient.
Register and activate
Register the cluster you are running through the
BYOC clusters API; the broker routes to it once it heartbeats
active.
The acceptance gate is a material improvement after including operations and support burden. For most workloads, Anthropic explicit caching (90%), Gemini implicit caching (75%), or Fireworks open-source routing eliminates the business case before this point is reached.
Control-plane ownership
BYOC splits responsibilities cleanly: the global control plane keeps everything Zumik-specific, and the customer-cloud data plane runs the GPUs. The control plane never touches the GPUs, and the data plane never sees another tenant's policy. That split is what lets a purge in the control plane invalidate a cache in the customer cloud without Zumik holding the keys to that cloud.
| Responsibility | Owner |
|---|---|
| Global auth, project policy, alias resolution | Product API Core |
| Provider / cluster selection | Execution Broker |
| Customer-cloud deployment controller | BYOC operator (registers + heartbeats) |
| Replica selection and KV-aware routing | NVIDIA Dynamo router (one scheduler per profile) |
| Default runtime | SGLang with FlashInfer GPU kernels |
| GPU-local KV retention | Runtime (FlashInfer RadixAttention + Cascade Attention) |
| Hierarchical cache | SGLang HiCache with LMCache as the KV management layer |
| Remote KV transfer | Mooncake Transfer Engine (RDMA zero-copy cross-node) |
| Kubernetes infrastructure | AIBrix (distributed KV cache, LLM gateway, autoscaler, LoRA) |
| Purge workflow | BYOC operator plus State Service |
| Observability | Customer-cloud collector plus central telemetry |
One scheduler owns replica selection per profile. The Dynamo router owns it here; the portable profile hands that to the llm-d / EPP router. You never run both over the same path. See the BYOC stack for what each component does and the control-plane registry for how a cluster reports in.
Stronger purge, concretely
A live BYOC cluster participates in purge: a purge job adds a
byoc_runtime processor and invalidates the cluster's KV namespace, so the signed receipt can reach
verified_namespace_invalidation or verified_physical_purge for state the runtime confirms. This is
one of the few capabilities that is strictly stronger under BYOC, because the guarantee is only as
strong as your control over the cache. A managed provider that cannot actively clear its cache caps
out at an expiry-bound best effort.
Related profiles
If the picture is "a few hot model paths plus everything else," the right shape is usually hybrid: managed providers for breadth, a BYOC hot lane for the concentrated traffic. The deploy mechanics for either BYOC stack are on BYOC stack and portable Kubernetes.
BYOK profile
Bring your own provider key for any of the five first-class providers. Zumik calls the resolved provider with your sealed credential, you keep the billing relationship, and you inherit every provider-native optimization.
Hybrid profile
Managed providers for broad coverage and overflow, with BYOC hot lanes carrying the few model paths concentrated enough to justify dedicated infrastructure. The common shape for coding-agent platforms.