BYOC profile

Self-host the inference data plane in your own cloud while Zumik keeps owning policy, resolution, and purge evidence. When and why to do it, replay-gated activation, and the control-plane ownership split. Running it needs your own GPUs.

BYOC (bring your own cloud) runs the inference data plane inside the customer's own cloud while Zumik's global control plane keeps owning policy, alias resolution, diagnostics, billing, and purge evidence. It is the escalation path: it earns its complexity only on workloads where a few model paths dominate and replay has proven a material benefit over the managed-provider path.

When a request runs against a registered cluster the Agent-Execution-Profile header reads byoc_dynamo (the NVIDIA Dynamo stack) or byoc_epp (the portable Kubernetes llm-d / EPP stack).

BYOC requires GPU hardware in your own cloud to actually run. This page documents when to choose the profile and how the control plane and data plane split; the deploy and architecture details live on the BYOC stack page. Zumik does not provision GPUs for you.

When and why to self-host

There are concrete capabilities BYOC delivers that the managed and BYOK profiles cannot, because in those profiles the data plane is the provider's cloud:

Dedicated SLOs

A cluster the autoscaler holds to your time-to-first-token target, with sustained hot-model volume that justifies keeping replicas warm.

Private networking

The data-plane endpoint lives in your cloud and is never on the public internet. The broker dispatches to it directly.

Stronger purge evidence

Because you control the cache, a purge can reach verified_namespace_invalidation or verified_physical_purge instead of a provider's expiry-bound best effort.

Explicit KV orchestration

Custom models, regional isolation, and direct control over the KV-cache hierarchy (LMCache + Mooncake) for workloads where that orchestration moves the numbers.

Replay-gated activation

Zumik does not activate BYOC on prefix length or raw volume. It activates BYOC only when replay proves it beats the managed-provider path after the full cost is counted - infrastructure, operations, the platform fee, and engineering burden - at equal reliability.

Diagnose

A workload diagnostic that returns byoc_pilot_worth_evaluating is the entry point. If managed providers already capture most of the available reuse, the diagnostic says BYOC loses, and you stop here.

Replay

A replay run over recorded traffic is the gate. It compares the candidate BYOC profile against the managed path on the same traffic and reports the blended-cost verdict.

Confirm utilization

Confirm the workload sustains the utilization BYOC needs to break even (typically 60-70% GPU utilization during business hours) and that a managed-provider fallback exists for when cluster capacity is insufficient.

The acceptance gate is a material improvement after including operations and support burden. For most workloads, Anthropic explicit caching (90%), Gemini implicit caching (75%), or Fireworks open-source routing eliminates the business case before this point is reached.

Control-plane ownership

BYOC splits responsibilities cleanly: the global control plane keeps everything Zumik-specific, and the customer-cloud data plane runs the GPUs. The control plane never touches the GPUs, and the data plane never sees another tenant's policy. That split is what lets a purge in the control plane invalidate a cache in the customer cloud without Zumik holding the keys to that cloud.

Responsibility	Owner
Global auth, project policy, alias resolution	Product API Core
Provider / cluster selection	Execution Broker
Customer-cloud deployment controller	BYOC operator (registers + heartbeats)
Replica selection and KV-aware routing	NVIDIA Dynamo router (one scheduler per profile)
Default runtime	SGLang with FlashInfer GPU kernels
GPU-local KV retention	Runtime (FlashInfer RadixAttention + Cascade Attention)
Hierarchical cache	SGLang HiCache with LMCache as the KV management layer
Remote KV transfer	Mooncake Transfer Engine (RDMA zero-copy cross-node)
Kubernetes infrastructure	AIBrix (distributed KV cache, LLM gateway, autoscaler, LoRA)
Purge workflow	BYOC operator plus State Service
Observability	Customer-cloud collector plus central telemetry

One scheduler owns replica selection per profile. The Dynamo router owns it here; the portable profile hands that to the llm-d / EPP router. You never run both over the same path. See the BYOC stack for what each component does and the control-plane registry for how a cluster reports in.

Stronger purge, concretely

A live BYOC cluster participates in purge: a purge job adds a byoc_runtime processor and invalidates the cluster's KV namespace, so the signed receipt can reach verified_namespace_invalidation or verified_physical_purge for state the runtime confirms. This is one of the few capabilities that is strictly stronger under BYOC, because the guarantee is only as strong as your control over the cache. A managed provider that cannot actively clear its cache caps out at an expiry-bound best effort.

If the picture is "a few hot model paths plus everything else," the right shape is usually hybrid: managed providers for breadth, a BYOC hot lane for the concentrated traffic. The deploy mechanics for either BYOC stack are on BYOC stack and portable Kubernetes.