Zumik
Execution profiles

Hybrid profile

Managed providers for broad coverage and overflow, with BYOC hot lanes carrying the few model paths concentrated enough to justify dedicated infrastructure. The common shape for coding-agent platforms.

Hybrid is not a third data plane. It is managed providers and BYOC running side by side, with the Execution Broker sending each request to whichever fits. A small number of hot model paths go to dedicated BYOC clusters; everything else - the long tail of models, burst overflow, anything that does not clear the replay bar - stays on the managed-provider path.

This is the shape most real platforms converge on, because real traffic is concentrated: a few model paths carry most of the volume, and the rest is a wide, thin tail that would never pay back a GPU.

Why the split works

                      ┌─────────────────────────────┐
   concentrated  ───▶ │ BYOC hot lane (your GPUs)    │  Dynamo + SGLang + FlashInfer + LMCache
   hot paths          └─────────────────────────────┘
Broker ─┤
                      ┌─────────────────────────────┐
   everything    ───▶ │ Managed providers (Bifrost) │  broad coverage, failover, native caching
   else + overflow    └─────────────────────────────┘

The hot lane earns dedicated infrastructure because its volume sustains the utilization a BYOC cluster needs to break even. The managed path absorbs everything that does not: it carries the breadth, it catches overflow when the hot lane is at capacity, and it serves the tail at no fixed cost. Each request still reports the profile that served it on Agent-Execution-Profile.

Who it fits

Coding-agent platforms

A handful of dominant model paths (one or two coding models running constantly) plus a long tail of occasional models. The dominant paths justify a hot lane; the tail does not.

Enterprise tenants

Strict isolation for the concentrated workload on dedicated clusters, with managed providers covering the rest without standing up infrastructure for every model.

Predictable high volume

Sustained, forecastable traffic on known paths is exactly the condition under which a BYOC lane breaks even, while bursts spill to managed overflow.

Gradual migration

Move one proven hot path to BYOC at a time, leaving the rest managed, instead of an all-or-nothing cutover.

Activation is still per-lane and replay-gated

A hybrid deployment does not relax the BYOC bar; it applies it one lane at a time. Each candidate hot path goes through the same gate: a workload diagnostic that flags it as worth evaluating, then a replay run proving the blended total cost beats the managed path at equal reliability. A path that fails the gate simply stays managed.

The managed-provider fallback is not optional in a hybrid setup - it is the safety net. When a BYOC hot lane is at capacity or unavailable, the broker routes that traffic back to the managed path rather than dropping it. The overview covers the full escalation order.

Putting it together

Stand up the managed path first (it is the default and needs nothing), prove a hot path with replay, deploy it from the BYOC stack or portable Kubernetes chart, and register the cluster through the BYOC clusters API. The control-plane ownership split is the same as for a pure-BYOC deployment - see BYOC for the table.

On this page