Uptime monitoring and self-healing

How Zumik's Cloudflare worker probes every property from the edge and auto-remediates the VPS during an incident, paging a human only when automation is exhausted.

Zumik runs a Cloudflare Worker that watches every public property and self-heals the VPS when a company-hosted service goes down. It is two cooperating pieces:

The uptime worker (infra/cloudflare/uptime-worker) - runs on Cloudflare's edge, probes, decides, and alerts.
The ops-agent (services/ops-agent) - a small daemon on the VPS that executes a fixed allowlist of remediation actions.

The loop

Every minute the worker runs five steps:

Probe from the edge

Each property - api.zumik.ai/healthz, the /v2 surface, zumik.ai, console.zumik.ai, plus auth and docs as alert-only - is probed, latency measured, and classified healthy / degraded / down. The probe bypasses the edge cache so a stale 200 never masks a dead origin.

Debounce in Workers KV

One blip is ignored; two consecutive failures open an incident.

Alert Discord

A message fires on every state transition.

Self-heal

A down VPS service triggers an OpenAI-driven agent on a tight leash: the model may only call the ops-agent's allowlisted actions. It reads diagnostics, restarts the narrowest thing likely to help, re-probes, and repeats up to MAX_REMEDIATION_ROUNDS.

Page a human

The on-call Discord user is @mentioned exactly once per incident, when automation is exhausted or the host is unreachable.

edge probe ─▶ KV debounce ─▶ Discord alert ─▶ OpenAI agent ─▶ ops-agent action ─▶ re-probe
                                                  │                                   │
                                                  └────────── loop until healthy ◀────┘
                                                  │
                                         exhausted ─▶ @mention on-call (page)

Why an ops-agent instead of SSH

A Worker cannot open an SSH session. The VPS runs a separately supervised ops-agent on its own Cloudflare-proxied hostname (ops.zumik.ai). Because it is a different process from api-core, an API crash never takes the remediation channel down with it.

The agent is intentionally boring and locked down:

Bearer token on every privileged route, compared in constant time. It refuses to start without one.
No shell, ever. Each action maps to one fixed docker/docker compose argv. Request data only selects which allowlisted action runs - it is never interpolated into a command. There is no arbitrary-command path.
Per-command timeout, and exposure only through nginx + Cloudflare with a strict per-IP rate limit. The origin is never reachable directly.

Allowlisted actions

restart_api_core, restart_gateway, restart_all, recreate_api_core, reload_nginx, prune_docker.

If nginx itself or the whole VPS is unreachable, ops.zumik.ai is also unreachable. The agent detects that, declines to pretend it can fix it, and pages a human instead.

Configuration

Worker secrets (set with wrangler secret put): DISCORD_WEBHOOK_URL, DISCORD_ALERT_USER_ID, SEPARATE_OPENAI_KEY (a dedicated key, isolated from the platform's customer-serving provider keys), OPS_AGENT_URL, OPS_AGENT_TOKEN, CONTROL_TOKEN.

VPS env: OPS_AGENT_TOKEN (must match the worker; ≥16 chars). The agent ships as a service in the deploy Compose stack with the Docker socket and the compose file mounted.

This loop watches whether the properties are up. For what they are doing while up - request rate, latency percentiles, error rate, spend, and the GPU/KV series from any registered cluster - see observability. The edge rate-limit rules the worker sits behind are managed in Terraform.

Uptime monitoring and self-healing

The loop

Why an ops-agent instead of SSH

Allowlisted actions

Configuration

Related

On this page