Uptime monitoring and self-healing
How Zumik's Cloudflare worker probes every property from the edge and auto-remediates the VPS during an incident, paging a human only when automation is exhausted.
Zumik runs a Cloudflare Worker that watches every public property and self-heals the VPS when a company-hosted service goes down. It is two cooperating pieces:
- The uptime worker (
infra/cloudflare/uptime-worker) - runs on Cloudflare's edge, probes, decides, and alerts. - The ops-agent (
services/ops-agent) - a small daemon on the VPS that executes a fixed allowlist of remediation actions.
The loop
Every minute the worker runs five steps:
Probe from the edge
Each property - api.zumik.ai/healthz, the /v2 surface, zumik.ai, console.zumik.ai, plus
auth and docs as alert-only - is probed, latency measured, and classified
healthy / degraded / down. The probe bypasses the edge cache so a stale 200 never masks a
dead origin.
Debounce in Workers KV
One blip is ignored; two consecutive failures open an incident.
Alert Discord
A message fires on every state transition.
Self-heal
A down VPS service triggers an OpenAI-driven agent on a tight leash: the model may only call the
ops-agent's allowlisted actions. It reads diagnostics, restarts the narrowest thing likely to help,
re-probes, and repeats up to MAX_REMEDIATION_ROUNDS.
Page a human
The on-call Discord user is @mentioned exactly once per incident, when automation is exhausted or
the host is unreachable.
edge probe ─▶ KV debounce ─▶ Discord alert ─▶ OpenAI agent ─▶ ops-agent action ─▶ re-probe
│ │
└────────── loop until healthy ◀────┘
│
exhausted ─▶ @mention on-call (page)Why an ops-agent instead of SSH
A Worker cannot open an SSH session. The VPS runs a separately supervised ops-agent on its own
Cloudflare-proxied hostname (ops.zumik.ai). Because it is a different process from api-core, an
API crash never takes the remediation channel down with it.
The agent is intentionally boring and locked down:
- Bearer token on every privileged route, compared in constant time. It refuses to start without one.
- No shell, ever. Each action maps to one fixed
docker/docker composeargv. Request data only selects which allowlisted action runs - it is never interpolated into a command. There is no arbitrary-command path. - Per-command timeout, and exposure only through nginx + Cloudflare with a strict per-IP rate limit. The origin is never reachable directly.
Allowlisted actions
restart_api_core, restart_gateway, restart_all, recreate_api_core, reload_nginx,
prune_docker.
If nginx itself or the whole VPS is unreachable, ops.zumik.ai is also unreachable. The agent
detects that, declines to pretend it can fix it, and pages a human instead.
Configuration
Worker secrets (set with wrangler secret put): DISCORD_WEBHOOK_URL, DISCORD_ALERT_USER_ID,
SEPARATE_OPENAI_KEY (a dedicated key, isolated from the platform's customer-serving provider
keys), OPS_AGENT_URL, OPS_AGENT_TOKEN, CONTROL_TOKEN.
VPS env: OPS_AGENT_TOKEN (must match the worker; ≥16 chars). The agent ships as a service in the
deploy Compose stack with the Docker socket and the compose file mounted.
Related
This loop watches whether the properties are up. For what they are doing while up - request rate, latency percentiles, error rate, spend, and the GPU/KV series from any registered cluster - see observability. The edge rate-limit rules the worker sits behind are managed in Terraform.