At 12:14:03 the daemon restarts ollama.service. At 12:14:31 it restarts it again. At 12:15:02 it restarts it a third time. The service is crashlooping. Each restart accomplishes nothing. The daemon does not know this. It will continue, on a fifteen-minute interval, until the operator notices.
This is the failure mode of an executor without a governor.
Three thrashes that share a shape
Consider three of them.
First, the restart-loop. A daemon watches caddy. Caddy holds a port that another process won at boot. The monitor reports CADDY_DEGRADED. Triage suggests RESTART_SERVICE. The restart runs. The bind fails. The next tick reports CADDY_DEGRADED. Triage suggests RESTART_SERVICE. The restart runs. The bind fails. Thirty hours later the operator opens a graph and asks why systemd has logged ninety-six failed restarts of the same unit.
Second, the requeue-loop. A PRD has shipped hollow — a build that compiled cleanly but produced no working artifact. Maintenance-crew flags it. Triage suggests REQUEUE_PRD. The PRD returns to the pipeline. The same compiler runs the same source. The same hollow ship arrives. Requeue. Hollow ship. Requeue. Ten times in an hour. Each round burns Kimi tokens, accumulates a Ship-failed commit, and leaves the operator a longer log to read in the morning.
Third, the prune-loop. Disk crosses 85%. Triage suggests CLEANUP_DISK. The script truncates application logs and brings disk to 84%. The script's own log is on the same disk. By the next tick the disk is at 85% again, because the daemon writes and truncates and writes. Cleanup. Logs. Cleanup. Logs. The IO is performed; nobody asked for it.
Three actions, three targets, one shape. The signal flapped, the executor did not.
The alarm and the actuator
There is a distinction here, and it is not subtle.
An alarm is information. A page about caddy can fire every hour for six hours and do no harm — the operator has been told once already, the duplicate page is cheap. Even so, alarm dedup is sensible: hash the signal names, not the body, so a flapping timestamp does not flood the inbox. The unit of "have I told the human about this" is the kind of problem, not the latest paint of it.
An actuator is different. An actuator mutates state. RESTART_SERVICE takes down a process. REQUEUE_PRD moves a file from completed to active and a daemon picks it up. CLEANUP_DISK truncates a log. None of these are free. None of these are reversible without effort. The cost of running an actuator twice is not double; it is sometimes squared.
A human operator does not need to be told this. A human, told a service is down, restarts it once, watches, then escalates. The operator's hands have a built-in rate limit, calibrated by the stakes and by attention. The daemon has no such hands. The daemon will execute the same action against the same target every fifteen minutes until the operator stops it.
Structure must compensate.
The rolling history file
The minimal mechanism is a file.
A file at state/recovery-history.json with the last two hundred entries. Each entry is a record: timestamp, action, target, outcome. Before any action fires, the executor counts entries matching (action, target) inside a sliding time window. If the count exceeds a per-action cap, the executor logs "rate-limited," writes a record with outcome: "rate-limited", and escalates to a human.
The caps are not theoretical. They come from the cadence of the underlying problem.
RESTART_SERVICE: thirty minutes, one per service. A service that wants two restarts in thirty minutes is broken in a way the daemon cannot fix.REQUEUE_PRD: twenty-four hours, one per slug. A failing PRD needs an operator, not another twenty laps.CLEANUP_DISK: sixty minutes, one. Truncation is cheap to perform and pointless to repeat.
The granularity matters. Per-(action, target), not per-action. Restarting ollama.service and shipyard-daemon.service draw from independent budgets, because they are independent problems.
The history is the truth. In-memory counters lose their state on restart, and the executor restarts a great deal — that is its job. The file persists.
The second-order failures
Once the governor is in place, the failure modes shift one rung up.
The daemon produces a "rate-limited" record and emits an alert. The alert reaches the operator's inbox. If the alert script hashes the body — which contains a timestamp — every alert is unique, every fifteen minutes another email lands, and the inbox is again the loop. Hash the signal names. The unit of dedup is the problem, not its current paint.
The daemon also commits and pushes. If it pushes without first running git pull --rebase, every divergence — an operator pushed a brain edit, a GitHub Action ran — silently rejects the daemon's push. The daemon then accumulates twenty-two commits ahead of origin without noticing. The pull-rebase is not optional. It is the same family of problem: an autonomous mutator that needs a single line of structure to keep from eating its own work.
The lesson
The pipeline that requires no human attention is the pipeline that requires the most thought up front. The executor without a governor is not a tool. It is a runaway shaft, turning at the speed the bus permits, until the bearings give.
The first duty of the operator who builds it is to ensure that it knows how to be idle.
Norbert Wiener spent the 1940s explaining that feedback without damping is not control — it is oscillation with a budget. The daemon learns this on a fifteen-minute interval, or the operator learns it on a thirty-hour one.