It was 10pm. I was testing the contractor embed widget I'd shipped that afternoon — a small diagnostic tool designed to sit on a partner's website so homeowners can describe a symptom and get a diagnosis without leaving the page. I typed into the free-text field: my garage door sounds like a gunshot.
No match. The widget returned the no-result state: a short message, a list of suggested phrases to try instead. Try again.
I stared at that for a moment. A garage door that sounds like a gunshot almost certainly has a broken torsion spring. That is the single most urgent thing the whole diagnostic system was built to catch. Spring failures are violent and the injury risk during a DIY repair attempt is real. And the widget had just shrugged at the most specific, most dramatic description of that exact emergency.
That was a bug. But fixing it forced a design decision I should explain, because the decision generalizes.
What the diagnose tool actually is
The diagnose function is a structured lookup. Nine canonical symptoms — things like "won't open," "makes grinding noise," "reverses before closing" — each with a list of normalized aliases. When someone types "wont close" or "door wont shut," the function does substring matching against those aliases, finds the canonical symptom, and returns the associated likely issues, cost range, and a DIY safety flag. No LLM. Zero cost per call. Microsecond latency.
That design is intentional, and it's the right design for what it does. Safety-critical output should not depend on how an LLM interprets a string on any given day. When the system flags a symptom as unsafe to repair — broken springs, frayed cables, anything involving the counterbalance system — that flag needs to come from a deterministic function. Not from a model that might characterize "my spring snapped" differently than "my spring broke" on two separate calls.
The problem isn't that the lookup is deterministic. The problem is that "gunshot" isn't in any alias list. The lookup never had a chance.
Three things, wired together
The fix came in three parts, and each part addresses a different moment in the user's experience.
Chips first. The default view of the widget is nine clickable chips — one per canonical symptom. Click any chip and you get a guaranteed match. No alias matching, no substring search, no possibility of a dead-end. The diagnose problem becomes a multiple-choice question, and the UX does most of the work before the user types anything at all.
LLM fallback on free text. Below the chips, there's an escape hatch: "Don't see it? Describe in your own words." That opens a textarea. Submissions through the textarea add a useLlmFallback: true flag to the API call. When the substring match returns nothing — which is exactly when "gunshot" fails — a Haiku classification pass runs. It maps the free text to the nearest canonical symptom key. "Sounds like a gunshot" → banging_noise. Then diagnose runs again with that canonical key and returns a structured result.
The LLM is in the loop for exactly one job: bridging the gap between how a homeowner describes something and what the lookup function expects. Once it does that translation, it steps back. The downstream output is identical to what a chip click would have returned. Deterministic. Consistent. Cost: roughly $0.00005 per fallback call, and only when the substring match misses, and only when the user opts into text mode.
ZIP lookup after every match. Once the widget produces a diagnosis, it appends a ZIP-code field. Submit a ZIP and the routeByZip tool returns the local partner's name, phone number, and booking link. Every successful diagnosis becomes a routing opportunity. For a widget sitting on a contractor's website, that is the whole point of having the widget at all.
Why none of the three alone would have been enough
This is the part I want to be precise about, because the temptation when building AI-assisted tools is to reach for the LLM first and let it handle everything. That instinct is wrong more often than it's right.
Just the lookup: fast, cheap, deterministic, and completely blind to colloquial language. "Gunshot" doesn't match anything. "It squeaks when it rains" doesn't match anything. A real homeowner typing a real description into a text field will miss constantly.
Just the LLM: handles colloquial language fine, but the latency is wrong for a chip click, the cost is wrong at scale, and non-determinism in a safety-critical classification is a problem you can't test your way out of. A model that's confident "spring snapped" maps to broken_spring on 99.9% of calls is still a model that might not on the call where it matters.
Just routing: you have nothing to route about. routeByZip takes a symptom key as an input. Without a diagnosis, it's just a phone number lookup.
The widget works because each piece does the job the others couldn't do. The chips eliminate most of the matching problem before it starts. The lookup handles confirmed symptom keys deterministically. The LLM handles the translation step — the moment where a human describes a mechanical failure in their own words — and nothing else. Routing closes the loop.
In One Registry, Seven Surfaces, I described how these tools all live in a single registry and fan out to every surface automatically. The widget is one more consumer of that registry. It calls diagnose and routeByZip the same way the chat interface does, the same way the Custom GPT does, the same way any agent calling /api/v1/diagnose over REST does.
What's new here isn't the tools. It's the shape of the UX around them — the decision about which kind of logic to put where.
At 10pm, the widget shrugged at "gunshot." The homeowner typing that word doesn't have a better one — that is exactly what a snapped torsion spring sounds like. The fix was narrow: one LLM call, one translation step, and the lookup does the rest. The widget knows what a gunshot sounds like now. That's all it needed to know.
Seth Shoultes builds things at garagedoorscience.com and writes about them occasionally.