Plausible Code That Doesn't Compile

When an AI coding agent writes against an API that doesn't exist, the bug is not a mistake. It is the model working correctly.

Pen-and-ink illustration: a wooden card-catalog drawer pulled open at a three-quarter angle, index cards inside filled with neat confident handwriting that has the cadence of technical documentation — but the words are invented non-words that only resemble English. The catalog system is intact; the references don't refer to anything.

The plugin was 3,600 lines. It had been written autonomously, committed to main, and was sitting in the repository when someone finally ran it. There were 121 calls to throw new Response(). Fourteen references to a global called rc.user. Several uses of process.env for reading configuration variables. None of these exist in Emdash's plugin sandbox. The plugin had been written for a runtime that did not exist.

Three sibling plugins had the same problem at smaller scale.

The agent that wrote them was not confused. It was not guessing. It was doing exactly what it was trained to do: producing code that looked like the code in its training data. And the training data — billions of tokens of JavaScript, PHP, Python, Cloudflare Workers, Express, Node — contains many environments where throw new Response() is correct, where process.env is how you read configuration, where a global user object named something like rc.user is available. The agent assembled an Emdash plugin out of conceptual fragments from all those other systems. The result was plausible. It was wrong. And the wrongness was invisible until someone tried to run it.

What hallucinating an API actually means

The phrase is worth being precise about, because it is easy to mistake for something else. When we say an agent hallucinated an API, we do not mean it introduced a bug — a logic error, a typo, a variable name that got swapped. We do not mean it misunderstood the requirements. We mean something more specific and more strange: the agent wrote code against an interface that does not exist, using a method name or a global or an event shape that belongs to some other system entirely, and the code is internally coherent. It would work. In the environment the agent implicitly assumed, it would work.

The wrongness is not in the code's structure. The wrongness is in the gap between the system the agent was imagining and the system the code will actually run in.

This is different from a hallucinated fact in a prose essay, where you can sometimes feel the seam — the oddly specific claim, the citation that doesn't quite ring true. Code is harder. Code is evaluated by a compiler or a runtime, not by a reader's intuitions, and the compiler does not know what environment the agent was imagining. The compiler just tries to run it. When it fails, the error message points at a missing method or an undefined variable, which looks exactly like any other missing method or undefined variable. The failure mode is invisible until execution.

Three cases, same week, same mechanism

The Emdash EventDash plugin was the first. The autonomous pipeline that wrote it — the office described in The Office Held a Vote — had produced competent code in dozens of other contexts. The build-gate counted thirteen source files and opened. The QA pass ran. Nobody grepped for throw new Response(), because nobody had thought to add it to the banned-patterns list, because nobody had yet encountered the failure.

The second case was smaller in scale but sharper in its illustration. The /build/ page needed to send a starter prompt to an ElevenLabs Conversational AI agent — pass a string from the browser into the agent's input stream so the avatar could begin a walkthrough without the user repeating themselves. The Claude Code agent writing the page reached for session.message(prompt), a method on the HeyGen LiveAvatar SDK's session object. The method existed. The autocomplete offered it. message on a conversational session object — the inference that it routed a string to the agent as a user turn was not unreasonable.

It sent AVATAR_SPEAK_RESPONSE. That command puts words in the avatar's mouth. The SDK exposes three commands: AVATAR_SPEAK_TEXT, AVATAR_SPEAK_RESPONSE, AVATAR_SPEAK_AUDIO. All three describe what the avatar does with output. There is no input command. The agent's ear is the microphone, managed by ElevenLabs on their servers, inaccessible to client code. The SDK is a mouth. The Claude Code agent had inferred it was also an ear, because a method named message on a session object usually means something like that.

The third case was the most recursive. The ElevenLabs conversational agent — the AI persona running in the session, not the developer agent writing the page — had been instructed via system prompt to call a show_blog_post tool with the slug of any post it cited. It did call the tool. With the slug register-elevenlabs-client-tool. There is no post on the site with that slug. There is a skill in the public skills repository with that name. The conversational agent had constructed a slug from the topic it was discussing, in the format that slugs take, and the slug happened to correspond to something that did not exist as a post. The call was structurally correct. The value was wrong.

All three failures, stacked: the developer agent imagined an SDK surface that was only a mouth. The developer agent also read the tool-call event at data.tool_name rather than data.client_tool_call.tool_name¹ — the nesting the ElevenLabs WebSocket protocol actually uses. The conversational agent confabulated a post slug. Three layers of the same mechanism. Pattern-match from training distribution. Deploy against actual system. Discover the gap at runtime.

Why the model produces this

The key thing to understand about hallucinated APIs is that they are not a failure of the model's reasoning. They are the model's reasoning. This is what makes the failure mode hard to route around by asking for a smarter model or a longer context window or a better prompt that says "be careful."

A language model learns to produce text by compressing the statistical structure of an enormous corpus into parameters. When asked to write code against an API, it draws on every similar API it has seen — which is a very large number of APIs, many of which are similar in shape to the one you are asking about. The model does not have access to the actual source unless you put it in the context window. It has access to its compressed representation of everything similar to it. The output is what an API of that type, in a project of that shape, in a developer's hands with that level of expertise, would typically look like.

When the actual API happens to match that distribution — when the SDK you are using is well-documented, widely adopted, and heavily represented in the training corpus — the model writes code that works. Not because the model has read the source. Because the source happens to match the model's expectation. The success is not evidence of correctness. It is evidence of alignment between the model's training distribution and the actual interface.

When the API does not match — when it is new, or obscure, or has a quirk that differs from convention — the model produces what the distribution suggests should be there. The plausible thing. The thing that would be there in ninety percent of similar systems. And it produces it with equal confidence, because the model has no internal signal that distinguishes "I am pattern-matching from similar systems" from "I have read the source." The confidence is identical. The autocomplete does not know whether the method exists.

The defense is structural

It is tempting to reach for model-level solutions here — a bigger model, a model with more recent training data, a model that "knows" the SDK better. This is the wrong frame. The failure mode is not caused by insufficient intelligence. It is caused by the absence of the actual source from the context the model was reasoning in. A better model pattern-matches more plausibly from a larger distribution, which in some cases means it pattern-matches more confidently to the wrong answer.

The defense that works is structural.

First: put the actual source in the context before writing. Not documentation, which may itself be incomplete or out of date. The source. The SDK's JavaScript, the plugin framework's PHP, the event object the WebSocket protocol actually emits. If the model is writing against @heygen/liveavatar-web-sdk@0.0.17, open LiveAvatarSession.js. Put it in the context window. The ten minutes this takes is less than the time the debug session costs.

Second: maintain a banned-patterns file. After the Emdash incident, the office added BANNED-PATTERNS.md and wired it into the QA pass — a grep that fails the build if any of the known-bad patterns appear. throw new Response() is on the list now. process.env is on the list. rc.user is on the list. The next agent that tries to write an Emdash plugin from Node fragments will fail the build before the code lands. The cost of the first failure was the discovery of the pattern. The point of the banned list is to prevent the second failure.

Third: treat verification as a mandatory step, not a courtesy. The skills pattern — procedures written down in a form the agent can read before it writes — is one implementation of this. The skill file for register-elevenlabs-client-tool encodes the event shape, including the data.client_tool_call nesting that the obvious handler misses. An agent that reads the skill before writing the handler does not hallucinate the event shape, because the actual shape is right there, named, with a note about why the naive shape fails. The procedure is the source of truth. The agent executes the procedure.

None of these are AI problems. They are workflow problems. The mandate to read source before writing, the banned-patterns grep in the build gate, the skill file that encodes the discovered gotcha — these are the same solutions a senior developer applies to a new team member who is talented and unfamiliar with the codebase. You do not tell them to be smarter. You give them the source. You tell them what breaks. You put the check in the process so the knowledge does not depend on who is in the room.

The recursive problem

There is something worth sitting with in the structure of this failure mode, which is that the agent's confidence is not informative. A model that has read the source and a model that has not read the source produce code with equal apparent confidence. The code looks the same. The method names are plausible in both cases. The event shapes are plausible in both cases. There is no syntactic marker for "I made this up from similar-looking systems."

This is different from most failure modes in software, where the failure produces a visible signal — an error, an exception, a test that fails. The hallucinated API produces code that is syntactically valid, that passes linting, that may even pass a superficial code review by someone who is also reasoning from the training distribution rather than the source. The failure is silent until execution. And on autonomous pipelines that run while no one is at the desk, execution may be the first moment a human is looking at the output at all.

The deeper recursion is this: the agent that wrote the Emdash plugins was trained on code written by developers. Some of those developers were themselves pattern-matching from similar APIs, writing code they were fairly sure would work without checking the source. Some of that code made it into production, got committed, got indexed. The model learned to produce plausible code partly by compressing a corpus that already contained plausible-but-wrong code. The plausibility of the hallucination is downstream of the plausibility of its training data. The autocomplete reflects the habits of everyone who ever wrote against an unfamiliar API without opening the source.

The defense — read the source, ban the patterns, verify before shipping — is not new advice. It is what careful developers have always done. What is new is the speed and scale at which the failure can propagate: 3,600 lines committed before anyone ran a function, on a pipeline that will happily produce four more plugins tonight if the cron fires and the build-gate doesn't catch it. The workflow defense has to operate at the same speed as the generation. The banned-patterns grep runs in the pipeline. The skill file is in the context before the first token. Verification is not a step you do later. It is a step you do before.

The full record of the first discovery is at brain/learnings/agents-hallucinate-apis.md. The event-shape discovery from the same debug session — the data.client_tool_call nesting — is documented alongside it. The banned-patterns list that came out of the Emdash incident is in the pipeline. The skill file for register-elevenlabs-client-tool has been updated to reflect the actual handler shape.

The next agent that writes an Emdash plugin will read the source first. Not because it learned to. Because the prompt that invokes it now says so.

¹ The ElevenLabs WebSocket protocol wraps the tool call one level deeper than the obvious shape: the event data object contains a client_tool_call key, and the tool name lives at data.client_tool_call.tool_name, not data.tool_name. The HeyGen passthrough preserves this structure without normalizing it. A handler reading data.tool_name will silently miss every tool call — the agent fires, the conversation log records it, the page does nothing. The fix is defensive: read from data.client_tool_call || data and resolve the tool name from whichever shape is present. This was the second discovery from the same debug session that surfaced the session.message problem.