The Mouth Was Not an Ear

The avatar SDK had three commands for what the figure said and none for what it heard. The bug was already inside the shape of the API.

Pen-and-ink illustration: an antique candlestick telephone in dense crosshatch, the speaking horn fully detailed and the listening cup conspicuously absent — the bracket is there, the cord descends, but the earpiece itself is simply not.

The /build/ page had a row of buttons. Not navigation. Starter buttons — each one a prompt, a topic, a signal of intent. Click "Add an avatar to my site," allow the microphone, and the avatar was supposed to appear and open the walkthrough without making you repeat yourself. The button encoded the question. The session would carry it. The avatar would begin.

Instead, the avatar appeared, said the greeting, and went silent. Not an error. Not a crash. Just — the avatar, waiting, as if it were your turn.

A different user every time, and every time the same pause. The kind of pause that teaches a visitor what the page actually does, as opposed to what it was supposed to do.

The code that looked correct

The sequence in the page's JavaScript read plausibly enough. After the session connected, the code called session.message(prompt), where prompt was the text from whichever starter button the user had clicked. session.message appeared in the SDK's autocomplete. It accepted a string. It existed. The inference — that it sent the string to the agent as a user turn — was not an unreasonable inference. A method named message on a conversational session object: what else would it do.

It made the avatar speak the text aloud.

session.message on @heygen/liveavatar-web-sdk@0.0.17 sends a command internally called AVATAR_SPEAK_RESPONSE. That command puts words in the avatar's mouth — not in the agent's input stream. The starter prompt, "Walk me through adding a HeyGen LiveAvatar to my site," was being read out in Seth's voice, into the void, as if the avatar had originated the thought. The agent received nothing. The user, having clicked a button to skip the preamble, heard the avatar say exactly what they had asked and then sit there as if waiting to be asked.

What the SDK source said

The diagnostic took ten minutes once the source was open. LiveAvatarSession.js, the message() method: one function, one command emitted. CommandEventsEnum.AVATAR_SPEAK_RESPONSE. The evidence was not ambiguous.

Then the enum itself. Three values:

AVATAR_SPEAK_TEXT
AVATAR_SPEAK_RESPONSE
AVATAR_SPEAK_AUDIO

All three end in a verb that describes what the avatar does with output. None of them end in INPUT. None of them end in USER. There is no USER_MESSAGE. There is no AGENT_INPUT. The three commands the client can issue are the three ways to put a specific string or audio buffer into the avatar's mouth. The ear — the channel through which the agent receives a user turn — is not in the enum.

In LITE mode, the user speaks through the microphone. The microphone feeds ElevenLabs's speech-to-text pipeline, which the agent is wired to. That is the only path. The client cannot inject text the agent will receive as a conversational turn. The SDK does not expose that surface. It was not overlooked in the page's code. It is not there.

The shape of the possible

A bug that presents as a wiring problem at the page — the prompt isn't reaching the agent — is, on inspection, a statement about what the SDK will allow. The three SPEAK commands and the absence of any USER or INPUT command are not a gap in the implementation. They are the implementation. The surface area of what was exposed to the client is the product.

This is the thing that autocomplete hides. session.message listed itself. It typed cleanly. It compiled. The shape of the method signature — a string in, nothing out — looked like the shape of "send a user message." But the SDK's job was never to simulate a user turn. The SDK's job is to control an avatar. The mouth. Not the ear.

Once that distinction is clear, the constraint becomes obvious: in LITE mode, the agent's ear is the microphone. The pipeline from mic to STT to agent is managed by ElevenLabs, on their servers, and the client has no programmatic access to it after the session opens. If you want to prime the agent with something the client knows before the conversation begins, you have one window: session start, before the ear opens.

The fix, three layers down

The fix did not live on the page. The page was wired correctly to the API it had. The API it had was the wrong one for the job.

The seam where text can cross from client space to agent space, in LITE mode, is the session token. The Cloudflare Worker that mints the token accepts a request from the browser, holds the API key server-side, and calls the LiveAvatar token endpoint with a configuration object. That configuration object can include elevenlabs_agent_config. Inside that, dynamic_variables.

Three changes, in order of where they live in the stack.

On the page: when a starter button was clicked, instead of staging the prompt for a post-connect session.message call, the code appended ?starter=<prompt> to the worker URL at token-fetch time. Before the session opened. Before the avatar appeared.

In the Worker: read the starter query parameter, validate it (length cap at 800 characters, which bounds the prompt-injection blast radius), and forward it as dynamic_variables: { starter_prompt: starterPrompt } in the elevenlabs_agent_config block.

In the agent's system prompt: a paragraph conditioned on . When the dynamic variable is present and non-empty, the agent opens the conversation by acknowledging the selection and beginning the walkthrough immediately. When it is absent, the agent greets normally and waits.

The starter prompt now reaches the agent before the conversation begins, as configuration, not as a message. The agent acts on it the way it acts on any instruction in its system prompt — because that is exactly where it lives.

The hidden contract

The avatar reads first made the argument that the brain is the agent, the face is the face, and the seam between them is just plumbing. True. But plumbing has a direction. The pipes that carry data from the client to the agent — in LITE mode, through a HeyGen session — are laid at the SDK level, and the SDK laid them one way. Outward. From agent to avatar. The commands expose what the avatar can be told to do.

The corpus on this site has skills that make procedures installable — the add-avatar-to-site skill encodes the right sequence, the gotchas, the session-token pattern. But a skill encodes what the procedure has learned. What this fix revealed is something a step upstream: the SDK itself is a contract, and the contract specifies what the procedures are allowed to do. Before a skill can encode the workaround, someone has to discover that the direct path is not a path.

That discovery happens when you open the source. Not when you read the docs — the docs describe what the methods do, not what methods are absent. Not when you run the code — session.message succeeds silently. The enum names are where the contract is legible. AVATAR_SPEAK_TEXT. AVATAR_SPEAK_RESPONSE. AVATAR_SPEAK_AUDIO. Three verbs describing the avatar's output. The input side is not a locked door. It is a wall. The wall does not announce itself.

What sits in the autocomplete

The constraint was already there the day the SDK shipped. Every developer who installed @heygen/liveavatar-web-sdk@0.0.17 and typed session. got the same autocomplete. message appeared. The inference was available to anyone who did not read the source.

The fix is real and it works. The starter prompts reach the agent. The walkthrough begins without the user repeating themselves. But the fix does not change the SDK. The SDK is what it is. The three SPEAK commands are still the three SPEAK commands. The next build that wants to inject a user turn after session start will hit the same wall.

The lesson the brain vault holds now, filed under learnings dated today: when debugging a LITE-mode avatar integration that fails to pass a message to the agent, check the SDK's command enum before you check the wiring. If the API is named message or say or speak, it almost certainly addresses the avatar's output. The fix lives at session-start, in dynamic variables, before the ear opens — because in LITE mode, the ear is the microphone, and the client does not own it.

The full record, with the source excerpt and the three-layer fix, is at brain/learnings/liteavatar-sdk-no-client-user-message.md.