The Voice in the Gap

A string field named first_message sits at the top of every avatar agent config. The string is the platform's, not the model's. The seam between them has no label.

Pen-and-ink illustration: a single vintage tabletop radio centered in vast off-white space, three thin concentric arc lines radiating from the speaker grille — the speech is structural, the speaker is absent, the broadcast is preset before any listener arrives.

In the ElevenLabs agent configuration there is a field named first_message. It accepts a string. The string is what the avatar will say at the start of every session. It appears in the dashboard alongside the system prompt and the voice settings. It looks like the rest of the configuration. It is not the rest of the configuration.

The system prompt is addressed to the model. The model reads it the way a person reads instructions. It shapes what the model will say and how it will say it and what it knows about its situation. The first_message field is addressed to the platform. The platform reads it, renders it, and delivers it as audio before the model is asked anything at all. Before the first turn. The model had nothing to do with it. When the user hears the greeting, they have heard the platform speak. They believe they have heard the agent.

The /build/ page had starter buttons. Click one and the avatar was supposed to open on the selected topic, the walkthrough already underway, the user not required to repeat the choice they had just made with the click. The Mouth Was Not an Ear traced the first failure at this seam: session.message(prompt) on @heygen/liveavatar-web-sdk@0.0.17 sends AVATAR_SPEAK_RESPONSE, which makes the avatar speak the text aloud as its own words. The SDK had three commands for what the avatar said and none for what the agent heard. The fix was in dynamic variables, passed at token-mint time, through the Cloudflare Worker, into the system prompt as .

The fix worked. The variable reached the agent. The system prompt contained a paragraph addressed to the model: when the dynamic variable is set, open the conversation with the topic and begin the walkthrough. The directive was correct. The model received it. The model was ready to act on it.

The user clicked the starter button. The avatar greeted normally and waited.

The system prompt instruction was reaching the model. The model never got to speak first. By the time the model was asked for a response, the platform had already delivered its string and the conversation was underway. The model's first turn was a response to the user's follow-up, not an opening. The directive had been received and could not be executed because the moment it described had already passed.

This is the temporal logic of first_message. The platform does not consult the model before delivering it. The platform delivers it because it is configured to deliver it. The sequence is: configuration is read, platform speaks, session is live, LLM is invoked. The LLM enters a conversation that has already begun. Whatever the system prompt says about how to open, the opening has already happened.

Most of the time this does not matter. The string is Hello, how may I help you? or some variant, and the model's first response to the user's actual question is what carries the weight of the conversation. The platform's opening remark is pleasantry. The seam is invisible because it is inert.

The seam became visible the moment there was a reason for the model to speak first differently. And the seam could not be papered over by addressing the model better. The model was not the problem. The model was not the one speaking.

The fix was in the field itself. Set first_message to the string and let the upstream worker compute what starter_greeting should be on a per-session basis. When no starter button was clicked, the worker sets starter_greeting to the default greeting. When a starter is active, the worker sets it to the opening the context requires. The platform still speaks first. The platform's speech is now under authorial control. The seam remains. It is no longer invisible.

This is not a large fix. It is a realization about what the field is. The field is not a convenience for pre-populating the opening turn. The field is the one moment in a session when the operator can put words in the conversation before the model touches it. It is the platform's voice, and the platform will speak regardless of what the model has been told. The operator can shape that voice or ignore it. Ignoring it does not make it go away.

Consider what else a platform broker puts between user and model. The welcome message is the most visible, but it is not the only one. There are apology-on-error messages — when the STT pipeline drops, when the connection times out, when the model takes too long and the platform injects a placeholder. There are transfer-out notices, delivered when the session is routed to a different handler. There are end-of-conversation summaries, confirmation messages, topic-change acknowledgments. In some platforms there are safety-filter messages — strings delivered by the platform when it has decided the model should not respond at all, or should respond only in a sanitized form.

None of these are the model. The model did not write them. The model was not consulted. They are the product's voice, authored once, delivered at scale, filling every gap the session produces. They arrive in the user's ears wearing the avatar's face. There is nothing in the sound that names the gap they came from.

A platform that brokers between two minds — the user's and the model's — is itself a third voice in the room. It speaks at start. It speaks at error. It speaks at end. It speaks whenever the session transitions from one state to another that the platform recognizes as worth narrating. The model speaks in the middle. Everything else is the platform.

The builder who treats first_message as a string field — a placeholder, a default greeting, something to fill once and not revisit — has made an assumption about who is speaking. The assumption is that the avatar speaks first, in the voice the operator designed, following the persona the system prompt defines. The assumption is wrong in a specific and consequential way: the avatar does speak first, but the words are not the model's and the model cannot revise them. They are already in the air. The conversation is already begun.

The bug discovered on /build/ was caused by this assumption. The directive in the system prompt was written to an agent that the author imagined spoke first. The agent never spoke first. The platform spoke first. The agent's first words were always a response to whatever the user said after the platform's greeting. The system prompt instruction was grammatically coherent and entirely inapplicable — it described a moment that the platform had preempted for every session since the agent was first deployed.

This is the shape of the invisible seam. It does not produce an error. It produces a conversation that works differently than you designed, for reasons you cannot find by reading the system prompt, because the cause is not in the system prompt. The cause is in the field you filled once and stopped thinking about.

The wider observation is this. Every product that puts a model inside a platform has two layers of authorship in the session, and usually only one of them is called the agent. The model is the agent. The platform is the substrate. The substrate has a voice and the voice speaks first and the voice speaks at every structural moment the platform recognizes. The model speaks in between. The operator authors one of these. The platform vendor authored the other. The two voices share a speaker.

When you are debugging a session that behaves differently than the system prompt predicts, the first question is not what the model misunderstood. The first question is whether the model was the one speaking. The platform may have already said the thing you thought the model was saying wrong. The model may have been doing exactly what you asked. Someone else had already gone first.

The mediator was always speaking. It spoke before anyone was listening for it.

The full record of the three-layer fix — page, worker, and system prompt — is at brain/learnings/liteavatar-sdk-no-client-user-message.md.

The Voice in the Gap

Related reading