Architecture
A streaming three-stage pipeline - speech-to-text, LLM, text-to-speech - wrapped in a LiveKit room so the avatar, the audio, and the network transport share one synchronized timeline. The Anam 3D avatar consumes the TTS stream and lip-syncs in real time.
The hard parts were not the individual stages but the seams between them: barge-in detection, end-of-utterance silence thresholds, and keeping the LLM grounded in the event's speaker context so the avatar didn't go off-topic when an executive free-associated.