Building Living Characters With Decart and ElevenLabs

You know the exact moment your AI demo dies: 300ms into the conversation, when the avatar's mouth keeps moving two seconds after the audio stops. Or worse - when it stares blankly while speaking. Your investor shifts uncomfortably. Your user hits refresh. Hours of vibe-coding reduced to a creepy deepfake meme.

Humans detect lip-sync misalignment in just 40 milliseconds. Miss that window, and user delight will quickly fade. Here's how we finally solved both the uncanny valley and the latency problem with Decart and ElevenLabs - and a batteries-included open-source app for you to ask Cursor to merge your demo code with.

The core ingredients
‍

Decart's Lipsync service focuses on visual credibility. Feed it a looping portrait, real-time audio, and desired frame cadence, and it returns frames aligned to every phoneme. The service quietly manages latency buffering, frame interpolation, and graceful interruption so your character doesn't "pop" between syllables.

ElevenLabs brings your character to life with the most expressive text to speech model at the lowest latency across 29+ languages – Now, that will get your users talking!

Pairing the two yields more than synchronized lips. You gain a programmable actor who can swap moods mid-conversation, whisper secrets, or project stage presence on demand—all while maintaining consistent video fidelity.

Sidekick as a reference build

Sidekick's sidekick.py is intentionally transparent: the Pipecat pipeline ingests microphone audio, sends transcripts through Groq's Llama 3.3, hands the reply to ElevenLabs, and leaves the last mile to Decart for lip-sync and frame delivery. The WebRTC wrapper proves the end-to-end viability of this stack on consumer hardware. Importantly, nothing in the code is hard-wired to Cleopatra; swapping voices, video clips, or even the conversation partner model takes a few YAML edits.

This reference implementation surfaces a few learnings worth carrying into your own products:

Latency layering matters. ElevenLabs already streams partial phonemes; Decart accepts incremental audio frames, so you can begin video synthesis before the sentence ends.
Interruptions must feel polite. When voice activity detection fires, Sidekick sends Decart an interrupt_audio signal to flush pending phonemes. The user sees the character stop talking rather than finish an irrelevant thought.
Context is everything. Pipecat’s context aggregator keeps system prompts, user history, and assistant replies in a single buffer. That means the emotional direction you set through ElevenLabs also informs future gestures and pacing.

Where this combo really shines

Adaptive storytelling

Imagine an AR history tour where visitors pick which revolution to explore. ElevenLabs can read local captions in the visitor’s language while Decart renders a life-sized historian who reacts to follow-up questions. The content engine simply streams new scripts; the avatar doesn't need re-rendering for each locale. Imagine a Union colonel brought to life to naturally tell the story of a battle lost to time.

Healthcare coaching

Patients respond better when health advice comes from empathetic voices. ElevenLabs can train on clinician-approved samples; Decart ensures the on-screen coach mirrors that empathy visually. Because both APIs support real-time streaming, the coach can pause, nod, and resume after a patient responds, simulating bedside manner.

Multiplayer experiences

Game studios can attach unique voices and avatars to NPCs. With Decart rendering lip-sync dynamically, cutscene production becomes text-driven rather than animation-heavy. Designers can tweak story beats minutes before launch and still ship cinematic-quality performances.

Designing for multiplicity

A single pair of APIs now produces an entire cast. Consider how you can:

Branch personalities by keeping multiple ElevenLabs voice IDs and Decart video loops in a content registry. Your matchmaking logic simply binds the appropriate pair at session start.
Layer effects such as echo or room tone on top of ElevenLabs' output before handing it to Decart, enabling characters who sound like they are broadcasting from space or whispering backstage.
Experiment with multilingual flows. ElevenLabs offers instant language switching; Decart’s lipsync operates on audio, regardless of language, as long as your video reference maintains neutral mouth positions.
Respond to sentiment by measuring user tone and swapping to calmer or more enthusiastic delivery mid-session. ElevenLabs’ real-time parameter adjustments and Decart’s interruption API make these pivots seamless.

Operational considerations

Combining streaming APIs poses practical questions:

Bandwidth budgeting: Decart returns JPEG or RGB24 frames at configurable FPS. For cellular deployments, consider downsampling to 360p while preserving facial detail.
Resilience: If ElevenLabs experiences packet loss, replay buffered phonemes so Decart doesn't animate silence. Conversely, if Decart stalls, fall back to an audio-only experience instead of dropping the call.
Compliance: Store voice clones and reference footage with clear consent trails. Provide users an opt-out path, especially if you deploy the same voice across multiple properties.
Cost awareness: ElevenLabs charges per generated character-second and Decart per processed frame. Smart gating—like idling avatars when nobody is present—keeps budgets predictable.

Beyond video calls

The same recipe powers:

Interactive billboards that lip-sync promotional hooks when sensors detect nearby shoppers.
Virtual production assistants who guide film crews through complex setups with hands-free instructions.
Language-learning companions that demonstrate mouth formations visually while speaking target phrases.
Premium voicemail services where celebrities (with permission) deliver personalized messages in both voice and visual form.

The art isn’t merely in stitching APIs—it’s in crafting experiences where the medium amplifies the message. Start by experimenting with Sidekick. Swap in a new persona, refine the cadence, shorten the response time. Once you see the avatar spring to life, you’ll realize the potential canvas extends far beyond a single character on a browser page.

We’re entering an era where adding a convincing face to conversational AI is no longer a cinematic endeavor. With Decart handling photorealistic lip-sync and ElevenLabs lending emotional voices, any developer can sculpt believable digital actors. Sidekick shows the wiring; the next breakout application is yours to imagine.

Ready to build your own living character?
‍

Sidekick is ready for you to clone, fork, or otherwise take inspiration from in your endeavors to create natural, interactive character experiences. For a detailed technical walkthrough of the application, check out our cookbook article.
If you have any questions, ask in our Discord! Happy coding 🧑‍💻

❤️ The Decart Team