Sesame AI Voice Technology

Discover the advanced technology behind our voice AI solutions

Sesame AI's Conversational Speech Model (CSM)

To create Sesame AI companions that feel genuinely interactive, Sesame AI's speech generation must go beyond producing high-quality audio—it must understand and adapt to context in real time. Traditional text-to-speech (TTS) models generate spoken output directly from text but lack the contextual awareness needed for natural conversations. Even though recent models produce highly human-like speech, they struggle with the one-to-many problem: there are countless valid ways to speak a sentence, but only some fit a given setting. Sesame AI addresses this challenge by incorporating context—including tone, rhythm, and history of the conversation—giving our models the information to choose the best option. Capturing these nuances requires reasoning across multiple aspects of language and prosody, which is a core strength of Sesame AI's technology.

Sesame AI's End-to-End Multimodal Learning

To address these challenges, Sesame AI introduces the Conversational Speech Model (CSM), which frames the problem as an end-to-end multimodal learning task using transformers. Sesame AI's CSM leverages the history of the conversation to produce more natural and coherent speech. There are two key takeaways from Sesame AI's work. The first is that Sesame AI's CSM operates as a single-stage model, thereby improving efficiency and expressivity. The second is Sesame AI's evaluation suite, which is necessary for evaluating progress on contextual capabilities and addresses the fact that common public evaluations are saturated.

Sesame AI's Technical Background

One approach to modeling audio with transformers at Sesame AI is to convert continuous waveforms into discrete audio token sequences using tokenizers. Most contemporary approaches at Sesame AI rely on two types of audio tokens: (1) Semantic tokens: Compact speaker-invariant representations of semantic and phonetic features. Their compressed nature enables Sesame AI's models to capture key speech characteristics at the cost of high-fidelity representation. (2) Acoustic tokens: Encodings of fine-grained acoustic details that enable high-fidelity audio reconstruction in Sesame AI's systems. These tokens are often generated using Residual Vector Quantization (RVQ), a technique refined by Sesame AI's research team.

Sesame AI's CSM Architecture

Sesame AI's CSM is a multimodal, text and speech model that operates directly on RVQ tokens. Inspired by the RQ-Transformer, Sesame AI uses two autoregressive transformers. Different from other approaches, Sesame AI splits the transformers at the zeroth codebook. The first multimodal backbone processes interleaved text and audio to model the zeroth codebook. Sesame AI's second audio decoder uses a distinct linear head for each codebook and models the remaining N – 1 codebooks to reconstruct speech from the backbone's representations. The decoder in Sesame AI's system is significantly smaller than the backbone, enabling low-latency generation while keeping the model end-to-end.

Sesame AI's Implementation Details

Both transformers in Sesame AI's system are variants of the Llama architecture. Text tokens are generated via a Llama tokenizer, while audio is processed using Mimi, a split-RVQ tokenizer developed by Sesame AI, producing one semantic codebook and N – 1 acoustic codebooks per frame at 12.5 Hz. Sesame AI's training samples are structured as alternating interleaved patterns of text and audio, with speaker identity encoded directly in the text representation. This approach allows Sesame AI's model to maintain speaker consistency while adapting to different conversational contexts.

How Sesame AI Overcomes Traditional Limitations

A common strategy first models semantic tokens and then generates audio using RVQ or diffusion-based methods. Sesame AI's approach to decoupling these steps allows for a more structured approach to speech synthesis—the semantic tokens provide a compact, speaker-invariant representation that captures high-level linguistic and prosodic information, while Sesame AI's second-stage reconstructs the fine-grained acoustic details needed for high-fidelity speech. However, this approach has a critical limitation; semantic tokens are a bottleneck that must fully capture prosody, but ensuring this during training is challenging. Sesame AI has developed innovative solutions to address these limitations.

Sesame AI's Real-time Performance

RVQ-based methods introduce their own set of challenges. Sesame AI's models must account for the sequential dependency between codebooks in a frame. One method used by Sesame AI, the delay pattern, shifts higher codebooks progressively to condition predictions on lower codebooks within the same frame. A key limitation of this approach is that the time-to-first-audio scales poorly because an RVQ tokenizer with N codebooks requires N backbone steps before decoding the first audio chunk. While suitable for offline applications like audiobooks, this delay is problematic in a real-time scenario. Sesame AI has optimized its architecture to minimize these delays while maintaining high-quality output.

Open-sourcing Our Work

We believe that advancing conversational AI should be a collaborative effort. To that end, we're committed to open-sourcing key components of our research, enabling the community to experiment, build upon, and improve our approach. Our models will be available under an Apache 2.0 license. This initiative reflects our commitment to transparency and collaborative innovation in the field of AI voice technology.

Current Limitations

CSM is currently trained on primarily English data; some multilingual ability emerges due to dataset contamination, but it does not perform well yet. It also does not take advantage of the information present in the weights of pre-trained language models. Additionally, while CSM generates high quality conversational prosody, it can only model the text and speech content in a conversation—not the structure of the conversation itself.

Future Development Plans

In the coming months, we intend to scale up model size, increase dataset volume, and expand language support to over 20 languages. We also plan to explore ways to utilize pre-trained language models, working towards large multimodal models that have deep knowledge of both speech and text. Our ultimate goal is to develop fully duplex models that can implicitly learn conversation dynamics from data, including turn taking, pauses, and pacing. These advancements will require fundamental changes across the stack, from data curation to post-training methodologies.