🎙️ ElevenLabs ⏱ 6 min read 🗣️ Text to Speech

Text to Speech — Technical Guide

Type any text and hear it spoken in a natural AI voice — choose from thousands of voices in 30+ languages, create multi-voice dialogues, and control emotion, speed, and delivery style

🗣️

Text to Speech

elevenlabs audio /app/elevenlabs-tts →

Type any text and hear it spoken in a natural AI voice — choose from thousands of voices in 30+ languages, create multi-voice dialogues, and control emotion, speed, and delivery style

Text to Speech turns written words into natural-sounding audio. Type what you want said, pick a voice from a library of thousands, and the AI generates speech that sounds like a real person — with natural rhythm, pauses, and expression. Supports over 30 languages.

Four modes cover different needs. Create Speech generates audio from text with a single voice — the simplest and most common use. Speech with Timing adds character-level timestamps to the output, useful for syncing audio with subtitles or animations. Create Dialogue lets you assign different voices to different lines, producing a multi-voice conversation with up to 10 unique speakers. Dialogue with Timestamps combines multi-voice with timing data for precise sync workflows.

Emotion and delivery control make the speech feel human. On the latest v3 model, audio tags let you insert direction directly into the text — mark a word as whispered, excited, or sighed, and the voice responds naturally. Speed and stability sliders fine-tune how fast the voice speaks and how consistent it stays.

The generated audio works standalone for podcasts, voiceovers, and narration, or feeds directly into other tools — use it as the audio input for Avatar (photo to talking video) or Lip Sync (make someone in a video speak it). This is how you give your AI character a voice across all their content.

✦ Best Results Tips

🎧 Preview Voices Before Generating

Browse the voice library and listen to previews before committing. Different voices excel at different content — some sound warm and conversational, others sound authoritative and professional. Find the one that matches your character.

✍️ Use Punctuation for Natural Pauses

Commas create short pauses, periods create longer ones, ellipsis creates a trailing hesitation. Write the text the way you want it spoken — punctuation is the easiest way to control rhythm and pacing.

🎭 Audio Tags for Emotion (v3 Only)

On the v3 model, insert tags like [excited], [whispers], [sigh] directly in your text to change the delivery mid-sentence. Click any tag pill on the page to insert it at your cursor position.

💬 Dialogue Mode for Conversations

Use Create Dialogue when you need multiple voices — each line gets its own voice assignment. Up to 10 unique voices per generation. Perfect for podcast-style content, interviews, or character interactions.

⚡ Flash for Speed, Multilingual for Quality

Flash and Turbo models generate faster and cost less — great for drafts and testing. Multilingual v2 and v3 produce the most natural, expressive speech — use them for final content you plan to publish.

🔗 Feed Audio into Avatar or Lip Sync

Generate speech here, then use the audio file as input for Avatar (turn a photo into a talking video) or Lip Sync (make someone in an existing video speak it). This is the voice pipeline for your AI character.

Text to Speech — Available Models

Multilingual v2

Default Default

eleven_multilingual_v2

29 languages, best quality for non-English. Default for dubbing.

29 languages

v3 — Latest

Latest

eleven_v3

74 languages, newest model.

74 languages

Flash v2.5

Fast

eleven_flash_v2_5

Ultra-fast, cost-efficient. 32 languages.

32 languages

Turbo v2.5

eleven_turbo_v2_5

Low-latency streaming. 32 languages.

32 languages

📥 You Give

📝Text to Speak 🎙️Voice Selection 🎭Emotion (optional) 🌍Language

✨

AI Magic

elevenlabs

🎵 You Get

🎵 Audio

Modes

Speech

Speech + Timing

Dialogue

Dialogue + Timing

Output formats

MP3 WAV PCM OPUS

🌍

74 languages

Model maximum

📝

5,000 chars

Max text per request

🗣️

10 inputs

10 voices

⚡

Speed 0.5-2x

Playback rate

🎯

Stability 0-1

Voice consistency

💰 Text to Speech — Pricing

Estimated cost

—

Failed jobs are automatically refunded