Multi-voice TTS

AI Dialogue Generator

Generate multi-character conversations with 54 distinct voices. Free, commercial-use, no signup. Build NotebookLM-style podcasts, audiobook scenes, and animated dialogue in minutes.

0 / 5,000
1.0x
0.25x 4.0x
No signup 100% free 54 voices Instant WAV
Two-voice TTS

Multi-character dialogue from a flat script

Most AI dialogue tools lock multi-speaker mode behind a paid plan. FreeTextoSpeech runs the same workflow for free — generate each character with a different voice from the 54-voice catalog, splice the WAVs in any audio editor, ship the scene. Podcasts, audiobooks, animated shorts, language drills, training scenarios. One pipeline, no subscription.

The quick answer

Write the script with speaker tags for your own clarity. Generate Character A's lines as one WAV with one voice (Sarah), Character B's lines as a second WAV with a contrasting voice (Adam). Stitch them in Audacity or Reaper, leave 200–400 ms gaps between turns, and bounce a single dialogue track. Free, commercial-use, 24 kHz WAV per voice.

Dialogue workflow

From script to two-voice scene

  1. 01

    Write the script with speaker tags

    Format every line as SPEAKER: line. Keep turns short (1–3 sentences) so handoffs feel natural. Strip the speaker tags before pasting — TTS reads them out loud otherwise.

  2. 02

    Generate each character with a different voice

    Paste only Character A's lines, pick a voice (e.g. Sarah), generate, save the WAV. Repeat for Character B with a contrasting voice (e.g. Adam). Same speed for both unless one character is meant to be faster.

  3. 03

    Stitch the WAVs in an audio editor

    Drop both WAVs onto separate tracks in Audacity, Reaper, or Logic. Cut each turn to its own clip and slide them into conversational order. Leave 200–400 ms gaps between turns — silence sells the back-and-forth.

  4. 04

    Export as a single dialogue track

    Bounce to one stereo WAV or MP3. Optional: pan Character A 15% left and Character B 15% right for clearer separation in headphones. Loudness-normalize to -16 LUFS and you have a publish-ready dialogue.

When to use it

What people build with multi-voice TTS

04 scenarios
01 / 04

NotebookLM-style podcast clones

Two-host podcast where one AI voice asks questions and the other explains. Build the script, generate Host A and Host B with contrasting voices, splice. No subscription, no waitlist.

02 / 04

Animated shorts and game prototypes

Voice every character in your storyboard or Twine prototype with a different Kokoro voice. 54 voices means an entire small cast without hiring VAs for the rough cut.

03 / 04

Audiobook character dialogue

Narrate the prose with one voice (River or Daniel), then swap in distinct character voices for direct quotes. Listeners track who is speaking without a "she said" tag every line.

04 / 04

Language-learning conversations

Generate two-speaker dialogues in Spanish, French, Hindi, Japanese, Mandarin, and 4 more languages. Pair a male and female native voice for realistic A/B exchange drills.

Voice guide

Six voice pairings that contrast cleanly

Dialogue lives or dies on whether listeners can tell the two voices apart without thinking. These six pairings contrast on gender, accent, or character tone — pick one and start generating.

01 US English

Sarah + Adam

Host + expert guest

Best for

NotebookLM-style explainer podcasts. Sarah asks the questions, Adam delivers the authoritative answer. Most natural pairing for two-host US English shows.

02 US English

Bella + Liam

Casual conversation

Best for

Lifestyle pods, friend-chat formats, café-scene dialogue in audiobooks. Both voices read warm, so the back-and-forth feels relaxed instead of formal.

03 British English

Daniel + Emma

Period drama or BBC-style

Best for

Audiobook scenes set in the UK, prestige documentary two-handers, history podcasts with a host-and-narrator structure. The matched accent keeps the world consistent.

04 US English (character)

Puck + Sky

Animation / game characters

Best for

Animated shorts, Twine prototypes, indie game NPC dialogue. Puck is mischievous, Sky is bright — clear character voices, not narrator voices.

05 US English

River + Nova

Documentary narrator + interviewee

Best for

True-crime, history, science docs where a smooth narrator threads between recreated quotes. River carries the throughline, Nova plays the voice in the archive.

06 US English

Michael + Jessica

Co-host energy

Best for

Morning-show banter, news-and-chat formats, branded podcast intros where two hosts trade a cold open before the show kicks in.

Want to hear them? Browse all 54 voices →

Best practices

How to write dialogue that reads as natural TTS

The model handles the voice. The realism comes from how you script the beats, the silences, and the small reactions between characters. Six rules that cover the full pipeline.

  • 01

    Strip speaker tags before pasting

    Write the script as SARAH: line / ADAM: line for your own clarity, then delete the tags before generating. Otherwise the voice reads "Sarah colon" out loud. Same for stage directions in brackets — keep them in the script doc, not in the TTS input.

  • 02

    Use punctuation to script the beats between lines

    A comma is a quarter-beat, a period is a half-beat, an em dash is a real pause, an ellipsis is a held beat. If a character is hesitating, write "Well... I don't know if that's true." instead of "Well I don't know if that's true." The pause is what sells the hesitation, and Kokoro respects the punctuation.

  • 03

    Leave 200–400 ms of silence between turns

    Real conversational gap-time averages around 250 ms. Slap two TTS clips back-to-back and the dialogue sounds robotic; add 300 ms of true silence between turns in your editor and it sounds like two people talking. For tense scenes, drop to 100 ms. For thoughtful exchanges, push to 600 ms.

  • 04

    Fake interruptions with a 50–100 ms overlap

    Truncate Character A's clip mid-word, then drag Character B's clip to start 50–100 ms before A ends. The brief overlap is the universal audio cue for "they cut in." A 100 ms crossfade on the overlap zone smooths the splice and the interruption reads as natural.

  • 05

    Pan voices slightly left and right for headphone clarity

    Pan Character A 15% left and Character B 15% right. Not enough to feel theatrical, enough that headphone listeners stop conflating who's speaking. Hard panning (50%+) only works for radio drama. For podcasts and audiobooks, 15% is the sweet spot.

  • 06

    Spell phonetic reactions for laughter and "uh-huh"

    No SSML, no <laugh> tag — write what you want spoken. "Hah hah hah" delivers a clean three-syllable laugh, "mmhm" reads as agreement, "ugh" lands as exasperation. Test the reaction as a 10-character solo generation, swap voices if one delivers it cleaner, then drop the WAV onto a separate track in your edit.

Honest comparison

FreeTextoSpeech vs PlayHT and Murf dialogue features

PlayHT and Murf both ship native multi-speaker modes — paste once, render one mix. The trade is monthly character caps, signup, and paid commercial license. Here is the honest read.

Multi-voice dialogue support

FreeTextoSpeech

Generate each character separately with any of 54 voices, splice in your editor — no character cap on number of voices in a scene.

PlayHT / Murf free tiers

Native multi-speaker dialogue feature, but locked behind paid tiers and capped voice roster on free.

Voice variety per scene

FreeTextoSpeech

54 voices across 9 languages — enough for an entire ensemble cast.

PlayHT / Murf free tiers

Smaller free voice pool with most distinct character voices paywalled.

Cost for a 5-minute dialogue

FreeTextoSpeech

Free. No character cap on number of generations.

PlayHT / Murf free tiers

Counts against monthly character cap on free tier; longer scenes push you to a paid plan.

Commercial license on free tier

FreeTextoSpeech

Full commercial use allowed, no attribution.

PlayHT / Murf free tiers

Commercial use restricted on free tier — paid plan required for monetized podcasts and indie games.

Output format

FreeTextoSpeech

24 kHz WAV per voice — drop straight onto separate tracks in any DAW.

PlayHT / Murf free tiers

Compressed MP3 on free tier, often single-mix only.

Signup

FreeTextoSpeech

None. Open the page, paste, generate.

PlayHT / Murf free tiers

Email signup required, credit card on file for commercial features.

One-click multi-voice generation

FreeTextoSpeech

Not native — you generate each voice separately and splice.

PlayHT / Murf free tiers

Native multi-speaker dialogue mode on paid tiers (one paste, one render).

If a single-paste multi-speaker render is non-negotiable, PlayHT and Murf own that feature. If you can spend 10 extra minutes splicing in Audacity to skip the subscription, the math here is straightforward.

FAQ

Frequently Asked Questions

01 Can I generate two voices in a single request?
No — each generation uses one voice. The dialogue workflow is to generate Character A's lines as one WAV, Character B's lines as a second WAV, then splice them in Audacity, Reaper, or any editor. The whole thing takes 5–10 minutes for a 2-minute scene.
02 How do I keep voices from sounding similar?
Pick voices that contrast on at least two axes: gender, accent, or pitch. Sarah + Adam (US female warm + US male authoritative) reads cleanly. Bella + Daniel (US conversational + UK formal) reads even cleaner because the accent shift is unmistakable. Avoid pairing two same-accent same-gender voices — listeners lose the thread.
03 How do I fake an interruption between characters?
Cut Character A's clip mid-word in your editor (Audacity: select to the cut point, hit Delete), then start Character B's line 50–100 ms before A's clip ends so the audio overlaps briefly. The overlap is the cue. Real interruptions on radio average 80 ms of overlap — match that and it lands.
04 How do I handle laughter, sighs, or "uh-huh" reactions?
There is no non-verbal sound tag in the model. Two workable options: write phonetic stand-ins ("hah hah hah", "mmhm", "huh") and let the voice deliver them, or grab royalty-free reaction sound effects from Freesound.org and drop them on a third track. The phonetic approach is faster; the SFX approach sounds more cinematic.
05 Can I publish AI-generated dialogue commercially?
Yes. The 24 kHz WAVs you download from FreeTextoSpeech are licensed for full commercial use — podcasts, monetized YouTube videos, paid audiobooks, indie games, ad reads. No attribution required and no royalty share. The license covers two-voice dialogue scenes the same way it covers single-voice narration.
06 How long can a dialogue scene be?
Each generation is capped at 5,000 characters, but you can run as many generations as you need. A typical two-host 30-minute podcast scene works out to roughly 4,500 words of dialogue split across both characters — about 12 generations, six per voice. Splice them in one project file and you have the full episode.
07 Does the dialogue sound natural across the cut between turns?
Yes, because the Kokoro model is deterministic per voice. Character A always sounds like Character A across every clip. The seam is in the silence between turns, not in the voice itself — leave 200–400 ms of room tone or true silence between speakers and the conversation breathes the way human dialogue does.
08 Can I do three or more characters?
Yes. The workflow scales — pick a distinct voice per character, generate each character's lines separately, splice them all on one timeline. Three voices is fine; four starts to strain listener tracking unless one is clearly the narrator. For ensemble scenes, lean on accent variety (US, UK, plus one of the other 9 languages) to keep voices clearly identifiable.

Still wondering? Get in touch →

Try it now

Build your first two-voice scene.

Generate it free, in under 10 minutes, with full commercial rights.