🎙️ Voxtral ⏱ 3 min read 📝 Voxtral Transcribe

Voxtral Transcribe — Technical Guide

Transcribe audio and video files in 13 languages with speaker diarization

📝

Voxtral Transcribe

mistral audio /app/voxtral-transcribe →

Transcribe audio and video files in 13 languages with speaker diarization

Voxtral Transcribe converts audio and video files into written text using Mistral AI. Upload a recording — podcast, interview, meeting, voiceover, or any media file — and the AI produces a full text transcript with optional speaker identification and word-level timestamps.

Supports 13 languages: French, English, Spanish, Arabic, Russian, Japanese, Chinese, German, Portuguese, Italian, Korean, Hindi, and Dutch. Set the language manually or let the AI detect it automatically from the audio content.

Speaker diarization identifies individual speakers in multi-person recordings. When enabled, the transcript labels each segment — Speaker 1, Speaker 2 — so you can follow who said what in interviews, meetings, or dialogues. Word timestamps add precise timing data to every word, useful for subtitle creation or syncing text with video.

Context bias lets you feed the AI a list of proper nouns, brand names, or technical terms that might otherwise be misheard. Add names like Voxtral, ArtCoreAI, or domain-specific jargon, and the AI boosts recognition accuracy for those words.

Accepted formats include MP3, WAV, M4A, FLAC, OGG, MP4, MOV, and WebM — up to 500 MB and 3 hours per file. A waveform visualization shows the uploaded audio with duration and file info before you submit. Results display as formatted text with a one-click copy button, and the transcript is saved for later reference.

Cost is based on audio duration — approximately /bin/bash.003 per minute — making it one of the most affordable transcription options available.

✦ Best Results Tips

🎧 Clean Audio Gives Clean Transcripts

Background noise, music, and echo reduce transcription accuracy. For best results, use recordings with clear speech and minimal interference. If transcribing from video, ensure the dialogue track is prominent.

🗣️ Enable Diarization for Multi-Speaker

If your recording has more than one person speaking, turn on speaker diarization. The AI separates and labels each speaker, making the transcript easy to follow — essential for interviews, meetings, and podcasts.

📌 Use Context Bias for Names

Add proper nouns, brand names, and technical terms to the context bias field. Words like Voxtral, ArtCoreAI, or industry jargon are often misheard without this hint — context bias dramatically improves accuracy for uncommon words.

🌍 Set the Language When Known

Auto-detect works well for single-language recordings, but if you know the language, set it manually. This avoids detection errors on short clips or recordings with accented speech.

⏱️ Word Timestamps for Subtitles

Enable word timestamps if you plan to create subtitles or sync the text with video. Each word gets a precise time marker, making it easy to align text with visual content.

💰 Extremely Low Cost

At roughly /bin/bash.003 per minute, transcribing a full hour of audio costs less than /bin/bash.20 in credits. Test with a short clip first to verify quality, then process longer recordings confidently.

Voxtral Transcribe — Available Models

Voxtral Mini Transcribe

BATCH Default

voxtral-mini-latest

State-of-the-art transcription with speaker diarization. 4% WER on FLEURS. $0.003/min.

Mode: transcribe

💰 Voxtral Transcribe — Pricing

Estimated cost

—

Failed jobs are automatically refunded