Real-World Speech-to-Text Accuracy: Benchmarking AssemblyAI, Deepgram, WhisperX & Saaras on Production Audio

Every time a new AI speech-to-text model launches, we see impressive benchmark numbers.

Gemini, GPT-4o, Whisper, Nova — and recently Saaras — all report strong results on datasets like LibriSpeech and Common Voice.

But there’s a problem.

Those benchmarks don’t represent real production audio.

At Scribie, we run a professional transcription service. The files we process daily include legal depositions, interviews, podcasts, and multi-speaker calls — exactly the type of recordings where speech recognition systems struggle.

Instead of relying on academic benchmarks, we decided to test the leading speech-to-text systems on real production transcription audio.

We benchmarked:

WhisperX
AssemblyAI
Deepgram
Saaras
VibeVoice-ASR

We also tested how much accuracy improves when the transcripts are proofread using a multimodal LLM (Google Gemini).

Full results are published on our live leaderboard:

https://scribie.com/leaderboard

Why Academic ASR Benchmarks Don’t Reflect Production Reality

Most speech recognition benchmarks rely on clean, scripted speech recorded in controlled environments.

Typical conditions include:

Single speaker
Studio-quality recording
Minimal background noise
Carefully spoken sentences

Real-world transcription looks very different.

Our benchmark dataset includes:

Legal depositions with rapid attorney-witness exchanges
Multi-speaker calls with 2–7 participants
Zoom recordings with compression artifacts
Phone recordings with inconsistent audio quality
Long interviews lasting 60–90 minutes
Specialized terminology and proper nouns

When models reporting sub-5% WER on LibriSpeech are tested on this type of audio, the numbers change significantly.

Baseline Speech-to-Text Accuracy on Production Audio

We evaluated five systems on a 26-file dataset drawn from real transcription jobs.

All metrics use N-WER (Normalized Word Error Rate), which removes punctuation and casing differences so we measure true content errors only.

System	Type	N-WER	Notes
WhisperX (large-v3 + pyannote)	Open source	12.81%	Whisper + wav2vec alignment + diarization
AssemblyAI Universal-3	Commercial API	15.13%	Word timestamps + speaker labels
Deepgram Nova-3	Commercial API	15.62%	Strong commercial alternative
Saaras	Commercial API	17.23%	Files >1h chunked
VibeVoice-ASR (7B)	Open source	21.72%	Microsoft multimodal model

Even the best system is getting roughly 1 in 7 words wrong.

On a 60-minute recording (~9,000 words) that translates to over 1,000 corrections for a human proofreader.

But word accuracy is only part of the challenge.

Speaker Diarization Is Even Harder

Professional transcripts must identify who said what.

We measure this with DER (Diarization Error Rate).

System	N-WER	DER
WhisperX	12.81%	26.31%
AssemblyAI	15.13%	26.61%
Deepgram	15.62%	27.23%

Across every system we tested, speaker attribution errors were significantly worse than word errors.

Speaker diarization remains one of the most difficult problems in speech recognition.

Saaras: The New Entrant

Saaras is a recently released speech-to-text API designed for multilingual and Indian-language transcription.

Because the system has been gaining traction, we included it in our benchmark.

System	N-WER
WhisperX	12.81%
AssemblyAI	15.13%
Deepgram	15.62%
Saaras	17.23%

On our dataset — primarily English legal and interview recordings — Saaras performed slightly behind the leading commercial APIs.

However, Saaras appears optimized for multilingual and Indian language scenarios, where performance may differ significantly.

As always, accuracy varies heavily depending on the type of audio being tested.

Why Raw WER Can Be Misleading

A surprising discovery from our analysis:

A large percentage of reported transcription errors are not real content errors.

Example:

Human transcript

Yeah, I think so.

ASR output

Yeah. I think so.

Raw WER counts this as an error.

But the spoken words are identical.

Across our dataset:

Raw WER averaged 5.5 percentage points higher than N-WER.

That means 30–44% of reported errors were just punctuation differences.

For this reason, all results in this article use N-WER.

Our Approach: AI Proofreading with Gemini

Instead of replacing ASR systems, we added a correction layer on top of them.

A multimodal LLM listens to the audio and compares it against the transcript.

Real-World Speech-to-Text Accuracy: Benchmarking AssemblyAI, Deepgram, WhisperX & Saaras on Production Audio

Why Academic ASR Benchmarks Don’t Reflect Production Reality

Baseline Speech-to-Text Accuracy on Production Audio

Speaker Diarization Is Even Harder

Saaras: The New Entrant

Why Raw WER Can Be Misleading

Our Approach: AI Proofreading with Gemini

Comments (0)

Tags

Beginner Transcriber Guide and Quick Tips

All You Need to Know About Test Files

10 Things to Know Before Starting a Youtube Channel

Cost For Transcription Services

Rev vs. Scribie: Which Transcription Service is Right for You?