Real-World Speech-to-Text Accuracy: Benchmarking AssemblyAI, Deepgram, WhisperX & Saaras on Production Audio
Every time a new AI speech-to-text model launches, we see impressive benchmark numbers. Gemini, GPT-4o, Whisper, Nova — and recently Saaras — all report strong results on datasets like LibriSpeech and Common Voice. But there’s a problem. Those benchmarks don’t represent real production audio. At Scribie, we run a professional transcription service. The files we […]

Every time a new AI speech-to-text model launches, we see impressive benchmark numbers.
Gemini, GPT-4o, Whisper, Nova — and recently Saaras — all report strong results on datasets like LibriSpeech and Common Voice.
But there’s a problem.
Those benchmarks don’t represent real production audio.
At Scribie, we run a professional transcription service. The files we process daily include legal depositions, interviews, podcasts, and multi-speaker calls — exactly the type of recordings where speech recognition systems struggle.
Instead of relying on academic benchmarks, we decided to test the leading speech-to-text systems on real production transcription audio.
We benchmarked:
- WhisperX
- AssemblyAI
- Deepgram
- Saaras
- VibeVoice-ASR
We also tested how much accuracy improves when the transcripts are proofread using a multimodal LLM (Google Gemini).
Full results are published on our live leaderboard:
https://scribie.com/leaderboard
Why Academic ASR Benchmarks Don’t Reflect Production Reality
Most speech recognition benchmarks rely on clean, scripted speech recorded in controlled environments.
Typical conditions include:
- Single speaker
- Studio-quality recording
- Minimal background noise
- Carefully spoken sentences
Real-world transcription looks very different.
Our benchmark dataset includes:
- Legal depositions with rapid attorney-witness exchanges
- Multi-speaker calls with 2–7 participants
- Zoom recordings with compression artifacts
- Phone recordings with inconsistent audio quality
- Long interviews lasting 60–90 minutes
- Specialized terminology and proper nouns
When models reporting sub-5% WER on LibriSpeech are tested on this type of audio, the numbers change significantly.
Baseline Speech-to-Text Accuracy on Production Audio
We evaluated five systems on a 26-file dataset drawn from real transcription jobs.
All metrics use N-WER (Normalized Word Error Rate), which removes punctuation and casing differences so we measure true content errors only.
| System | Type | N-WER | Notes |
|---|---|---|---|
| WhisperX (large-v3 + pyannote) | Open source | 12.81% | Whisper + wav2vec alignment + diarization |
| AssemblyAI Universal-3 | Commercial API | 15.13% | Word timestamps + speaker labels |
| Deepgram Nova-3 | Commercial API | 15.62% | Strong commercial alternative |
| Saaras | Commercial API | 17.23% | Files >1h chunked |
| VibeVoice-ASR (7B) | Open source | 21.72% | Microsoft multimodal model |
Even the best system is getting roughly 1 in 7 words wrong.
On a 60-minute recording (~9,000 words) that translates to over 1,000 corrections for a human proofreader.
But word accuracy is only part of the challenge.
Speaker Diarization Is Even Harder
Professional transcripts must identify who said what.
We measure this with DER (Diarization Error Rate).
| System | N-WER | DER |
|---|---|---|
| WhisperX | 12.81% | 26.31% |
| AssemblyAI | 15.13% | 26.61% |
| Deepgram | 15.62% | 27.23% |
Across every system we tested, speaker attribution errors were significantly worse than word errors.
Speaker diarization remains one of the most difficult problems in speech recognition.
Saaras: The New Entrant
Saaras is a recently released speech-to-text API designed for multilingual and Indian-language transcription.
Because the system has been gaining traction, we included it in our benchmark.
| System | N-WER |
|---|---|
| WhisperX | 12.81% |
| AssemblyAI | 15.13% |
| Deepgram | 15.62% |
| Saaras | 17.23% |
On our dataset — primarily English legal and interview recordings — Saaras performed slightly behind the leading commercial APIs.
However, Saaras appears optimized for multilingual and Indian language scenarios, where performance may differ significantly.
As always, accuracy varies heavily depending on the type of audio being tested.
Why Raw WER Can Be Misleading
A surprising discovery from our analysis:
A large percentage of reported transcription errors are not real content errors.
Example:
Human transcript
Yeah, I think so.
ASR output
Yeah. I think so.
Raw WER counts this as an error.
But the spoken words are identical.
Across our dataset:
Raw WER averaged 5.5 percentage points higher than N-WER.
That means 30–44% of reported errors were just punctuation differences.
For this reason, all results in this article use N-WER.
Our Approach: AI Proofreading with Gemini
Instead of replacing ASR systems, we added a correction layer on top of them.
A multimodal LLM listens to the audio and compares it against the transcript.
Comments (0)
No comments yet. Be the first to share your thoughts!