Our new platform is now available on both Scribie.com and Scribie.ai. The old UI is still accessible at legacy.scribie.com. 👉 Read more

Real-World Speech-to-Text Accuracy: Benchmarking AssemblyAI, Deepgram, WhisperX & Saaras on Production Audio

Every time a new AI speech-to-text model launches, we see impressive benchmark numbers. Gemini, GPT-4o, Whisper, Nova — and recently Saaras — all report strong results on datasets like LibriSpeech and Common Voice. But there’s a problem. Those benchmarks don’t represent real production audio. At Scribie, we run a professional transcription service. The files we […]

4 min read
speech-to-text models benchmarking overview

Every time a new AI speech-to-text model launches, we see impressive benchmark numbers.

Gemini, GPT-4o, Whisper, Nova — and recently Saaras — all report strong results on datasets like LibriSpeech and Common Voice.

But there’s a problem.

Those benchmarks don’t represent real production audio.

At Scribie, we run a professional transcription service. The files we process daily include legal depositions, interviews, podcasts, and multi-speaker calls — exactly the type of recordings where speech recognition systems struggle.

Instead of relying on academic benchmarks, we decided to test the leading speech-to-text systems on real production transcription audio.

We benchmarked:

  • WhisperX
  • AssemblyAI
  • Deepgram
  • Saaras
  • VibeVoice-ASR

We also tested how much accuracy improves when the transcripts are proofread using a multimodal LLM (Google Gemini).

Full results are published on our live leaderboard:

https://scribie.com/leaderboard

Why Academic ASR Benchmarks Don’t Reflect Production Reality

Most speech recognition benchmarks rely on clean, scripted speech recorded in controlled environments.

Typical conditions include:

  • Single speaker
  • Studio-quality recording
  • Minimal background noise
  • Carefully spoken sentences

Real-world transcription looks very different.

Our benchmark dataset includes:

  • Legal depositions with rapid attorney-witness exchanges
  • Multi-speaker calls with 2–7 participants
  • Zoom recordings with compression artifacts
  • Phone recordings with inconsistent audio quality
  • Long interviews lasting 60–90 minutes
  • Specialized terminology and proper nouns

When models reporting sub-5% WER on LibriSpeech are tested on this type of audio, the numbers change significantly.

Baseline Speech-to-Text Accuracy on Production Audio

We evaluated five systems on a 26-file dataset drawn from real transcription jobs.

All metrics use N-WER (Normalized Word Error Rate), which removes punctuation and casing differences so we measure true content errors only.

SystemTypeN-WERNotes
WhisperX (large-v3 + pyannote)Open source12.81%Whisper + wav2vec alignment + diarization
AssemblyAI Universal-3Commercial API15.13%Word timestamps + speaker labels
Deepgram Nova-3Commercial API15.62%Strong commercial alternative
SaarasCommercial API17.23%Files >1h chunked
VibeVoice-ASR (7B)Open source21.72%Microsoft multimodal model

Even the best system is getting roughly 1 in 7 words wrong.

On a 60-minute recording (~9,000 words) that translates to over 1,000 corrections for a human proofreader.

But word accuracy is only part of the challenge.

Speaker Diarization Is Even Harder

Professional transcripts must identify who said what.

We measure this with DER (Diarization Error Rate).

SystemN-WERDER
WhisperX12.81%26.31%
AssemblyAI15.13%26.61%
Deepgram15.62%27.23%

Across every system we tested, speaker attribution errors were significantly worse than word errors.

Speaker diarization remains one of the most difficult problems in speech recognition.

Saaras: The New Entrant

Saaras is a recently released speech-to-text API designed for multilingual and Indian-language transcription.

Because the system has been gaining traction, we included it in our benchmark.

SystemN-WER
WhisperX12.81%
AssemblyAI15.13%
Deepgram15.62%
Saaras17.23%

On our dataset — primarily English legal and interview recordings — Saaras performed slightly behind the leading commercial APIs.

However, Saaras appears optimized for multilingual and Indian language scenarios, where performance may differ significantly.

As always, accuracy varies heavily depending on the type of audio being tested.

Why Raw WER Can Be Misleading

A surprising discovery from our analysis:

A large percentage of reported transcription errors are not real content errors.

Example:

Human transcript

Yeah, I think so.

ASR output

Yeah. I think so.

Raw WER counts this as an error.

But the spoken words are identical.

Across our dataset:

Raw WER averaged 5.5 percentage points higher than N-WER.

That means 30–44% of reported errors were just punctuation differences.

For this reason, all results in this article use N-WER.

Our Approach: AI Proofreading with Gemini

Instead of replacing ASR systems, we added a correction layer on top of them.

A multimodal LLM listens to the audio and compares it against the transcript.

Comments (0)

No comments yet. Be the first to share your thoughts!

Tags

accuracyASRAssemblyAIbenchmarksDeepgramDERGeminiSaarasspeech-to-textWERWhisperX