Our new platform is now available on both Scribie.com and Scribie.ai. The old UI is still accessible at legacy.scribie.com. 👉 Read more
Updated 2026-03-06

Transcription Accuracy Leaderboard: ASR Models + Gemini AI Proofreading on Real Audio

We benchmark leading ASR engines — AssemblyAI, Deepgram, WhisperX, Saaras, and more — on live production audio files, then measure how much accuracy improves when Gemini proofreads the raw transcript using custom prompts. Scored by N-WER (word error rate) and DER (diarization error rate).

8-File Golden Test Set

Proofreader Configs — sorted by N-WER

SystemN-WER ↓DER ↓CostNotes
AAI + Gemini Pro (thinking=low)🥇 Best N-WER6.87%11.35%~$67/8 files, native API
AAI + Gemini Flash (thinking=low, t=0)7.12%6.59%★~$0.60Frozen benchmark (cfe3524)
AAI + Gemini Flash (medium thinking)7.53%24.08%~$2Higher thinking hurts DER
AAI + Gemini Flash (thinking=low, t=1.0)8.17%7.07%~$0.603-run mean, stdev 0.19pp

Raw ASR Systems — no proofreading

SystemAvg N-WER ↓DER ↓Notes
Deepgram Nova-315.26%14.52%Commercial API
Saaras v317.64%—Commercial API
VibeVoice-ASR (7B)22.52%—Open-source, 1 outlier file (71.95%)

9-File Benchmark Test Set

9 diverse files tested across 6 system configurations: 2 raw baselines (AAI, WhisperX (pyannote finetuned)) and 4 proofreader combos (model × CTM source).

WhisperX (pyannote finetuned): WhisperX pipeline (Whisper large-v3 + wav2vec2 alignment) with the pyannote diarization model finetuned on 40 hours of production audio. The Whisper ASR model itself is stock large-v3.

DER note: AAI-based systems use AAI CTM for alignment; WhisperX-based systems use WhisperX (pyannote finetuned) CTM. Metrics are not directly comparable across CTM sources for DER (different diarization backends).

Rankings — 9-file averages

RankSystemN-WER ↓DER ↓
🥇WhisperX (pyannote finetuned) + Gemini Pro6.20%★7.72%
🥈AAI + Gemini Flash (prompt-v6)6.49%6.06%★
🥉AAI + Gemini Pro7.06%8.89%
4WhisperX (pyannote finetuned) Raw (no proofread)8.54%9.05%
5AAI Raw (no proofread)9.43%10.63%
5WhisperX (pyannote finetuned) + Gemini Flash (prompt-v4)9.43%12.39%
7Deepgram Nova-3 (no proofread)10.74%12.54%
8Saaras v3 (no proofread)16.87%—

Key Observations

1

Pro + WxFT wins N-WER

6.20% — best of all 6 systems. Gemini Pro's stronger language model combined with finetuned pyannote diarization is the best combo for word accuracy.

2

Flash + AAI wins DER

6.06% — current prompt with AAI CTM gives best speaker accuracy.

3

Flash + WxFT is surprisingly bad

9.43% N-WER, 12.39% DER. prompt-v4 was developed against AAI CTM and doesn't handle WhisperX's different speaker label format well.

4

Pro handles WxFT much better than Flash

Pro adapts to different CTM formats, Flash does not. Pro+WxFT achieves 6.20% / 7.72% vs Flash+WxFT's 9.43% / 12.39%.

5

7-speaker file edge case

Flash+WxFT DER explodes to 29.66% while Flash+AAI holds at 9.79%. Pro+WxFT handles it fine at 5.89%.

6

WhisperX (pyannote finetuned) raw outperforms AAI raw

8.54% / 9.05% vs 9.43% / 10.63% on both metrics, confirming the value of pyannote finetuning.

Prompt Versions

VersionDescription
prompt-v4Production baseline proofreader prompt.
prompt-v6Latest iteration, addresses limitations in prompt-v4.

Ready for accurate, AI-assisted transcription?

Our pipeline combines the best-performing ASR and proofreader configs to deliver industry-leading accuracy at scale.