We benchmark leading ASR engines — AssemblyAI, Deepgram, WhisperX, Saaras, and more — on live production audio files, then measure how much accuracy improves when Gemini proofreads the raw transcript using custom prompts. Scored by N-WER (word error rate) and DER (diarization error rate).
| System | N-WER ↓ | DER ↓ | Cost | Notes |
|---|---|---|---|---|
| AAI + Gemini Pro (thinking=low)🥇 Best N-WER | 6.87% | 11.35% | ~$6 | 7/8 files, native API |
| AAI + Gemini Flash (thinking=low, t=0) | 7.12% | 6.59%★ | ~$0.60 | Frozen benchmark (cfe3524) |
| AAI + Gemini Flash (medium thinking) | 7.53% | 24.08% | ~$2 | Higher thinking hurts DER |
| AAI + Gemini Flash (thinking=low, t=1.0) | 8.17% | 7.07% | ~$0.60 | 3-run mean, stdev 0.19pp |
| System | Avg N-WER ↓ | DER ↓ | Notes |
|---|---|---|---|
| Deepgram Nova-3 | 15.26% | 14.52% | Commercial API |
| Saaras v3 | 17.64% | — | Commercial API |
| VibeVoice-ASR (7B) | 22.52% | — | Open-source, 1 outlier file (71.95%) |
9 diverse files tested across 6 system configurations: 2 raw baselines (AAI, WhisperX (pyannote finetuned)) and 4 proofreader combos (model × CTM source).
WhisperX (pyannote finetuned): WhisperX pipeline (Whisper large-v3 + wav2vec2 alignment) with the pyannote diarization model finetuned on 40 hours of production audio. The Whisper ASR model itself is stock large-v3.
DER note: AAI-based systems use AAI CTM for alignment; WhisperX-based systems use WhisperX (pyannote finetuned) CTM. Metrics are not directly comparable across CTM sources for DER (different diarization backends).
| Rank | System | N-WER ↓ | DER ↓ |
|---|---|---|---|
| 🥇 | WhisperX (pyannote finetuned) + Gemini Pro | 6.20%★ | 7.72% |
| 🥈 | AAI + Gemini Flash (prompt-v6) | 6.49% | 6.06%★ |
| 🥉 | AAI + Gemini Pro | 7.06% | 8.89% |
| 4 | WhisperX (pyannote finetuned) Raw (no proofread) | 8.54% | 9.05% |
| 5 | AAI Raw (no proofread) | 9.43% | 10.63% |
| 5 | WhisperX (pyannote finetuned) + Gemini Flash (prompt-v4) | 9.43% | 12.39% |
| 7 | Deepgram Nova-3 (no proofread) | 10.74% | 12.54% |
| 8 | Saaras v3 (no proofread) | 16.87% | — |
6.20% — best of all 6 systems. Gemini Pro's stronger language model combined with finetuned pyannote diarization is the best combo for word accuracy.
6.06% — current prompt with AAI CTM gives best speaker accuracy.
9.43% N-WER, 12.39% DER. prompt-v4 was developed against AAI CTM and doesn't handle WhisperX's different speaker label format well.
Pro adapts to different CTM formats, Flash does not. Pro+WxFT achieves 6.20% / 7.72% vs Flash+WxFT's 9.43% / 12.39%.
Flash+WxFT DER explodes to 29.66% while Flash+AAI holds at 9.79%. Pro+WxFT handles it fine at 5.89%.
8.54% / 9.05% vs 9.43% / 10.63% on both metrics, confirming the value of pyannote finetuning.
| Version | Description |
|---|---|
| prompt-v4 | Production baseline proofreader prompt. |
| prompt-v6 | Latest iteration, addresses limitations in prompt-v4. |
Our pipeline combines the best-performing ASR and proofreader configs to deliver industry-leading accuracy at scale.