We benchmark leading ASR engines and audio proofreading models on real-world conversational and legal transcription data. Scored by N-WER (word accuracy) and cpWER (word accuracy + speaker attribution).
| Model | Avg N-WER ↓ | RTFx ↑ | License | Golden (8) | Bench (7) | E2wK |
|---|---|---|---|---|---|---|
| google/gemini-3-flash-preview🥇 Best | 6.63% | ~1x | Proprietary | 7.37% | 6.82 | 5.71 |
| google/gemma-4-E4B-it | 10.43% | 6x | Open | 10.49% | 11.11 | 9.48 |
| assemblyai/universal (raw) | 11.44% | NA | Proprietary | 11.44% | 11.70 | — |
| CohereLabs/cohere-transcribe-03-2026 | 19.59% | 35x | Open | 19.59% | — | — |
| XiaomiMiMo/MiMo-Audio-7B-Instruct | 42.43% | 10x | Open | 42.43% | — | — |
cpWER measures word accuracy + speaker attribution jointly. Lower is better. Cohere excluded (no diarization).
| Model | Avg cpWER ↓ | License | Golden (8) | Bench (7) | E2wK |
|---|---|---|---|---|---|
| google/gemma-4-E4B-it🥇 Best | 12.54% | Open | 12.54 | 14.91 | 53.55 |
| assemblyai/universal (raw) | 12.95% | Proprietary | 12.95 | — | — |
| google/gemini-3-flash-preview | 16.63% | Proprietary | 16.63 | 14.97 | ~29.0 |
| XiaomiMiMo/MiMo-Audio-7B-Instruct | 33.74% | Open | 33.74 | — | — |
| Model | fPTn | de21 | rx7S | I0Tb | THnj | yeF0 | 5qLa | ag8P | AVG |
|---|---|---|---|---|---|---|---|---|---|
| google/gemini-3-flash-preview | 5.48 | 4.37 | 6.55 | 7.34 | 7.50 | 10.31 | 8.98 | 8.42 | 7.37 |
| google/gemma-4-E4B-it | 6.76 | 5.35 | 6.76 | 13.70 | 13.08 | 8.46 | 15.77 | 14.06 | 10.49 |
| assemblyai/universal (raw) | 6.57 | 6.59 | 7.53 | 15.37 | 15.91 | 7.66 | 17.04 | 14.87 | 11.44 |
| CohereLabs/cohere-transcribe | 12.10 | 10.53 | 12.92 | 15.26 | 15.87 | 17.13 | 18.16 | 54.74 | 19.59 |
| XiaomiMiMo/MiMo-Audio-7B | 39.65 | 6.44 | 75.18 | 50.31 | 36.55 | 25.20 | 57.62 | 48.52 | 42.43 |
The Golden Set is our primary, hardest benchmark — legal depositions, medical interviews, and files with up to 6 speakers. This is the set we use for final model evaluation.
The Benchmark Test Set is broader and more representative of typical volume — mostly general conversational content with 2–3 speakers.
The E2wK Stress Test is a single 4-hour board meeting with 28 speakers — testing extreme multi-speaker handling.
All sets are strictly held out from training, tuning, and prompt development. The holdout is enforced programmatically.
| ID | Duration | Spk | Domain | Content |
|---|---|---|---|---|
| fPTnbIxCoXrz | 44 min | 3 | Legal | Deposition — attorney, witness, court reporter |
| de217fa950c3 | 32 min | 2 | Healthcare | Pediatric healthcare needs interview |
| rx7SAmlqi2ZB | 35 min | 3 | General | Multi-party conversation |
| I0Tb5h9VWliu | 44 min | 2 | General | Two-speaker interview |
| THnjjF5Sy991 | 172 min | 6 | Legal | Long deposition — multiple attorneys, rapid Q&A |
| yeF0szcOajip | 17 min | 4 | General | Short multi-speaker discussion |
| 5qLa4TMcApPj | 57 min | 2 | General | Extended two-party conversation |
| ag8PupPUuwvj | 115 min | 3 | General | Long recording, similar-sounding speakers |
| Model | Type | Size | Audio Limit |
|---|---|---|---|
| google/gemini-3-flash-preview | Proofreader | Undisclosed | Long |
| google/gemma-4-E4B-it | Proofreader | 8B (4.5B eff) | 30 sec |
| assemblyai/universal | ASR | Undisclosed | Long |
| CohereLabs/cohere-transcribe | ASR | 2B | 35 sec |
| XiaomiMiMo/MiMo-Audio-7B | Proofreader | 8.2B | ~3 min |
Audio → text. Pure transcription, no correction against existing transcript.
Audio + ASR transcript → corrected transcript. Listens to audio and fixes ASR errors.
| Metric | What it measures | Lower is better |
|---|---|---|
| N-WER | Word accuracy (normalized — lowercase, strip punctuation) | ✅ |
| cpWER | Word accuracy + speaker attribution (optimal permutation) | ✅ |
| RTFx | Processing speed (audio minutes / wall clock minutes) | Higher = faster |
Our pipeline combines the best-performing ASR and proofreader models to deliver industry-leading accuracy at scale.
Maintained by the Superproofer team at Scribie. All evaluation files are held out and never used for training.