👋 The legacy platform at legacy.scribie.com has been retired and now redirects here. Reach out to support for queries.

Updated Apr 8, 2026

Transcription Accuracy Leaderboard: ASR & AI Proofreading on Real Audio

We benchmark leading ASR engines and audio proofreading models on real-world conversational and legal transcription data. Scored by N-WER (word accuracy) and cpWER (word accuracy + speaker attribution).

Leaderboard — N-WER ↓

Model	Avg N-WER ↓	RTFx ↑	License	Golden (8)	Bench (7)	E2wK
google/gemini-3-flash-preview🥇 Best	6.63%	~1x	Proprietary	7.37%	6.82	5.71
google/gemma-4-E4B-it	10.43%	6x	Open	10.49%	11.11	9.48
assemblyai/universal (raw)	11.44%	NA	Proprietary	11.44%	11.70	—
CohereLabs/cohere-transcribe-03-2026	19.59%	35x	Open	19.59%	—	—
XiaomiMiMo/MiMo-Audio-7B-Instruct	42.43%	10x	Open	42.43%	—	—

Leaderboard — cpWER ↓

cpWER measures word accuracy + speaker attribution jointly. Lower is better. Cohere excluded (no diarization).

Model	Avg cpWER ↓	License	Golden (8)	Bench (7)	E2wK
google/gemma-4-E4B-it🥇 Best	12.54%	Open	12.54	14.91	53.55
assemblyai/universal (raw)	12.95%	Proprietary	12.95	—	—
google/gemini-3-flash-preview	16.63%	Proprietary	16.63	14.97	~29.0
XiaomiMiMo/MiMo-Audio-7B-Instruct	33.74%	Open	33.74	—	—

Per-File Results — Golden Set N-WER

Model	fPTn	de21	rx7S	I0Tb	THnj	yeF0	5qLa	ag8P	AVG
google/gemini-3-flash-preview	5.48	4.37	6.55	7.34	7.50	10.31	8.98	8.42	7.37
google/gemma-4-E4B-it	6.76	5.35	6.76	13.70	13.08	8.46	15.77	14.06	10.49
assemblyai/universal (raw)	6.57	6.59	7.53	15.37	15.91	7.66	17.04	14.87	11.44
CohereLabs/cohere-transcribe	12.10	10.53	12.92	15.26	15.87	17.13	18.16	54.74	19.59
XiaomiMiMo/MiMo-Audio-7B	39.65	6.44	75.18	50.31	36.55	25.20	57.62	48.52	42.43

Datasets

The Golden Set is our primary, hardest benchmark — legal depositions, medical interviews, and files with up to 6 speakers. This is the set we use for final model evaluation.

The Benchmark Test Set is broader and more representative of typical volume — mostly general conversational content with 2–3 speakers.

The E2wK Stress Test is a single 4-hour board meeting with 28 speakers — testing extreme multi-speaker handling.

All sets are strictly held out from training, tuning, and prompt development. The holdout is enforced programmatically.

Golden Set — 8 files, ~8.5 hours

ID	Duration	Spk	Domain	Content
fPTnbIxCoXrz	44 min	3	Legal	Deposition — attorney, witness, court reporter
de217fa950c3	32 min	2	Healthcare	Pediatric healthcare needs interview
rx7SAmlqi2ZB	35 min	3	General	Multi-party conversation
I0Tb5h9VWliu	44 min	2	General	Two-speaker interview
THnjjF5Sy991	172 min	6	Legal	Long deposition — multiple attorneys, rapid Q&A
yeF0szcOajip	17 min	4	General	Short multi-speaker discussion
5qLa4TMcApPj	57 min	2	General	Extended two-party conversation
ag8PupPUuwvj	115 min	3	General	Long recording, similar-sounding speakers

Model Cards

Model	Type	Size	Audio Encoder	Audio Limit	Context
google/gemini-3-flash-preview	Proofreader	Undisclosed	Proprietary	Long	1M+ tok
google/gemma-4-E4B-it	Proofreader	8B (4.5B eff)	USM Conformer (300M)	30 sec	128K tok
assemblyai/universal	ASR	Undisclosed	Proprietary	Long	—
CohereLabs/cohere-transcribe	ASR	2B	Fast-Conformer	35 sec	—
XiaomiMiMo/MiMo-Audio-7B	Proofreader	8.2B	RVQ Tokenizer (1.2B)	~3 min	8K tok

ASR

Audio → text. Pure transcription, no correction against existing transcript.

Proofreader

Audio + ASR transcript → corrected transcript. Listens to audio and fixes ASR errors.

Metrics

Metric	What it measures	Lower is better
N-WER	Word accuracy (normalized — lowercase, strip punctuation)	✅
cpWER	Word accuracy + speaker attribution (optimal permutation)	✅
RTFx	Processing speed (audio minutes / wall clock minutes)	Higher = faster

Ready for accurate, AI-assisted transcription?

Our pipeline combines the best-performing ASR and proofreader models to deliver industry-leading accuracy at scale.

Maintained by the Superproofer team at Scribie. All evaluation files are held out and never used for training.