Improved Automated Transcripts

Our latest speech and language models have been released. There are several new features in this release. The following is a list:

Acoustic Model: This is our fourth acoustic model trained on our data. The dataset contained mostly accented speakers (eg. Indian, African, Irish etc.). It also contained some noisy files. The accuracy of the automated transcript on accented files should be better now.

Language Model: We have added more data to our language model and doubled its size. The model now model has now been trained on around 46 million lines and has improved the WER by around 2%.

Punctuations: The biggest feature of this release is expanded punctuations. We now support all types of punctuations including quotes and hyphens. To our knowledge, nobody else including Google Web Speech, AWS Transcribe and Speechmatics supports quotes.

Speaker Turns: We also have updated our speaker turns model. The accuracy of the model is around 80% on long paragraphs. The automated transcripts will be better segmented now. We are currently working on adding speaker diarization to the automated transcript and it should be out soon. We do speaker turns a bit differently and do not require the number of speakers as an input. That is also one of our unique features. Google Web Speech does not support multi-speaker files and AWS Transcribe and Speechmatics require the number of speakers as an input for diarization.

This release also fixes the issue of missing predictions where some words, especially near speaker turns were not being transcribed. The automated transcripts should now capture all utterances, except filler words. We also benchmarked our model with LibriSpeech Clean and our internal dataset. The following are our numbers.

Dataset Type WER CER
LibriSpeech Clean Read speech 14.53% 5.85%
Scribie Internal Conversational 16.33% 8.82%

For comparison, PaddlePaddle numbers are the following:

Dataset Type WER CER
LibriSpeech Clean Read speech 5.4% 1.9%
Scribie Internal Conversational 28.34% 17.71%

As you can see, for conversational audio, our models outperform PaddlePaddle by a wide margin. We are working on improving our models for non-conversational audio as well. Our ASR is a DeepSpeech-based system and therefore a comparison with PaddlePaddle is a good benchmark for us. The Continual Learning blog post has some more details on how we trained our DeepSpeech models.

The automated transcripts are free currently, so try it out today!

Leave a Reply