Continual Learning for Speech-to-Text

Flawless transcripts and fast turnaround time are the hallmarks of Scribie. Not only are our transcripts highly accurate, but also priced reasonably. But have you ever wondered what makes that possible? The answer lies in constantly improving our speech-to-text engine, which assists our transcribers. We provide automatic word completion to our transcribers, and the better those autocompletes are, the less they have to type.

Our speech recognition engine is a Deep Learning system. For the uninitiated, Deep Learning is a subdomain of Machine Learning. It makes use of Artificial Neural Networks that, in a way, mimic the structure and function of the human brain. Our speech recognition engine is based on the DeepSpeech 2 network from Baidu, and written in PyTorch.

Scribie has a large dataset of audio and transcripts — over 100,000 hours at the last count. Training Deep Learning models over such a large dataset is very expensive in practice, as it requires a large number of GPUs and SSDs. For comparison, Baidu trained their models with 256 GPUs on custom hardware when they developed the DeepSpeech architecture. We don’t have the time or money to do that. So we developed an approach which we call Continual Learning.

Continual Learning

We first built and trained a large model with a 3,000-hour dataset. That took around three weeks on our rig. Since then, every month we have built a ‘corrections dataset’ of around 1,000 hours. This corrections dataset is made up of predictions from the previous model that were wrong, and then manually corrected by our transcribers. In each iteration we remove an equal amount of data from the previous training set and fine-tune the model over the newly combined data. This ensures that our model keeps improving over time.


We have completed three such iterations and the results are promising. We have been able to consistently decrease the Word Error Rate, a common metric for automated transcription accuracy. The following is the chart of our WER.

We are providing free automated transcripts for a limited time, so please don’t hesitate to try out our online speech recognition system soon!

Deep Learning and AI has been in the news a lot lately, and there are concerns that Machine Learning will end up taking our jobs and replace humans. We have taken a different approach and built a system to assist our transcribers instead. Eventually, we want to reach a point where a human would have to spend just 10 minutes on a one-hour file, and still produce a highly accurate transcript of it. We still have a long way to go and we are working hard at it!

Building a Custom Deep Learning Rig

Deep learning is a very exciting field to be part of right now. New model architectures, especially those trained with Graphics Processing Units (GPUs), have enabled machines to do everything from defeating the world’s best human Go players to composing “classical music”. We wanted to take advantage of its applications in speech and language modeling, and started with AWS G2 instances. We soon found that training even very simple models on a small portion of our data took days at a time, so we decided to build our own rig with specialized hardware. Continue reading “Building a Custom Deep Learning Rig”

Next Generation Technology Aiding Transcription

Technology transcriptionTranscription is an indispensable part of business. It helps in reporting, predictive analysis and much more. It also helps enhance business web presence. For perfect transcription many are using next generation technology to help them in entire process and to make transcription more flawless and accurate.

At Scribie, the motivation behind our service is to deliver perfectly transcribed files in most convenient and hassle free way. Continue reading “Next Generation Technology Aiding Transcription”