Continual Learning for Speech-to-Text

Flawless transcripts and fast turnaround time are the hallmarks of Scribie. Not only are our transcripts highly accurate, but also priced reasonably. But have you ever wondered what makes that possible? The answer lies in constantly improving our speech-to-text engine, which assists our transcribers. We provide automatic word completion to our transcribers, and the better those autocompletes are, the less they have to type.

Our speech recognition engine is a Deep Learning system. For the uninitiated, Deep Learning is a subdomain of Machine Learning. It makes use of Artificial Neural Networks that, in a way, mimic the structure and function of the human brain. Our speech recognition engine is based on the DeepSpeech 2 network from Baidu, and written in PyTorch.

Scribie has a large dataset of audio and transcripts — over 100,000 hours at the last count. Training Deep Learning models over such a large dataset is very expensive in practice, as it requires a large number of GPUs and SSDs. For comparison, Baidu trained their models with 256 GPUs on custom hardware when they developed the DeepSpeech architecture. We don’t have the time or money to do that. So we developed an approach which we call Continual Learning.

Continual Learning

We first built and trained a large model with a 3,000-hour dataset. That took around three weeks on our rig. Since then, every month we have built a ‘corrections dataset’ of around 1,000 hours. This corrections dataset is made up of predictions from the previous model that were wrong, and then manually corrected by our transcribers. In each iteration we remove an equal amount of data from the previous training set and fine-tune the model over the newly combined data. This ensures that our model keeps improving over time.

Results

We have completed three such iterations and the results are promising. We have been able to consistently decrease the Word Error Rate, a common metric for automated transcription accuracy. The following is the chart of our WER.

We are providing free automated transcripts for a limited time, so please don’t hesitate to try out our online speech recognition system soon! Please note that we support only English at the moment and it works best for files with North American speakers and clean audio.

Deep Learning and AI has been in the news a lot lately, and there are concerns that Machine Learning will end up taking our jobs and replace humans. We have taken a different approach and built a system to assist our transcribers instead. Eventually, we want to reach a point where a human would have to spend just 10 minutes on a one-hour file, and still produce a highly accurate transcript of it. We still have a long way to go and we are working hard at it!

Price Cut

We are pleased to announce an across-the-board 20% drop in our pricing effective today. Our new transcription rates are as follows:

Scribie New Transcription Rates
Old New Savings
Budget $0.75/min $0.60/min 20%
Regular $1.50/min $1.20/min 20%
Express $3.00/min $2.40/min 20%

We started with the mission to build the best place for transcription; both for transcribers and customers. We have been relentlessly pursuing our goal and recently have built technology that helps reduce the time and effort of transcribers, without compromising accuracy in any way. We are happy to pass on the savings to our customers.

We have always stood for accuracy and our goal has been to provide the highest quality transcript, at the lowest possible cost. However, we still want to compensate our transcribers fairly. The only way to solve this problem was with technology. Our tech has now been rolled out in production and we are happy to reach this milestone. This is the real test of whether our tech is good enough or not!

We will be talking more about our tech here in the coming days. So check back here if you’re interested in the details. In the meantime, upload your files online and order transcripts online to enjoy the benefits tech can offer with our reduced pricing.

 

When YouTube Captions Go Wrong

Human Transcription > Computer Transcription

Have you ever used Google Voice’s visual voicemail option? How about YouTube’s closed captioning service? If so, you’ve probably encountered a wildly inaccurate and hilarious transcript.

Rhett McLaughlin and James Lincoln, the comedy duo behind Rhett & Link, used this amusing side effect and turned it into a series of hilarious skits on YouTube.

The concept is similar to the Telephone Game. A message is passed from person to person until the original message is mostly unrecognizable.

Here’s what they did:

Step 1: Record a short script.

Step 2: Upload it to YouTube.

Step 3: Record a new video with the garbled transcripts that YouTube produced.

Step 4: Repeat.

The result is a funny and an incoherent message similar to that of the famous “Bad Lip Reading” videos.

These skits were filmed between 2011 and 2013 and demonstrate just how inaccurate Googles’ automatic transcription services used to be. Since then, Google’s automated voice transcription service has improved significantly – hence the reason the series eventually fizzled out.

Given these modest improvements, automated transcription services still pale in comparison to the level of accuracy that human transcription services, such as Scribie, can provide.

We believe the English language, in all its complexity, nuance, and beauty will never be completely mastered by artificial intelligence.  And while this video is in jest, it’s an excellent example of why knowledge work will always require a human component to maintain quality assurance.