How To Correct Automated Transcripts With Our Editor

We provide a browser based-editor which can be used to quickly correct the automated transcripts. Click the Edit Transcript button to launch it.

Scribie Editor

The first thing you will notice is the audio waveform at the top. That is the audio player.  Clicking anywhere on it will take you to the corresponding word in the transcript.

The first row of buttons are the controls. Each button also has a corresponding keyboard shortcut so that you don’t have to use the mouse which saves a lot of time. The important shortcuts to remember are CTRL+P to play/pause and CTRL+O to rewind (CMD for Mac).

The second row of buttons are some controls for the text editor. Hover the mouse over the button to get a description of what the button does. It’s mostly self-explanatory.

You will also notice some text underlined in blue and red. The red ones are spelling mistakes. Run the spell check to correct those. The blue ones are where our speech recognition engine was not confident enough and so those may be mistakes. You can right click on those and choose Play Word to check the corresponding audio.

The following are the list of corrections which tend to be required in the automated transcripts:

  • Mistakes: These are words which are incorrectly transcribed. Most of these words will have blue underlines.
  • Speaker Turns: Our speech recognition engine misses around 40% of the turns. So some paragraphs may actually have two speakers in them (we are working to improve it).
  • Punctuations: There may be some missing periods. The commas and other punctuations are mostly correct, although we only provide the start quote. The end quote has to be manually inserted.
  • Capitalization: Some of the capitalized words may be wrong. Some other words may need to be capitalized.

We recommend the 2-pass approach to make the corrections. First play and check the blue underlines. Those are the low-hanging fruits and you can get them out of the way fast.

Next, play the audio from the beginning and make corrections as you go along. Whenever you notice a mistake, pause, make the correction, and resume play. Rinse and repeat till you reach the end of the file. Increasing the playback speed can also help in cases where the accuracy is more than 80%.

Once you are done with the edits, Click the Download button at the bottom for the Word Document or other formats.

Editor download files

Effectively, it takes around 3-4 times the duration of the file to correct the automated transcript, if you include the time for replays. It is also easy to lose focus on long files. So, remember to take breaks. Without the automated transcript, you may have to spend 8-10 times the duration of the file.

Of course, if you do not have the time, our transcribers will be happy to make the corrections for you. We guarantee 99% accuracy for our manual transcripts. Please do try it out.

Improved Automated Transcripts

Our latest speech and language models have been released. There are several new features in this release. The following is a list:

Acoustic Model: This is our fourth acoustic model trained on our data. The dataset contained mostly accented speakers (eg. Indian, African, Irish etc.). It also contained some noisy files. The accuracy of the automated transcript on accented files should be better now.

Language Model: We have added more data to our language model and doubled its size. The model now model has now been trained on around 46 million lines and has improved the WER by around 2%.

Punctuations: The biggest feature of this release is expanded punctuations. We now support all types of punctuations including quotes and hyphens. To our knowledge, nobody else including Google Web Speech, AWS Transcribe and Speechmatics supports quotes.

Speaker Turns: We also have updated our speaker turns model. The accuracy of the model is around 80% on long paragraphs. The automated transcripts will be better segmented now. We are currently working on adding speaker diarization to the automated transcript and it should be out soon. We do speaker turns a bit differently and do not require the number of speakers as an input. That is also one of our unique features. Google Web Speech does not support multi-speaker files and AWS Transcribe and Speechmatics require the number of speakers as an input for diarization.

This release also fixes the issue of missing predictions where some words, especially near speaker turns were not being transcribed. The automated transcripts should now capture all utterances, except filler words. We also benchmarked our model with LibriSpeech Clean and our internal dataset. The following are our numbers.

Dataset Type WER CER
LibriSpeech Clean Read speech 14.53% 5.85%
Scribie Internal Conversational 16.33% 8.82%

For comparison, PaddlePaddle numbers are the following:

Dataset Type WER CER
LibriSpeech Clean Read speech 5.4% 1.9%
Scribie Internal Conversational 28.34% 17.71%

As you can see, for conversational audio, our models outperform PaddlePaddle by a wide margin. We are working on improving our models for non-conversational audio as well. Our ASR is a DeepSpeech-based system and therefore a comparison with PaddlePaddle is a good benchmark for us. The Continual Learning blog post has some more details on how we trained our DeepSpeech models.

The automated transcripts are free currently, so try it out today!

Continual Learning for Speech-to-Text

Flawless transcripts and fast turnaround time are the hallmarks of Scribie. Not only are our transcripts highly accurate, but also priced reasonably. But have you ever wondered what makes that possible? The answer lies in constantly improving our speech-to-text engine, which assists our transcribers. We provide automatic word completion to our transcribers, and the better those autocompletes are, the less they have to type.

Our speech recognition engine is a Deep Learning system. For the uninitiated, Deep Learning is a subdomain of Machine Learning. It makes use of Artificial Neural Networks that, in a way, mimic the structure and function of the human brain. Our speech recognition engine is based on the DeepSpeech 2 network from Baidu, and written in PyTorch.

Scribie has a large dataset of audio and transcripts — over 100,000 hours at the last count. Training Deep Learning models over such a large dataset is very expensive in practice, as it requires a large number of GPUs and SSDs. For comparison, Baidu trained their models with 256 GPUs on custom hardware when they developed the DeepSpeech architecture. We don’t have the time or money to do that. So we developed an approach which we call Continual Learning.

Continual Learning

We first built and trained a large model with a 3,000-hour dataset. That took around three weeks on our rig. Since then, every month we have built a ‘corrections dataset’ of around 1,000 hours. This corrections dataset is made up of predictions from the previous model that were wrong, and then manually corrected by our transcribers. In each iteration we remove an equal amount of data from the previous training set and fine-tune the model over the newly combined data. This ensures that our model keeps improving over time.

Results

We have completed three such iterations and the results are promising. We have been able to consistently decrease the Word Error Rate, a common metric for automated transcription accuracy. The following is the chart of our WER.

We are providing free automated transcripts for a limited time, so please don’t hesitate to try out our online speech recognition system soon! Please note that we support only English at the moment and it works best for files with North American speakers and clean audio.

Deep Learning and AI has been in the news a lot lately, and there are concerns that Machine Learning will end up taking our jobs and replace humans. We have taken a different approach and built a system to assist our transcribers instead. Eventually, we want to reach a point where a human would have to spend just 10 minutes on a one-hour file, and still produce a highly accurate transcript of it. We still have a long way to go and we are working hard at it!

Are Automated Transcribing Softwares Good Enough? Not for New York Times

If you are ChristineMcM, a New York Times commentator you probably know a too much about how automatic transcribing software can mess things up for you.

As reported by The Daily Dot, she had something to say about a recent Trump article but had to take a phone call in the middle of her comment. Her automatic transcription software heard and posted the whole conversation.

Yes, you read that right.

This is what it ended up posting.

This might be funny, but this shows the state that we are currently in with respect to automated transcribing.

Transcribing still continues to be mostly done by humans to avoid such gaffes.

Although she later clarified the mistake, it left those close to her and her followers in a state of a fix. Some even suspected that she might be having a neurological episode.

Here is her clarification:

Having understood these problems, Scribie is not looking to go the same route.

Instead, we use technology and AI to help humans transcribe faster and better.

The industry is far away from completely eliminating the human factor in the transcribing chain (unless you can afford such a gaffe).

For the time being the best way to get your file transcribed is a human with cutting-edge technology that enables efficiency and high accuracy.

 

 

Building a Custom Deep Learning Rig

Deep learning is a very exciting field to be part of right now. New model architectures, especially those trained with Graphics Processing Units (GPUs), have enabled machines to do everything from defeating the world’s best human Go players to composing “classical music”. We wanted to take advantage of its applications in speech and language modeling, and started with AWS G2 instances. We soon found that training even very simple models on a small portion of our data took days at a time, so we decided to build our own rig with specialized hardware. Continue reading “Building a Custom Deep Learning Rig”

Google Introduces Dictation in Google Docs

Google provides us with a variety of services and tools to make our lives easier. One tool in particular, voice dictation, is now available in Google Docs. It’s an easy feature that makes the lives of those using it run a little smoother. Need to get an email sent? How about the notes for your next business meeting? Google Docs voice dictation makes that possible, without you having to really lift too many fingers.

google-76517_640

To get started, you will need to have the latest version of Google Chrome installed and a microphone for your computer. With these tools set up, you’ll head to Google Drive and open a new Google Docs word processing document. You’ll go to the top menu and select Tools, then . A pop-up window will appear with a dark microphone icon in the middle. Once you click on the microphone, it will turn red to signify that it’s recording and you can start to speak.

microphone-34097_640

It’s okay if you need to think about your words as you’re speaking; Google will wait. When you’ve completed your dictation, click the microphone to turn off the dictation. It is important to note that punctuation needs to be dictated.

An added benefit to voice dictation is that you can edit and format as well. Take the sentence, “I like pie.” To edit or format it, just say “select ‘I like pie’ and follow that with whatever formatting change you need to make. That could include “apply heading” or “apply underline.”

You can also create itemized list by saying “create numbered list” or “create bullet list.” When you need to go to the next item on the list, just say “new line” and say “new line” twice to finish the list. And no fears if you mess up! You can simply say “undo” to change any mistakes.

For transcribers, these features can be a great time saver. Not only that, but it can reduce the amount of effort you have to put in to typing up your latest project. Life made simple by Google. It’s as though Google just provided you with the option of having your own free secretary. For those you who may wonder what all can you type with your voice, Google even made a complete list of commands for your viewing pleasure.

Why Do We Still Need Humans For Transcribing Speech

siri errorsSo, how is Siri doing on your iPhone. Would you happily replace her with your secretary?

Personally, I won’t, because there are just too many ‘misses’ and ‘trouble spots’ that I wouldn’t want in my business.

The case is almost the same when you count upon software to transcribe your audio files instead of their ‘time-consuming’ human counterparts. Unfortunately, despite several attempts, science has not yet come up with a software solution that would act like Aladdin’s magic lamp. And from what it seems, the genie isn’t coming out any time soon. Why? The reasons are many.

The English language can be very tricky and hence very difficult to master especially when the learner in question is a transcription-software. Homophones pose a problem that most software find impossible to overcome. For instance, will it be sale or sail, no or know, fair, or fare? The list continues. Unlike us humans who are blest with critical analyzing skills, software cannot comprehend the difference. Plus, making these finer differentiations may be very difficult without a context, which might not appear until further into the conversation.
The problem aggravates when the software needs to transcribe an interview or a dialogue involving many speakers. It is easy to guess why. Each of us has a unique style of speaking. This speech distinction becomes far complex as this personal style of speaking is influenced and shaped heavily by our geographical location, our culture, and our upbringing, to name a few. It is impossible to ‘teach’ so accurate a speech recognition to any software.

Audio quality is yet another issue. And a very important one. Any speech recognition and transcription software would need a clear piece of audio. Anyone in the transcription business would know that an impeccable audio file is a rare phenomenon.

Talking about the accuracy rate of a human transcriptionist versus a software-driven one, Xuedong Huang, a senior scientist at Microsoft says, “If you have people transcribe conversational speech over the telephone, the error rate is around 4 percent. If you put all the systems together—IBM and Google and Microsoft and all the best combined—amazingly the error rate will be around 8 percent.”

Now the real question is, would you settle for something that is twice as bad as humans? We know the answer. That is why we offer transcription service that is among the best in the industry. Start uploading your files now!

Automatic Audio Transcription

Humans Are Better at Transcribing Than Robots

Audio transcription can be a long process, especially if you are a newbie in the field. For many, the automatic audio transcription offers an easy alternative. But is the shortcut worth taking? Statistics would not say so.

Express Scribe, an automatic transcription software, offers an accuracy of around 40 -60% when integrated with the Microsoft Speech Recognition. Google Voice, on the other hand, offers an approximately 80% accuracy but only while transcribing voicemails. That percentage goes significantly down for conversational speech audio. The appalling performance of the various automatic audio transcription or speech recognition software programs even today makes one think why it is so. The reasons are plentiful.

The software fails to factor in the various styles of speaking

A language changes its character depending upon who speaks it. For instance, the way English is spoken in the US is different from how people in India speak it. Teaching a software program how to recognize the variations in human intonations and accents can be very challenging. The problem multiplies when there are groups of speakers involved. Analyzing voice can be equally frustrating for a program. The ease with which the human ear can decipher the spoken words by a variety of voice quality, such as hoarse, soft, deep, etc., does not work in case of a software. In the ideal world, the speaker would have to speak clearly and carefully in order to be accurately transcribed by an automatic audio transcription system. But unfortunately, we don’t get to work in an ideal world scenario.

English can be a tricky language

Sale, sail. Year, ear. Feet, feat. You get the drift. Homophones can be quite tricky and sometimes becomes impossible to understand from a spoken language if we don’t understand the context. Quite obviously, this is a high expectation from a software, and this naturally leads to undesirable mistakes.

The better alternative

Hiring a transcription service with a team of experienced transcribers is still the best. Old is gold when it comes to accuracy, at least in this context. Scribie is completely powered by humans and hence is able to consistently maintain accuracy level of 99% or higher.

Want to find out for yourself? Start uploading your files now.