Calum has a life long love of technology, and is passionate about working with new technologies. Calum is an experienced professional in the Internet of Things market. He has done everything from helping Fortune 500 companies build their IoT devices to building Xively's IoT Platform for connected devices.View the profile
About the talk
Google Cloud’s Speech-to-Text API provides incredible accuracy out of the box. What you might not know is that it also has new tools for enhancing accuracy and customizing the model for your industry, domain, or use case.
Come learn how we measure accuracy at Google and how you can use our tools to customize your model and improve accuracy. We will walk through the basic concepts and introduce a lab that you can complete later on your time.
Speaker: Calum Barnes
Google Cloud Next ’20: OnAir → https://goo.gle/next2020
Subscribe to the GCP Channel → https://goo.gle/GCP
product: Speech-to-Text API; fullname: Calum Barnes;
event: Google Cloud Next 2020; re_ty: Publish;
My name is Callum Morris from Google Cloud. I'm the product manager for cloud speech. And today, I'm going to be talking about measuring and improving speech checks accuracy with the Google Cloud speech product. First time I give you a little bit of an overview on cloud speech to text. Then I'm going to talk about how you can measure the accuracy of speech to text on your own data. Then what you can do using our tools to improve the accuracy wants. You measured it. And finally I will do a quick work example at the end and lick you out to some of the companion,
tools and examples that we've created to go with this next on their presentation. so, Google Cloud speech-to-text is a ATI which accepts audio and identifies speech within that audio and Returns the text representation of that speech. This happens in real time. Or in back mode we support both in all 71 languages and 127 local bearings that we support. We also have for each one of these languages a massive vocabulary that goes far beyond the dictionary covers local
vernacular, proper nouns cetera. Today we're going to be especially focused on the tool that we have for customizing, the API, and improving recognition on specific terms as well as doing things like changing the output format it. So, let's jump in and start talking about speech. Accuracy, accuracy is pretty much the most important thing when it comes to speech to text. So whether you're doing post-processing on the text to determine someone's in 10 and extract entities, or displaying the text directly to users, like for captions. Having a higher quality transcript, that's more
accurate is going to result in a better user experience. Either way, there are a whole number of factors that can affect speech, accuracy, from background noise to audio quality to various different accents, or strange verbalizations, and all of these things can have a big impact at Google Cloud. Google as a whole, we have tried to make our models as robust as possible. And this means that they should work really, really well, right out of the box on a wide variety of types of speech, and other scenarios. But all speech
recognition systems are very sensitive to input data. And that means when you think about accuracy, you really need to be thinking about what is the accuracy of this system on my specific data, in my record environment, the types of speakers, I have the types of speech it is And to do that, you have to measure it and to measure it. We need to have a Common Language or how we talked about what the quality level is and in speech-to-text. Usually that's a word error rate for w. E r Wer or where are rate is not the only way to measure speech, accuracy.
But it certainly is the most common and even if you're also looking at other metrics which can be valuable usually, they're looked at in conjunction with word are right as well. Where did our rate is composed of three different types of errors that can happen during transcription? One is the number of insertions. So, these are words which are present in the hypothesis or the machine-generated. In this case transcript that are not present in the actual ground-truth accurate, transcript substitutions are the number of words, which are present, but are Miss transcribed, as a different word for
managing, an incorrect way. And finally deletions are the number of words which are completely missing from the hypothesis transcript, which are supposed to be present in the source audio or all three of these numbers. And you divided by the total number of words that exist in the ground truth reference and not gives you the overall word error rate. This means that it is actually possible Forward Air rate to be greater than 100% in situations where you have very, very poor quality. Do to measure word error rate. We had a few steps you can follow and we've also created a bunch
of tools to make it as easy as possible. But it all starts with getting your audio files. Like I said before it, if you want to measure accuracy and how the system is going to perform and how you're using it, you must have in domain, You Know, audio, which is similar to what you're going to be doing generally. We recommend having about three hours in this test set, but you can still get pretty statistically significant results with as little as 30 minutes of of indomie audio. It's more important that the audio is similar and representative of what you're trying to transcribe then it is to
have a ton of it. The second step is probably the hardest of all of them. You have to get the ground Truth for this audience. This means you must have the, you know, human transcribed 100% or as close to 100% as possible. Accurate version of the transcribe audio in order to compare the various hypotheses Against the Machine. And then finally, you can compute the word error rate with our simple were script where you feed in the hypothesis, on the ground truth and we will show you what the order rate is that script
for telling you the number of insertions deletions and substitutions will also give you this pretty printed HTML that will show you exactly where each are is and what type of error it is. And you can see a sample of that to you. So now that you've measured the accuracy of your audio It's time to think about what you can do to improve that accuracy on your Tessa and on your wider set. And we do this with a set of tools that allow us to help customize the model or customize the system towards the type of audio that were sent
it. broadly speaking, there's three different ways to think about improving speech, accuracy, through model, customization The first is to customize the model to your domain. By providing contextual information. An example of this would be if you knew that people were going to be talking about pizza, ordering pizza, you could biased towards, you know, different types of cross. Four types of cheese, different toppings cetera. Next is to tweak weights to address specific word or phrase issues. So commonly this can occur with a
proper nouns, people's names or maybe if individual product names, but words, which generally speaking occur, very rarely in all speech. Number three is to use context that you have two biased toward specific types of information or specific situations. So for example if you had a phone system, ivr you might know that a user is about to tell you an account number or phone number. And you can tell the SR system to only look for a digit sequence or alphanumerics, We support doing all three
types of these customizations in Google Cloud speech through the use of our speech on a patient rules. These tools if used correctly can be extremely powerful and can shift quality vary significantly, depending on the situation and the type of icing what you're doing. Before we get into the features of speech, outpatient and exactly how to use them, though, I want to talk a little bit about how speech outpatient works and how about somewhat different from other types of model customization in order to talk about how we customize the model and
customize, the results. We need to understand a little bit more about how a Sr or automatic speech, recognition systems work. Now, this is a extremely simplified diagram and not all speech recognition systems work in this exact same way, but it will help us to illustrate the point here. So, very simple system would take audio input in from a user into what's called an acoustic model. The acoustic model looks at the audio wave forms or the spectrogram and it converts these waveform into sounds widget thinks are there. Sedonas phonemes, the language model, then looks
at the groupings and series of phonemes and tries to understand from. Those sounds with words or phrases are being most language models. Also take context and other factors into account, when determining the text, which the phone and represent some models will produce a single output or perhaps it will Harbour and best Alternatives, that could be determined from that group in a phone. Now, let's take a look at the simplified version of the Google speech to text API. The audio comes into an acoustic model and is it converted to phonemes
that are scented language model? Just like we talked about before the difference is two things, one is the language model produces instead of just a single highest confidence hypothesis or a series of of Ms, alternative, but it produces an entire lattice of potential word. Alternative is the other difference is the speech out of patient API, which allows users to send context and word hints about what the audio might contain. This information. Is then used to operate on the word lattice and determined based on these hints as well as the
confidence is from the language model. What is the highest confidence hypothesis text is as well as Endust Alternatives if that's interesting too. This differs slightly from other types of model, customization, where you might have a custom language model, where the entire language model itself is changed to account for specific, proper nouns, or various different sequences which might be expected. But the great part of speech outpatient, is there's no training or retraining required. So it's much easier to experiment and much cheaper to adopt a model to a variety of different Meats.
Next is it happens completely in real-time. So we're able to compute these changes without adding any latency to the end-to-end SR pipeline. And this also means because there's no training and it happens in real time, you can have a different contact with every single request that you sent the one. Request can biased towards digit sequins and the next request, Ken biased towards somebody's name. For example, all of this can be accomplished with the speech outpatient API, that I was talking about before the speech outpatient ATI today, has three different components or
options first is phrase hits. This is the ability to send words or long phrases which you think may be present in the speech or in the text and the system will do its best to look at the words and combinations of words by threats, that particular piece of information. In some cases though that might not be enough or might not be giving you granular enough control, and that's what the boost apis for the Boost API allows you to specify an actual weight value. For a specific word or by Graham that will then take effect on that
word or by Graham in the word. Lattice, this is especially useful for things like proper nouns or rare words because you can significantly boost the likelihood of them being recommended. Finally, we have the classes feature, which is essentially pre-built phrase, hints and boost values. For common scenarios, for example, an alphanumeric sequence, or a digit sequence, being able to Recognize that without having to set the values yourself coming soon, we'll have some additions to the speech outpatient API. Namely custom classes, which will
allow users to create and share their own pre-made classes, that can then be used with Boost or speech on Appalachia the same weight classes. Can today, this doesn't make the system necessarily any more accurate, but it gives users much greater composability. When they Are working on these problems in the way that they can specify phrases or words that they want to buy. Next you save contacts which allows you to save up one or many known, good bias and configurations and specify that entire phrase list or boost
list. We just an ID on subsequent calls. This can be very useful and saving bandwidth overhead for users that need to send thousands or tens of thousands of words in each context. So that gives you an overview of the Bison tools, and the model customization tools that we have available within the Google speech API. We're always trying to improve these but ultimately these tools are just how you sent it, how you give the Bison information, the real hard part is figuring out
what the right bias in information to send to the ATI. Teas, And that's what we're going to talk about next. So, broadly, some things to consider are. What am I doing with this transcript? So when you are transcribing, the audio is the result going into some nlu system, where you need to extract identity, is being displayed directly to a user. Are there specific things that the downstream systems are going to be sensitive to, if your goal is to capture a phone number, then you have to be absolutely positive
that you're getting that digit sequence, right? Every single time Next thing to consider is, are there where words are proper nouns? Rare words and and proper nouns are very difficult for a star systems because statistically speaking, they occur, very, very uncommonly in everyday speech. So it's less likely that a speech-to-text system is going to recommend that that is a word in the series of phonemes as opposed to a homonym of a more common word. This can be complicated, but this is why by same towards these words.
Very very highly can cause them to be recommended at the rate what you want the next is, what contextual information can I use? So this is what external information outside of the audio itself. Can I use to figure out what the person might be talking about? Show examples of this would be no context or state in a chatbot application baby. You have to use our history. So, you know, the types of query that the user usually makes, and you can use that help increase accuracy on future queries. And that feeds into the final point, which is do I have strong or
weak contact. So, when you think about that, contextual information that you have, you should be thinking about. Do I know exactly what the user is going to say. And I'm pretty sure I know how they're going to verbalize it or do I just know for the broad categories? What this is about? Some examples of strong context would be a ivr like a phone answering bought scenario where, you know, exactly that the user is about to give a phone number or say yes or no or something like that. Next would be a system for giving commands or in a system where you know that the user is going to say
change the channel or play songs by some artist, any systems. You have very, very constrained vocabulary and this can help you to buy a Sim and increase accuracy on that specific vocabulary. And finally, as I talked about before, important words or entities, if you have a proper noun, that's very important. Transcribe correctly. You're almost always going to meet the biased towards it very very strongly. Next, we have weak context, which is a situation like captions or dictation or perhaps a conversation between two. People are multiple people in a meeting
and neither situations where you don't know exactly what somebody's going to stay at any one moment. But, you know, broadly, what they are talking about, for example of this recording is all about speech to text technology and that might give you hints about what words are going to be said, or, or a comment that you can use to increase accuracy on this works. So now that we've looked at the bicycles that are available, as well as what to think about when you're starting to think about bias in their model, customization, I'm going to take you through a demo or really more of a work
example. On what it looks like to do biasing for real. Now, this is going to be a very simple example. We've also created a bunch of content to go along with this next intersection, where you'll actually be able to try all of this for yourself in a quick lap. For the purpose of this demo, I'm going to be focusing on improvements at 2 that I talked about where we're going to treat. Waits to address specific words and phrases shoes. Especially focused, on a rare words. Do to get started here. I'm going to do
basically exactly what I told you not to do before, which is I'm going to give this example based on just one phrase and one sentence, I don't have a whole Corpus some audio here, but the goal is not necessarily to, you know, vastly improved, this one phrase, but to just show you how I think about the Bison problem and the signals that you can use and applied to a larger Corpus when you're doing this yourself or trying it out at the club. So I recorded a single sentence of, of me speaking here that I'll play for you. Ron. So you can see, I say my name
a proper noun, as well as a totally made-up words, you can't run. Now, even with these rare words like I said the Google recognize our works very well out of the box. So I actually do not have any issues recognizing this audio originally so I had to make the problem a little bit harder for a speech recognition system. And to do that, I add Some Noise to the same audio recording. And this is the audio recording. I'll be using for all the future examples in the demo. So I'll play that for you now. So, as you can see,
exact same sentence with some white noise added in the background that made it, so that the hazard system, wasn't recognizing it right out of the box. On the right hand side. I've set up some python code to try out. Sending this to the crowd speech recognizer. I specified my input file. I'm using us English since I'm a American English. Speaker is a basic linear, PCM WAV file recorded at 16 kilohertz. I'm not sending any speech contacts yet, because I just want to see what the bear usage of the system. Is next, I need to set up my simple
were script, so I'm putting in my ground-truth. Hypothesis here. Hi, this is calie. I'm talking about the speaker Tron. I haven't included any punctuation in this case, but you could include punctuation. If You're Expecting punctuation in the results as well, the hypothesis will come directly from the Google speech transcript that I just set up. And then we will compute the error rate and So, let's take a look at what the accuracy looks like, right out of the box on this noisy file. I had a word error rate of 37.5% driven all by substitution are here,
so we can see it got my name, wrong. That wrong, but it is spelled incorrectly, but more concerning, instead of the speaker Tron, it was transcribed as this ecotron call which is not good. If we were really worried about the speaker Tron product here, then we would have not even captured that this phrase was about to speak of chopped off. So let's look at what we can do using the phrase him a p. I I can do something very simple like put the exact right transcription is a phrase. Hi. This is Calum talking about the speaker from now on, this didn't work. This did result
in 0%, word error rate and the phrase being transcribed perfectly. This is always a good thing to try out. If you have a single phrase, you know, it basically just shows you that speech outpatient as tools work, we are actually operated on the lattice in and changing things. But this isn't that helpful to you, as you think about improving recognition across the entire Corpus. Because if you knew what the correct transcription of the phrase was you wouldn't be sending it to the speech API to begin with. So I don't think this is not helpful and sending the whole phrase is also not going to
be helpful to other words for somebody saying hey this is some other name or they are verbalizing, the phrase little bit differently. So, what's been going? We can do to increase recognition without just going for the full phrase be inaccurate. We could use a boost API and look at the rare words. Like I talked about doing before, Callum, and Speak O Tron. Here are the two rare words. And by boosting them fairly significantly. I think we can get pretty good results on on those words. So let's take a look at how that performed on our noisy version of audio. I was able to
decrease the word error rate substantially just by doing this where I've got it to spell my name correctly now and instead of ecotron it has correctly recognized but there is still an air of it actually thinks it's saying this speaker Tron and not the speaker truck. Now, this is an easy fix on, on the phrase side, too, but it got me thinking more broadly about what you can do to increase this. Not just this one phrase, but all phrases that are talking about the speaker Tron or that are talking about some you know somebody saying her name is Callum. And one thing we can do with Boost
is to lose, not just the unigram that are important to us, but also the by Gramps we wouldn't want to boost really anything more than a by Graham because you're just going to significantly confused, a lot of sense. Those phrases are never going to exist like that in the lattice. But by thinking about, you know, how what are the buy grants for, people are going to actually use these rare words, we can greatly increased recognition so when somebody saying their name, they're usually going to say his name. You know my name is Callum or this is count. And you can also think of other
verbalizations and when I thought I speak o Tron, I thought people are probably going to say dusty cattron the article with it basically And we can use that to boost those by grams as well as the unicorns even more strongly and that's going to help us get even better recognition in some of those scenarios. And so, in this case, I boosted by Graham & Graham. I've chosen boost values of 5 and 10, I think you'll have to play with exactly what boost value works best for you. But generally speaking of the, by gram of the Articles to be posted at a
lower rate than the unibrow because you wouldn't want every single word after the to be transcribed as dust because I was able to actually get down to 0% when we have an accurate transcript. Like I said, this is just working on one single phrase and you would never get to 0% word error rate on an entire Corpus or multiple different variations on on the same phrase. But hopefully this gives you a helpful idea about the way to think about these problems and what you can do to overall decrease word error rate across Forks, So if you are interested in trying this out
for yourself or finding out more information about improving speech, accuracy for your data, you can check out our docs. We've recently updated this speech outpatient documentation to include a number of the best practices that I should talk about here. If you want to actually try it out for real, we've created a quick loud specifically to go along with this content called measuring improving speech accuracy. where you can try out using the temple were script and tuning the by same with Boost and Frey Simpson and classes, and try to achieve the greatest word error
rate reduction, Off yourself. If you're interested in the features that I talked about before the custom classes or be safe context, you can apply for Alpha access. Google.com is how to get in touch with me. Thanks very much. And I wish you best of luck with all of your speech accuracy.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.