Duration 31:39
16+
Play
Video

[Arm DevSummit - Session] Reinventing Live Communications and Collaboration Around AI Speech

Richard Burton
Senior Software Engineer at Arm
+ 1 speaker
  • Video
  • Table of contents
  • Video
Arm DevSummit 2020
October 6, 2020, Online, San Jose, USA
Arm DevSummit 2020
Request Q&A
Video
[Arm DevSummit - Session] Reinventing Live Communications and Collaboration Around AI Speech
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
90
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speakers

Richard Burton
Senior Software Engineer at Arm
Chris Rowen
VP, Engineering, Voice Technology at Cisco

Machine learning software engineer experienced in solving ML problems using Python with TensorFlow and PyTorch.Preparing models for deployment on edge devices.Developing machine learning demo applications for Arm based systems.Python, C++, MATLAB

View the profile

Chris is a Silicon Valley entrepreneur and technologist, now cofounder and CEO of BabbleLabs, a deep learning technology company focused on speech. Most recently, he has led Cognite Ventures, a specialized analysis and investment company for deep learning start-ups. Prior to Cognite, he served as CTO for Cadence’s IP Group.Chris joined Cadence after its acquisition of Tensilica, the company he founded in 1997 to develop extensible processors. He led Tensilica as CEO and later, CTO, to develop one of the most prolific embedded processor architectures.Before that he was VP and GM of the Design Reuse Group at Synopsys. Chris was a pioneer in developing RISC architecture and helped found MIPS Computer Systems. He holds an MSEE and PhD in electrical engineering from Stanford and a BA in physics from Harvard. He holds more than 40 US and international patents. He was named an IEEE Fellow in 2015 for his work in development of microprocessor technology.Chris is an avid runner who has completed a number of marathons, including Boston Marathons in 2013, 2014, 2017 and 2018.

View the profile

About the talk

Abstract: In the past 6 months, we have all been thrust into a strange new world of remote interaction where perfect speech communication is now essential to all work and social collaboration. This talk describes how BabbleLabs systematically applies deep learning to noisy, reverberant real-world audio using microphone arrays, neural networks and training methods to transform live speech experiences.

Presenters: Richard Burton, Senior Software Engineer, Arm, Chris Rowen, CEO, BabbleLabs

Technical Level: Intermediate

Target Audience: C-Level/Executive, Architect, Hardware Engineer ,Software Developer

Topics: #ArtificialIntelligence, #Automotive, Industrial, IoT, Laptops, Linux, Performance Analysis, Tech for Good, Android, DSP, Machine Learning, #ArmDevSummit

Type: Technical Session

Conference Track: AI in the Real World: From Development to Deployment

Air Date: 2020-10-06

Share

I'm Chris Rowan. And it's a pleasure to talk today about live speech and speech interface. In the context of AI, there's really a Revolution going on and my colleagues at arm and I are either talk about what we're doing to push this forward. I represent babblelabs, but that believes is right in the midst of being acquired by Cisco. So all this technology you see here is really becoming part of the Cisco WebEx for If we think about speech, it really is an important

user interface. In fact, it is the ultimate human user interface that you can think of this. As the new face, that has 7.5 billion users today and has been in development for more than a hundred thousand years. In fact, speech is essential to every aspect of society and commerce. It is this which encoding of of words of text of identity of emotion of location. In fact, it captures much of what makes human culture speech, communication and speech user, interfaces face major

limitations, despite years of effort, this is in some ways. One of the original Electronics problems and it's still unsolved in many ways. The limitations come from noise at the source, it comes from conflicting speakers overriding one another, it comes from bandwidth limitations in the communication Channel impairments, that is noise or interruptions or dropouts that occur in communicating that captured speech. Latency is a big issue especially in two-way

conversations because of the lag that causes people to talk over one. Another and in audio only communication you lose the visual clues that are a secondary but contributor to being able to hold a real conversation. and if you look at, What it means to improve speech, they were really two major areas that dablabz has been working speech enhancement and speech user interfaces. Speech enhancement is simply the problem of capturing or recovering clean. Speech a maximum comprehensibility from noisy reverberant environments and channels

and its applications are pretty broad, obviously telephony but also conference calls and contact centres video. Recordings forensics. Where are you? We want to recover what was really happening from body cameras or Court Reporting or any of these environments? But there are so many places where it's either impossible or exhausted to figure out what's going on in a speech interface. The cases of the live speech are the hardest because you have tiny latency budget. You have to get this

done within a few milliseconds or maybe a few tens of milliseconds, but you don't have hundreds of milliseconds, you can't look ahead in the Stream to figure out what should have been said, and it's very off. Has to occur at the edge. That is not in the server, but at the edge because that's where you have the greatest privacy and the least likely impact on latency. And just to give you an example, I invite you to click on this simple before and after for a series of statements, in this case, some

commands to give you an idea of what you'd like to do in very noisy conditions. So click on the first one to hear speech that is sorted in the original and this one that gives you that same speech after it has been cleaned up using deep neural, network based feature enhancement. So, how do people attack this problem? Well, for some time. People have been attacking this problem using signal processing algorithms. And in fact, we recently looked at what was

done over the last 40 years and we looked at the best DSP based algorithms over the last forty years and re-implemented them consistently. And when a consistent set of speech enhancement test on them, to compare, what kind of improvement was made from the original to the output of these algorithms and we choose who chose a fairly common metric that is a pet stores. That is in ICU standard for the quality of of speech communication, we looked at how much did the beach improve as a result of applying the algorithm. And what we found is that

for realistic text weeks. There has been no meaningful button improvement over the course of the last 40 years in these algorithms. But then, just in the last couple of years as neural networks have been applied to Algorithms we've seen a dramatic. Hockey stick in performance Improvement, suddenly we're able to virtually eliminate the noise the reverberation from these algorithms and so we are at the start of a revolution. Now, why is that? I think we can look Inside the nature of noise and see how

these new algorithms are fundamentally different and more versatile. If you think about, you can broadly, categorized two classes. There are stationary noises, these are steady in frequency and Austin reasonably low frequency. So Stan's Air Conditioning, the rumble of equipment, this category and in fact, you can see in this spectrogram time versus frequency that this fan has a set of a fairly narrow low, frequency bands, and if you click on the The icon you can hear what classic stationary

noise. Looks like the dfps jobs tend to be pretty good at this category because these birds are which identify which are the study of frequency bands and work to subtract those out. So that the fundamental variability of speech is what differentiates it from the North's. But what happens when you have other kinds of noise and maybe the worst in some ways is human Babble noise. That background Rumble made up of speech. So that has exactly the same frequency characteristics of speech and has some of the same variability overtime and you can see in the spectrogram that

there's a broad range of frequencies and everything is time, varying. And if you click on that, You can hear how different it is, and the classic DSP algorithms have struggled with this kind of speech. But this is the kind of speech that's often most important to get out of a crowd or Street, see Nora a busy office. And so the ability to handle this broader category of of noises. The non-stationary noise is really what particularly distinguishes of the neural network algorithms because they are learning all the structures of

human speech. When it stopped ansible versus when he has been reduced to a puree of sounds as in that Babble noise. Within this revolution, there really is a lot of progress being made. And in fact as you saw in the chart what we were doing two years ago was already a big step forward relative to the historical fend for DSP on for them. But as we have pushed towards broad-based enormous lie about, not only the effectiveness, how much noise we can remove. But also the efficiency how well this is going to run on mainstream low-cost

platforms at low energy. And so we recently developed and put into product, use to variations of these algorithms are full performance algorithm, which is are purple. Here, which improves the noise reduction by another 50. Sent over where we were less than two years ago, but reduces the compute load by about 15 x and then a low-power version, which is still better than the old 2018 version by the is 100 x less compute then where we were just two years ago. And I think this is symptomatic of what we see more broadly in these speech, AI problems which is that

there's an enormous amount of Leverage that you get from Advanced Training using sophisticated data sets and interning, both the algorithms and the implementations to be able to deliver remarkable experiences really in very low power and very welcome to cost. You can get an idea of the effectiveness looking at these six pairs of the original sound and then after our clearedge speech, enhancement is applied. I invite The click on a few of them, maybe he's a, the baby crying and the baby sound, it eliminated, the

dog barking of work from home. I think many of us are so eager, to present a professional. Face even when we're working in rather non-professional environment. So eliminating the typing in the baby in the street noise, and the dogs are really important goals, so play around with that. So look at before and after for a couple of cases, we can go on and say something about how speech user interfaces are transformed by this. Same attention to differentiating, the substance of speech, important, part of speech from everything else around

it that this problem of building a speech user interface is really important because not every environment is a quiet living room that we, we know the great success of the, you know, the Amazon echos of the world that happened when you have a really clean environment, but you take that on the street, or into the train or into a factory floor or out into a crisis situation, you do not have control over the environment. So you have to make speech user interface, work in all those environments. And

in fact we choose our interfaces are particularly important when the hands and the eyes and the eyes are unavailable for conventional buttons and screens. So we are particularly interested in those chaotic Boise. Hands-free environment, places like health care and Public, Safety on the street shopping environment, where you don't want, everybody touching the same touchpad and you often in those environments, also often have Mobility. So you want tiny compute budget, tiny memory, allocations low, Network bandwidth budgets. And

so we want to build very robust user interfaces, using neural networks, that needs several dimensions of robustness. Robustness means always available. So not dependent on the network being up completely. Private. So they're not always listening and sending everything you say to the cloud Works under worst-case noise. Conditions that is Speaker independent. So even people with significant, variations in accident experience the same high quality robust reliable recognition

and something which supports significant grammer's so that you don't need speaker training, you need only have the idea. And the vocabulary of the user interface is sufficiently General to satisfy people's need to just use the intuitive command about to do something. So how do we go about this? Well there's an inherent problem in the existing systems today. A many user interfaces rely on the same automatic speech recognition ASR systems that are used generally for transcription but what happens in

most of those systems is that there's a significant dependency on the noise level. So I'm just taking one of the most popular and successful species of interfaces. In this case, IBM Watson is a great, a Sr engine. I could get very similar results if I looked at Google or Microsoft Cortana. They're all great ASR transcription engine, but in the context of command recognition, they suffer. So if you just listen to what how And under clean conditions that

is a pi signal-to-noise ratio you can see on the spectrogram and you can hear that the commands are quite clear. And in fact, IBM Watson does a great job of picking them up. If you go to higher and higher noise, that is worse and worst signal-to-noise ratio. You can see, both visually how the structure of speech is lost in this muddle of noise, you can hear it. And you can see in the accuracy of the speech recognition that, by the time you get to 0db pretty much you're lost. There is no ability for these systems to extract out any useful information.

But why is that? so, what's happening that causes that problem in recognition in the presence of noise When things are quiet. You got a relatively clean speech coming in. You may go through a noise filter, you go into an acoustic model with shakes, that way form and Maps it to a string of phony. For example, if somebody says power on you get on, then a language model for English, for example, will map from the string of phonemes to the most probable. String of words power and

on and that text stream that wordstream goes to natural language application which is looking to find the interesting statements from his grandma so it knows what power on and power off and and other things may be if there's some noise. However We might have. Confused the acoustic model so that it gets power in and then the language model gets the wrong thing and the natural language application doesn't know what to do with it because power in isn't part of its grammar and so it ignores it or it doesn't want a better way of approaching. This

problem is a more unified model in which the noise filter in the acoustic model, the language model and the natural language application. The grammar matchup is all done together and by doing that we overcome this problem of ambiguity and in fact, the network is only looking for the commands of interest and it power in isn't one of those commands then it will never get that wrong. And it's much more likely to say, well, the closest thing to power in his paw In the context of the

grandma that we have here and so unified models can handle the ambiguity of noise in a significantly smart or what. So, when we build User interface using neural network for command set recognition. We can do a better job and in fact, here is our clear Command Technology which for exactly that same audio does a significantly better job than a large cloud-based, a Sr system, which maybe a hundred to a thousand X bigger and take vastly, more compute than something, which is targeting a specific vocabulary.

There's always a point where there's so much noise even a very noise robust system, starts to make a few mistakes, but the knee of the curve is much further out at much higher noise before it really becomes problematic. If we look at this down, in War of the guts, we find that this unified model you really implements a whole family of recognizers in this particular case, we're doing spectral-domain processing. So we really are doing a kind of image processing on the line frequency spectrograph And were using a set of

techniques derived from residual networks with separable convolutions and one-by-one convolutions. But with it, we can typically build Fairly sophisticated recognizers with tends to as much as a hundred different commands or intense and do it in a relatively small, so depending on the demands of the vocabulary and the level of noise immunity that's required, we typically range from $50,000 to $250,000 all 8-bit and depending again on the scale of that Network 22 100 million * 4 seconds. Something that can be done in

almost any CPU, DSP microcontroller. There are a wide range of choices and we done optimizations that mean the working memory footprint is really small like 128 kilobytes of memory example from 1:30 command command set where we look at the Accuracy. In this case, F1 score as a function of signal-to-noise ratio, as you'd expect under quiet conditions, it's very accurate. Well, above 95% even for the smallest models. But as the noise level increases the signal-to-noise ratio gets worse. You see, this gradual decline. Certainly

much better than those ASR base systems, which may be only fifty or sixty percent accurate under high noise conditions, but some dependency on the model, say about a hundred and ten thousand for amateur is 136266 a hundred and ninety-nine thousand parameters. You can see that for this particular model, it really pays to get above 130000 parameters wear in the 80s. It was pretty much completely gone here. It's pretty much completely correct. And so what we really got is a technique here in user interface design that allows us to be very robust

and have quite small footprint and in general as we look at these neural network applications, we see that there are a wide range of choices of compute platform with some of the classic trade-offs for the greatest flexibility to handle not only a very wide variety of neural network problems but any other problem will you can go to general-purpose CPUs and x86 has an arm cortex, all fit that category and is that they generally are even more efficient when you add in their respective vector and instruction set extensions like

avx2 and Mia. They don't have the same multiply efficiency for the other machines so that you Can have General flexible Solutions at higher and higher efficiency but as you push towards really low power operation or really high throughput for. Then you see the domain for dsp's and gpus and ultimately we see a significant role for neural network accelerators, you give up something. In flexibility that neural network. Accelerator won't be good for doing other Computing tasks. You won't use it effectively at least to run

general-purpose control but in the context of computing intensive, convolutions it'll be great. And so you can get to very high efficiencies In fact, we've been working with arm directly on evaluating how neural network acceleration works on. General purpose, microcontrollers and in specialized accelerated engines. And so I'm pleased to pass the ball to Richard Burton who's going to talk about the work that we've done together on speech user interface for these kinds of engines. Go ahead, Richard. Thanks so much Chris. Sorry, my name is Richard and I

work in arms machine, in an ecosystem to you. And recently we had the great pleasure to work with Chris and Bob allowed to try and evaluate smoothie and I were closer. You just walked out on serologist IP and software, see how we can help accelerate got a great performance from them. Have you seen running down trivial? And I'll models are embedded devices. It's very much that I'm already thinking. Say there's also no doubt the number of devices running ml at the edges. Are we going to keep assuming? We are super excited about a really Keen to help nausea. Now, squeezing all

the fullest possible to the tiny embedded devices essential, if you want to run any serious kind of ml workload like speech recognition on such a device to read turbocharged. One way we didn't miss Simpson and Arkansas flight rate of optimized devices and Elaine to accelerate the key and I'll operate it. It more than your networks y'all's house release number footprint with TurboTax reuse It's completely open-source linking, the slides and welcome contributions like a investor. One thing that's reason to be what saint is, how much types of

integration with tents were like micro? Does? A result is now easier than ever take your tray and concise mortal coil on a quote XM device and have accessories for you have to mess with one of our Labs, come and join all those. We profiled kind of speed up. We could get with Samson and some of the grass on this despite having a grand Time. Players, reference cows Alex, either by using Sam's tonight on the quarters and 7. We were able to achieve an asshole X Improvement to a different speeds afterwards. We tested our latest device that I know what clothes

you have to talk to revise so you can see I'm removing double O7. So we don't have to ask as much as possible. My controllers and for all these cases will be more than enough to satisfy your tiny and all needs. What is the temperature now? And what kind of motor oil, do you want to run? Maybe our model requires. She was confused any applications that waiting too long for infants to complete? How do I get to Poynette engine area? My stomach is very important to you, but this is what I latest and buy the product, the

EOS you 35 comes in 85, is what we bleed at the first of my current hear. This means the newer person, you know, how to pronounce guillotine by the space micro. What a nice thing about to is the application. Code wood, plank walls exam weight loss. You especially identical to Steve Semple One thing we do requires these tests, optimize your model with our offline. Optimizer call dollar, dollar will inspect my Moto, look to see where I kept my the model by using reordering or So, using the same

code, as, before we to play the power lines, mall, on the 55, and the shops. In this slideshow car results. We got this time of year was dances. And as a base 14, x p dubs, the smallest details, you could directions to time speed of sound. If you speak to you so you can see what you take 590. Allow you to run more computer networks while I can see is important. But also sensitive earrings from multiple networks on one device, one of the other countries cases Infants, please don't tell you everything. Sorry, by utilizing any place to you. We can

achieve Lodge energy directions to Emma wear clothes like speech recognition as using just, a course exam with 90%, in some cases When comparing to Oldham 7 devices. We also want to see how efficient Beatles use when the Chief Keef remote key and using a similar model to Rob, babol Ops, One released valuated, the energy used by inference for a different configurations and Thundercracker that you can you slide me the digital Max to efficiently run inferences. It helps us to demonstrate kind of performance gains. You can expect real wealth

Network, + y, +, 55 + 35. +, a great choices to running ml. What lies at the edge on a hunt about the crest for some Toys R Us? What we hope that you see is that first of all speech is really an ideal Target for AI. It's a very complex Nuance representation of many kinds of information. They're otherwise very hard to extract an analyzed and clean up but neural networks because of they can be trained on massive amounts of speech. Possible. Yeah. The problem is

computational intractable, you can run it across a very wide range of platforms. And in fact, with things like the ethos, you can now run it even in very low energy environments. Second, we are just at the beginning of a revolution, we expect that this pace speech Innovation, using a r that we've seen in the last couple of years, is going to accelerate persist for some years to come. We have seen very little yet of what is going to be possible. And I think that we're going

to see a diversity of new algorithm especially at the end that can handle the real time to man the mobility demands. The robustness demands that we expect from ubiquitous devices. Whether they're in our phones or they're wearable and where we really care agree. He'll not only about how accurately they perform but how mobile how energy-efficient they are so buckle up. I think we're entering a really exciting time for speech. A I thank you.

Cackle comments for the website

Buy this talk

Access to the talk “[Arm DevSummit - Session] Reinventing Live Communications and Collaboration Around AI Speech”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “Arm DevSummit 2020 ”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “IT & Technology”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
177
app store, apps, development, google play, mobile, soft

Similar talks

Ashwin Matta
Product Management, Infrastructure Line of Business at Arm
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Thomas Gall
Director at Linaro Consumer Group
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “[Arm DevSummit - Session] Reinventing Live Communications and Collaboration Around AI Speech”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
735 conferences
30224 speakers
11293 hours of content
Richard Burton
Chris Rowen