Events Add an event Speakers Talks Collections
 
MLconf Online 2020
November 6, 2020, Online
MLconf Online 2020
Request Q&A
MLconf Online 2020
From the conference
MLconf Online 2020
Request Q&A
Video
Efficient BERT: Optimal Multimetric Bayesian Optimization
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
54
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About the talk

With the introduction of BERT, we suddenly have a strong performing, generalizable model that can be transferred to a variety of tasks. But, BERT is really, really large. During this talk, we will explore how to reduce the size of BERT while retaining its capacity in the context of Question Answering tasks. We will pair distillation with Multimetric Bayesian Optimization. By concurrently tuning metrics like model accuracy and number of model parameters, we will be able to distill BERT and assess the trade-offs between model size and performance. This experiment is designed to address two questions through this process:

By combining distillation and Multimetric Bayesian Optimization, can we better understand the effects of compression and architecture decisions on model performance? Do these architectural decisions (including model size) or distillation properties dominate the trade-offs?

Can we leverage these trade-offs to find models that lend themselves well to application specific systems (ex: productionalization, edge computing, etc)?

About speaker

Meghana Ravikumar
Machine Learning Engineer at SigOpt

Meghana has worked with machine learning in academia and in industry, and is happiest working on natural language processing. Prior to SigOpt, she worked in biotech, employing NLP to mine and classify biomedical literature. When she’s not reading papers, developing models/tools, or trying to explain complicated topics, she enjoys doing yoga, traveling, and hunting for the perfect chai latte.

View the profile
Share

I will go ahead and get started. This is our another presentation from one of our Platinum sponsors. Say. We have a Magna ravikumar. She is in ml engineer with Sagat and if you'll be talkin about efficient brt today. So, without further Ado, like does this floor's yours everyone? Thanks for tuning in and it's been a long week. So we'll just get right into it with Smokey metric station optimization. Again, I'm a gunner. I'm in machine learning engineer and my handles if you want to chat. So during this talk will explore

how to reduce the size of birch. While retaining its capacity in the context of question-answering, task will be pairing multi Metro station optimization with desolation and show that by concurrently tuning competing metrics, like accuracy and model size. We are able to just Albert and his house, the trade-offs between model size and performance in a granular fashion. So before we jump into the UK, so I'm just going to quickly go over what we do at 6 and the product features will use for this use case.

As they got their mission is to amplify and accelerate your model development and to help you gain insights throughout your experimentation process. I understand that modeling is messy and difficult to standardize the training, to be a crapshoot and difficult to debug. And that tuning is expensive and difficult to scale. and we build solutions to help with these problems, including intelligent experimentation experiment management to organize and collaborate with your team throughout your modeling process, providing insights into your throat, training runs,

and if sessions intelligent and scalable hyperparameter optimization, And we know that you're iterating on tooling along with iterating on your model development and we've make sure that we are easy to use and staff agnostic. For this talk will focus on hyperparameter. Optimization that scale. So, here's how our general optimization Loop work. We set up our parameters days. They got samples to space and provides parameter values. We were on our model training with these suggestions and return, the models performance or

other metric. Back to cicotte, the optimization precedes in the slough black fashion. Wear over time. We should Steer model performing better and better. 8 minutes done and we're more familiar with the tools will be using. Let's talk about just telling Bert. Sober is great and is a very pivotal architecture. It is generalizable and transferable, which means that the model perform strongly across a variety of MLP Tha with very minimal changes required and its architecture, enables best used techniques, like transfer

learning and leverage, large pre trans model to solve specific and Niche problems, which is really convenient. But on the other hand, it is very large and difficult to put into production systems and other memory constrained application. So many teams are trying to solve this problem by compressing or distilling Birds including huggingface Raza and other works to list. A few and will be focused on expanding hugging faces work on distilling and specifically asked too many questions. When

can we understand the trade-offs between model size and performance? When distilling vert? And to get in these trade-offs, can we make informed decisions on a Model architecture. The best works for our needs? So to answer these questions with focus on our knees cage, distilling Bert, or question-answering. The dentist that will be using a squad to it is comprised of 35 topics, ranging from the chemical properties of oxygen to the history of the Yuan Dynasty. So each

topic have a set of question. The model needs to understand and answer you can think of this as your standardized reading comprehension test, where you're given a paragraph and a set of questions and you need to find the answers to these questions within the paragraph. I like Squad wine Squad to introduces the concept of unanswerable question, making it more challenging for models. The date of that is split between 50/50 between answer below and unanswerable question. Four. Answer both questions to the Americas. Correct. The model has to sign

the exact string match with in the passage, but for unanswerable questions and model really has to only classified that data points as an answer. This is not great. As a model can randomly, guess that all answers are unanswerable and get a 50% accuracy for now. Let's take a look at how the solution works. So, just tell a shin is a technique, that's used to reduce the size of model to wear a large. Cumbersome model. The teacher model is distilled into a smaller model of the student model. So during industrialization, the teacher model is typically already trained on the data set and

the student model is being trained through the distillation process. Unlike most model training process. He's the student models. Last function is a waited last over hard and soft target losses. Where the Hard Target lost, its just your classic lost. I'm showing where the model is trained on the data set and a soft target. Losses has Adam component that allows the students to learn properties that the teacher model has already learned. So they can both types of the student. Use this information to generalize the same way as a teacher model and reach higher model

performance. And if they were to just eat range. So the overall goal of distillation is to get a trained student model that is smaller than the teacher model and perform strongly on the given data set. So I will be using distillation as our techniques to reduce the size of Bert and will be running it. Multiple times with different student model architectures to really understand which model architectural model is Bert pre-trained and fine-tune on Squad 2. And we perform an architecture surge during

her optimization process to design the student model for each destination cycle. Do you know how distillation works? We also expect the student model to have work properties from the teacher model that it would not have been able to learn on its own. And a quick shout-out to hugging face without their Transformers package or their prior research. This project would not have been possible. So I highly recommend. Checking them out, find our distillation process at a high-level. Let's look at how we're going to define the student

model. We designed the student models architecture through suggestions from cicotte, where Sagat provide suggestions for the supermodel architecture as well as the hyper cranberries, so it'll tell us basically, like, how many layers should we use for the Train Cycle? As long as well as like? What are the parameters for example, trained on the Toronto book? Corpus and English Wikipedia? So now that we've designed a distillation process, what's the water Baseline in? Bread. The

basement. We use the student model architecture from the distal bird paper to the selection process itself, uses a low temperature and an equally weighted lost been across at the hard and soft target losses and used the default. As a result, our Baseline has around 66 million, trainable model parameters and reaches a 67% accuracy on Squad to end by optimizing. This distillation process. We're going to see if we can beat the baselines, accuracy and come up with a student architecture. That's smaller than 66 million kilometers. So, take a look at our

organization cycle. As I said earlier, we're going to use the tops hpo solution and specifically, multi metric Bayesian optimization. So to optimize the distillation process and perform our architecture search simultaneously. What does hyper printer? Optimization techniques were able to optimize for two competing metrics at the same time? And at the end of the authorization process? It'll populate a Pareto Frontier where each point on the freighter from here is an optimal trade-off point where you cannot improve in one

competing that wreck without sacrificing the other. More concretely. What that means for us is that will be concurrently optimizing for model performance. Vs model sides, where we want to increase model performance and decrease model size of the student model on the grass. We see you again. The Baseline values for each metric that we are trying to be. As we already know the squad to date us, that is split across answer bullet unanswerable. Questions does just means that if the model is trained

poorly, it's really easy for it to fall into the tall of classifying everything, as an answer called and randomly be correct. 50% of the time. So they're a couple ways to deal with us. We could create a better composite metric that measures model performance across that you had a glorious more accurately or we could just use. They got optimization feature called symmetric specials and because we chose to stay at our 50% accuracy point. So, what this does is it

really important the optimizer to focus, its efforts on printers face values that result in a model accuracy over 50%. And that means we can successfully avoid configuration where the model is randomly guessing. So why are we getting so we'll be tuning training parameters, including Cranberries, for STD, that size and weight initialization will be turning architecture. Primary. Is that include the number of Transformer box? The number of attention heads within each block,

attention had pruning and dropouts for the network and will be joining the distillation parameters or temperature and the weights for these thoughts and hard target lost components. So, at the end of the optimization experiment are Pareto. Frontier will be optimal sets of model, architecture and hyperparameter configuration, where we cannot improve inside without sacrificing performance and vice versa. This is just an overview of our optimization cycle. Say God

provides distillation, architecture and other hyperparameter suggestion. We create the student model and run the distillation process. Given these parameters, the resulting trains model reports back, its validation performance and thighs. They got then takes these performance metrics and suggests the next set of cranberries which includes again just Elation architecture and other hyperparameter suggestion and we continue in the loop like fashion until the end of the experiment.

Then order to conduct this experiment, we use rate orchestrate are easy to Cluster, the rain, manages the cluster orchestration and use the spigot for the paralyzed Bayesian optimization algorithm in Lacrosse, 2020. So what were results? So, this is our resulting Pareto Frontier. The yellow dots are the optimal points in the pink. Is our Baseline that we had previously established by using this year. We're able to understand the trade-offs between model performance and

size during distillation. We're also able to identify architecture configuration for the student model that result in these trade-offs. Essentially, instead of relying on a single architecture performing. Well, for question-answering were able to leverage these trade-offs and she's from a set of architectures. So these are just a few optimal points. I wanted to highlight on the far left. We have our smallest student model architecture that performs as well as the sign in on the far, right. We

have our best performing model that is slightly larger than the Baseline. So here are the respective architectures or those optimal points. Some common features found between the optimal configuration including include the student model preferring, no drop out and an emphasis on soft target lost. So it's waiting wearing from the teacher model more so than the dataset. And we see that the architecture is compressed More Often by reducing the number of layers and more. So than pruning, the number of attention has

This one's more intuitive because we're trying to reduce the size of the model and layers are larger than the attention head. So what we see here is that I love her Jane multi metric base, optimization to conduct an architecture and hyperparameter search during a compression or distillation process or able to identify sets of viable model for our specific problem, and were able to choose the model architecture that best suits our needs. So we're going to take a quick look at our dashboard.

Are here. We're here. We're able to use that loads at able to see our praetor Frontier. We're able to analyze. Patterns between specific parameters and the metrics that we choose more likely and have higher influence on each of the metrics that we care about. So here we see the number of layers being. The parameters are kind of differentiates. What are we getting us? An accuracy as well as what is the model size, which makes sense. And down here? We can really use ice sweet

spots for Parameter values that we can leverage for either this experiment or experiments in the future that might have similar printer spaces. So this is all great but we really want to know, was the model able to answer question. So let's take a look at our best performing model. Starbucks performing model has 70% accuracy. As we see in the garage, depending on the categories and models are farms better or worse with topics such as That Yuan Dynasty Warsaw

and steam engine, be more challenging than others. Understand what's really going on. We're going to categorize Miss cost to patients into four categories. Mostly write the model is off by either including more context words or things like punctuation and it's exact match. Identification of the answer, mostly wrong. The model has gone. The answer, completely the model predicts. An answer despite it being unanswerable and label has answer the model predicts. No answer when the question is answerable. So I'll do some categories have different majority. And as

classifications across the categories, most of the areas we saw our label. No answer, which means that the model was answering questions that in fact are unanswerable. So this is interesting because going back to why we sat the metric specials. We were really worried about the model randomly predicting that everything is unanswerable. So, it seems to have taken care of that problem. But is now running into this new one. Don't take a closer. Look at, why the model tries to answer questions that are in fact unanswerable.

So this after John Warsaw is a pretty good example. Of the patterns that I saw soda section spans a long. Of time with complicated and convoluted historical events. When you see, very similar nouns in each section, take on a different role. So that mean entity. The city of Warsaw goes through transition, and many different terms, are used to describe the same place. And in addition to that, the questions are tricky and do ask about relationship between these overloaded term. So, for

example, in the passage above 1, answerable question is, whose Army liberated Warsaw or says one unanswerable question, is, whose Army liberated where the text says duchy of Warsaw, and the model, guess is Napoleon for both. So, these entities are related, but have nuances between them that the model is an able to pick up. It's not able to differentiate between Warsaw and the duchy of Warsaw or really understand that these differences are due to a temporal transition. That is described in the passage and many

Miss questions. Similarly, follow the pattern and it's definitely something to look into for future work. So why does this matter? But using scalable and intelligent hyperparameter optimization, we're able to easily understand trade-offs made during a compression process or relation process. And by understanding these trade-offs were able to choose a model architecture that best suits our needs. For more on this project please goes for a Blog. There is like the full analysis on it and there's also get her green bow and published by

torch model checkpoints that you can use on for your own projects and you can explore and experiment dashboard as well. This link is public and please feel free to play around with that. And for a limited time you can sign up to you. So I got for free. So if you're interested, please join our free beta. And thank you, and now I'm going to open it up for any question. I don't know how to pop up for you. Alternatively, if questions do show up in there, in the chat room. I would encourage you to dance with him, one-on-one via the chat.

So I want to thank you very much for the informative presentation. We appreciate the support With Your sponsorship as well. This actually concludes our business track your sessions for the day. However, we have another presentation on the main stage, starting at 3:45. And we also have one more session in the ml sizes track as well. That starts at 3:45 time frame and we will have a startup showcase at 4:15 at the conclusion of those, other two presentations. I hope that you guys have a stick around for that and I believe we're going to have some giveaways during that time frame as well.

So make no, thank you. Once again, we appreciate it. I hope you guys enjoy the rest of the conference.

Cackle comments for the website

Buy this talk

Access to the talk “Efficient BERT: Optimal Multimetric Bayesian Optimization”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “MLconf Online 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Artificial Intelligence and Machine Learning”?

You might be interested in videos from this event

February 4 - 5, 2021
Online
26
104
ai, application, bot, chatbot, conversation, data, design, healthcare, ml

Similar talks

Jake Shermeyer
Research Scientist at IQT CosmiQ Works
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Tianshi Gao
Principal AI Scientist at Cruise
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Rishabh Mehrotra
Senior Research Scientist at Spotify
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video
Access to the talk “Efficient BERT: Optimal Multimetric Bayesian Optimization”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
949 conferences
37757 speakers
14408 hours of content