Table of contents
About the talk
Large scale deep learning training workload runtime optimization is computationally expensive, requires contributions from a multidisciplinary team, and relies on complex hyperparameter optimization techniques. Habana Labs developed a computationally cost efficient methodology to reduce the MLPerf training workload’s runtime, namely the reduction of the number of training epochs required to reach target accuracy.
In this talk, Basem Barakat, Large Scale Machine Learning Engineer, and Evelyn Ding, Senior Machine Learning Engineer, both from Habana Labs, discuss how their team utilized HPO to optimize runtime and tested two different methods in the process: home-grown Grid Search and the SigOpt optimization library. When comparing these two optimization methods, SigOpt provided a clear advantage over home-grown Grid Search in multiple ways. Habana Labs team will share their experience working through these challenging workloads and the insights they gained in the process.
As a large scale machine learning engineer at Habana, Basem leads efforts to accelerate AI workloads on Habana Gaudi and other Habana AI accelerator hardware, including collaboration with the team on MLPerf submissions. Previously, Basem was a senior performance architect at Samsung. Before Samsung, Basem was a senior systems engineer at Freescale. Basem holds a Ph.D. in Physics from the University of Houston.View the profile
Evelyn is a a senior machine learning engineer at Habana Labs and previously occupied the same role for Intel. She focuses on data analytics, machine learning, application development, optimization and deployment for at scale machine learning and deep learning projects. Previously, Evelyn held advanced software engineering roles at Heristar, ExxonMobil, Dow and ENGlobal, among other companies. Evelyn graduated with a Ph.D. in Electrical Engineering from the University of Sheffield in the United Kingdom and received her bachelor’s degree in Electrical Engineering from the University of Science and Technology in China.View the profile
Hi, my name is Braxton barekat. I'm here with my colleague Evelyn, Ding and the presence of the hyperparameter optimization painting with the guy in many other individuals. I would like to provide an overview about the work with presenting here in this life, is to reduce her workload. The optimization apply to two or three loads of that email pertaining, resnet50 and Bert to the ml per training. For those would like to look at it. However, this thing is for the version 1.0. We have the mission done for version 1.1 and these
results is not public yet for that reason. I will not prevent the result as an absolute numbers. We will show a Ross's performance. what's that result became public that and probably the method is applied to resident 50 in bird, but it's really General and could be apply to any other War codes. the residents 15 because that identical, I would like to emphasize in. Make sure everybody understand that many factors contribute to optimizing work, excuse and time. We will be focusing And discussing
that hyperparameter optimization contribution to the performance of that or close. And also, I would like, to mention we're starting from an unknown value. So these work roads, are they ran several times? So we will start and the known values and then we will go ahead and apply our methods. In addition, we don't have infinite Computer Resources. Our group has resources, but their shared with our group and other objectives for the optimization or Google, or with the least number of nodes. Using the least number
of time. The word with optimization work done, men lie and against the one done with Sagat, but it applies to God. I'm providing references here to what we optimizing. So these work and a bird optimizing the model oneself with optimizing the cost function and name the that's Lars and land. So these are the reference paper would describe the cost function used and answers to absurd to provide more details for you. Obviously, you can think about it, we optimizing the optimizer so that they
are highlighted in these papers are at Charter over them here. Immediately with optimizing, the scheduler of the weight Decay to the optimizer has scheduled with several High barometer and we optimize windows. So I don't provide the details about those hyperparameters as we use them. so, Plies also of the optimizing. I will be discussing that the Homegrown, the men will search grid search with. And that I will be discussing the matter with you. So we have four of them. We have the number of airports that what we call the training at box. And
then we have the number of the warm-up or the number warm up at 5. So that's important for the schedule or wait up order based on grades during that warm up to the maximum value and show me a fox and a warm up Xbox are integers. Now, we have two other, I could Brown service, the base learning green, and the way to K. So the Baseline, and they're constrained basically any real numbers. But ml per Fullerton, train on the way to K as shown. So it has to be
on time to to the end and then any integer. So, so that's the country. We have for the training at box and the warm up at box office leading be positive integers, the numeral. Now, what we measure is what the gospel to measure a piece or the response. We measure the evaluation accuracy and this is also a full load. So it's it's a it's a percentage and or attraction you can think about it. It's in Edgecomb. Ml /, 7.75 94, the resident 15 And and when the evaluation reaches, that accuracy, will you read the Epic number? And that's would be
what we call the conversion at Fox show. How long it took us to reach that efficiency. And obviously our goal here is to minimize that number. So we would like to have it shorter that because that will affect how long we train the training time is directly correlated with. The Men Who served as a standard manual values and then you do with the standards Explorer Asian and exploitation. Do you start a list of values for that training a box for the warm
up, a fox and the base lending rates and await the case and loop over ban in. In each combination. You submit a run to the Custer and you will get the results back. And the results back is telling you what the actor, see you guys in a tweet. If I the conversion of her number and you evaluate after you do a bunch of those run, And you adjust the boundaries and their content. So basically, I just searched and you keep doing that until you don't get better or shorter.
So with that, we are able to lure the conversion at Box by about 20% from the known Valley. And the Computer Resources we use about 85,000 Galaxy. Our God is our neural network accelerator and we ran into clusters. We ran on the Cooper 90, and we're at and bare metal and basically the same. If you have no control over the schedule. And and and we thought that's very good. Will you lower the run time by 20%? Do it then we decided to use an Optimizer and we did not know whether the God will give us a better number versus
other Optimizer or even a third 28%. So we did well and because because it wasn't API easily integrated and we have a lot of launching scripts and we don't want to deal with any other install on our system. So was easy just integration python lines or launchers and run. And also we found out that the Gap was support her evaluation. So that's really important to her to speed up our run. So if we have enough resources, we could run several runs a l and power results and we could speed up.
And once we started using Stegall everytime we get stuck. We got very good customer support, phone around with mutual funds include example, so Got us going fast. So that's why things move. So I just wanted to share this experience with you. So now how we run. So I will run and deployment station at the right hand side of the slide. You will find that we have a big cluster sit somewhere in that for us. It's probably somewhere in the clouds and we should that a launcher
and that's the place we launched our strip from so that part is the same with me. We share with Sagat. That's exactly the same. So when we do we have it was nice ones in the Cloud 2. To do what we have to do is just get suggestion from Seagal. Launcher will go ahead and pack suggestions for the parameters. We trying for the run for a for Resident and launch the job on. On the quaestor. We read the results back and we will lay them back to cigar and we keep going. So,
that's the psycho or of our measurement. And as for the details about our cigar implementation, I will I will return to Evelyn thing walking through those details. Okay. Thank you. A possum. Hello, everyone. My name is Evelyn and I'm a senior, machine learning engineer in the Habana liable and the, in the next few slides. I'm going to talk about the details of implementation and also seek out. So I'm busy slide. We develop a process flowchart to show the software execution
plan from the user perspective. The first we started about 6 cops from the left hand side. Then we're going to initialize the sticker off. I defined them feel important. The perimeter is below. The first one is the hyperparameter initial boundary. The next one is a valuation metrics the threshold. And the and also the experimental patch. It for our implementation. We use 100 a budget for each experiment. The next one is the computation and resource. We just depends on the availability of number off of the accelerator. We are planning to use The last one is a parallel wrong
budget. For example, if we're going to plan to use a 64 or decelerator, we can t find the parallel by the wait to, then, we can round two side of the 32 accelerator at the experiment that's going to help us to speed up speed up the experiment. Then we are going to start heading to create and a configure, the experiment from this point. This is our main execution, plan and plot has been divided into two major Loop. The one in the center dash. Line. That's our Inner Loop and a. The bigger one, the Bosch spline. That's our outer loop.
So, in the Inner Loop, which is a user and a sick of the interface. That's when we round up experiments. We scratch the Harbour perimeter suggestion, from the city cops. Then we're going to execute of those have a perimeter by training, our deep learning model and evaluate, the performance past the metric system or not. And then we keep going thumb and turn left. We're going to transfer the metrics. Back to the city cops. And it during this process. We had tracking the process from the city of the dashboard up and then we'll get
rich to the pointer. If we awaken to evaluate, if we raced to the maximum Patchett or not. If not, we're going to continue running this dope. Keep it or wait until we reach to the maximum budget that way defy out of the inner loop. We started at the outer loop. The main purpose of this outer loop is to help to further adjusted the boundary to further narrow down the search space to expedite the search process. I got into this outer loop the first a condition we're going to evaluate. If the converter app, Hawks new over-the-counter chat box old. Is a
difference is more than 1% or not. If more than one person, we are going to continue in the outer loop. The Nexus that we're going to transfer the Converse at Pocono to become a convert, a pack out to prepare for the next experiment. Then we're going to continue to The Next Step, which is by using. It developed our own algorithm using Kami and supervised machine. Learning algorithm as an evaluator blocks further adjusted the boundary to send it to the next experiment.
And a whole Outer Loop as being Either Way to the Forum number of times, until we reach to the condition if the new app house cannot be improving anymore. Then we got to Auto for the Outer Loop and we save all that they have point and the last two boot suggestion from sick after going to be ready for a steinway's, stop the cigar. Okay, from in this life or the next slice. I'm going to give some details of how we build up. Our evaluator block are using a chameleon
cluster method. On the top left of corner after we wrong. I said, he'll for the experiment to then, we can order pizza point where the have a perimeters and also the metrics. That way only pickup of the disappoint. That's me to the accuracy and also pick up the top of 75% of the converter app pocket appointment. Then we're going to split the oldest good data points of The Hobbit parameters into the different class, sir. Then the bass line of divided into different class, or is for each class, or we will keep as Elisa more than 10 observation Park Lasser.
Then we're going to selected the best of claster which is highlighted the end of Black Box on the right hand side. By using the criteria. Alexa. The minimum round time me of that claster. So we can see from the left of the bottom chart. That's a boxplot. That's a good illustration to show for us too easily to see which class or has them in the ground. Had me send a 10 in a box chart. That's the one we selected as our pastor cluster. For example. Does an
extra slice? We continue to evaluate our block, on the top left corner Westway Celexa, the pastor cluster. We will use the pantry off of this class her hitting the right Circle. Translate this red. Circle the social space to the right. We just wait and see the space between the Red Bracket. That's our passes to space from Alabaster claster. We are going to use this Ranch compared with the Second Officer recommended a wrench. The city of the recommended wrench is highlighted in the life of blue. The red and the black racket,
our black bracket to the whitest one, which is our maximum have a perimeter episodes of boundary. So we're going to adjust of the city of the recommended amount rate the light up blue French, the boundary between the red and the black. So on the left bottom part of this is the logic how we adjusted the boundary. So it is followed to pass for. Use a on the right to pass. We are going to compare Best of class through a boundary with a stick of butter, a
palm tree. If you compare, if it is close to each other, each other or not, if the difference is less than Alpha, which is a have a perimeter, but I using other way for use a 30%, if it is, that's a 30% difference. We are going to move the boundary off of sick Optum. Andre towards it to the maximum have a perimeter boundary the words to the right and then the second the past compared the best of class or lower boundary with the secret of the Lord boundary. If it is far away
at the same or that's 30%. Then we are going to remove the lower boundary of the secret Ops 2 words to the lower boundary off the basket class her that we're going to shift, the left hand side of the two words to them as a class earlier Pantry. So this is the way Alfalfa last two adjusted to the sick of the pantry that we are going to further narrow down the search space, to help to expedite the search process. And this is a weapon. Prepare to send to the next experiment. Where were the
who were the red light of a security feature from the city cops? Because this is Kay with the flexibility for the user to defend their own optimization method on top of the stick up for the differential implementation. So happy that I'm going to hand over back of the presentation, back to a possum, and then he is going to give some comparison of the resort and the provider of your conclusion at the end. Thank you. Thank you, Evelyn. I would like to share some inside. How about
some of the Run results? I would like to say what we like about the interface. We like to see and also we could monitor the history of the wrong and the results. Of course, we got all those and some files and log files. That would be nice to see them all together. And we like clear visualization metrics so you can see the metrics so we can see the dependency and how they correlate and how they act with each other. As far as the result of explain, we ran an extra
hundred runs and then he squeezed. Just that she got range. What we tell the cigar of the battery range, for those parameters, we adjusted and and then we run after new set of experiments. We got Improvement compared to the manual sir. And then we ran another round with a 300 runs to 6%. Improvement. And and then we ran another hundred and that I we we didn't get the job. Can we stop cycling today? Evelyn discussed? So how much computational resources we use reuse about
21,000 Galley hours or the or the interest? And and we thought this very well and very good for the competition cost as well as the improvements that mean 6%. On top of the 20% is not bad. I would like to go ahead and compare the two side by side. Compare the manual run. We or the manual dreads run. We did the research with the doctor started. Comparing. I would like to mention that I'm not comparing Apples to Apples. So when we started with the power manual, search grid search, we started from
the really wide. We started with known values and b e. Y f g. Okay, and we went ahead and we pushed it down as far as we're able to do with that message with the money by then the Brothers pizzeria tight since we knew we knew of France with the good boundaries are. And what they do is search, which is the windiest challenging cigars at that point to find better results, more results. The message we I mean the masses are not the same or do Jackson wearing the same till we started one
and we that cigar finish never last. I would like to go ahead and compare to, because we had some feeling that's really found out. She was extra point using an algorithm which locking quickly under the minimum and way better than just our manual inspirations. And that translated into is 6% reduction on top of or the optimized search space. And that's also reduction in production and time. So that's basically goes straight to the bottom and also better, we felt it. The better
consumption of resources. Do we use? $85,000 for the search was never the last. We, we were happy with the resource consumption at that point. So, what's next? So obviously, we provided feedback to cigar or what about our experience. So we already done that and we continue to talk to them about it. And we looking forward to update to version which features you must have better than the better implementation and deployment and also would like to evaluate more automated to automated
weaker, one more work early. So that we'd like to use them on more work and Analysis and Metric. And that was that. I would like to thank you. And that's the conclusion of this talk. Thank you.
Buy this talk
Interested in topic “Education, Training and EdTech”?
You might be interested in videos from this event
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.