About the talk
DLRM (Deep Learning Recommendation Model) is a deep learning-based model for recommendations introduced and open sourced by Facebook. It’s one of the State-Of-The-Art models and part of the MLPerf training benchmark. DLRM workload poses unique challenges for single-socket and multi-socket distributed training due to the need to balance a mixture of compute-bound, memory-bound and I/O-bound operations. To tackle this, we implemented an efficient scale-out solution for DLRM training on Intel Xeon clusters that includes innovative data and model parallelization, new hybrid splitSGD + LAMB optimizers, efficient hyperparameter tuning for model convergence with much larger global batch size, and novel data loader techniques to support scale-up and scale-out. According to the MLPerf v1.0 training result, we can train DLRM with 64 Xeon Cooper-Lake 8376H processors in 15 minutes, a 3X improvement compared with our MLPerf v0.7 submission with 16 Xeon Cooper-Lake 8380 processors. In this talk, Ke will discuss DLRM, the unique challenges associated with it and these optimizations that drive training performance acceleration.
Ke has 16 years’ working experience in machine learning and platform SW development at Intel. Currently he is Principal AI Engineer and Engineering Director at Machine Learning Performance group under Intel Software and Advanced Technology Group, responsible for applied machine learning end-2-end workload development, framework optimization and new AI technology exploration for the latest Intel Xeon CPU and upcoming discrete GPU platforms. He has 20+ granted patents in the domain of machine learning, multimedia and context awareness.View the profile
Hi, good afternoon. It's great to be here and then she might talk a stick-up submit. My name is I am a principal engineer at Intel and I currently leading the MLB team at supper and Advance our technology. Proved. My talk today is faster after training or recommendation assistance requested, be 120 trainees at the mission this year. And the hair is today's agenda. First. I will talk about recommendation system overview and the in particular, the Facebook
in the second apart until dusk be 1.0 training without including title and the mother paralyzation efficient. And the Enforcement office. I'll talk about how to use pick up hyper print on my vacation to to improve the amount of coverage ins. I will conclude the talk with the same reception. If you have a minute, I walked out, you probably already know that recommendation is one of the biggest a I walked out things like a personalized, Amazon other recommendations. Nearby
restaurant has the vacation plan. They are all running at running recommendation systems. I can get us enter your real-time personalized the day that you put a typical recommendation system takes to type of data, is input. 1 is so-called numerical 10 speeches examples, like 8 x number of purchasing the moms, like active minutes a day vacation, and the other type of input is categorical response. Features examples, like a gender, like a Prada ideal application is very large. End up with both ends than this past. Week has been put day
passes through the recommendation, models, and a generator and output. The most commonly used output of the recommendations. This is also called as Vidya play straight with all the dance and a spaz, contacts picture presented, whether the user will end up with accepting the recommendation or not. And you can imagine already why it's so important to be the best recommendation system. Generates the pastor business started, working very hard in this tournament as a result. Young man is so tomatoes for recommendation system,
such as wide and deep in in the deer are also many industry benchmarks to popular examples are cheaters and your breakfast Challenge and ml comps and up up and smart. That's management. System is so important. Yet. There are many technical challenges to solve every actress Improvement directly translates into business value. And Tracy requirement is is very high because of that, one time, the recommendation models are very big in the data set is healed
in order to cover more scenarios. Retraining is another challenge in order to get the most up-to-date models based on user data model of training and how to optimize is it to get the pastor still out during performance? There is a sort of model and part of a map of training Benchmark. Level Smollett actually is too is quite a stink up the best feature goes to the bothan MLP and the spot feature goes to in diving, look up tables off of that it intact together and a fit into top MLP and then
output as the click of a village. Each block has its own characteristics as color. Coded Communication in a computer, pounded on the Terrace. You support. Iran training is 1.2. TB of data, set is huge and running it on a single instance. The required memory will be more than 150 pie on the other side forever. The 26in wedding tables. In the imbalance, the tables have more than 14 million trees. While the smallest one, only has three and four entries with all this. You are a unique challenges on disability. The training needed to balance,
compute memory and the in a communication and the mall is so large it, so that we won't need you. You're supposed Mater and pater politician in order to speed up the training. However, when we do so it means that we're increasing the global. Baptized in the reducing the number of weight update is an introduce another challenge on the model. Converging waiting for Menses an efficient solution using data in the mother politician, a novel data loader in the new hybrid, split
SPD plus them up. Visors and efficient hyperparameter. Tuning model, conversion chart. Then in the following slides. Now let's talk about data and the mother politician is straightforward to use that a politician to spell it to the Terrace light and speed up the time to train. This post is the high memory in a communication requirement because of those large inviting to reduce the communication overhead and the memory requirement on each device. We use a Hybrid Power, distributor training solution
is divided into 10 small and a 16, large and writing tables on Edition model instances without a local, copy of part of the light in dining tables, with a hole in the wall, socket in every instance, without to Latin wedding tables and 46 inch socket system. Every instance only had one light in dining table. Also eclectic, communication is used to exchange inviting information between ranks top penalty and a 10 small dining table. Every instance or reduced communication is used to average between rent. So the
limitation here is that the number of instances and not exceed, the number of Lights in dining tables. We only have 16 light in dining tables, which means that we cannot kill two more than 16 instances to improve the skating Plaza. We used to spend it and I didn't taste model paralyzation. Putting this in writing vertical, split message about a vertically split into multiple in dining table with the same engine number as original. In the each table has a subset of the columns in the original table reason that each model instance, Howard, one of the
tables and he is also communication. What's the weather in bed and tables? Are you tomorrow look up each table with global badge that look up and Trace amount instances. We would catch an eight and trees blown into the same original inviting table and multi-layers as a territorial account on this what we can clearly say the advantage of our vertical split inviting Paster hybrid approach. Spell atopic tables. We reduce the communication overhead of models within dining tables. Time to train in
allows more efficient, skating smartphone, model instances, s, reduces the memory requirements. 26in dining tables require 100, 100 GB memory for a single note training with simple, hybrid data and a moderate politician results. But still, it still requires more than 20 GB on a Huawei still out to six people. Ranks is in the solution just as described. We only need about 260 be there for the vertical split. Inviting optimization is also a general surgeon to chain walked out with oversized in dining tables. And so far, I have
described the vertical split in dining table to scale, two more ranks and that reduce memory footprint need to look tops. Global byte size, entries, with model, paralyzation, which means that we also need to read the global. Byte size, input on contest their full potential bottleneck on Northeast security systems here about 8:00 there. It's supposed to reduce this. Only read a local battles Imports, which is a fraction of a global bikes as input in the use of a communication to get double elbow pipe sizing, but
I don't want to dig into very detail for this due to the time consuming, but if anybody has interest to normal we can discuss up line Orchard. Roblox. Next on my location area is the randomizer optimizers in order to support largest year-out training on Lancaster. We need to make certain Improvement. You know, that you speed up the time to train in particular position form at 8:15 or performance. Also wants to ensure proper amount of Legends length weight scale out. So let's talk about the F-16 support ization by using Spirit SED
computation also requires only half of the memory size. However, we will have to keep a FAT32 Master. Wait in order to maintain necessary actual receiver model conversion. You can see if it gives us a memory consumption challenge, because it actually repast more memory to base training. Which is true about sticker. Rushing here is to use explicit version of SGD of Mesa. Supposed to buy our Intel last team, send a text to Pat, and only sacrifice in the Mantis apart. Only need you stole that lower 16 bit mantissa on the original episode 8
to wait, then I'm buying it as high as it can fit together for 32-bit P30 to number Zoo. This way to keep the full position masturbate as well as to reduce the memory consumption and a traffic. The default LCD works. Perfect for t, o n, when the table out, is not large. Why is when systems are all Swiss Mo knows either you or the double batch size saying, I reduce the local batch size, which is not good for computing. A patient has been increased a double batch size and scale out more in the latter case. It
introduces, the convergence challenge with a when increasing the batch size from 32 32 120, at the Ford SUV of my co-workers Converse tomorrow at 12 in the increasingly turning right or larger patch size makes me extremely unstable and their lending rates will not only will go up to certain catch size introduction of lamb of Miser in Saint Paul. Saint. Marie. Israel was adapted Adam of Miser. Cost to buy Google branch in the New Concept called. Charles Ray Show the way you should you
warm up is not needed or not help much if it is killed by a truck race. Show is defined as blood is a norm of the weights and all to is the norm of the Adam operation for the Basin. Right is still tied by this transmission. The full weight of the time is soon either. By Stine's what it means is that when the weights are small ways, take us most at not to make a big mistake initially make mistakes initially with which I got bigger stuff in order to accelerate the conversation. So why this is because we want to have the gradient and a white in the in a similar dynamic range because the best size
is a toppin Time, Simulator the gradient. On this mini badge is much larger or sometimes much smaller than the way to disturb. This is exactly the phenomenon that Google branch in objects and the figure out a way to salvage. Stop in summary. Lamb is just another level of adaptation on top of atom, is a patient player. Plus adaptive moment in useful for large batch size, training conversion. Salt with be accepting, and the number of Malaysian or split SGD and the Lamb of Miser for large, Double B size. Now
we have now we have both to form hybrid of my ex's or trigorin in order to get all the benefits to dance layer, including top and the bottom. And now pees and has a smaller than inviting to imagine layers and I split SGD is tied into those big stars in wedding layers and the code snippet is in the app figure which is straightforward. I doing this optimized for memory consumption Origins and a performance which leads to the speed up for the time to train metrics that ml podcast. Most.
And now let's talk about mother, jeans model in practice. We have many design choices that needs to be finalized like a boss. Here has technology. That's why I said choose. So even if you will Train episode it tomorrow, when you reach rain or 5/8 or Downstream, task, all may be optimized for a hour in papon off on time. You cannot use the original hyper premise, that and that you will need you to a new one in order to meet your business requirements. Such as a cheeseburger, such as the training with Muhammad resource speed up the
training, you now have big of a size, you have to use a note position formats such as if they are 16 and sometimes you change. Miser, snoot arrested to optimize or sometimes you think they want to change the parameters brother, to get a better time to train on this leads to a very strong requirement for a good hard to find a Malaysian show for service. The data scientist. What I'm expecting hp02. Should provide a flexible search space technician? Supporting different data type, has a strange condition, something like that.
Also cheap off a large search space to tackle, complicated problem. And the second search for is expected to be built on an efficient model so that it can convert and find the global optional Point, much quicker, or a black box of my vision poppin in the trees as your neck Advantage. Also, in order to speed up the execution, execution picture is required. As why I asked you earlier. I'm promising files because by comparing with early results already predict, whether the calendar one is going to be much worse or much better. In some situations,
the optimization may contain multiple objectives such as to optimize performance and accuracy than what is required in the last but not least, nice that Sport and inside was fake tongues of work or what's Natalie. We have to take off as pot up entails machine gun intro set. It is the service and they can support all the requirement. I just mentioned. So, let's talk in more detail. How I use pick up to, to help this deal and training College in Auburn. This page, that shows the HBO show sharp
objects HBO training process between hbo2 and walks out. It'll take such space definition for the two new workout that suggestion from HBO Library, timings at Union work out with this Prime meter and the sandbag with the result that activates and then Captain oxidation process until pre-deployment training part it is which in this case is based on ten sport tunable. There a Bose learning rate, warm-up EK and optimize the choice is a big walk out and I use internal internal dance with your car. What is tuna exercise with the power experiment, Beatrice to see
David at the chunin efficiency Improvement, or any kind is super helpful in Give me the cost as well as to impose the time to trim features are used with you. And this to Nexus eyes such as learning, which is about anomaly detection. Call Lenny Cooke and the like of a spare outfit, that action also, I use the pruner. I'm promising experiments, such as medium, such as percentage and a nausea, I use in my salmon. And she got the provides very nice that spot for the summary and inside for you to make amends for this day. I have
to stop by the province. We said that shooting targets of my the AOC actually Cisco only as a single Mexico Edition problem from the from the left side of a man in Pokemon Mega is about about you and the result much quicker and the other open source that bomarc conversion, special holidays, 0.025 pickup only takes for a field experiment that exceeded the special winnings XM budget optimization without cash, improved further in the eventually is about. You find a very good basis for that. Helps the deer. I'm covered in quicker.
Aside figure shows the prime of the important. This is very useful for scientists to understand, which prime meter should be tuned in the, which you might not need. In this particular case. We learning rate in the Walmart, a very important because when you still out at size increases and that you would take me to meet to get up the volume on the other side, is the less important which index in future or such kind of similar problem. You don't need to change so that to reduce a space and to save Computing Cycles.
So putting together all the above nationalization. We implemented this video. I'm training for the Catalina Hotel with each, not having a socket, set used iPad, iPad stands for in tailpipe. Extension and I was here Library, x-axis is a number of socket in the y-axis is a time to change the minutes for the First Column. By shows the close division result, which takes about 2 hours to get a conversion rate on a used. Cheese with 32 GB size, 0.7 training,
soda machine, without last year with with split a city of Miser. And so when we used to them two up to enable 256, take a double batch size and the vertical split inviting table. She was still route 264 sake. It took 15 minutes to convert the which is the 3x speed up conference. You lost the 0.7%. Don't waste the upcoming separated and the Air Max, the deplanetizer. So we're expecting more Improvement. Okay. Now, let's summarize what I talked to say the first we demonstrated, how we optimized the do. I'm training, workload on Dionne, Carter got
a balance among choose memory in a communication to a cheap efficient scale out. And even though we have hybrids a data and the paralyzation Rapala fishing bait and hybrid advisors as individual Improvement. This final results cannot be easily achieve orgasm without the great HBO to see. So speak up HBO to is crucial matter and the batter training with eom recommendation system. This country is my my talk today. Thanks so much. Thanks so much for listening.
Buy this talk
Buy this video
Our other topics
With ConferenceCast.tv, you get access to our library of the world's best conference talks.