About the talk
For more talks and to view corresponding slides, go to scaledml.org, select [media archive].
Presented at the 5th Annual Scaled Machine Learning Conference 2020
Venue: Computer History Museum
My name is Kevin and I work on the machines running infrastructure team at Netflix. So my team is tasked with providing support for all of the activities that Netflix is involved and over the last few years, my team has been working diligently on the steam machine learning intersection, at least, if he opens doors and go. The next 30 minutes, I'm going to talk about the two top speed behind it and as well as give a brief introduction to solid. Since we are at a conference with scaling its name, it make sense to stop at that to Michael Stonebreaker, who
is securing of what venue did amazing work in database systems. He had a very handy categorization of audio systems around between Dimensions volume, velocity and variety and modern machine learning systems are so my volume with the meaning is machine Learning Systems, which process is a huge amount of data. A huge volume of data by velocity systems that need to sort of like a train on a firehose of data or need to rely on a very low latency predictions. And then the third one, which
not a lot of people she talked about it and have had to send it to the shelter, where is all the big wide variety of machine Learning Systems, which don't really fall into each of these buckets. Now when it comes to Netflix, you know, I'm sure that most of you here would be would have at some point. Subscribe to our service and this is the first thing that you see, when you log into Netflix on, either your TV or your computer to the Netflix home page, every single aspect that you see here is driven by an algorithm. The shows that he recommend you the album art that is
he over here? The rules, The Ordering of every single thing and to assemble the speech we definitely rely on a huge volume of data. We need to sell these recommendations to more than $170 members all around the globe at a very small agency. So the volume and the velocity buckets. But behind the scenes, Netflix using machine learning and cases, right? So we are going to the world's biggest to you. We spend more than 16 billion dollars every single year deciding what content to buy and somebody needs to make a decision. What is the value of Any Given piece of Ip? That comes
our way when we are producing the show before we clean light red violet, to introduce, we need to estimate. How many people are going to watch? And how many subscribers are we going to be in? Is this actually an efficient use of money that a subscriber is not giving us to us. We happen to be one of the world's biggest distribution networks. So every single time you hit that content is being streamed from a physical Hardware device. I just got here, I guess be. So if you are so you need to make a decision what content is going to be popular in your neighborhood so that he can see that box
with that particular piece of contents so that you don't run into any buffers as one of the biggest subscription service. We also need to be like in the Forefront. How do we fight Financial How do you make sure that you know what, the baby process payments is as efficient as possible as you can see? You're like all of these defense against it and as it happened, just don't want things with miso machine on the infrastructure that can necessarily salt that and that was the problem. My team run into as well. And
that's when we started to think like, okay, what does a doctor that can help? All of these data scientist? All of these people, who are tasked with solving these problems day in and day out, be more successful at their job. So to begin with, you know, like let's let's talk about how did I find us functions? What does the day today look like on for the bedroom? I'm pretty sure, you know, the story Saturday. Fantasy artist when figuring out a solution to this problem. At the end of the day you need to drive value by optimizing certain metric and what are
you do you spit up a notebook maybe collab on GC. Jupiter notebook book about ends up having to have folks from Facebook. We'll talk about it later today. It's Adara. You can just fine as long as you are a trading on your laptop, but then this data needs to come from somewhere, right? You need to figure out how to write these ETS, which arrange your data in a format that you can consume. We need to figure out how exactly right that he kills. Can you not eat? You answer the phone. Then there will be
a point in time when your resources on your laptop are just now getting off. So you can move to a bigger laptop. Can I move to the cloud? How do you actually do that? Tried? Like these days people rely on Dockers. So then you start thinking about okay how do I continue write this? How do I write this doctor? Find two fighting occupied for most difficult expecting. Somebody who doesn't really have a formal background in software engineering most times Been a model by itself. Doesn't really mean much, right? I just have to be put into production. This model has to be used
in some way or form, which often means more container in your mother as a Microsoft Office, or you're just doing some computer on your mother's going to request and then storyhits hungry. All of this will ultimately at the end of the day. Power, some sort of you. I some sort of business stakeholders will then take a look at these results and they need to do more need to a trade 15, or they would be like, comes down and then you would be like, hell, why did I go through all of this? And they'll be something that can just like make
my life easy. And so that I don't have to spend like months and months just getting to the real estate. So as you can see you like gives you as an individual who has a cleaning and deep expertise in data science. But if you actually look at the actual amount that they're doing that most of the time is consumed by engineering, which ultimately at the end of the day, seems like a productivity for Netflix. And I'm pretty sure and a lot of other organizations are often the most expensive resources, like machine time is super cheap, the amount of money that you paid. You like, any of
these windows. Like, just walked by the salaries. And the other resources. These day of scientist are afforded. So the next question is, what can we do to make sure that can like this desecration type of this title from prototype into production, back to prototyping equation is made as quick as possible as efficient as possible in a manner that doesn't truly and forced a good amount of tax on data scientist and they can focus just on data, science and engineering. No one other way of looking at the same problem is looking at it from the lens of what is the date of my stack?
Like the story that I told you about it's a story and not get into the specifics. Let's get into the Weeds. Now, when you start with any machine learning problem, the very first thing that comes in as well as a scientist you often don't really care too much if your data is stored in S three. R s d f s or does that sound CSV file stolen? Like somebody's cabinet. The next thing that comes in our Computer Resources you would need access to gpus you need access to boxes with like you
don't want to like go to five or six different guis and like figure out like special fancy base of the scheduling, your computer on something. If you plaster and these days in a row to the shore people, You some sort of a container orchestration system, so intermediate Netflix V, use something else but the concepts are pretty much the same. On top of this, you need some sort of orchestration later because you need to make sure that, you know, when you move to production, there is an scheduler,
awesome orchestrator, that is orchestrating your computer on top of this container orchestration and all of these container orchestration system. They have containers in the world. So you need to make sure that you have some way of generating these docket images laying around that they stopped at 5:20. I see the infrastructure tickets package your phone and be able to deliver and deploy it. No machine learning happens to be penetrated task rights are, which means that experimentation needs to be your first love 2010. How many days is that? You treated? That way, if I
was standing and metadata management, it's just maybe if I had kept track of these, my daddy. These exact version of cold, but then you can't really go back and fix things in the past. The thing that comes on top of that is your actual development environment. Using a notebook are using our studio on using pycharm. Are you on FB stock below? You collect support access language. At the very top of the machine on electronics right? Eye has a data scientist. You want to exercise your full Freedom. Hey do I want to use up your
data scientist? You care about the top of the stock price. Really cared about whether you're using, but you don't care if he dies coming from Etsy, orders are coming from hdfs right now, but as infrastructure, engineer's for me, deeply care about how did he has laid out? Are you making efficient use of resources? How is your container orchestration system? Send out how its packaging being done, how are the fortunes manage for this experimentation management system in
order? And it's very, very hard work that needs to be done in time. Adidas scientists would be much better if a systems engineer. It's me to Covenant on that and keep their life easier by making sure that they have the absolute amount of Freedom at the Highlands To to solve this clue is the machine learning framework that has been used at Netflix for the last two years. And so is essentially independent rapper around this deep stack of machine learning, Tulane green, win the world
cup. With the machine guns are running into, there are solutions for that but there is no one coherent story around the 23rd of Reinventing the world. We just try to make sure that he can. You just tell it all of the knowledge that has been built already and presented in a way such that the data scientists can just focus on data science and they don't really have to worry about like a is my code getting version right now, making the best use of resources right now or not. See you at the end of the day, it's a boarding tube
nature, machine learning activity and the other is deploying them in real-world scenarios. And both of these involve humans, at the end of the day, right? Somehow? The problem can figure it out and then Disney Channel are Redskins. That's what is a Daddy really want to do and how best can we design something? That's why I like helps them rather than a steep learning curve? So let's get into the weeds of it. Let me give you a brief introduction of what my daughter looks like. What is it like to play with
me? So this is energy, what any piece of machine learning for it? Looks like you have like some function that you're coming. Airport. Getting some data, obtaining a model storing it somewhere, five hundred lines of code Thousand lines of covid-19. How do we make improvements to that? You can do that is what if we allow users to? You can think about these five steps in the stocks if you are. So I was like getting data from somebody else too. Or maybe you are training two different versions of More Than A and B and then in the joints that you are figuring out. And then in
the end step, remind me to find your computer. Can I get off? It's all of them have but one thing that wasn't nice to do this without on as everything is execution and Steve Austin. Is I like left as an exercise to Davida which made of Lord provides. Support for. So in this case, let's say you have any instance variable X ax. And you find the values are available to every single second step of the truth. And in the background actually, snapshot the entire State and
started in the data storage happens to be at 3:40. Now, this is a lot of different benefits for us. So, for example, you know, you decide that he Maybe the code that I rode and surbhi doesn't really work. Well. There's and then you can just like your competition, right from Stampy. It could be a scenario. Where can you get stuck here? Loading a huge dataframe, which took like 20 minutes of time on your play engine. No, you don't have to pay that tax anymore because that data is already snapshot it for you and then you bring that damn back for you every single time.
The other problem that people run into is that okay? This works better than fine on my laptop but now I need to get active too much bigger resources right now. I need to run 500 gigs of RAM. Right. How do I actually go ahead and do that train and at that point in time this visit huge gap that the data scientist need to cross and oftentimes they need to figure out how to schedule a selection, what computer and manifest blowers that down significantly for people decorator and rotation at resources, they can be easy to specify that for training. The status of the
model that I have. I need the pin CPU coolers and I need access to a GPU. I need to perform this map to beat up your operation and I need two hundred gigs of RAM. And what you're going to do is we are going to take this computer. We're going to actually schedule it on the clouds though. In our case, inside of Netflix, that happens to be a dispatcher that container orchestration system in the open. Hot Wheel and it'll be a smashing the electrician coming out. So that's that's really helpful for people, right? Because now all of a sudden, if they run into any issues in an easy
to use the cloud and get it worked on. The other important constant that made a few interviews here in this notion of for each one, one of the things that are bunch of typos and I just want to train all possible combinations of a model and then figure out what's best try. And that means that that requires a significant amount of resource consumption and how do I do that very easily so you can have steps which can just like depend on it and you can very easily just have the step up that list and fart every
single step. In case if you define a specific resource, we would actually launched an individual container on the counter. So for example, in this case, let's say you are paying attention to more than some arbitrary number of family. Does you have a tree? Then you can very easily. Just request me different instances with a G fuse, and turn on the TV. And you have internal use cases where people request 10000 containers in Portland for training some of the models right now. So we have between confit and language and title specific models. And if you'd like
to do the cardinality. Another point that I want to mention here is you don't like a lot of people might be wondering when I said, I can like 10,000 containers and so on so forth. That's that's like a huge amount of computer that cost a lot of money as well. Right. Time to live in a world where spot instances are like really, really cheap and the fact that my dad is not shutting your stage and you know, you can add like automatically tries and everything it just makes you a computer or a lot more efficient than it used to be before. And you can very easily discovered
specific pieces of your blood flow between specific dedicated instances. So compared to let you know. See, you might be needing an instance with a GP for the entire length of your book. You know, you can just isolated for the actual work. No one big problem. We observed our user face was their own data access, right? I guess it's all good and fine that I can do a hyper parameters to attend. I can have it ready right for each but then each of those notes are also actually and usually people have been trained that Audi that means
that you need to go against to play the engine and most organizations have a fixed eyes, and it's a fun running away. You might for each side and everybody else doing at the end of a dr's, use my data partition and I want all of that data as quickly as possible. So many ships with its own, the entire network bandwidth available on Dickinson so that we can get the pocket files and we can load them and stole them as in memory of the presentation. So, you heard real talk about Alexa, He has been doing that, they don't have to pay the taxes by going against a
Quality Engine over. And over again. Has been really an option for this feature. Externally, as what? The other big issue that people run into is around. Third-party dependencies, free machine learning is a very fast-moving feet every single day. Every single week, there are new advances that have been made. Me wish I'd started coming out and the entire life and more often than not, they don't even open the specific versions of dependencies. So even if you're installing 10th floor today and if you install the same version of tomorrow, and when does it
end up picking up different versions of bouncing dependencies? To even though from your side? You didn't actually change anything. The results. You might get might be entirely different things, might not feel things, will be just a friend and Dueling available right now and really hard to capture what those defenses are. You are such a big version of iPod and will make available a Docker image, but all of these blessed versions and people are supposed to only use. Now, what happens when
you actually have to move to what happens to the quality of cydonia, previous version? What happens if you have to be on the bleeding edge? And try out a new feature that just recently, not, installing many of these packages. It's not really trivial, right? Like if you're on your own, you lie on this. I'm eating, I could manager called conduct which has most of the day. My packages available and behind-the-scenes. What to do with my doctor is he actually snapshot the entire continent, dependency trap for you
and be stored at in the data stored. So now what is inherently a closer for us where me and you have defined your dependencies. We have actually captured a specific moment when she look like so that we have to keep confidence. At any point in the future, we can regenerate rehydrate. That in mind for you, we know what sort of resource requirements you need and coupled with both of these and the music for that conversion and Snapchat every single time for the user. We can take this unit of computer and we can participate and schedule. It anywhere. Be at
your laptop. Did the cloud anywhere else and that helped a lot. And so does behind the scenes every single time you executed work. So is that assigns a unique identifier with just instruments by one every single time, you execute and then you can go back in time and you can access the state of the case, may have been Snapchatting every single instance I can look at. Hey, this morning that I need to do so that I can do it a month ago. What is the state of that workflow? What were some of the attributes of that? How does that compare with the one
that I just turned this new-look field? Or if, you know, my coworker was actually doing some experiments more than a year ago. And I want to see you how my new line of research appears up with that. And that pleases our users, when they don't really have to actively think about worsening. To look, those are ruining their code. All of that is just handle seamlessly by behind the scenes. Now, the other component machine and you're happy with that is what you do with it.
Menifee allowed people to the firework show in production racing. Listen to show up until now the situation in organizations before Anthony Define terse with work on some prototype walkthrough and then it would be the responsibility of a data engineer or software engineer into production, quality cone and deployed on top of some scheduling and we translate that into a language that the bags. Can you understand? Understand to internally? We use schedule. And the
open source pollution has Integrations with AWS step functions and it can be easily also. So now once you have deployed your model into production, right like anything that goes into production, it's bound to fail at some point in time and at that point in time again you are given the fact that we are stopped on the entire state of the work that everything the time and very easily. Verify, what is the state of the workflow? What's happening with? A lot of times
people have tried to build kind of like a universal UI for machine on the use cases. And in fact, we've seen that all of these machine learning books, lowest are very different from each other. In every Define dest, has a very different notion of what is it that they actually want to monitor. You just keep them to choose where they can just use a notebook and a small little steps taken just like, figure out what is the, what is the state of my workflow? What I like on the parameter values and Stop putting every single aspect
they don't have to go back and be like, hey now I need to log the speedometer as well taken care of. The other aspect is training going we're close very you might be doing that coding but a lot of times you just want these mortals to pay. I'm so afraid it could be another service. I could be an internal you I and found in practice. I'm much better interface between 19th and 19th. Instead of being the morning is just a simple API who also has the system components, or you can just go in at the data, scientists write a function and then definitely take care of just posting, it has a mic,
a service and Technical skill it up and down as needed in response to the traffic. And this particular example for doesn't activate your difference in this instance, variable X, which comes from this world. And because we know that the value of x + 3, To. These are some of the features of Sevierville East Meadow close at 8:03 in rent, in December has been in production inside of Netflix, for 2 and 1/2 years. Most of machine learning, except for the recommendations use cases, used internally.
And you've seen a good amount of adoption and a half years. They are like a whole bunch of big and small Enterprises that I'm using them feel. So if you guys want to give me a try, I would highly recommend going on to our website. We also have a heart ablation of metal do. So in case you don't have access to Cloud resources, you can go to Middlefield sandbox and the provision of silence for you and Bella. Sue and somebody as we saw before models are very tiny, part of an end to end and it's really important that he defined us are allowed to
focus on data science and the trouble to language focuses on the actual human aspects of data science. It's actually possible to do that. Thank you. Questionnaire, what is the main thing? Bad mediflow can do that Airbnb. Air air flow cannot do. You are kind of complementary in the sense that Airfield is a scheduler. It even think of another production scheduler wild. Meadow flower is a machine learning framework rights over there. Are around I think they have some projects
coming in. Please like But you don't remember the name, but another way of thinking of interaction between once you're ready to go into production and Export image of your dad on top of the replacement for the temperature. Building a building of that question. How does it compare with the other Frameworks like Cube flow sensor Flex, tended and corollary to that question? Who is this, who is that typical user for this tool? Because those are tools, you don't see like normal users using, it's mostly like large companies in the field. I think
one thing is that the vehicle is he tied to the ecosystem and if those are not your choices, then not a good fit for you. So when we started building at that point in time, we looked at all of these projects as well and we fail to find an interim solution that work for our use cases. If you bloating super extended book for your use, cases the Assumption t-shirt But for us, it was really important that we don't I have a future with one particular set of technology and
all the tooling that's coming out. And I think there was one more question or like, who are the intended users, who are looking at that has been proven out well introduction to make any machine learning projects for ecosystem. Given the fact that there are so many competing products out there, so yeah, I mean, if, if you are in that thing of mine would be happy to have flu and would be good around her when she's little bit like you and we are moving to cook a slow right now. So I think I like a lot of
things that you show but for some Pokemon I have to problem so it's the first one is No online services are basically in scalloped or online and believe it or not, I have a doctors appointment for when I want to hear what you have to ship. So one of my flight had many more teachers, like, I don't know. Also, we have people in supporting art so our internal users are split across Python and everything that you saw that he was supposed to. And then his daughter Kelly, most of the steering wheel
and the low latency online solving systems that we have recognized JDM carbon recommendation, use cases Be easy to still pick the pieces that can be like Safari sounds or you can just take the offering to provide you a video to a new models at the store and be able to different stores and then you can have your online schooling system Strickland in any other language, as long as you can see realize those models and make you something, you should be good to go. All right, let's thank seven one more time.
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.