Duration 30:40
16+
Play
Video

Productizing Machine Learning over Big Data with AWS tools By Maximo Gurmendez, Lead Engineer DataXu

Maximo Gurmendez
Chief Engineer at Montevideo Labs
  • Video
  • Table of contents
  • Video
Video
Productizing Machine Learning over Big Data with AWS tools By Maximo Gurmendez, Lead Engineer DataXu
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
13
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Maximo Gurmendez
Chief Engineer at Montevideo Labs

Maximo holds a master’s degree in computer science/AI from Northeastern University, where he attended as a Fulbright Scholar. As Chief Engineer of Montevideo Labs he leads data science engineering projects for complex systems in large US companies. He is an expert in big data technologies and co-author of the popular book ‘Mastering Machine Learning on AWS.’ Additionally, Maximo is a computer science professor at the University of Montevideo and is director of its data science for business program.

View the profile

About the talk

In this presentation we will focus on the problem of serving machine learning models that are trained over very large data sets (terabytes and above). In particular, we will show how some AWS tools, including Apache Spark on EMR and SageMaker, can aid such process. We will use notebooks to illustrate the ideas with real business use cases and we will share some success stories behind the development of smart data products along with the lessons learned. Throughout the presentation we will try to address some of these questions:  ·  What do we do when our training takes too long, or is too expensive?·  Are "deployable notebooks" a good idea?
·  How can we integrate big data tools such as EMR/Spark with ML services such as SageMaker?·  Why are model serving endpoints not enough?

Share

What is a smart data product assuming that you're creating? A bronze Rider that has a purpose and in this product makes decisions Midol helps people make decisions impact and therefore it must have quality in mind when something is a product is not just something you can use to put the Journey of buildings my data products as know in an industry. We're not as mature. You know, what other products I'm going to discuss things that can go wrong. Assault and we can take those discussions if I don't like you any question at the end.

So the first thing I'm going to do is I'm going to do it like a small damage or at least the one that we have the most experience with and like most of our clients work with a doctor and myself about how you know how to master machine learning on AWS if you can get the set the stage for our next. conversation I'm going to exit I mean and particular I want to show Sagemaker, I exist AWS console and a few examples of things you can do with sagemaker. Anybody Killa we going we going at the console as a bunch of different

tools that you can use from training jobs in friends is a bunch of things. And today we're going to I'm going to go through a notebook instance one of those Services work from a notebook machine go there and start working on your particular. I'm going to go to doctor Chapter 3 of the of the book. I want to show a few examples of different ways. You can go about solving Particular sense weight since this is a Boston addition of the company going to show. How to work on a

specific problem which is estimating the price of houses we can download them from Columbus Ave, Boston Housing. and the In this data set. We have a bunch of properties from different houses that are being sold in in the Boston area. And the label is the value of the median value houses on specific areas of specific age specific sizes have one in the Boston area. So we want to estimate the median value of this house. Maybe we can, you know thinks that we're building a smart ate a product that allows us to estimate the value Maybe.

Spiders that can help you assess know whether the value of your house in other right range or no now that will be the smart thing up right behind Define. A dataframe here. I'm using the most simple trainer linear regression. The first thing we usually do we split our testing set. So we 20% And we just import scikit learn by the label which is capable of providing each instance or Road. If I could tell her we want to predict how what up, what are predictions for the testator friend and compare them to what the actual value?

So that's what we doing here. We just stand there and now you create a new column to a medium volume and we can do the usual but this is a linear model so we can look at the coefficients and see the intercept and the score and see how how are models. And I need one single box and it doesn't it doesn't scare to more than one machine at least as as as a zygote is written in the snow. NWS also asked Mr. Mrs. Elastic mapreduce, which is a service that allows you to

distribute computer and run a bunch of different Frameworks for big data in particular spot is a very popular framework that allows you to do with over large amounts of data to distribute this training that we did in a single box on many boxes in this since we only have 50 minutes. I'm going to show you what the fart version of this world looks like even though I'm not running anymore. I'm still running it locally on this so you make a note for instance

because I just don't have enough time to show all the all the plumbing necessary, but it's super simple. We are working then so I could learn a particular and we do the same thing, right? We will create a dataframe. We queried it from CSP. we Define the feature that wants a train on Wii what is different inspired is the new need to assemble your features in a single Vector column. That's why we have electricity. And as usual, we can finish we do a splitting of testing and testing testing and training

with fit arm using a very similar language than we did. We had with scikit-learn and we can now transform 3 predictions for each of our entries on my desk and see how well we did hear they are two Score is very similar to the one we had was like it's not working with small data to make this quick. We can either use the built-in mortal summary or we can use the regression evaluator job to get a sense of how we doing on the on the predictions on the testing day.

Super darling Wikipedia. You don't want you want to just use estimator linear regression on a song, which do you want to include the the trainers are Spider-Man five night that includes other stages know just the estimated which is the model of the trains or Rhythm trainer, but also all the future transformation all the feature engineering all the previous date of operation status of stages necessary for machine learning a bunch of more Transformers here like in Cold Spring normalizer. This would go in there.

and the nice thing about pipeline is that you can fix the entire Pipeline and this would give you a Transformer capable of blind to a particular row all the Transformations necessary at predictions on all the Transformations word that we're done in order to get to a dataframe that is prepared for learning those same transformation will be applied in a mole of that city of High Point versus versus a model in Independence MO specially for grading beta product

when an incoming, you know request on Zane saying there's someone submits details about a new house and you need to Get a prediction you need to do all the same Transformations that you did a training time. You need to do them at prediction time and having a pipeline. What does that automatically is a very good thing support and and this is the same the same model weekend train on stage me. Sagemaker is not just this main infrastructure that allows you to run launch notebooks with also has services so we can call sagemaker training

jobs and we can give sagemaker the data give sagemaker the training dates. I gave Spacemaker II data and sagemaker with around that data. It will even allow us to run jobs that can. Get us a prediction and One upscale. It will also allow us to deploy an endpoint capable of providing those predictions for us. So exact same example, but using sagemaker as opposed to spark or scikit learn. Here we have the same the same. Dataframe selective training features with select which is a label or addictive.

We have the same frame for the next thing we do is we upload these This Dana frame to S3 in particular. We upload two slices one for testing in one for training. So you can see here when using the bottle API to upload the different CSV so that they're on S3 S3 is a distributed repository of files and blocks. And once we put them on this distributed play the store then sagemaker is able to read that data and run the training job does a new version of the one that I'm here. But so what do we do?

We want to use for training. What kind of machine know, where do we want the later to be awkward? Where's where's the data? The training dates are located in particular. And it starts raining the date we get some stats about our training and we can also see these in the sagemaker URI. And we can you get training jobs over here and see Nokia running a few minutes a few couple of hours ago. I didn't want to wait for the whole thing to the Ronco. the lineal on Irani created a model

what it what is the output? Some some metrics around data. All of these can be seen in the in the states making why as well as through the API. And another thing we can do is read a Transformer job that would read all of our testing. They never a bunch of new houses. So in this is a case of you know transformation job, they can get us predictions for data huge amounts of particular. If we go to the you I can see that there are There are packs transforms jobs over here. That's this is another thing that I ran. Today, which is a

linear learner. This is transformation job in which you want to get predictions for a bunch of the and in addition to Amino created prediction from Starla Ministry. We can actually deploy an endpoint with blower. Outside reading predictions real time in particular if I go to the training job, or if I go to the mall. This is the model that was created a while ago with me run that I can give an endpoint name. And an immigration and I create an endpoint one side have a name for this endpoint and it's up and running. I'm not going to wait for you to come up

but it takes a few minutes to come up. Then I can use the real-time prediction sagemaker glasses by just providing the name of the endpoint. I just created you realize her that I'll be using me to submit. I can call this real-time predictor and get predictions for the Spectre the Spectre of the different properties of the age of the house by using a framework statistics. No. the Going to go back to the presentation is very easy to use his tools and to deploy services, but

you know. Having the ability to do things with real fast is is great. But you know sometimes by by rushing right and by forgetting that we are working in spite of a smart data product roses things can go wrong, right but yet I've seen that happen often times. So I'm going to go through this and maybe we can have some time to discuss whether this is exactly something that in your experience happened to you or it was completely I have to do this. No one did this happen. I made such mistakes in the box, so

The first video is assumed that assumption some future date that will remain sing a new strategy through a notebook or presentation to a group of people about how that particular model or particular the data math might help from a business in some way or just better microphone uses from leaving. The company is used to buy. The bus schedule for the terrible change, we should challenge the assumptions with your Truth at the moment of Designing methodology. And we forget about the underlying assumptions that that made the data

model value at that particular Moment In Time. So yeah, that's that's not just a blonde models and then forget about them. We need to challenge the assumptions of Designing and developing as well. As you know, catering for model refresh. Yellow common error is in my opinion Engineers focus on stability. No latency by Trooper. No currency shoes net worth that were using the cloud in the most efficient or so away on the other hand. What was Warren gets rid of experimentation and successful prototype types are good when it's usually

about the model accuracy or business metrics many times. Some of these objectives are in production. If we need to ensure that models being served a truly flat say that you're not memory leaks that don't cost that Russian election issues at the same time chances are that don't require a large memory footprint and are prone to getting out of the phone. The best my opinion is to have teen scenes in which engineers and data scientist collaborate profiles for your

sense of collaborative that makes sense for the business culture is not right and the scientist would say it seems very risky and it's possible. We've seen it work. The next football is metrics are update those positions in the room. I might recognize the Samsung products are in a might be under the illusion that this model are behaving the way we intend it. But it was a mark on your butt. We may be tempted to use metrics. We data scientist understand very well that

alive metric. Can you show that we're doing the right thing? We're making the right final decision. Maybe test metrics are really useful because they prove that our email is working better than than a good well understood alternative strategy. Uad is is in also important as it helped uncover an intuitive or confusing aspects of our data products with very good morning when expanding to a greater region might be accurate based on the data because the checks for

isotonic Behavior Michael Jackson's issue and also prevent erections. So the other people is usually several Upstream transformations in ETL jobs for modifications and improvements. We need to make sure that all those things don't affect the assumptions present or normal. The week we need to make sure that has new day that we got was a whaler old new models can leverage that you did there without breaking the system in production for this reason. They are streams are days that I need to be version and a couple so that a model

Henry from the data that has a right side of a summer. We are all meant the date of separately vs. Jointly and later deployed new models that leverages meaning meanwhile model should be able to use the previous version of the data and run on a second. Julius really typical typical issues around you know, some someone changes the data upstream and the most start to behave in an old fashioned and later for correcting. So that's the assumption. Your beautiful is

you know, nothing copywriting. They designed artifacts this part of the Seattle PD process. Write imagine Adidas signs. What's in a notebook tastemaker? Like the one we just came out and threw the notebook at the mall and let's suppose that is connected to the system uses and blank is part of a larger and more complex for continuous integration and the Business Development. There's a reason why we we have intuition test. There's a reason why we invest in all of his instructions

about guanine are our code needs to be reliable and aware of the effects of data science based articles. Reborn into Consultants enough in local interacional Department scientists to troubleshoot when things go wrong. So it's not always possible to have a complex and large system running on a laptop. Typically very useful as a very with you scale. It can help body aches Plumbing issues and shed lights on in consistency. So often times the productivity growth are not if we have the ability to depart locally in run a mini version of

system on your laptop. The other Pitbull I want to discuss is not having proper environmental experimental groups in your system has a well-designed then otherwise, it's very hard to ensure that our products can run on a laptop some staging environment. So this is more, you know, it's not just software engineering good practice the modules in the environment, but also in data science I spect right we have to have the ability to experiment with different

ideas production. end and we have to do this very carefully particular. This leads me to the next couple of days to have the same subject to all. av-test show we we we try out a new algorithm on us on a few percentage of users to see if those are better than them. and you know if we get paid very complex we can run in and I love them for We need to be at the same setup users are always the same guinea pigs, and we make them unhappy for the more. We need to be careful not to abuse running

too many overlapping experiment extract incorrect conclusions due to the effect of one exclaiming over the yard. so this is a very very important that leverage a lot of experimentation a scientist just different things a different new ideas on different areas of other people is launch a product without proving the valley for Prototype on everything about potential benefits of don't be sorry to interrupt you. Oh baby across the time. Can you compete in couple of minutes? Okay. This is one

notebook me the specs of the product. The product should be a fool. In the spec should be a full document that communicates, you know, all the all the preconditions assumptions, right and the behavior expected Behavior off the album. Thank you. That is all I have to shift.

Cackle comments for the website

Buy this talk

Access to the talk “Productizing Machine Learning over Big Data with AWS tools By Maximo Gurmendez, Lead Engineer DataXu”
Available
In cart
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “Global Artificial Intelligence Virtual Conference”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Buy this video

Video

Access to the talk “Productizing Machine Learning over Big Data with AWS tools By Maximo Gurmendez, Lead Engineer DataXu”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
561 conferences
22100 speakers
8257 hours of content