In 23 years in the data management industry, Paige Roberts has worked as an engineer, a trainer, a support technician, a technical writer, a marketer, a product manager, and a consultant. She has built data engineering pipelines and architectures, documented and tested open source analytics implementations, spun up Hadoop clusters, picked the brains of stars in data analytics and engineering, worked with a lot of different industries, and questioned a lot of assumptions. Now, she promotes understanding of Vertica, MPP data processing, open source, high scale data engineering, and how the analytics revolution is changing the world.View the profile
About the talk
Getting Python data science work into large scale production at companies like Uber, Twitter or Etsy requires a whole new level of data engineering. Economies of scale, concurrency, data manipulation and performance are the bread and butter of MPP analytics databases. Learn how to take advantage of MPP scalability and performance to get your Python work into production where it can make an impact.
Is Rick examples in my presentation because they're the only customers you want to talk to me these days. So it might my presentation is all about how to get your python work. You're a machine learning work into production faster because we all know that it doesn't start bringing money back in. It's just a cost center until you can get it into production and earning money for your company and then you can do some pretty amazing things. So getting it into production is the big hurdle. My presentation has absolutely
nothing to do with space but I put stars all over it cuz I'm kind of a space geek and I'll figure out some way to incorporate it into my presentation. So let's start with Phillip Phillips is is actually one of their slides and it is a good example. Of somebody who needs to get machine learning into production and who has actually been in production for quite a while. Now. In fact, they predate some of the technology is there out there and yet they have tweaked their architecture a big as they go along to take advantage of cutting-edge
capabilities. So they have one foot in the past and one foot in the future and there are pretty cool example. They're also putting generic example. It's like this is a good kind of a general idea how this usually works in production in a company and it starts with all the components of the devices that they create MRI machines cat scanners ventilators really important machines that we absolutely don't want broken all of the components of them have phone capabilities. So there send Information back constantly about how they're doing what the temperature is what you
know how many times they've been used what the hell all this kind of information. It's kind of stream of data just coming back and they've been collecting this information for years and they take that information and they combine it with some good textual information. Like when was this device purchased? What when was it last maintained what was replaced or fixed the last time it was maintained? What's the geographical location of this device? Is it in Minnesota where it's you know, when it's January and its -20 or is it in Texas where it's
105 in August? It's like, you know, what what kind of things might affect this machine. And then they take their algorithms and you know python algorithms some kind of good sophisticated machine learning algorithm thrown against that massive data and refined the data in figure out. Okay. This is data that indicate something important in particular. It's going to tell me the pattern for when a particular component in a particular device is about to break. So what they're doing is they're moving from the hospital comes up and says my
thing is broken and his MRI machine that everybody needs is down for like a week or two while they go look at it and troubleshoot and trying to figure out what the problem is and maybe go back to the warehouse and get a part. Oh, that's the wrong part. We try again. To a model where they know is that this particular part is about to break. So what do they do? They send somebody out there to fix it at 3 a.m. Or something when machine is not being used. So the machine is never down. It's never not available when someone needs it and that's really powerful for
that's incredible valuable for their customers for Philips and Ferb. No general public who might like to have that machine available when you need it. So how are they doing that? So this is kind of a high-level look at their architecture and I'm not look at the edge here. There's lots of presentations on how to do iot Edge stuff. This is this is just particularly how to get your machine learning into production, especially if you're doing it in Python. So in this case, they are pulling all that data in there putting a data Lake they're throwing their algorithms. They're not
actually throwing they're all grown and since a lake, It's kind of a surprise a lot of people expect the only place you can do data Sciences in the data lake with that raw data or was some refined version of that raw data, but they're not doing that there was finding the data and they're putting it into a distributed analytics database and when they go in real time to try to detect whether or not something is wrong right now after they've trained the algorithm in the database they now use the model to predict and figure out this thing is about to break and
scheduled maintenance and take care of that and they're also doing be alright because you know, just because we're doing data science doesn't mean a business stopped having to do bi they have to do both to run a business. But the main thing I'm looking at when I look at this is that they're doing it all in that distributed MPP analytical database. Why why are they doing that? Well, Back in the day. We kind of had this concept that data warehouse in a local database couldn't really handle are you cheated?
It couldn't handle the multiple types of contextual data, like geospatial shapefiles and things like that. We couldn't couldn't handle the mass and volume of storing 5 6 7 years worth of information. It just wasn't it wouldn't work. This is not possible in the past. Well things have changed. So now databases are distributed and the analytical database is now MPP. It's now capable of scaling just like a dude where you add another computer and it gets bigger and it can handle the scale that you have or that
you need in one place and that's pretty cool. But it's still it's a SQL database everything you can do in it you do in SQL and that's fine. Bi but data science. You usually want to do it in Python. Where are well? This database has distributed are built-in. It has python capabilities built in so I can take a python algorithm and I can put it in the state of base and it will distribute it and it will do the whole play the algorithm two slices of a data across the cluster.
Well, that's weird. Is it because vertica is the super cool awesome unique database? Well, I'd love to say that that's true. And certainly is we are super cool and awesome. But we are not alone. There are a lot of degrees out there that can do that. So Google bigquery ml is a good example. They have him she learning capability with Google bigquery. If you want to do Oracle autonomous has greet our capabilities built in this is something that is real and that is happening and that a lot of companies that are trying to get their past that hurdle
and get their stuff into production. This is how they're doing it. So, let's have a look at this. It's like what do you need to make machine learning work to get it into production? Well, you got to have speed especially in iot kind of use case or any kind of music is I'll see how I'm doing real time ad targeting. Web if I click on something and I want an ad I want accept bids on that ad and then place that a doll before the website loads. I need that in microseconds. I need some fast response. I need it to be easier if you easy to use
so something like SQL or python, which is a high-level abstraction. That's something that I've got to have. I got to be able to do a lot of stuff why do different functionality machine learning have like hundreds of applications? So I got to be able to use it for whatever it is. I need I got have the flexibility to put it where my production system needs to be. I mean one company may we needed on on Prime? Another one needs it on Amazon Cloud another music on Google Cloud. It's like you've got to be able to put it where it needs to go. I need to be able to handle a change you need to be
able to handle all the sources that you got to pull in. You got to be able to handle crazy our limbs because the algorithms are constantly changing and you've got to have an open architecture because there isn't just one thing that does this is almost always I mean that architecture and have one technology and it had a whole lot of Technologies and all of those Technologies not only have to be good at whatever it is. They're doing they have to be good at connecting to all the other things that they're doing in the to make the whole thing work. So you can't just have one good
component. You have to have all the components working well together. So what is python have gone for it? Well, as a lot of functionality could do a lot of different things flexible. I can go pretty much anywhere fairly easy to use as a programming language is Go Ask A high-level abstraction. And if you want an even higher level you can go use a jupyter notebook or something like that. You can you can make a python is really easy as forgiving languages go so and there's a strong Community. There's a lot of capabilities. There's a side kid and pandas and all of these packages that are coming
out. So this is cool python is awesome. IPhone can do everything you did to do it it it can do all the things machine learning. Why's it can do all of these great use cases, but it has some problems. In particular has trouble with scaling and not just killing to have large data. I mean, it has a school interpreter lock and CPU thread management and all these technical problems. They cause that problem but it wouldn't really comes down to is once you get past a certain scale python doesn't work. Well and what'd you get past a certain number
of concurrent users? It also doesn't work cuz it doesn't like divide up resources. Well, so if I have five people in my data science team that might work but if I have 1500 forget it not even in the room. We all know the data science isn't like this thing, but I do like to look I did D Defiance today. It's like it doesn't work that way for us to get to figure out you know, what's going on. You got to look into data and do some you know. Duration figure out how it's going to work is going to answer the question. Then you can have prep the data. So it's just right for for
answering that question and you got to train a bottle or model is terrible your start over get a different model. Try again valuate this new model and you're like, hey, this one looks like a good one. It's because hey now, I've got it and Bam you did. Well, this is what people keep coming up against over and over again. It's like trying to make this final jump into deployment is is difficult and I got to ask you why why is it hard? Well, there's several reasons one is a person is subsampling there. They're taking a small amount of the data and they're going
off and working somewhere else because they're working in some kind of technology that really can't handle that they can't move all of that data somewhere else. Well, what do I do? Action location. It's like once this is great. How do I then get it back to where I started from? Like you're you running into some very is some difficulties there and you've got to have that speed of response and a lot of times that can be difficult to get so Well, I mean the answer to you how to improve
my accuracies pretty obvious use all the data don't use a sample. Okay? Well that's easier said than done. I mean, we all know that that machine learning models do better with more data. I mean, that's why it's kind of taking over the world right now is because all of a sudden we can process the amount of data and that amount of data is available to us or it wasn't before. I mean, I I said he'd artificial intelligence and machine learning and stuff like that, you know, 20 years ago 30 years ago, but it was largely theoretical because a lot of what we needed to do we
didn't have the processing power to do that. Unless you spent some ridiculous amount of money and you just now you have that capability to get the full data set and get the maximum accuracy and sometimes a single point of accuracy can can make a difference in your business between being lucrative scooter or B. Not lucrative this quarter and making you look like a rockstar or making you look like maybe you don't deserve your salary. Well, here's the other problem. It isn't a big deal. It's not easy to move while what do you do bring them bring the
analytics to the data. Don't try to be bring the data to the analytics. So I got that I got that space thing in there. I knew I was going to get in there somewhere data has Gravity the bigger the data the harder it is to to do things somewhere else. The more your sort of pulled towards that data as opposed to train pull that day or somewhere else because the more Cedar you have the harder it is to get it moved to somewhere else and they didn't the better the more it makes sense to to leave it where it is and to bring other things to it
instead of the other way around so Bring models of the data. Don't bring the date of the model. I might repeat that a few times because I think it's like a fundamental principle that you need to think about. Whenever you're trying to architect data are data science and machine learning and AI all those things like you not only get the paralyzation capabilities in the play. She not only get you know, you'll have to move your data. You don't get slow downs, but you also get the security and the Providence and all the things that the
government all the things that then a mature technology is likely to beat bring so if you've got all your data sitting in the database and not your production system, like why would you go do your data science somewhere else? Why would you go somewhere else to do it? Once that's where the key deer that you need is sitting like it makes so much more sense to take it to where it needs to be only other thing is just that you need speed a lot of A lot of use cases require sub-second response or a few seconds response and that's just not what some of the other
Technologies are built for nobody posted data like with the idea that I would look at 10 petabytes of data and and respond to you in less than 2 seconds as like that's just not what it's built or political databases. That's exactly what they were before. That is their goal in life is to give you a fast response only look at another example, so, this is Uber. Now this is actually Ubers. Architecture from about a year ago because they've done a lot of work to it. And they one of their Hobbies is
tweaking their architecture in the reason is they have over a hundred megabytes of data that their manager and every little teeny tweak that they can do that looks a little bit more efficiency out of that system is money in their pockets are is is satisfied customers is so it makes a huge difference to their model to their business model in order to get you know, it's worth it for them. So what are they doing is they're taking data into a data Lake Business Park to do some ETL and I putting holder to get in the data Lake and this is a cool thing that they're done for anything.
They need a response time between 1 and 1 minute and 5 Min. They've created something with the cold hoodie that has an acid transactional capability on the data Lake itself, so they can put all this data and parquet. What does a really efficient way to store your data long-term because it's hierarchical. It's compressed. It's really good and they can do acid transactions on that and they can do some data exploration. They can do some experimentation to machine learning all that if they want to directly using Presto or spark or Hive or something like that. You can earn Robux.
Or if they need a response that's faster than a minute that's within a few seconds. Then they take that the data that they're going to need those high-speed for responses on the most recent date of the hot data all the data that they really are doing queries on really quick. They put that into a distributed analytical Davis reason for it. Like I said, my my examples are my customers, but you could use something else. I mean they had a warehouse it was pretty good. They went to a data Lake now they have a combination.
So they're their progression is Pretty Natural. This happens a lot. You see this over and over again. Now here is one of the things that is important is they needed workload isolation. They needed to be able to do multiple things at once and so they created multiple databases and now they can query data and a basin if it's like, you know just Forbearance, I can't take anymore. It's overloaded with queries. You can move the course of the other database now. It's got duplicate
data. Well, what about other data that they might you might not get quite as often but every now and then he's another database for that. And Corey said that will never interfere with each other. So they're all isolated workloads and they could you be all right, and I can do data science and they can use Python and Excel python directly in there. And one of the cool things that they're doing actually is geospatial analysis because they found out it works better in the database and I can get a quicker response for things like route optimization so they can get a car to you in under 7
minutes, you know, where get your food to you while it's still hot or whatever. So this kind of that sort of thing they can do right there and because at the time I actually had a python interface, but it was kind of terrible. I mean it was good. It was very performance. What is You have all the capabilities and they build all their applications in Python. So they were like, well, I want to be able to use Python directly on the state of basically created something called for like a python in the open-source. It happens. A lot of times is capable of these that companies need to fill
the open-source Uber is you know, one of those that does that sort of thing. Now moving forward. They're actually looking at man. I had to build all this code so I can decide where to put my data and then I had to build this proxy and I can't go here and now they're actually moving to Google cloud in the future and they're looking at using self clusters so that they can spend up a a full cluster for one isolated workload and another customer and isolated workload another cluster for Lake and femoral workload spend it back down and spit it back up and you know
that sort of thing without having to build all that cut themselves. So that's going to be more efficient for them and they're working for on it right now. The meantime, it's like what did the database bring to the table? What what is it? What was it? Why is it were using a database? Why didn't they just do everything on the day? Like why didn't they do it in some data science lab or something else? Like what? What do you get out of it? Well, Couple things first of all, there's no single point of failure in that database. There's no name. No, there's no master know.
There's nothing nothing is going to make you know one know go down in the whole thing dies like that doesn't happen you get sauce response because that's what they do bases are both for is high-speed response a few seconds sub-second milliseconds that kind of response and can currency devices are used to you someone doing business intelligence dashboard and giving it to a thousand people in a thousand people drawing them down at the same time and send in SQL at this database all at once and they have to be able to handle concurrent user. So it's built to handle concurrency. It's got all
these cool features. They can handle the ml it's got built-in machine learning algorithms. It can paralyze your your are your python or whatever I can do moving Windows use facial analysis time series coins fast, dataprep all that kind of stuff and it has an open architecture which means it connects with Tableau and liquor in all the guys on that on the visualization side and that sort of this will be your side and a hold of ATL guys and Kafka and it can interact with spark dataframe Zoar or python dataframe. So I kind of things do it it gives you a lot
of flexibility. Currency this kind of like what can you do while you can compare the data? You can explore the data hear that. You know circular thing that you're trying to do you can sample the day that you can join two different time Series stats and then impute missing values. You can do the kind of things that you need to do the date of prep to get that date of ready. Then you can train your algorithms and you can use one of the Hilton machine learning are the ones that are already in there or you can build one in Python and you can
put it in there and then we'll distribute. It was just cool and then you will or maybe I already trained it. And I want to squirt I want to check it out when I'm deploy it. Well, you can just say what is PML or something like that put it in and then and then it'll score it it'll it'll compare it with other algorithms. It'll do the same to you need to do and then most importantly it will deploy it because the database the same as in death in test in production takes one line a sequel production 10 minutes instead of 6 months. That's huge. That's the big thing is that a lot of people
run into so we have the power and the flexibility of Python and the scalability of Hanukkah and crazy will simply Raspberry Pi which is our new jupyter notebook capability, which is open source, you can go out there and get it so you can use your Netflix now if you want to so what you got you put your python or your are and their attributes it takes care of I put your We're them up against slices of the data. So you get that parallelization without having to write any extra code and that's huge.
You also get iced frappe and things like that that really makes a big difference and tensorflow. It's a neural Nets Nets are better off turning on GPU systems, but your production systems are almost always on CPUs or on the clouder's of which has usually CPUs and that causes you some problems is like, how do I get my tensorflow into production same way that you're python introduction by in this case, you put it in Frozen broth instead of being so that's kind of the idea for check out burger place on brick pie. If you want to hear some things that people are saying
which is pretty neat say nice things about you that looks good. Checkout.com try this is a free Community Edition up to a terabyte up to three nodes. Tryouts got all the machine learning capabilities in it. You can do everything with it that you can do with the larger version except. It's just on a small scale to give you an idea of how it works or if your data is only let you know what songs are bites just use it. It's yours. And if you want training Onnit Academy Limerick Academy is free on demand as
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.