Duration 41:08
16+
Play
Video

Data Science at Scale with R on GCP (Cloud Next '19)

Mikhail Chrestkha
Machine Learning & Data Science at Google
+ 2 speakers
  • Video
  • Table of contents
  • Video
Google Cloud Next 2019
April 9, 2019, San Francisco, United States
Google Cloud Next 2019
Request Q&A
Video
Data Science at Scale with R on GCP (Cloud Next '19)
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
5.12 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speakers

Mikhail Chrestkha
Machine Learning & Data Science at Google
Christopher Crosbie
Product Manager, Dataproc and Open Data Analytics (ODA) at Google
Gregory Mikels
Customer Engineer - Machine Learning Specialist at Google

Mikhail Chrestkha is a machine learning specialist within Google Cloud’s customer engineering team. He focuses on designing and developing end-to-end machine learning platforms including data processing, model training, model serving, and model management. He has experience with various frameworks (scikit-learn, R, TensorFlow, XGBoost, Spark MLlib, PyTorch) to help customers leverage ML to solve business problems. Prior to Google Cloud he spent seven years in consulting helping customers drive value from data and analytics. Mikhail holds a B.S. and M.Eng. in Operations Research from Cornell University.

View the profile

Greg is a Machine Learning Specialist for Google Cloud and holds a Bachelor’s Degree in Mathematics and Master’s Degree in Applied Statistics from Loyola University Chicago. Greg has 6 years of professional experience working in technology, ranging from systems engineering to data engineering and data science. He specializes in helping customers deploy end-to-end machine learning solutions on Google Cloud Platform.

View the profile

About the talk

Cloud computing has opened up new opportunities for the R programming community. Google Cloud lets R developers enhance and scale their data science work to large complex datasets. We will cover (1) managed GCP services for R (2) training & serving architectures and (3) end-to-end R pipelines on GCP.

Share

I think everyone in the room will agree that our is a very active Community some of the best data scientist statisticians mathematicians out there. It's also had over the last two decades a very healthy and steady growth a year after year. I hear 10r scale can a deal with very large datasets. So that's why we're here today to do to talk to you about how you can steal your artwork loads on Google Cloud. Our agenda for today will be talking about the evolving our ecosystem on Google Cloud 3 reference workflows for training answering machine learning models one tensorflow to spark and

any our library that might be a favor of yours Bollywood clothes with a Q&A with an online Dory app. What is the Dory? Well, it's a Q&A messaging app that we use internally for Q&A at scale to make sure everyone has an opportunity to ask questions. If you go to our Google Cloud next to app find our session, you'll see a Dory Q&A link. This will allow you to compose a question as well as vote to move the good question. To the I thought you want to see answered. This will actually be open for the entire week. So if you don't get to all your questions will make sure to answer them directly

in the app itself. So let's get started with our ecosystem on Google Cloud. I posed the question around 10 our scale. 1 response might be does it really matter? Is there enough demand for it? So two areas to look for are distributed data processing as well as deep learning to work clothes that require lots and lots of computation. And if I do a quick analysis on crane logs the number of cran downloads from 2017 Q4 vs 2018 Q4 tensorflow Keras libraries to deep learning Frameworks have doubled in usage. This really is a signal that data scientist are exploring deep learning

to solve new novel ideas with in New Prague middle names also sparkly are is actually grown by 6ix during the same time. A great signal that data scientist are really pushing the limits of one machine and the memory that that really restricts you too and looking for distributed options on clusters. So cool car should scale. I'm going to go cry for two years and I will say I felt neglected isn't our user of course of the cloud provider developers were our primary audience. We want to make sure that developers can build applications on the cloud. We provide the client libraries like

Java nodejs golang next when you get the help data Engineers building pipelines and datastores providing a sequel interface, maybe a Java API for Big Data Frameworks like Apache beam and then finally tensorflow is still Central to rml ecosystem. So naturally python was a first-class language choice for ML Engineers, but after two years of feeling neglected, I really am happy happy to talk about some of the evolution of Google Cloud to support our community natively would do have a lot of notebook options. Now, we also give you options to run to run spark on a managed auto-scaling spark service

allow your trained tensorflow models or any custom AR Library models and really a big thing for the old Rock Community. There's a lot of packages have been contributed from Mark Edmundson Hadley Wickham in the Art Studio. Torelli interfaced a lot of the great big data and I'll Services a Google cloud has to offer. So let's talk about how this actually looks like at the very top. We have lots of different notebook Solutions Zeppelin. Everyone's favorite rstudio Rider Studio as well. As a man is jupyter notebook with an R Colonel then moving to the very bottom left. We have ingestion and

storage options. The two main ones will cover today are Google Cloud Storage object storage service really for unstructured data working with images free text files audio samples and then for structured and semi-structured data, we have big crate are petabyte scale fully serverless data warehouse as we move more exploratory work and data profiling what are options to explore and prepare that data. We have a SQL syntax on top of big braids that allows you to Crunch through billions of Rose. We also have spark again the manage auto-scaling cluster that I talked about if you want to use some of

the more statistical techniques like PCA or Dimension reduction techniques that are available through Sparks native mln Library. Now this is where things get interesting. Usually I would move on to the right side and build a custom ml pipeline. But I do want to stop and think about the bottom middle there were before you invest time in that you should actually look into it some of our pre-trained AP I can offer up some solutions. I can get you a good way there. We have a lot of different preacher in ati's in the natural language space the image space translation audio transcription and

we actually have a ml service billed directly into our data warehouse big braids and on the right side. This is where we have invested and we believe we can add value in building a robust and I'll Pipeline and a 3 work clothes were going to cover today are tensorflow & Chaos on a managed service called Cloud machine learning engine will have Sparks. Ml lip Library outside of just the data processing techniques on dataproc. And then the number one question. I've gone over the last year is what about any our library? I like to use my statistics libraries my carrot my are part. How can we use

a container based approach to actually run those at scale as well? So before I jump into the workflow switch will primarily focus on I want to just give you also too quick examples of how to embed somebody's work flows Upstream. So let's say I want to cray lots and lots of data on the left side. I'm loading big are crazy. This is crazy by Hadley Wickham and with a few lines of code. Nothing special your normal SQL syntax, but this exact Prairie actually crunch through a hundred billion Rose 3 1/2 terabytes in 45 seconds, and then it downloads as a dataframe for you too, then run

ggplot and other various analytics and visualization on top so very powerful for quick exploratory work and release kit allows you to push all that skill work to are managed service and then bring what you need in to your r dataframe on the other side of things. Let's say I work for customer service and I want to analyze all the customers lots for sentiment and what our customers actually calling about. Usually I would need to go to probably go to school and get a deep learning degree. I would then maybe have to look for word embeddings and understand the relationship between the

different word. I haven't gone to creating a labeling data set that is required for machine learning. So before I invest in all that or hire a team member who can do that, here's a few lines of code where you can access our natural-language API. The Google language are API. That was a contributing by Mark Edmundson. It allows you the few few lines of code to analyze the text extract various entities ranked by Steely insert important to classify the text into various categories as well as then look at various sentiments. Now, this might not be your end all be all. This might be an upstream

data enrichment exercise where you then combined some of these output with some of your structured data to build a more traditional structured data model is tensorflow. The art studio team has been working alongside. Our tensorflow team really build a great set of are apis interfaces on top of the tensorflow Python modules on the very left side. We have our Three core are packages starting from the bottom. We have tensorflow. This is the lowest level API that allows you to build the graph of operations gives you the most

flexibility and customization moving up moving on to the middle TF estimators. This allows you to use out-of-the-box algorithms regression classification some of your common neural-net architectures as well as even moving to spaces life support Vector machines. And finally we have tariffs are are top-level high-level API that's really build for usability and quick prototyping rabbit experimentation. And then on the right side, we have a lot of complementary supporting libraries. We have tfdata set a lot of times when you're building a machine learning model. The training process

doesn't always have to be your bottleneck. It might be how quick your feeding all the data in the appropriate file formats. Sophia tfdata set allows you to work with different file types to look at data data parallel pipelines TF2 run tracking visualizing monitoring those lost curve looking at evaluation metrics for those of you who have explored tensorflow on This is very analogous to the tensorboard tool directly in your rui SPF deploy, while you use our to build and train your tensorflow models. One of the main powers of tensorflow. It is a language agnostic system a

framework where any other down Two Notch location, whether it be Java C sharp and I'll consume all those predictions. So TF2 play allows you to export and manage and save all those models going forward and finally our core or Corps Library Cloud ml is an interface to Cloud machine learning engine a managed service that allows you to provision a cluster of machines with gpus are CPUs as well as also hosting the models as Cloud endpoints for you to consume in terms of an instance Levolor backs all the prediction. So what is this look like in a more holistic architecture? We do have rstudio server

Pro available with one click deployment through RJ speed Marketplace. The code can be Version Control within a cloud Source repository. This is a private get repo but you're welcome to use GitHub or bitbucket as well. Once you package your tensorflow and then you'll be working on a deep learning model. We would store all our images in the bottom left there with Google Cloud Storage. They will pack his petroflow code and the main piece here again is you might be doing like development work on the environment. You're our studio. Notebook is based off of what you're going to actually submit

the ml training as a job to climb machine learning engine. You'll be able to only provision resources for number of minutes and number of hours with the appropriate Hardware accelerators that you need at that time. So it's a very job Focus mindset from an always running development environment next. We're going to save and store our results as well as our model binaries in Google Cloud Storage bucket in the hierarchy. Of your choice to relieve a virgin them as needed and finally ml engine also as a secondary module for serving. This is where a model essentially automatically gets

containerized and hosted as a common endpoint for you to be able to access as a rest API for instance level one time predictions maybe for a Weber mobile application or if you need back predictions for running through millions or billions of ab rows of a table. You can run the batch option as well. I'm also interesting ly ml engine is it called a solution you have to be connected to it are Studio connect actually is a very similar offering where you can they be hosted on Primus on a server as well and host some of those models or you might use our burst compute to train the model but

then still want to stir them on premise. And then finally the API call can be consumed by various down the applications whether you stay in the our ecosystem with are shiny or other mobile and web applications. Let's dive a little bit now into the actual code for training on cloud. Ml engine again. Just loading the cloud. Ml Library Cloud. Ml underscore train is the main function where you actually just point towards your tensorflow code that sitting in a r file. There's a lot of optional parameters and this is where the power is the master type or the skill tier allows you to

access those resources for only the time and the training is taken a standard GPU gives you access to a Nvidia k80. We have a p100 that's also an option. But if you actually want to even scale further and have multiple gpus on one machine we have up to eight D100 on what one virtual machine with a complex large model there so was with with quick Plug and Play really have access to various choices to really optimized for the price as well as performance when your training models and thinking about this on a job job mindset next we have to plan predict. We're going to load the cloud in my

library. Once again, once you've really trained the model you're going Have a directory where that model is saved you can explore that model. In this case. I'm calling it save model and then I'm going to I'm going to deploy and host it as an endpoint. In this case. I'll call it Keras mnist. Now at that point you have a lot of different options from Downstream applications. But if you want to run back predictions or sample predictions directly in the our environment, you can use the cloud and Mel underscore predict options to cycle through and see whether the images are being protected

correctly weather audio samples are being transcribed correctly or any other traditional structured data problems that you may be solving. The great thing is that JJ was actually at next two years ago and producing a lot of these libraries. So this has been around for almost 2 years now. So this is a rich ecosystem. Where are Studio has actually great resources tutorials Gallery samples for you to take a look at various use cases across Industries everything from Healthcare to traditional time series forecasting which guest a boost with deep learning techniques on his walls natural

language. So please check that out. This is Paul. Get very mature currently out there. So now I'll kick it off to Krista Taco the about spark with r on PCP. I'm a product manager here at Google Cloud for open data and analytics team. And I'm here to talk about how spark combined with Cloud dataproc can help you scale your our analysis. Now, if you're not familiar Cloud dataproc that is our managed to Pence Park service and many of you probably hear the word Hadoop in zone out instantly thinking okay that old Legacy technology

and that's valid in some ways but spark is still a very active Community for machine learning. In fact in like 2 weeks right here. There's going to be a spark AI conference and the reason why is because Apache spark it a Computing framework that lets you scale across a cluster of computers and a generalizable way and that's perfect for building a rich ecosystem as ml libraries. And the other great thing about the park is it lets us write our our code and that means that cloud dataproc can scale your existing our analysis without requiring substantial code

rewrites Cloud dataproc. It's often the fastest way to move your analysis and ml with our into the cloud. So our goal with Cloud dataproc, we want to let you take the open source tools algorithms and programming language is like our that you used today, but make it easy to apply them to cloud-scale datasets and then you can do that without having to manage tool like clusters and computers. Now there's three ways to run are on dataproc today. There's our studio. Which is the standard web IDE that can be Auto

installed on the dataproc masternode and then there's two different packages for running are on spark sparkly are in spark car and they have a lot of similarities but there's some differences that I'm going to explain this well. The sparkly art, that's a package working for working with r on spark that was developed by our studio. So you're going to see a lot of great Hooks and integration in between our studio in sparkly are. Far car that's built directly into spark itself. And so you're going to find great hooks there with things like Cloud dataproc jobs API.

Which job API that makes it really easy to submit jobs through Smith AR jobs for things like the g-cloud command or an HTTP endpoint so you can build automated cool around your art and your AR code. But what I'd like to do now is just take a quick peek under the hood of both sparc AR and sparkly to help you understand how these tools actually scale urine are analysis. So this is a very high-level spark art architecture. There's two pieces to spark. There's going to be the spark driver and then the spark workers again, very high-level.

But often times the spark driver, depending on what motive spark you're in that soften going to run exactly where you're our Studios running and you're going to spend up with called a spark or contacts you can think about that essentially as a client for calling into spark melt under the hood. What happens in that spark are context is there's a bridge it's done over a socket layer 10th of translate llorar code in a Java code and then there's a Java sparkcontext that actually going to scale your work across the various spark executor so you can play as nodes in the cluster

at a high-level. Now, what's going to happen from this park executors? Is it going to actually make a call out to a local version of our and so that's real. That means any package that you can take with orange scale. You can scale with a spark spark are rather because this is real our code that you're running at the WhatsApp broken up. Now alternative architecture to that is sparkly are now with sparkly are does instead. Is there going to serialize the information into Apache arrow and that means in terms of ml algorithm.

And then what they do is they call out they use that information to call out to the spark. And I'll package. So that means in terms of machine learning you're going to be limited to the functionality that exist in the park. Ml package. So there are some now despite that there are some pretty cool features are found with sparkly are that are missing from spark are so for instance. They do have a direct integration with Fran. They have an extra boost extension and they also integrate with broom. If you're not familiar with broom. It's a pretty ml package to take the

output of a lot of unstructured Messi date of sounds like we have a bunch of T test and it comes back as a bunch of unstructured information. It'll it'll take Bruno take it and turn into a nice tidy dataframe. That'll let you look at Rizzo. Multiple models, that one's pretty quickly. It's pretty it's interesting. Of course is going to beg the question. I'm sure everyone has right now. Okay. What do you want me to use for a car or sparkly are so far car? I find that folks use it a little bit more for automation. First of all, it's built into the cloud dataproc and the spark itself. So you

get a little bit better performance but sparkly are again that's coming from the Tidy rstudio team. So that does 10:10 to have a little bit better integration with rstudio sparc ar that come from amplab. Say people who did spark. So you're going to find that they are going to follow things in a little bit more Sparks first kind of way. In other words. It's going to follow mostly the skalla way of doing things. Whereas sparkly are they think a little bit more are first and then sometimes they'll do in our integration at the sacrifice of something like performance, but but they're active

projects. And so when there are examples like that what you find is a 10 to get resolved pretty quickly. so often times you just come down to where you want to spend your time automating our code spark are or is it, you know working with interactive art studio type analysis and you go sparkly r But if you're still a little confused on the difference in what to use don't worry, I probably am too but luckily I found it's not really an issue to use both. So all you want you to load the sparkly the Spike Lee are package. You can then just call into far car. It does Stefan some of the deep

layer functions. But as long as you mask it correctly, you can go ahead and use both and a code a set of code that has both sparkly are in Far car can be submitted to the date of rock jobs API. So what I'm going to do now is I would like to quickly walk through a example of how you can use cloud dataproc to scale tomorrow analysis and we're going to show you they come in model like the one that you see here or going to run through a workflow where the data starts in bigquery which is often a common store for a semi structured or unstructured data sets. We're

going to spin up an auto scaling Cloud dataproc clustered actually do our machine learning and we're going to stay at the output of that model into Google cloud storage and that way later on when I want to rerun this. I'm going to use a cloud dataproc workflow template, which is a set the face with a graph of jobs that will even send a dataproc. It'll spin up a right size cluster to run those jobs run those jobs. Make sure that the cluster is torn down and then save all the output of what happened all the logs into stackdriver. So do you want to take over

on your laptop flip over? Okay. Okay, so essentially on a product manager here Google Cloud. So a lot of time I spend essentially your building articles or blogs that I put out onto the internet and what I really love would really love to have a machine learning model. They would read my blog and then say, you know how successful is this going to be on the internet and that might give me an opportunity to go back and tweak some things before I actually go and try to publish it. So whenever I'm looking for data sets like this, I

always start with bigquery public dataset program that has over a hundred eighty different publicly available datasets, they're stored for free, but I can take in used to augment my analysis and what I do and what I'm doing here is I found a great day to set for Hacker News and Hacker News data set. It has basically a score which is how well that article did on the internet that score is a combination of upvotes Andre links and it also has the full text of the article so I can use that to train. Now what you're saying in this query here is just I'm taking spores greater

than 0 and a length of text greater than zero only to get a nice clean data set is what I'm trying to do here is although bigquery is awesome at sequel. I want to take advantage of our and some of the machine learning capabilities that cloud dataproc enough for me. So what I'm going to do is just going to put this into a table I could go directly with the query but I'm just going to make it easy on myself and take a thousand Rose sample and just put it into The Hacker News sample data set and you can see in the bottom of the screen here. You can see what I'm getting back which is a

score that I'm going to use for a label along with the forecast. So what's that job has actually completed what I'm going to do with him to jump back over to cloud dataproc and I'm going to click on my cluster and I'm going to come into a web interface using our component Gateway. We expose various web realize that are running on your cluster and only make sure that you can come into those using your Cloud. I am credential so you saw me Jumping to Zeppelin which is a fart face notebook, but it also can work with our I said the first thing that you're seeing me do here is just

that think we're running bigquery. I could have run that directly and Zeppelin there is an interpreter's I'm just showing that but what I'm doing I would I really want to do is I'm going to run the python function that's going to let me take that an arbitrary tables of my Hacker News sample and pull that into a spark dataframe. Now, what's that into a spark dataframe? I'm just going to do a quick count to establish that hate is a thousand Rose in that sample that matches. My limits are probably on the right track. Now, what I want to do is I'm going to register this data frame and so

that it was one line. I'm going to register my data frame of the temp table and that's going to let me work with this same exact data frame across all the languages of of spark. That's skala SQL python, or I can go back and forth without having to move in in the structure or date or around it's all in the same data frame and you can see now That I can start to do some of my basic our Explorations things that look completely normal light filtering on data than 50, but I cannot do this on cloud scale data sets instead of just you know, are

the are the front of my laptop now, there's two ways. I can actually scale analysis using spark first off. This is an example where I'm just going to tune a bunch of to have send a glm across a spark and send it with a lot of different hyperparameter so I can tune a lot of smaller models at once with different hyperparameters get those all in a list and then analyze which model may be performed the best but in this example here, I'm actually not going to do that list apply. I'm going to do a deer fly cuz I just want to run, you know one, you know NLP package that I love in

spark across that hold data set of Hacker News. Now in this case. I just have my thousand samples that came back, but this is enough. What I'm saying? Yep, I think I'm in the right track with this model. So what I want to do now is I'm going to go back up to the function where I was calling my sample table and I'm not going to change that to the full Hacker News data sets over the entire Corpus Corpus of notes. Not just my sample and then just by clicking this off. I'm going to rerun all that same code. You just all except this time on the fall Hacker News data set and as this kicks off would I

realize it takes off? What I want to do is come back in the day to Prague and take a look at what's happening underneath the hood. So I started with 3vm which is totally fine for my thousand note sample size manipulating the date and getting ready but about two minutes later on the large dataset as I'm pulling that data in from bigquery the date of RockAuto. Skelos. Hey, you need a little bit more help. Let me add a couple notes for the cluster. Then a few minutes later, there's another spike in the yarn memory usage which is because I'm actually now training on that text Ada

so the cluster got even bigger, but what's that model training completes? I just refreshed the page and the date of RockAuto Skyler has said, okay. Have you Zelda memory need and it's scales my cluster back down for me until all of that was able to occur without me Ever Changing from The Notebook at in change environments. I just clicked run and Cloud dataproc autoscaler knew how to respond finally if I wanted to save this model off and not keep a running cluster but have something else around I have it around I can create a workflow template and I'm not going to step through all of the

steps here. But essentially it would end up looking like this in the console where any time if I want to rerun all this with the associated cluster, I would just click run and that would be it. All right. They are being put back to the slide deck now. So if you want to go and get started and do that exact same analysis, this code is what I used to spend up the cluster. It's pretty straightforward. It's just I'm calling into Cloud dataproc I'm saying use the optical components of Jupiter Zeppelin anaconda in this example, you could also throw in rstudio. I used

Zeppelin. Cuz I want to go across languages but you could definitely use rstudio with this initialization action. And then I installed a bigquery connector and the enable component Gateway. That's what expose those web UI links for me and make sure that I had permission to get into them and finally the auto-scaling policy. You can actually tune the auto scaling policy to a there's a lot of knobs will be exposed. If you don't care about those knobs you don't have to use them, but they are exposed if you want to tune the settings of the autoscaler to how aggressive you wanted or how

often do you want to scale? So what that I'm going to hand over to Greg? Thanks everyone for joining us this morning. So we heard about some great ways to leverage tensorflow and Spark ecosystems using our gcp. But what about the cases where you need to train and deploy a model using ner Library? So what's the solution for this? That's the first we need to set up in our development environment and Cloud. Ml notebooks is a great tool on Google Cloud for data scientist to do this while we can get started with one

click easily scale up with compute engine machine types at a remove GPU. Is it necessary leverage both R and python from The Notebook and enable team collaboration with get integration and cut a I hope we can publish notebooks to my hub. Specifically for our users, we can get started in 4 steps so we can launch with pytorch ML notebook which has conda installed and this is going to make it easier to install our our base package in libraries. We're also going to load the RPI to extension to enable our sell Magic and this is what allows us to run both R and python for both in the same

notebook. We're going to create a directory in our home folder for Library so we can install our own libraries in you don't have to write access to that directory. What should we do in those situations where resources under development workstation aren't adequate like running out of memory when working with a large dataset. So let's look at how we can package our code in containers for running on cloud. Ml engine and Cooper Nettie's engine for those just getting started with containers. You can think of them as a method for packaging an app with its dependencies for running on a variety

of platforms. Not just on M L engine. So the next time I'll walk through building a pipeline based on this architecture. We're going to import in analyze our data with bigquery. So we're going to use Stan standard sequel to query R dataset stage. What is a CSV file in Google Cloud Storage write a training application with Jupiter with ML notebooks in our and wheel package this up with Docker and send it over to ml engine the assets the model acids that are created by the training app or stored in Google Cloud Storage will take these assets and package them in another container that

we can deploy for serving and this will leverage Google kubernetes engine Selden an STI. All right, great, toe first will navigate over to ml engine and ml notebooks. I already have to create a new instance pytorch, but I already have one created. So we're going to open this one to start. And I'm going to I pulled some code from from GitHub. So I'm going to start here with a notebook. I've already created. I have conda. I do kinda to install our pipes you already. So I'll go ahead and load the extension. I created

the directory for my libraries and I can go and check this out and see that I've installed some already and I have access to them. And then I can use the RCL magic next to you know, install whatever Library I need. And we want to get data from bigquery so we can use the bigquery sell Magic and then our data will be exported into a dataframe which week and then import into are so then once it's an already know we can do the structure of the day that we pulled in here looking at public baby weight data from bigquery. And we can also build a model of just the traditional linear model just to

get started. We want to do more than that, you know, we wanted so let's take this data and stages in cloud storage so we can run some other models and we can scale it up to ml engine. So I'm going to go into the big three console. And take that query and paste it in there. And this can be done from you know within our using the API, but it's just as easy to use the console. So first, I'm clearing about 68,000 Rose just for texting, so I'm going to have to to CSV files and I'll export this directly to Google Cloud Storage. And this is my storage

bucket. The train date of small was going to be that file. And then I'll change this query to sample a bit more data and we'll use this data set for only scale 10 L engine. So when it finishes will see that we have about 13 million Rose. You can also see that we're doing a little bit of food processing in the query and it's it's good to keep in mind that Victory is very performing for pre-processing data with SQL. So this would be a good case to to use SQL for some of that instead of are potentially. So I export to CSV to cloud storage.

And now we've got my data. I got my data staged for training. So I'll head back over to Jupiter and I'll start working on writing my training app. So first I create this trainer folder that you can see on the left side bar there and I'm going to create a file for installing my are dependencies. First time going to settle Library path based on an environment variable and then I'll specify the packages I need and then in the training app, I'm going to specify that same Library path. I'm going to configure command-line Arguments for Google Cloud Storage for training data for the expert

directory. I'm going to use carrot to impute missing values Center and scale my data. I'm going to change the unit. The factors are the features. I need two factors partition the data and intestine train setup hyperparameter tuning cross-validation and then I can kick off my training job using a deal on that for this example. And when the tree is complete it's going to save my model file as well as my pre-processing weights to Google cloud storage and I could also export you know, any additional assets that I wanted. Like I wanted export plots are other results going to doctor to package

office. I'll take the base our image install the Google SDK so I can download my files from cloud storage. I'll copy over that file to or that screw to install packages set the library environment variable. I'll copy over the training app and then I'll use entry point to tell ml engine. Hey, this is the script is going to run when the container kicks off. So before we go ahead and deploy the container to container registry, let's just make sure the training app runs locally and then we'll get the model file saved and then we'll also test inference. This is an end-to-end deployments. We want

to make sure inference works as well as training before we go ahead and start training on ML engine. So we see we get a result returned. So we'll go ahead and build the doctor and get your package up our our training container. I know we can deploy it to Google Google container registry. And once it's in container registry, it's accessible to ml engines will submit a job referencing that container reference are training data in the storage bucket the output directory and we can also specify a machine type or a cluster of machines here with accelerators if necessary. So when the

job is submitted we can go back over to the console and use ml engine to do the job running. We'll be able to see our utilization metrics. o&r logs and you can add additional logs as needed, you know based on your training app. When the model is complete the model assets are written to cloud storage so we can go in the cloud storage and see you know the assets from the prior jobs that have run. So I got a few folders here from prior jobs to look at the most recent. And here we can see the model RGS file and the

pre-processing weights. So then let's go back into Jupiter and let's serve our are completed model. So we're going to install s2i, which is source to image. It's a tool that just going to make it easier to build the container image without a dockerfile essentially source to image in Selden require this environment file. They were requires and installed at our file which is installed the dependencies. We're going to download the model assets from Google cloud storage and include them in the container. And then we're going to set up a run time file, which is going to load the RDS

model file to pre-processing file. It's going to take the input from prediction do the pre-processing and then evaluate against the model. So once we have all these files we can call s2i to build are serving image sewveryeasy. We don't even have to build a dockerfile here. We just build her our files for for serving and then packaged up or ended we can test it locally. And then run request against it locally. We see we get what we were expecting so we can also few logs if you needed any bug and then stop the container to restart it. So go ahead and push the working container

that we have to container registry. And if we don't have a cluster already started we would we would create one communities cluster. We can see that we already have one running. So we'll deploy the application to the cluster we have running and then verify that it's been deployed in that the deposit is running for the application. And then we'll expose expose the deployment on 480 and then get an external IP address for the app. So we have our external IP address we can make another request and we see the result we were expecting and so it

looks like we're good to go. A nice thing with kubernetes engine or managed kubernetes service. We can go into the console and view the Pod the application running. We can see utilization metrics and we can do logs from the container as well. So with that I'll pass it back over to Mikhail to wrap up for us. Thank you. Thanks, Greg. So just just in closing. I hopefully this image makes a little bit more sense. I think at the very top starting with the notebooks think some of the key ways. We saw Chris talked about the new web

interfaces module with and a defrocked allows you to quickly soften up these notebooks. Ml. Notebooks actually is an Amana Jupiter service that went also beta on March 1st to also new service that they may be a lot of you may not have heard of before I believe really make sure that all both demos as well as my my reference architecture showcase all three so you got a sense of how each of those three look like. We also walk through different storage techniques about firm Bakery as well as GCS from a spark perspective think the auto scaling is something that I've really been using quite a

bit myself as well as well as working a lot of customers because there is no need to Dan Katz Deli right size your work clothes are maybe have a little bit more capacity than you need sitting idle that auto-scaling feature really has a certain logic built-in went when the number of clusters grow. Turn bottom. Right what Greg walkthrough, you will make sure to provide that that in terms of GitHub code, but the key takeaway is again hybrid portable. These are all containers. You can use them on Google Cloud you can train or create a serving application. Those are fully portable to be

ported over to another Cloud onto on-premise and Selden specifically disturbing disturbing approached by Greg walkthrough. That's part of the broader kubeflow ecosystem where we are making it very easy easy connector to the kubernetes ecosystem. A quick Chichi around both the services Mi Brady's it is quite a bit and we'll try to get some resources out there online. But the first two rows really talk about the data ingestion gcp Services as well as the are packages that allow you to access these very quickly directly in our code. We walk through different ways of

doing that within your AR code as well as directly in the console whatever you're more comfortable with various options there then the bottom three talk about the training service the serving service and then the actual our library packages that you can you can leverage to access that and in a familiar environment. Now the last column of course, it does require a bit of container knowledge Docker kubernetes really really does future-proof yourself in and makes makes Claudia our workload portable moving forward. So what now are we have a couple of bitly links for the spark ecosystem

Krista. Published a blog post a few months back. That's a great primer on how to get running with spark on dataproc tensorflow to rca.com is a gray. State repository for resources and our studio in NJ G actually partnered with the Creator Perris to publish a book. I'm deep learning with r as well as the great cheat sheets that are studio is known for on that. I mentioned examples that Greg walk through a lot lot of great information that get her blank. Is there under G Michael's his GitHub name around with Google Cloud are examples as well as some Billy links for a new feature within

machine learning as soon as well for custom containers as well as keep flowing Selden.

Cackle comments for the website

Buy this talk

Access to the talk “Data Science at Scale with R on GCP (Cloud Next '19)”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “Google Cloud Next 2019”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Software development”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
163
app store, apps, development, google play, mobile, soft

Similar talks

Faustine Li
Computer Vision Engineer at Descartes Labs
+ 1 speaker
Kyle Story
Computer Vision Engineer at Descartes Labs
+ 1 speaker
Available
In cart
Free
Free
Free
Free
Free
Free
Christopher Baumbauer
Sr Software Engineer at TriggerMesh
+ 3 speakers
Andrew Bayer
Senior Software Engineer, Jenkins Pipeline at Cloudbees
+ 3 speakers
Kim Lewandowski
Chair of The Board at Continuous Delivery Foundation
+ 3 speakers
Christie Wilson
Software Engineer at Google
+ 3 speakers
Available
In cart
Free
Free
Free
Free
Free
Free
Joseph Holley
Head of Gaming Solutions at Google Cloud Platform at Google
+ 1 speaker
Mark Mandel
Developer Advocate at Google
+ 1 speaker
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Data Science at Scale with R on GCP (Cloud Next '19)”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
567 conferences
23025 speakers
8619 hours of content
Mikhail Chrestkha
Christopher Crosbie
Gregory Mikels