Duration 36:49
16+
Play
Video

TensorFlow in production: TF Extended, TF Hub, and TF Serving

Andrew Gasparovic
Lead Developer at Google
+ 2 speakers
  • Video
  • Table of contents
  • Video
2018 Google I/O
May 9, 2018, Mountain View, USA
2018 Google I/O
Video
TensorFlow in production: TF Extended, TF Hub, and TF Serving
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
12.22 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speakers

Andrew Gasparovic
Lead Developer at Google
Jeremiah Harmsen
Lead of Brain Applied Zurich at Google
James Pine
Software Engineer at Google

Andrew leads the TensorFlow Hub team at Google Research Europe in Zurich. Before building machine learning infrastructure, he worked on distributed storage systems like Bigtable at Google in New York and low-latency transaction processing at ITA Software in Cambridge, MA. Andrew is a private pilot and enjoys exploring Europe with his wife.

View the profile

Jeremiah Harmsen joined Google in 2005 where he has founded efforts such as TensorFlow Hub, TensorFlow Serving, and the machine learning ninja rotation. He currently leads the Applied Machine Intelligence group at Google AI Zurich. The team increases the impact of machine learning through consultancy, state-of-the-art infrastructure development, research and education.

View the profile

About the talk

This session will introduce TensorFlow Extended (TFX), TensorFlow Hub, and announce new innovations and features in TensorFlow Serving. As machine learning is evolving from experimentation to serve production workloads, so does the need to effectively manage the end-to-end training and production workflow including model management, versioning, and serving. TFX provides this solution to Google and you'll hear about the release plans to deliver it to the community. TensorFlow Hub is a central repository of reusable parts of TensorFlow models. With its libraries, you can incorporate these parts in your models for transfer learning and package them up to be served with TensorFlow Serving.

Share

Welcome everyone. I am Jeremiah and this is tensorflow in production. I'm excited that you're all here because that means you're excited about production and that means you're building things that people actually use. So I talk today has three parts. I want to start by quickly drawing a thread that connects all of them. The first thread is the origin of these projects. These projects really come from our teams that are on the front line machine learning. So these are real problems that we've come across doing machine learning at Google scale. And these are the

real solutions that let us do machine learning at Google. The second thing I want to talk about is this observation if we look at software engineering over the years we see this growth as we discover new tools as we discover best practices. We're really getting more effective. I doing machine doing software engineering and we're getting more efficient. We're seeing the same kind of growth on the machine learning side, right? We're discovering new best practices in new tools. The catch is that this

growth is maybe 10 or 15 years behind software engineering and we're also discovering a lot of the same things that exist in software engineering but in a machine learning context, so we're doing things like Discovery Version Control for machine learning or continuous integration for machine learning. So I think it's worth keeping that in mind as we move through the tox the first one up is going to be tensorflow Hub. And this is something that lets you share reusable pieces of machine learning much the same way we share code then we'll talk a little bit about

deploying machine learning models with tensorflow serving and we'll finish up with tensorflow extended which wraps a lot of these things together in a platform to increase your velocity as a machine-learning practitioner so that I can hand it over to Andrew talk about Text Jeremiah. Hi everybody. I'm answering kasperowicz, and I'd like to talk to you a little bit about tensorflow Hub, which is a new library that's designed to bring reusable Leti to machine learning. so software repositories have been

real benefit to developer productivity over the past 10 or 15 years and a great first of all because when you're writing something new if you have a repository, you think I'll maybe I'll check whether there's something that already exists and reuse that rather than starting from scratch, but a second thing that happens is You start thinking maybe I'll write my code in a way that's specifically designed for reuse, which is great because it makes your code more modular but it also has the potential to benefit a whole Community if

you share that code. What we are doing with tensorflow Hub is bringing that idea of a repository to machine learning. In this case tensorflow Hub is designed so that you can create share and reuse components of ml models. And if you think about it, it's even more important to have a repository for machine learning even more so than software development because in the case of machine learning not only are you re using the algorithm and the expertise but you're also reusing potentially enormous amount of compute power that went into training the model and all of the training date as well.

So all four of those the algorithm the training data to compute and the expertise all go into a module, which is shareable tensorflow Hub, and then you can import those into your model. And those models those modules are pre-trained so they have the weights and the tensorflow graph inside. And unlike a model they're designed to be composable which means that you can put them together like building blocks and add your own stuff on top there reusable, which means that they have common signature so that you can swap one for another trainable which means that you can actually back propagate through a

module that you've inserted into your graph. So let's take a quick look at an example in this case will do a little bit of image classification and out to classify rabbit breeds from photos, but we only have a few hundred example photos. Probably not enough to build a whole image classifier from scratch. But we could do is start from a general-purpose model and we could take the reusable part of it the architecture in the weights their take off the classification and then we could add our own classifier on top and train it with our own examples will keep that

reusable part fixed and we'll try our own classifier on top. So if you're using tensorflow, huh, you started tensorflow. Org job, or you can find a whole bunch of newly-released state-of-the-art research-oriented and the well-known image modules. Some of them are included classification and some of them chop off the classes classification Polaris and just output feature vectors. So that's what we wanted our own case in this case because we're going to

add classification on top. So maybe we'll choose naznet which is a an image module that was created via neural architecture search Sole shoes today the large bird with the future of actors. So we just paste the URL for the module into our TF Hub code and then we're ready to use that module just like a function in between the module gets downloaded and instantiated into your graph. So all you have to do is get those feature vectors, add your own classification on top and output

the the new categories so specifically what we're doing is training just the classification part will keeping all of the modules weights fixed. But the great thing about reusing a module is that you get all of the training and compute that has gone into that reusable portion. So in the case of nasna, it was over 62,000 GPU hours that went into finding the architecture in training the model plus all of the expertise the testing and the research that went into naznet you're reusing

all of that in that one line of code. As I mentioned before those modules are trainable. So if you have enough data, you can do fine tuning with the module. If you set that trainable parameter to true and you select that you want to use the training graph what you'll end up doing is training the entire thing along with your classification. The caveat being that of course, you have to lower the learning rate so that you don't ruin the weights inside the module. But if you have enough training

data, it's something that you can do to get even better accuracy. And in general we have lots of image modules on TF. We have ones that are straight out of research papers, like naznet we have ones that are great for production even once made for on device usage like mobilenet plus all of the industry standard ones that people are familiar with my conception and read that so it's look at one more example in this case doing a little bit of text classification will do look at some restaurant reviews and decide whether they're

positive or negative sentiment. And one of the great things about do you have table is that all of those modules because they're tensorflow graphs. You can include things like free processing. So the text modules that are available on TF Hub take Paul sentences and phrases not just individual words because they have all of the tokenization and pre-processing stored in the graphic South. So we'll use one of those and basically the same idea we're going to select the sentence embedding module will add our own classification on top and

we'll train it with our own data, but we'll keep the module itself fixed. And just like before we'll start by going to tensorflow. Org / Hub and take a look at the text modules that are available in this case. Maybe we'll choose the universal sentence encoder which is just recently released based on a research paper from last month. The idea is that it was trained on a variety of tasks and it specifically meant to support using it with a variety of tasks are also takes just a very small amount of training data to use it in your

model, which is perfect for our example case. So we'll use that Universal sentence encoder and just like before will paste the URL into our code. The difference here is we're using it with a text in wedding column. That way we can feed it into one of the high-level tensorflow estimators. In this case the DNN classifier, but you could also use that module it like I showed any earlier example calling it just as a function. If you are using the text embedding column that also just like in the other example can be

trained as well and just like any other example, it's something that you can do with a lower learning rate if you have a lot of training data and it may give you better accuracy. And so we have a lot of texts module is available on to you. If we actually just added three new languages to the end in l m module is Chinese Curry and Indonesian. Those are all trained on g-news training data, and we also have a really great module called Elmo from some recent research which understands words

in context and of course the universal sentence encoder is I talked about Just to show you for a minute and some of those URLs that we've been looking at. Maybe we'll take apart the pieces here. TF up. Dev is our new source for Google and selected partner published modules. In this case, this is Google That's the publisher and the universal sentence encoder is the name of the module. The one at the end is a version number. So tensorflow Hub considers

modules to be a mutable. And so the version number is there so that if you're doing one training round and then another you don't have a situation where the the module chain changes unexpectedly so all modules on TF Out. Dev version. And one of the nice things about those URLs, if you paste them into a browser, you get the module documentation the idea being that maybe you read a new Paper Co there's a URL for TFI module Annette. You paste it into your browser. You see the documentation you pasted into some code and then one line you're able to use that one. Try out the new research.

And speaking of the universal encoder the team just released a new light version which is a much smaller size. It's about 25 megabytes and it's specifically designed for cases where the full text module wouldn't work for doing things like on device classification. Also today we released a new module from Deep mines this one you can feed in video and it will classify and detect the actions in that video. So in this case it correctly guesses. The video is of people playing cricket.

And of course, we also have a number of other interesting modules. There's a generative image module which is trained on celeb a it has a progressive ganoncide and also deeplocal features module which can identify the key points of landmark images. Those are all available now on dfm. And last but not least. I wanted to mention that we just announced our support for tensorflow JS. So using the tensorflow JS converter, you can directly convert a tf5 module into a format that can be used on the web. It's a really simple

integration to be able to take a module and use it in the web browser with tensorflow. JS and we're really excited to see what you build with it. So just to summarize tensorflow Hub is designed to be a starting point for reusable machine learning and the idea is just like with a software repository before you start from scratch check out what's available on tensorflow and you may find that it's better to start with a module and import that into your mod rather than starting the task completely from

scratch. We have a lot of modules available and we're adding more all the time and we're really excited to see what you felt. So thanks. Next step is Jeremiah to talk about TF serving. Alright, thank you Andrew. So next tensorflow serving. This is going to be how we deploy modules are deployed models just to get a sense for where this Falls in machine learning process, right? We start with our data will use tensorflow to train a model in the output are artifact. There are these models right?

These are saved models. It's a graphical representation of the the dataflow. And once we have those we want to share them with the world. That's where tensorflow serving comes in. It's this big Orange Box. This is something that takes our models and exposes them to the world through a service so clients can make requests. Tensorflow serving will take them run the inference run the model come up with an answer and return that in a response. So tensorflow serving is actually the libraries and binary is you need to do this to do this production grade

inference over trained tensorflow models. I've written in C plus plus and supports things like grpc and plays nicely with kubernetes. So did you miss well, it has a couple of features the first and most important is it supports multiple models. So on one tensorflow model server, you can load multiple models right in just like most folks probably wouldn't push a new binary right to production. You don't want to push a new model right to production either. So having these multiple models in memory lets you be serving one

model on production traffic and load a new one and maybe send it some Canary request send it to make you a request make sure everything's all right, and then move the traffic over to that new model. And this supports doing things like reloading. If you have a stream of models, you're producing tensorflow serving will transparently load the new ones and unload the old ones. Rebuilding a lot of isolation. If you have a model that serving a lot of traffic in one thread and it's time to load a new model you make sure to do that in a separate thread that way we

don't cause any hiccups in the thread that serving production traffic. And again, this entire system has been built from the ground up to be very high throughput things like selecting those different models based on the name for selecting different versions. That's very very efficient. Someone who has some Advanced batching right this way we can make use of accelerators. We also see improvements on standard CPUs with dispatching and then lots of other enhancements everything from protocol buffer magic to lots of War. And this is really what we

use inside Google to serve tensorflow. I think there's over 1,500 projects that use it. It serves somewhere in the neighborhood of 10 million QPS which ends up being about a hundred million items predicted per second. And we're also seeing some adoption outside of Google. What are the new things I'd like to share today is distributed serving. So looking inside Google we've seen a couple of Trends one is that models are getting bigger and bigger some of the ones inside. Google are over a terabyte in size. The other thing we're seeing is this sharing of subgraphs right

to your Hub is producing these, and pieces of models are also seeing more and more specialization in these models as they get bigger and bigger right? If you look at some of these model structures, they look less like a model that would belong on one machine and more like an entire system. So that's this is exactly what distributed serving is meant for an Alexis take the single model and basically break it up into microservices. So to get a better feel for that. Will say that Andrew has taken his rabbit classifier in a serving it on a Model server. It'll say that I want to create

a similar system to classify cat breeds. And so I've done the same thing. I've started from enter flowhub so you can see I've got the tensorflow Hub module in the center there. And you'll notice that since we both started from the same module we have the same bits of code. We have the same core to our machine there model. So we can do we can start a third server and we can put the tensorflow help module on that server and we can remove it from the servers on the outside and weed in its place this place holder recall a remote app. You can think of this as a portal. It's kind of a

forwarding up that when we run the inference it forwards at the appropriate point in the in the processing to the models. Barely computation is done and the results get sent back and the computation continues on are classifiers on the outside. To the few reasons we might want to do this right we can get rid of some duplication. Now, we only have one model server loading all these weights. We also get the benefit that that can bats request that are coming from both sides and also we can set up different configurations. You can imagine we might have this model server

is loaded with TPU starcatcher Processing Unit so that it can do what are most likely convolutional operations and things like that very efficiently. It's another place where we use. This is with large sharted models. So if you're familiar with deep learning, there's this technique of embedding things like words or YouTube video ideas as a string of numbers, right? We represent them as its factor of numbers and if you have a lot of words or you have a lot of the YouTube videos, you're going to have a lot of data so much that it

won't fit on one machine. So use a system like this to split up those embeddings for the words into these shards and we can distribute of course the main model when it needs something can reach out and get it and then do the computation. Another example is what we called triggering models. So we'll say we were building a Spam detector and we have a full model which is a very very powerful spam detector, you know, maybe it looks at the words understands the contacts is very powerful but it's very expensive and we can't afford to run it on

every single email message we get so what we do instead is we put the triggering model in front of it as you can imagine. There's a lot of cases where were in a position to very quickly say yes, this is Spam or no, it's not so for instance if we get an email that's from within our own domain and maybe we can just say that's not spam in the triggering model can quickly return that if it's something that's difficult. It can go ahead and forward that on to the full model where it will process it. So similar

concept is its mixture of experts. So let's say we want to know where we're going to classify the breed of either a rabbit or a cat. So what we're going to do is we're going to have two models were to call Expert models. Right? So we have one that's an expert at rabbits and another that's an expert at cat. So here we're going to use a gating model to get a picture of either a rabbit or cat. And the only thing that's going to do is decide if it's a rabbit or a cat and forward it on to the appropriate experts who will process it and will send that result

back. All right, there's lots of ukases we're excited to see what people will start to build with these remote apps. The next thing I'll quickly mention is a rest API. This was one of the Cubs were happy to be releasing this soon. This will make it much easier to integrate things with existing existing services and it's nice because you don't actually have to choose on one model server 1 tensorflow model. You can serve either the rest one point or the grpc. There's three API there's some

higher-level ones left classification and regression. There's also a lower-level predict in this is more of a tensor intense are out for the things that don't fit into classify and regress. So looking at this quickly how you can see that you are I hear we can specify the model write. This may be like rabbit or cats are we can optionally specify a version and are verbs classify regressing predict. Give two examples the first one you can see we're asking the iris model to classify something in here. We aren't giving it a version model version. So it'll just use

the most recent to the highest version automatically and the bottom example is when we were using the mnist model Andres pacifying the version to be 314 and asked me to do a prediction. Listen to you. It's up to you easily integrate things and easily version models and switch between them. How quickly mention the API if you're familiar with tensorflow example, you know that representing it and Jason is a little bit cumbersome so you can see is pretty verbose here. There's some other warts like needing to encode things base64. Instead with tensorflow

serving. The rest API uses a more idiomatic Json which is much more pleasant much more assisting. And here this last example just kind of pulled it all together where you can use Curl to actually make predictions from the command line. So encourage you to check out the project at tensorflow serving. There's lots of great documentation and things like that. And we also welcome contributions and code discussion ideas on a project page. So like a finish with James to talk about tensorflow extended.

All right, so I'm going to start with a single non controversial statement. This has been shown true many times by many people in short tf-x is our answer to that statement. I'll start with a simple diagram of this core box represents your machine learning code. This is the magic fits of algorithms that actually take the data in and produce reasonable results. The blue boxes represent everything else you need to actually use machine-learning reliably and scale Ubly in an actual real production setting of the blue

boxes are going to be where you're spending most of your time comprises most of the lines of code. It's also give me the source of most of the things that are selling off your Patriots in the middle of the night. In our case, if we squint at this to just about correctly the core ml box looks like tensorflow and all of the blue boxes together comprise tfx so quickly run through four of the key principles that TX was built on. Offer six possibility and see if taxes can be flexible in three ways. First of all, we're going to take advantage of the flexibility built into

tensorflow using it as our trainer means that we can do anything tensorflow can do with the model level, which means you can have white models deep models supervised models unsupervised tree models anything that we can whip up together. Second we're flexible with regards to input data. We can handle images text sparse data multimodal models for you might want to train images and surrounding text or something like videos. After their lots of ways you might go about actually training a model if your goal is to build a kitten detector. You may have all of your data up

front and you're going to be to build one model of submission high-quality, and then you're done. In contrast to that if your goal is to build a vinyl Kitten video detector or a personalized kitten recommender, then you're not going to have all of your data off front. So typically your trainer model get it into production and then as data comes in I'll throw away that Model and Trina new model and then throw away that Model and Trina new model. We're actually throwing out some good date along with these models though. So we can try out a worm starting strategy instead. We're will continuously

train the same model, but as data comes in will warm start based on the previous state of the model and just add the additional new data. This will let us free result in higher quality models with faster convergence. Next let's talk about portability. So each of the tf-x modules represented by the blue boxes don't need to do all of the heavy lifting themselves the part of an open-source ecosystem, which means we can lean on things like tensorflow and take advantage of its native portability. This means we can run locally. We can still up and running the cloud

environments. We can scale two devices that you're thinking about today. And two devices that you might be thinking about tomorrow. A large portion of machine learning is data processing. So we rely on Apache beam which is built for this task. And again, we can take advantage of beans portability as our own which means we can use the direct Runner locally where you might be starting out with a small piece of data building small models to affirm that your photos are actually correct and then scale up into the cloud with a dataflow router. Also utilize something like the

flick Runner or things that are in progress right now like a spark about her. Will see the same story again with kubernetes where we can start with Mini Cooper running locally. Scalloping to the cloud or two clusters that we have for other purposes and eventually scaled the things that don't yet exist, but they're still in progress. Supportability is only part of the scalability story traditionally. We've seen two very different roles involve machine learning. Do you have the data scientist on one side and the production infrastructure engineer's

on the other side? The differences between these are not just amounts of data, but there are key concerns that each has about as they go about their daily business with tf-x. We can specifically Target use cases that are in common between the two as well as things that are specific to the two. So this will allow us to have one unified system that can scale up to the cloud and down to smaller environments and actually unlock collaboration between these two roles. Finally We Believe heavily and interactivity able to get quick in results with

responsive to willing and fast debugging and this interactivity should remain such even it scale with large sets of data or large models. This is a fairly ambitious goal. So, where are we now? So today we open-sourced a few key areas of responsibility. So we have transform model analysis surveying and facets each one of these is useful on its own but is much more. So when used in concert with the others, so what's walkthrough what this might look like in practice. Throw goal here is to take a bunch of data. We've accumulated and do

something useful for our users of our product. These are the steps you want to take along the way. So let's start at step one with the data. We're going to pull this off in facets and use it to actually analyze what features might be useful predictors. Look for any anomalies. So outliers in their data or missing features to try to avoid the classic garbage in garbage out problem and to try to inform what date are we going to get you further pre-process before it's useful for rml training. Which leads into our next stop which is to actually use transform to transform. Our

Future's So TF transform will let you do Full Pass analysis and transforms of your base data, and it's also very firmly attached to the TF trap itself, which will ensure that you're applying the same transforms in training as in serving. From the code you can see that we're taking advantage of a few Ops building to transform and we could do things like scale generate vocabularies or bucket eyes are based data and disposable look the same regardless of our execution environments. And of course if you need to just find your own operations you can do so So the

sports is at the point where we're strongly suspicious that we have dated we can actually use to generate a model. So let's look at doing that. We're going to be the Principal estimator, which is a high-level API that will let us quickly Define train and Export a remodel. This is a small set of Estimators that are present in court tensorflow if there are a lot more available and you can also create your own. We're going to look ahead for some future stops, and we're going to purposefully export to graph sin for a safe model one specific to serving and one

specific to model evaluation. And again from the carriage you can see that we're going to this case. We're going to use a wide and deep model. We're going to Define it. We're going to train it going into our exports. So that we have a model we could just push this directly to production, but that would probably be a very bad idea. So let's try to gain a little more confidence and what would happen if we actually did so for our end users? So we're going to step into TF model analysis. We're going to utilize this to evaluate our model over a large data set. And then we're going to Define in

this case one, but you could possibly use many slices of this data that we want to analyze independently from others. This will allow us to actually look at subsets of our data that may be representative of subsets of our users. And how are metrics actually track between these groups. For example, you may have sets of users in different languages may be accessed through a different devices. Or maybe you have a very small but passionate community of rabbit aficionados mixed in with your large community of kittens and headaches and you want to make sure that

your model will actually give a positive experiences to both group equally. Snow we have a model of that. We're confident in and we want to push it to serving. So let's get this offense for some careers at it. So this is quick now we have a model up. We have a server listening on Port 9000 for grp. See requests never going to back out into our actual product code. We can assemble individual prediction request and then we can send them out to our server. If the slide doesn't look like your actual code and this one looks more similar, then

you'll be happy to see that this is coming soon. I'm shooting a little by showing you this now as current state, but we're super excited about this and this is one of those real soon now scenarios. So that's today what's coming next? So first, please contribute and join the template or Community. We don't want the only time that we're talking back and forth here to be at Summit sandpipers is I'll second leave soon as you may have seen the tf-x paper at Katy last year that specifies what we believe

an end-to-end platform actually looks like I hear it is and by We believing that this is what it looks like. This is what it looks like. This is actually what's powering some of the pretty awesome guy Forest Products that you've been seeing that I owe and that you've probably been using ourselves. But again, this is where we are Noah Cyrus right now. This is not the full platform, but you can see what we're aiming for and we'll get there eventually. Please download the software. I

use it to make good things and send us feedback. And thank you for all of us for being current users and for choosing to spend your time with us today.

Cackle comments for the website

Buy this talk

Access to the talk “TensorFlow in production: TF Extended, TF Hub, and TF Serving”
Available
In cart
Free
Free
Free
Free
Free
Free

Access to all the recordings of the event

Get access to all videos “2018 Google I/O”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “Software development”?

You might be interested in videos from this event

September 28, 2018
Moscow
16
159
app store, apps, development, google play, mobile, soft

Similar talks

Yufeng Guo
Developer and Machine Learning Advocate at Google
Available
In cart
Free
Free
Free
Free
Free
Free
Priya Gupta
Software Engineer at Google
+ 1 speaker
Anjali Sridhar
Software Engineer at Google
+ 1 speaker
Available
In cart
Free
Free
Free
Free
Free
Free
Sara Robinson
Developer Advocate at Google
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “TensorFlow in production: TF Extended, TF Hub, and TF Serving”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
558 conferences
22059 speakers
8245 hours of content