A machine learner, software developer, and researcher at Google Research.View the profile
My work involves designing and building large-scale and on-device machine learning solutions grounded in research that spans various theoretical and practical problems related to the fields of Machine Learning, Natural Language Processing (NLP), Computer Vision and Information Retrieval. I am specifically interested in large-scale and on-device inference using deep learning, graph learning, unsupervised and semi-supervised methods and their applications to structured prediction problems in natural language processing, conversation modeling, information extraction, user modeling in social media, graph optimization algorithms for summarizing noisy data, image understanding, multimodal learning and computational advertising. I am passionate about research, designing innovative & efficient ML solutions and applying this to real-world problems and large-scale data thereby helping to directly shape solutions and products used by millions of users daily.View the profile
About the talk
Neural Structured Learning is an easy-to-use, open-sourced TensorFlow framework that both novice and advanced developers can use for training neural networks with structured signals. Neural Structured Learning (NSL) can be applied to construct accurate and robust models for vision, language understanding, and prediction in general.
Presented by: Da-Cheng Juan, Sujith Ravi
Hi everyone. I'm fidget items in Google AI we work a lot and how do you do deep networks and build, you know machine learning systems at scale on the cloud with minimal supervision be working language understanding computer vision multimodal applications. But also we do things on the edge that means how do you take all these Albertsons in Philly to computer memory constrained devices on the edge and I'm here today with my colleague and I'm working with learning and other related topics. So let's get started. You guys have the owner of being here for the last session
of the last day. So Kudos and the Bravo. I mean you're really dedicated. So let us begin to get very excited to talk to you about noodle structured learning, which is a new framework in tensorflow that allows you to cranial networks instruction signals, but first, Let's go over some Basics. If you're in this room and you nobody planning and you care deeply about deep learning, you know, how it typically will Netflix work. So if you want to take for example a new leopard cat and train it to recognize images and distinguish between Concepts like cats and dogs. What would you
do you feed images like this on the left side, which looks like a dog and give the label dog and feed it to the network and the process by which it works. If you were just wasting the Network's that's that's natural to learn to distinguish and discriminate between different contest and correctly tag damage in convert to a category. This is all great all of you in this room. Probably had both on network. How many of you have actually built a new leopard? Greg and he said this is the last eruption last day, but still very happy that you're all with me here. So what is
the one thing I mean, it's all gravy have like a lot of fun networks. What is a 1-quart ingredient that we need when we build a network so almost majority of the application study work on Wii require label data annotated data. So it's not one image that you're feeling to this network. You're actually taking a bunch of images pad with the labels cats and dogs in this case. But of course, it could be whatever like depending on the application. What do you feed? It's like thousands to hundreds of thousands or even millions of examples into the network to train a good classifier.
Right today. We can introduce New York structured learning which is a framework supported in tensorflow 2.0 and Paris and it allows you to train better and more robots Rule net worth by leveraging structure in the data. The core idea behind this framework is that we going to take Newell Network and feed it in addition to feature input structure signals. So think of the abstract image that I showed you earlier addition to these images paired with labels, you're going to feed it connections are relationships between
the samples themselves. I'll get to it in a what these relationships might mean. But so you have this talk turn signals and the labels and you feed both into the network. You might ask. What do you mean by structure? Right structure is everywhere in the example that I showed you earlier. If you look at images just take a look at this graph hear the images that are connected to your address in this picture here basically represent that there's some visual similarity between these so it is actually pretty easy to construction signals from day-to-day sources
of data. So in the case of images here, it's visual similarity, but you could think of what if you tag your images and you know created albums that represents some specific Concepts do everything within an album or a photo album has some sort of a connection or interaction or relationship between them. So that represents another type of structure It's not just for images we can go to like more advanced or like me different completely different applications. Like if you want to take the publication's Publications or news articles and you want to tag them with their topic.
One simple things take by medical literature all the papers that are published with its nature write any of the conference's they have references and citations to other papers that represents another type of structure or linked to these are the kind of structure that we're talking about here relationships that are exhibited or like model between different types of objects in the NASA Langley space occur everywhere. If you talkin about doing search, everybody has heard of Knowledge Graph, which is a rich source of information which captures relationships between
entities. So if I talk about the concert Paris and France, the relationship is one of the capital of the other rights Saudi sort of relationships. It's not typical to capture and Lexie donate to a new leopard, but these are the kind of relationship switch on holiday dressing in day-to-day data sources, so why not leverage them? So that is what you trying to do with the new structured learning and the key advantages made before we talk about what it does and how we do it. What why do you even want to care about it? Right. The one of them is a dimension W. It allows you to take
the structure and use it to train leopards with less label later. And that's the costly process right to her every application that you want to train. If you have to collect a lot of rich and it say the data at scale for millions of examples. It's going to be a tedious. Instead. If you're able to use like a framework that automatically captures the data structure and relationship and with minimal supervision, you're able to train classifiers or production systems with the same accuracy. That would be a huge pool like who wouldn't want that's one type of a benefit other one is that
Typically when you deploy these systems in practice in real-world applications you want the systems are networks to be robust. That means you train them ones. You don't want if it into distribution changes are Adidas Ali change his or somebody, you know craps the images with Stanley the network to flip the predictions and go Bonkers strength. So this is another benefit. Where is he using structured learning? You can actually improve the quality of the network and also the robustness of the network. Let me dive a little deeper and give you a little more insight into the first scenario
right document classification as an example. So I'll give you like probably some example that you probably have in your home right now and you have a catalog of our library of books in your home and these are digitized content and you want to buy them. I need to arrange them into like specific topics are categories. No one person might want to categorize them based on the genre a different person might say I want to belong to you know, the same. A third person might say all I want to capture different books that have the same kind of content or phrases
or words that appear in the Same by the same author of by alexiane obeisance. Am certain aspects like the you know, there is a like a particular plot twist that you have that is captured in the different books and I want so you can think of not everybody is means of the same. So are you going to collect an annotated in off-label data for each of those tasks, you know to create a network they can you know Very high accuracy distinguish and classify them into these genres. Probably not. So on the one hand you have plenty of data to draw book content is available to
Ron news articles are available to text available to but it's hard to Constructors labeled annotated dead, right? So this is where nearest Active Learning or NSL Going back to the previous like a example we can model the relationships between these different inputs or samples using structure again. I will tell you what the structure means but pair with a few label examples and then training at work that is almost as good as the network that is trained don't like millions of examples for your application only use 5% or 10% of the label data
and trained as good as pacifier or like a prediction system. So that's what we're trying to do and help you do with noodle structured learning. And I said structure. How do you come up with the tractor or how you do this reduction classification. I'm going to give you a forward reference hear that song is going to talk a little bit more about that as well. So there is times on tutorials for the exact example that I told you document classification with a graph and you can go to the Noodle structured Learning Center for website and try this out for yourself all within just a few
lines of code. All you need to do is constructed a Tinder format and with a few lines of code. You should be able to run the system and Trent. switching to the second scenario It's great to have you know, high quality models. We have seen that you like reserve their capacity to Crane really really good quality models, especially as you increase the number of parameters, but robot has is an important concept and they actually are an important right ear when we deploy this to real-world scenarios. For example, if you have an image recognition system and suddenly been at work
since predictions because the images of corrupted this is not a good thing and expectations or any kind of an example take image on the left. What do you think it is? It's a panda take image on the right. Where do you think it is? Anyone who disagrees with the panda basically you on your network is saying that both of these images are pandas, but that's not what happens. If you actually train a real life example, resume the state-of-the-art network and you try to apply to the two images of first one would be correctly
recognized as a panda II going to be recognized we're completely different concert in this case the given and the zoom in really close II images actually and I just heard example, it's actually created by adding some noise to the original image, but these are like, so tiny is very impressive about the human eye, but the network based on these changes in the pixels collectors protection completely. Not in this happening, you know in Herbalife system, right? So you don't want these things to happen. So you want the net worth of your robot. So this is where and it's all comes
again and again be going to use the same concept use the structure in the data to train more robust models and slightly different from the one that I mentioned earlier earlier relationships are model at the graph constructed structure. So we take the original image and we also generate a perturbed image or an adversarial examples for that reason. I made these two images Arkansas join via a link, but so there's a relationship between them the differences. This is dynamically generated during the learning process as opposed to
the earlier case where somebody gives you Knowledge Graph or destruction signals, or you can check them from some data source. And what we try to do here is using the structure. We already know the label for the first image is a panda. We're trying to force the network to launder The Perfect Image also should be classified as a panda at a very high level how this works. Again, if you want to try this out for any of your network or in any application, you can go there's a tutorial which allows you to do this like use the API to NFL enables Bode
Bronx new grass type of lighting and also the adversarial learning and you can just go to the website and like run to the code example all with just a few lines of code to see more details later in the talk. so this is that a high-level how a wife we would want a framework like NFL and the power of using it to enable more of us networks and also build networks with you and I can be friends with minimal supervision fees are very very handy when you wanted those like, you know applications on the Fly and a very very custom applications that do not fit like the regular mode dive a
little deeper into how we do the trade for the framework is doing so in the first part time as a said structure is going to be used to model and as an input and network, so we called it. I'm running to the core idea is that in addition to this feature inputs that your family would like textiles frame is classification or let's work features or phrase features a sentence features for document classification in addition to the structure of take nose model as a graph. She might ask by the way at this point. Where is the grass coming from and
in some cases? It might be given to buy they said that the titration graph Knowledge Graph. In other cases. You can actually construct a graph we were happy to say that we provide tools again. You will hear more about this that allow you to construct these grass from sources of data lights wedding wedding rings or Imaging bearings. Now the goal here in Europe. Bloating is the network is going to be forced to jointly optimize both the feature input and destruction signal simultaneously. Let's see how that happens like, you know, like diving in deeper. If you were trying to look at what
exactly the network is learning every network is trying to optimize I'm lost so in image classification, what is the lost when you take a pic so fast through the network get some prediction? What is the error occurred between the predictions and the true label? So in case net worth train this in this mode noodle got machine optimizes to component one is The Standard Los which is an image classification is The Lost Empire Vinny Paz the pixel through the network are the predictions and measure the error with the two labels. The other component is going to be based on the
structure signal or the graph that you provided and where that comes in is if I have an image that looks like the you know, the pitbull dog that's labeled as a pitbull if I have a different image which through my truck turn signal is like sort of has an edge with your image. Then the network is for still under the source image and its neighbor in the graph should learn similar representation. That means you're trying to say that respect the structure that you provide us in food. And also try to optimize the supervisor. This is very flexible. As you can imagine you can
use it. Okay to change the formulations that means inside of a Super Why is lost if you want to do and supervise planning are very different kind of lost. You can actually change the first component very easily. And just another quick note that you don't have time for that. But like we happy to answer more questions at the end of the loss of themselves. The type of losses are all to customizable you can use Alto to Los Prados entropy depending on the kind of applications. You can even use ranking losses. If you do this now makes it if you know very very easy for you to train a
wide range of applications in a different learning setting like it over there and supervise supervised or ranking or classification. But at the same time you'll be able to passing some structure in a seamless Manor. Here's an example. So the NFL little grass lunning take image classification you start with some samples as I said the pixels and you also have a structured in this case. The images are connected in the graph based on user infraction signal or basically, for example, as I showed you like they belong to the same album or there's some
sort of structure tying them together. This is given to you. We passed it through the network Bode sample and its neighbors are passed simultaneously through the same network and the network is learning to optimize within you know each layer and this is all too configurable. By the way to push the end bearings for Neighbors closer to each other that means to images that are connected in the grass should learn similar and bearings when passed through the network. Simultaneously, you should also optimize that they should learn the correct prediction. So one of them was labeled as a time that
then you also in the prediction error to be minimal. So both of these parts are being optimized joint. Okay, so hopefully this gives you an idea of how we use graph learning or noodles graph learning and enabled us where the new structured learning framework. As I mentioned start you can come in different forms. That was an explosive structure be provided as a graph input, but we can also do implicit structures and this is where the address are learning type of tie rod ends are enabled using banished cells framework and cure. Again, you're going to
join Kia Optima features and structure except the difference is the structure is now in use during the learning process by constructing adversarial examples to do if you have an exciting as an input, you create a necktie Prime, which is an adversarial version of that at least you're connected under some sort of weight in a basin. Do you know this is configurable and this structure is not pass through the network and the network is forced to optimize both of them to the same and bearings are representations inside. These are on break. So
we mentioned this sort of opens up a whole set of new kind of application or leg training scenarios the best part about this is you if you're thinking oh now how does this work with Transformers or residents are different kinds of next Network structure doesn't matter here. You can use this with any type of network structure. That's the best Bots Transformers, like resin conclusions. Like, you know, he has combination of CNN gelaskins doesn't matter. These are certified learning strategies that you can actually build a network but enabled and a self-standing very easily with very
few lines of code in TF2 final. And you tell me more about that turning it over to Jeff Chang. All right. Thank you. So ji. Next we are going to introduce the libraries to work and trainers provided by the newer structural learning framework. Everything here is compatible with tensorflow 2.0. So you could train them your nuts with structure signals while and during all the great features from tensorflow 2.0. This is the training workflow. We just mentioned previously
in rest here as a new staff introduced to the word slow to train with straw with structure signals and you are such a learning provide libraries and tools for the steps. Let's first take a look at a lab part of the workflow. The training samples in Neighbors From the same neighborhood impact to form the noob noticed that in the batch. Each training sample is extended to include the neighborhood information to achieve dust in the new effective learning framework. We provide Standalone tools such as Bluegrass and pack neighbors that the user could
involve directly. We also provide functions that users can integrate into their own custom Pipeline and you may notice Bill Graff and pack neighbors hear all list all listed both as Wineries and functions. This is not a typo. This means they could be involved at binary or as a function. Next let's take a look at a Ride part of a finger. Again. We provide libraries for these new staff introduced to enable graphic ruggedization. Both the training will be spent to the new network and unpack neighbor features is for this purpose.
the model in this illustration is, lotion or not, but it can be any type of not just limited to then the difference between the sample and its neighbor embedding is calculated and add it to the final loss after regularization term. In addition. We also provide libraries to generate the adversarial neighbors as implicit structure signals for regularization. Finally, we also provide the Tariff API for user to easily build Carol trainer with wrap regularization War adversarial regularization the care of a PI from New York to learning support all three
types of model building either via VIA functional API for Via subclassing. This is just a stop set of tools and libraries. We provided in the New York structural learning framework. Please visit our website to learn more about a tools and API in your function learning. The first step if you want to use New York structural learning is to do a pip install. Here we provide the code example demonstrating the API from the New York structural learning library. We first need to read the training
data note that the day. Hear a pre processed by the tools for functions to ink to incorporate the draft in 2018 training samples. Not the user to build custom models such as and treated at the base model using the user computer tomorrow. The base model used any of their favorite like we just mention sequential functional or subclassing. After the base model is filled with you at the API to wrap around a base model to enable the graph representation. There are several how to configure 8, for example, we need to specify the maximum number
of neighbors considered during a wreck during regularization each hyperbola. We provide Depot values set to a certain number that we know empirically they work well. After we enable graph regularization in Keras model. The rest is just a standard Charis workflow compiled bit and then email. That's it within 5 lines. We are able to enable graphic regularization in five lines actually include one line that the company not the actual logic implementation.
Here, let us show some results of a model train with structure signals. The task is to conduct the sentiment analysis on the IMDb movie review. We want to point out that this without is just from one of our internal experiment your actual mileage may vary from task to task from data to data or from model tomorrow. The x-axis represents the amount of supervision which could be converted into the amount of label example and y-axis here represents. The model-actress the left speaker showed the performance of a bi-directional lstm and the right figure shows the performance of a C4 and you are
not as you can see when they're when we have lots of training examples when the amount of supervision is high. There is actually not much performance difference. But as soon as the amount of supervision drop to 5% or even 1% training with structure signals lead to more accurate models usually white bean soup and it's more than 10% If you are interested in more results, please refer to our paper. Show all printing with Spectra signals felt really great. But sometimes we do not have a
structure. We do not have a grass to begin with. So what should we do? No such as learning provide to message. The first one is to confront Nebraska for the Tuff truck the structure via data preprocessing and the second one is to construct a structure via adversarial neighbors. Let's focus on the data preprocessing one first again, let's take document classification as an example given the sample document. How do we know if another document is similar enough to beat a neighbor document?
These documents will be projected to the invading space. For example, we could use the pre transferred and bedding that's mentioned in earlier tests before talk M2 project. All these documents into generated in wedding documents that are closer in the inviting place or assumed to have similar cymatics. Next we examine the similarity between chewing that cause a similarity or other metric to be used here. If the similarity is higher than a predefined fracture retrieve these
two documents are similar enough and therefore we added as between this. Two documents to make them neighbor by repeating his father. We could construct a structure or construct a graph among all the data the other day. after we have the graph the rest of the training flow is exactly the same as we mentioned before the one asked if the graph is given let's again take a look at the actual code example. We first load the training data and test samples from the IMDb dataset
not we load the preet rent in Banning model from the TF Hub. The embedding model used is a swivel model, but feel free to replace it with your favorite rent in Banning model such a spurt. Next we project a text or the document of each review from IMDb to the embedding so we can calculate a similarity between play remember when to reduce are closer in the inviting place. We assume they share similar semantics. After we project text Linda reading we use view
graph function provided by the new structure learning to construct a graph. When invoking this function, we also need to provide us in a lot of pressure with this point eight in this case. After we have to grab we call pack neighbor function to incorporate the neighbors samples into each training sample here for each sample three. Neighbors are considered. After we all come in a training day. With with rough signals everything. It's just like the first coat example, we show we data build a base model be
so functional API or stop coughing feels familiar to learning API to wrap around the base model to enable the graphed regularization again, the rest of the workflow. It's just a sin dark aeroflow compiled fit an email. So we also provide hands on tutorial step by step tutorial on our website. So feel free to visit the website and play it by yourself. the second method to construct a structure or construct a graph signal is not to build a graph dynamically by adding adversarial Neighbors For each trainee sample, we find a malicious
participation based on the reverse gradient Direction. In other words deprivation is designed to confuse the most which means to maximize the Lost. Then this malicious perturbation is added to the original training sample to create an adversarial neighbor again the design of adversarial neighbor Target a confusing the model by most which is to maximize the Lost. Then we added Edge between these adversarial neighbor and the original training example, and therefore we have constructed a glass or constructed a
structure. This is the code example using adversarial Terrace model from your structured learning again feel familiar. In addition to the three lines everything else follow the same workflow introduced before. Real structural learning has been widely used in many products and services in Google. For example, learning image somatic in batting here. We provide examples to four on two for each somatic granularity to illustrate a difference from Porter to Alta find granularity. The object in the right
is the Golden Gate Bridge. The Golden Gate Bridge is Bastille Red Bridge, but not all the steel Red Bridge in the middle. Learning such embarrassing is a challenging task probably due to the large variations thing among images that belong to the same class or to the same category learning image combating to capture find Grand Slam Attax. However, it's a score of many image related application such as image search for the example query image. This is the over under architecture
used to learn the image in batting and again feel familiar exactly the same workflow. We introduce again and again in this talk and think this talk focuses on learning. We will not introduce in detail about other techniques such a simple stuff like that used to train tomorrow if you're interested please on refer to our paper. Let's do men into the structure part. The graph used here is a cold current scrap. Especially the Corcoran is trying to answer the following question given one is selected. What
are other images that's sufficiently similar that will also be selected. Stay. The Prairie is the White English Bulldog, if two images or quarter for men bring time between these two images making them Neighbors. So here are some experimental results for each Curry image. We provide the top three nearest neighbor based on the image in batting learned the image color color by Green Ranger to be stronger strongly similar with the fairy image by human Raiders where afterimage color by Red is not so similar for example in a
lap Digger when the inquiry is a white squirrel squirrel can be correctly retrieved by using a combating learn from the Norfolk Child Learning framework. In other words learning with structure is able to capture instamatics much closer to actual human perception. So to recap training which structure is very useful last label data are required to effective training model. Also learning with structures structures signals lead to the more robots model and oh,
so it works for all types of New Orleans, or return your name or any costume. This is probably the most informative slide of this talk. You can you can learn to love libraries or Hands-On tutorials in detail in our website. Oh, so please start over GitHub. We do take the Hub issues, and we would love to hear from you. We are looking forward to developing this from work with all of you to make this promote more comprehensive. We will be waiting for your pull request on GitHub. Thank you.
Do we still have time for a question? Okay, so I think we still have some time for questions. You may have to use the mic. The British graph function that you mention the comparison across all the items. I see so far. I see one edition. Did I tell you don't really have to do all tires comparison much pasta techniques. Do we stay to nothing? We're in the process of releasing of the tools. We should make it much faster. Even on single machines right without having to do like to have like Milling example, you don't need to
do me like crap million right now. The music feel great that you are probably going to be the first user for something. Very nice. Just so that I think I can I understand what you're doing. It looks like you're taking images and then Gann generated images from the generator associating them and thereby negating the ability of Gans to deceive a neural network. Is that correct? To be the gun structure has any network that you're trying to learn with using the gradient that are back propped
and reversing the gradients to construct the you know, set up a noisy example, if you will, I'm the idiot the reason to do this is that and then while doing the training is that next time the Network's these this no example, right it was still learn to correctly identified rather than you know, flipping the bird and the worm turned make the intermediate layers and also be predictions robot. It's a coincidence. The equation function that you had that you're trying to minimize
application has the second one is that with sacrificing the Lost over neighbors like Source images and neighbors right in the case of adversarial. Let neighbor image is basically construct and the way to whatever rate did US Annex Douglas Elementary. Really neat idea. Thanks. But the 30 for the graph look weird to me. Page 33 34 and when you compare it with the standard how many samples of each point huge and I suppose are the airports in here? So I'm basically this is the Aero bars from the testator. It's not from the
how many trials training trials how many times you stood for the different seasons how many training trials am I do believe we use we have five trials at this point is that depends on the network radio training expected even if the net force really powerful like to do the Gap to increase so there's some like to pretend thing you see like there's a like lower than what the 5% is what that's just based on this data set. Hi, how you doing? I was curious how you guys might apply this to say segmentation or to video classification. I
think. Couple of different ways like a in the video classification. You can look at videos that are sort of related. Right like I'm in a similar letter from the same channel, for example, right and you have some like that kind of matches you can actually now created links between these different videos and I do mean that you have a way of processing to come listen address all these frames into some representation. You going to buy this regulation, Indiana to work to say that hey these related videos all should be optimized to the same protection.
If we're talking about principles like Distortion flooring stretching and the eldest kind of stuff compared to the adversarial generator to a samples. Is there any benefit of using one versus another in this case? Y y Rosario Jr. The samples to versus let's see some classical principle behind right the adversarial. It depends on the data and the network and you're using the gradients that are back propped to create like an equal and examples of hard example, if you will like the transformation
everybody knows hit. This is the rotation you apply rotation. Are you blowing? Right? But what if like somebody is hacking says like it's not a rotated image, right? But inside I take like that serve a different kind of transmission functioned. So here you're trying to do this across the board using the learning process and the gradients that are being learned in for the network at that layer compression when the networks use the feedback from the graph that actually adversarial neighbor gives you a completely alike low score on the on the example, but then
work itself is pretty sure that the class is is is there so how does it the result when you have a pretty convinced scenario versus I think the clarification may be referring to the given trenchant wrong label during the process of constructing it on the Fly and they're making their Network for to do we just started at a that is a given if you didn't apply that says that they are classification is wrong here. So you can use it and that buy this moment their network is already pretty much convinced that it's not given for
example, but they're the graph tells that is still like the closest wrong. So how how does this result? So a quick add on this it's actually on in some experiments. We apply the correct label to the to the editor example example, we will have the pendulum path label. You will be more robot. Thank you very much.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.