Martin is passionate about science, technology, coding, algorithms and everything in between. He graduated from Mines Paris Tech, enjoyed his first engineering years in the computer architecture group of ST Microlectronics and then spent the next 11 years shaping the nascent eBook market, starting with the Mobipocket startup, which later became the software part of the Amazon Kindle and its mobile variants.View the profile
About the talk
On the forefront of deep learning research is a technique called reinforcement learning, which bridges the gap between academic deep learning problems and ways in which learning occurs in nature in weakly supervised environments. This technique is heavily used when researching areas like learning how to walk, chase prey, navigate complex environments, and even play Go. This session will teach a neural network to play the video game Pong from just the pixels on the screen. No rules, no strategy coaching, and no PhD required.
00:42 Dense neural network
11:30 Policy gradients
17:10 Training dataset
21:45 Playing a game
26:47 Training loop
31:03 Things from the lab
36:30 Neural architecture search
37:41 Move 74
Hello, welcome to this tensorflow session about rain enforcement learning. Oh, so this is you Han I'm Martin and today we would like to build a neural network with you. You know about neural networks. If I said serious question or cross entropy raise your hand if you know what that is. Do you like a coffee or just very quick primer? This is on your own network. Okay layers of neurons those neurons. They always do the same thing. They do a weighted sum of all of their inputs.
And then you can leave them in layers the neurons in the first player. They do a weighted sum of laissez pixels. If we are analyzing images the neurons in the second layer will be doing weighted sum sums of outputs from the first layer and if you're building a classifier, let's stay here. Probably by those little square images into airplanes and no airplanes. You will end up on a lost layer which has as many neurons as you have classes and keep those ways are configured correctly will get you that
that one of those neurons will have a strong I'll put on certain images and tell you this is an airplane or the other one we have is not on airplane, okay? So a little bit more details about what time does for those who like if you have the full right here you have to do. So you see it's a weighted some plus something cool to buy us. That's just an additional degree of freedom. And then you feed this through an activation function and in you don't always a nonlinear function and to simplify for us here only to activation
functions that count. So for Alden intermediate players really that's the simplest function you can imagine you have it on the graphic. Okay. Does this it's not only are we love it? Everyone uses Dad? Let's not go any further on the Lost Leonardo if we are building a classifier. Typically what you use is a softmax activation function and that is an exponential followed by a normalization. So here I have different pacifier. Classified in 10 classes and you have your own weighted sums coming out of
your 10 spinal neurons. And what you doing softmax is that you elevate all of them to the exponential. Come to the norm of the director of 10 elements and you divide everything by 10:00 or the effect of that in exponential has it's very steeply increasing function is that it will the winner of heart. I'm in a little animation to show you like this. This is all this is before so much clearer, which of those is indicating the winning class. That's why it's
okay. So just those two activation functions with that we can build this stuff we want to build. So now coming out over noodle Network. We have our would say here 10 final neurons for using values, which have normalized between zero and one we can say those values are probably okay the probability of this image being in this place. Are we going to determine the weight in those weighted sums and it's all just so you know random so initially our network doesn't do anything actually so useful.
We do this through supervised learning. So you provide an images which you have labels before have you know, what they are and all those images your network is going to put this set of probabilities, but you know what the correct answer is because you are doing supervised learning. You're correct. Answer uniform at that looks like what the network is producing. It's the simplest encoding you can think of his old one hot encoding and basically it's a bunch of zeros was just one one in the middle at the index of the class. You want to hear to represent to 6. I have a vector of
zeros for the one in the sixth position in now, I can compute the distance between them and the people who started this and a pacifier. They tell us don't use any distance use the cross entropy distance why I don't know they are smarter than me. I just follow and the entrance to the Cross entropy distance is computed like this. So you multiply element by element element of the vector the known answer from the top of the probabilities you got from your neural network sum that up acrostic.
This is the distance between what the network has predicted and the correct answer. That's what you want if you want to trade or you don't like where you get that it's called an error function or loss function from there on panthersville can take over and do the training for you, but just need an error function. Tokyo nutshell. Those are the ingredients in our part that I want you to be aware of you have noodles. You don't do it yourself. You have only two activation functions that you can use either the radio activation function on intermediate layers, or if you
build a classifier on the last layer softmax, and the error function that we are going to use is the process and rough ER function that I had on the playlist fly. Okay. There's a high level ATI in tensorflow cold lairs where you can instantiate an entire layer at once. You see the first layer your has 200 neurons and is activated by the activation function. It's this later here. And this also intensely as the weights and biases for this layer in the background, you don't see that it's in the back. I have a second later here. Which is
this one is just 20 neurons again radio activation function do this little later. And even if your dude don't see it on the screen, it is activated by softmax, but you don't see it because I use its output in the cross entropy function which has South Knoxville TN, so it is a soft. It's just that you don't see agents of mechanical this here is my error function, which is the distance between the correct answer one of them coded and the output from my neural network. Once I have that I can give it to tensorflow pick it up
tomorrow ask it to optimize this loss and the magic will happen. So what is this magic will take the sub error function differentiated relatively to all the way to and all the biases all the trainable variables in the system. And that's something that is mathematically called the gray gems and it can't figure out how to adjust the weights and biases in the neural network in a way that makes this error smaller that makes the difference between what the network FedEx and what we know to be true smaller. That's supervised
learning. The prime now what do we want to build? Today we would like with U2 built, but you don't have to work that plays the game of punk. I've been just from the pixels of the game. It's notoriously difficult to explain to a computer the rules of the game and the strategies and all that. So you won't have all that and just get it to fix those and find some learning that's will learn to play this game. And of course that is not the goal in itself because
it's super easy. You just always stay in front of the ball, you know, and and you win all the time. And actually we will train again such a computer-controlled agent. The goal is to Explore Learning algorithms. Because this application hopefully Way Beyond talk and it has three possible outcomes. This is a position in which you want to go up stay still or you want to go down. Okay, let's try to do this by the book in a classifier. So we have a single intermediate layer of neurons activated with
three neurons are activated by softmax. We use the cross entropy loss. So you have the function here. It's the distance between the probabilities that this policy Network policy Network predict probability of going up still or going down. What what is the correct move? I don't know. Do you know who you'll have? No problem. We just don't know what's correct way here. However, the environment requires that we making moves of going up or things the are moving down to Earth
again. Is so what we're going to do is we're going to send for the bull which means Peking one of the three possible moves of moving ahead of things. They were going down randomly pick the move based on the output of the net worth of the outer part of unity of our next move. All right. So now we know how to play the game. We are all loaded Dice and pick from that and we know what next move to play initially. This network is initialize with a random weight. So it will be playing around them moves. How is that? How does that inform us about the
correct? Move to play? I need to put the Quake movie my formula That's Right Moves that will lead to winning. only when someone is an hour past the point do we know whether or not so whenever somebody stores reward the point will keep ourselves the plastic one reward point and you over here once again Over here you see some lost time should very much like the process of establishing myself before now in the middle here is the main difference where instead of the correct label learning problem. We're just going to put the central the movie in
there the movie played the moves out what happened to play. Well some of the most memorable and so that's why every lost lost value is X the reward out front this way moves that eventually lead to a winning point will get you courage and move Saturdays when losing point will be discouraged overtime every move. I can see how it is. It could lead to some learning but putting back by mathematicians hat on. I see a big problem here. You have the samples move that is a peeking operation
the sampling operation you pick one out of three that is not differentiable to apply gradient descent and all that the lungs function must be differentiable. Urgent Care the simple moved here depend on the models weights and biases but the sampling operation is not differentiable. It's a warrant and then we play many many many moves to get a lot of things and only difference rating of probabilities that are offered by the model in Bruges on the screen and nose probabilities the right hand of the models weights and biases
and we can be friends with those with thanks for the weights and biases this way. We still getting Valley gradient and can apply gradient descent techniques kind of cheated the part that is problematic just regarding has constant. You're going to play mini games with the same neural network accumulate those blade moves like you made those rewards whenever you know you guys use Morning Pointe you accumulate those rewards and and then you plug that in and you only difference is relatively to the predicted probabilities. And yes, you're right that still gives us so we should be
able to do that. Okay, I get it. This is clever. So this will actually bring probably very slowly. He wants to show you the minimum amount of stuff you need to do to get it to train. What is the minimum amount that there are still two little improvements that you always want to do. The first one is to Discount the rewards so probably if you lost the point you did something wrong in the 357 pain moves right before you lost that point and probably before that you bounce the ball correctly a
couple of times and that is that was correct. You don't want to discourage that it's customary shoe discount or rewards Through Time backwards through time with some exponential discount Factor so that the moves you place closest to scoring points are the ones that count the most Till you see the hero we just counted them with a factor of 1/2 for instance. And so and then this is normalization steps. What is that different multiple ways to think about
these the way I like to think about. This is an account at the beginning of the time the mod only has a randomized weights and biases going to make the most of the time it's not going to make the right move only once in a while. He's going to score a point by accident and Muscle X going to lose points to the very rare winning booster and Performing is normalizing a very nice. To the rare lost in town of Lodi moose play the game and accumulate enough data to compute. Everything that we
need for this function. Okay. So those are those which so we are going to collect the pixels during gameplay. That's how our Vita set collected data that looks like you have one column with the move. You actually played one column where you would like to see the probabilities predicted that point but you were actually going to store just the game board and around the net work together for a little cheese and a lost, with three words. You see that is a plus one or minus one reward on a remove that score 2 points and on all the other moves you discount
that reward backwards in time with some exponential discount. So that's what we want to do an end. Once you have this maybe you know this in the form that you just multiply those three columns together and and some of that that's our lost that is how the lost his computer. Let's build it you implemented this this this is demo. So can you walk slow placeholders function arguments are required to compute Alto values from Arlo Guthrie Place holders for the input one for the
observation. Remember this really means the difference between two consecutive frames unit gameplay because you don't see the direction of the ball with from here to fix electric but you train from the Delta between two frames cuz they're you see the direction of the boat. That's the only ones learning venular enforcement learning that applies to many other problems. And the rewards placeholder. Whoa Calexico V County rewards that we was all ready to be the mother is like the one before
the single dance with activation of the renal function 200 year old followed by a Salter Max layer. Pulling us off the max function here. And that's what I do because the next day the same pulling operation already takes the in-law g-switch before you close and you can perform multinomial sampling which doesn't mean you output a random number of zero one or two for the three classes that were based on the probabilities Face by The Animals. Okay, I'm interested parentheses for those not familiar with tensorflow tensorflow built a note a graph of operations in a memory. So that's why
out of multi monomial we get an operation and then we will have an additional step to run it and actually gets those predictions out and placeholders are the data you need to put it in when you're actually around the nose to be able to get any medical result app. Okay. So we we have everything we need to play the game still nothing to trade. Let's do the training for it. For training when you lost function so our beloved cross my cross entropy loss function Computing the distance between our actions. So the moves we actually blades
and logic switch are from the previous screen. That's what the network predict from the pixels. And then we modify it to you to use a using the reinforcement learning by multiplying list remove lost by the photo move towards and now with that with those rewards moves leading to a point moves leading to losing point will be discharged. So now we have our error function tensorflow can take over we pick one of the optimizers in the library and simply asked this Optimizer to minimize our last function which gives us a training operation and we will be
on the next flight to run this training operation feeding in all the data we collected during gameplay that is where the greatest movie computed and that's the operation went wrong that could modify the weights and biases in our policy in it. So let's play this game. This is what you need to do to play one game in 21 points. So it technical wrinkle is that in tensorflow? If you want to actually execute one of those operations you need a session and then we play again. So first we get the pixels from the game State
compute the Delta between two frames. Okay, that's Technical and then we run. Run this sample of operation. Remember sample operation is what we got from our picking multinomial the dat's wot designs the next move to play. Then we use a pain stimulator here from openai gym. We can give it this move to Play Store and give us information on whether this game in 21 points in Spanish. Okay, that's what we need the reward if we got one. So we don't have to play
one game and we will call and play many of those games to collect a large backlog of moods, right that's right. Now that we have collected louder Utah playing one game against why not we can start as we plan to will discount the reward so that it moves that did not get any rewards during gameplay now cancelled his Cavalry worth based on whether or not they eventually led to winning or losing points and how far they are from the moves. I actually want normal
before now, we're ready. We have we have observations. That's the difference between frames and we know you those actions were good or bad and they want me to call the screening off and spend a couple nights before when we had the initialize and this going to do the heavy lifting and kung fu Radiance for us and modify the waist just slightly supposed to play tennis with modify the way you lie to lie and repeat the process and expect movies. The model
Lorenzo play a little better every time I'm a bit skeptical you really think this is going to work like them all let's go. Let's run this game. This Is The Life Dental, I'm not completely sure that we are going to win but we shall see So Brown on this side is the computer controls paddle very simple algorithm is just stays in front of the ball at all times. And so there's only one way to vent to win its Vertical Velocity is limited. So you have to hit the ball at the very sight of the federal to say is to send
to send it at a very steep angle and then you can overcome the vertical velocity of the opponent. That's the only way to score on the right in green. We have our neural network control agent. I will see if it wins and if you want I want this side of the room to cheer for brown and this side of the room to cheer for a r, okay. Is that even right now? 1 AI is winning is winning. I'm happy because this is live there but there's no guarantee that a I will win actually win that one thing that is interesting here.
Actually is that This is a learning to just a pixels 2 initial Aza I had no idea of even what game it was playing what the rules where in her even no idea which pedal is okay, and we didn't have to explain that. We just give the pixels and I'm scoring points. We get a given positive or A negative reward then that's it's from that it learns and you see you see those emerging strategies, like what I said hitting the ball on the side and send it again at a very steep angle is the only way of winning and it picked that up. We never
explain it. It's just an emergent strategy. This is looking good. I think will win. Okay next point. I won't allow cheer when he wins. Cuz I hope this is going to work. This is fantastic. All right. So what was going on during gameplay? Actually remember? How this network was built. Right here. Neurons in this very first layer, they have a connection to every pixel of the board. They do weighted sum of all the pixels in the in the board. All right, so they have a wait for every
pixel and it's fairly easy to represent those weights on the board and see what those neurons are seeing on the board. So let's try to do that right here. We pick each of those 200 murals and here we visualize superimposed on the board the weights that have been trained enter the untrained eye doesn't see much in here. So maybe you can Enlighten us that you had pizza game keeps those that we gave it to you months to the game 5 for the other teams to put
a lot of weight is moving in on us about the Bulls to Jackson react rose again board and cramps on the right before the beginning to remember that important piece of information to be able to play a game well and when you think about it, Can see something important information to be able to play the game well. Knox Or prophesied pong, although it worked Explorer training supervisor training. Well, sometimes when we teach people it looks like to provide training probably in class. The teacher says this is the Eiffel Tower and the people say that the Eiffel Tower
and so on but if you think about a kitten jumping on a furball and missing Edge and jumping again until it catches it and There's no teacher there. It has to figure out a sequence of moves and it gets a reward from Catching the ball or not catching. It. Looks like you made sure there are multiple ways of training our own neural networks. And one of them is probably quite close to this reinforcement learning way of kitchen. Is there some thought they can fire you yeah. I have a technical.
Insight meaning for you as a takeaway message Linda space simply remove all the probability output from Network and playing a game and kitchen rewards back factors really do depend on them away. And so even with the powerful tools like tensorflow, you wouldn't be able to view the last time round gradient descent raining and it's a way of getting around some non differential stuff that you find in your problem. That's great. That's great. So what is this going to show you a couple of things from the lab because this has had mostly live applications. And then one
last thing. What is this is a very interesting everyone what we're witnessing here is a human X birth of pancake flipping trying to teach robotic arm doing the same thing. And there's some modeling the backs of protein that controls the joint movement or the more person is what angle must be to move to her. And the goal of these is to flip a pancake in the pen case. You can lend it on the floor or on the table or Bank in the frying pan working the popular award and otherwise next year award. The experimenters federal reward to be
a small about A+ rewards for any moment on the floor and that we should have learned but why you run the watch to bring the pancake possible to make some ice pancake ever on time tonight illustration of the fact that you can change the learned behavior by changing your loss function exactly. We know how to play Pong and flip pancakes that's significant progress. Deepmind also publish this video. So they use the reinforcement learning neural network is predicting the power to Sam's to the simulated muscles and joints of these models and the reward is basically a positive reward whenever you managed
to move forward and a negative reward when you either move backward when you fall to the hall or when you just crumpled to the ground the rest is just reinforcement learning as we have shown you today to all of these behaviors are emerging behaviors. Nobody taught those models and look at you have some wonderful energy be haters. It's coming in a couple of seconds. Look at this jump. Those are nice with the model swinging arms to get momentum then lifting one leg cushioning right here. Look at this. This is a fantastic. Athletic jump you it looks like from the from the Olympics and it's a
completely submerged Behavior. What is it with no way of walking around doing this function didn't have any Factor discouraging, you know, useless movements. It's all again by modifying glass function you get different behaviors and actually one lost one not this one. This one is kind of funny. Yes still playing around but I figured out how to run sideways. This is there are two ways of running and it did figure out how to move sideways. This one you probably seen this is move 74
in game 4 of alphago vs Lee sedol. And that's the one that leads into a plate that was cold to God move and he's world famous for just Dad. He played one click move and menace to get to win one game against I was like, oh he looked forward to one which is fantastic. And also I can also use these reinforcement learning not exactly the same way. It wasn't entirely built out of reinforcement learning. Okay. Because for a turn-based games the algorithm for winning is actually quite easy. You just play all the moves to the end and then pick the ones that
leads to positive outcome. The only problem is that you called computer there are too many of them. So you use what is called a value function you unroll on YouTube Love booze. And then you use something that looked at the board and tell you this is good for white and gold or black or good for black. That's what they built using reinforcement learning and I find it interesting because it's kind of the way we humans solve this problem. This is a very visual game. We have a very powerful visual cortex when we look at the board go in the game of influence and we see that
in this region black has a strong influence and region-wide has a strong presence so we can kind of process that and what they built is a value function that does kind of the same thing and allows them to unroll the move to a much shallower. Because also just a couple of the moves their value function build using an enforcement learning tells them. This is great for white or black. So these are results from the lab. Let's try to do something real what if we build on your
butt for what you have to know it still has weights and biases in the middle, okay? So let's say we build one. That's what uses sequences of character. And we structure is so that those characters actually represent a sequence of layers so you can see my first layer of this is how big it is and blah blah blah. represent What is that we training Disney don't let work on some problem. We care about let's say our spotting airplanes in pictures. What is now we take this accuracy and make it a reward in a reinforcement of learning algorithm.
So and we apply reinforcement learning which allows us to modify the weights and biases in our original neural network to produce a bit architecture. It's not just changing the shape of the network. That's what it's better for our problem to problem. We hear about we get when you don't let work that is generating for hours, but it's cool. You don't actually published a paper on this. And I find this very nice application of a technology design industry to be spanked.
Somebody just saying we have you don't let words that run to be with other neural network. Do you have to finish ubuildit them out? So it can you tell us a word about the tools that you used for the heads up board for tracking them? So I use the claw machine running engine for the training model that was playing again. I was told before you like them old took maybe about one day of training. So it's a country that you have this job based View and you can launch 20 jobs with different time and just let them
run and practicality, right? Yeah. I use a managing a lot as well for that other tools in Cloud for doing a machine-learning but one we just launched his ultimate Vision. That's one when you do not program just put in. Enable data and it's figures out the model of for you and now you know how it works. And also this using a lot of CPU GPU Cycles so-called CPUs are used for when you're doing your own architecture throat and they are available to you as well.
Thank you. That's all I wanted to show you today. Please give us feedback. We just released this code to get up. So you have to get help you out. If you want to train a punk agent yourself go into it. You can take a picture there. And if you want to do you want to list it on the screen if you want to learn machine learning I'm not going to say it's easy, but I'm not going to say it's impossible either. We have this series relatively short series of videos and put samples and collapse caused him to reply without a PhD
that is designed to give you the keys to the machine learning Kingdom. So you go through those videos. This talk is one of them and it gives you all the vocabulary and all the concepts and we are trying to explain the concepts in a language that developers understand because we are developers. Thank you very much.
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.