Duration 14:21
16+
Play
Video

Distributed TensorFlow model training on Cloud AI Platform (TF Dev Summit '20)

Fan Yang
Software Engineer in Machine Learning at Google
  • Video
  • Table of contents
  • Video
TensorFlow Dev Summit 2020
March 11, 2020, Sunnyvale, USA
TensorFlow Dev Summit 2020
Video
Distributed TensorFlow model training on Cloud AI Platform (TF Dev Summit '20)
Available
In cart
Free
Free
Free
Free
Free
Free
Add to favorites
2.24 K
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About speaker

Fan Yang
Software Engineer in Machine Learning at Google

I am a Software Engineer at Google working on computer vision and machine learning. Before that, I was a Research Scientist at eBay Inc. I have 10+ years experience and 30+ publications/patents (2000+ citations) in premier conferences and journals including PAMI, CVPR, ICCV, KDD, MM and ACCV, etc. I am an active reviewer for 30+ journals and conferences. My work on visual search has been deployed to production, supporting multiple products across teams, such as eBay ShopBot and Close5.

View the profile

About the talk

Cruise machine learning platform team worked with Google CMLE team together to enable distributed Tensorflow model training with Horovod in 2019. We will present the work we have done and the learning around training performance analysis, fault tolerant, monitoring and cost management.

Share

I'm a young fan engineer at close today. I like to tell you some work we have done last year on how did we what is a platform team Builder the machine learning training for articles? So way across in a San francisco-based company where you are building the world most advanced the self-driving Rideshare in technology operable on San Francisco street. If you visit San Francisco city Gomez have a chance to see our always test cars running on the streets.

The other to operate by shaving service in San Francisco, our cars have to handle a manager compositions everyday. I want a stereo surround cereal many Ministries and the interact with a double pack double pocket cars cyclist delivery trucks emergency vehicles pedestrians at the event. So we see many interesting things on trees all over the city. All cars are designed to handle most of those situations on its own. So I'mma cars on multiple cameras lighters weather sensors detecting the surroundings environment and then make the station at the same time our mission critical and driven by machine

models a lot of data in Chelsea the front rotator. Alternative format is very complex and a height of the national. So we have to work streams in the mother of my mother Department one is to continuously Richmond the using the new owners and they are the other ways to develop the experiments and the new models the chart shows the number of model training jobs. So each by Avicii Wake the number of a virus the week to week after the meantime, we want to change marshmallows fast, but not so costly. Are the platform where you want to

fulfill requirements for both our machine 10 years and also for the company. All legendary size Engineers want to have I want to train models at any time the tools and support of the Frameworks flexible without too much are constraints Sunshine zero so long. She was Steve Jobs start as soon as they submit a train training jobs while they are able to see the results as early as possible. more importantly on these experience iPhone all accounts a size. We need to make sure that

we spent every penny wisely. It means we need to run album all the paying jobs efficiently today. We need to allow our Engineers to focus on building the brushing protocol mission-critical softwares to impact the Encore performance. So in order to fulfill those requirements what is starting to build our model train you in 4 on top of a Google's Cloud a artiphon the iPad phone offers a fully managed to training service through command line tools and the web apis so we could have launched our jobs as a single machine or as many as many things as our color allows

for therefore. We can't let our machine Sharon Stone still have to change jobs if they want to make a difference GPU CPU and the Maverick requirements of training service. Also provide good service service in a big hurry. They all have the same although on multiple machines efficiency. We also need a tissue disappear training strategy about framework Uber price change and cold training jobs, the young a single machine so far. We haven't tested the training model using from 16 to fuse 250v

fuse While we ski Lofts the training cluster window, there's a deficiency decrease because of communication over has when there are multiple use the communication between sushi rolls. They are the most efficient set up for the training job. At least the two factors needed considered one is the unit cost. The other isn't playing time on the chart on the right side head dime is 21 change example, so if we're training on one minute images at a ymh per second for Jeep you using a media

and the Machine high is the high napping sense to use with 64 cars in time. However, too many pairs of efficiency So the blue light on the charger come with the flash from lots of rides when the number of juice increases while I'm watching people can stable training time. The average cost is increasing. So the right light is showing the total cost of going out front at the rides on the chart could be most cost-effective. What is the science of building automation system to provide the best potatoes to our users diagram of a system

interact with the exponent rule well patch of the coal and dependencies into a training job than submit a job to our covenants Table Saw Service the service exam on the job finders to prevent for any accidental miscounting finders. For example, if it is using too many CPUs of much memories. At the same time the service will translate a computer requirements into action machine type on the iPhone. So user don't need to memorize all the different types of documents rise of the cost them self.

Alma training jobs are Rochester International tracking system. Where are we can keep the reference to the job source code and its founders. When we want to analyze the mall of performance we can trace back to a job information as they as if that's needed. The Trinity important and all that are stored in Google Cloud the car starting service. That's just as efficient storage and the highest report for a job's a job start to produce some metrics and results for service.

Why the job is running a different service counseling pulled agile metrics will AI platform apis and feed them until their dog? So far we care about the GPU and a suit invitation and a job division. When is he is too low. I couldn't thing that y'all got stuck and didn't need to. Then we can go to find a jobs to inspect a job or are they just the machines that have to save some cost? Our internal framework is a bridge between the train service between cyst and how are users and Heisel

interactions from The Wizard training. So you're so focused on Reading Road. Hilton call turn on the history of training strategy by changing one line in the consecration in the in the previous slide lucero's full automatic package has a cold and a submitted finding job the framework by Daniel Caesar on Spotify the power Zone by number of tissues, then we'll figure out the actual machine set-up. The family also provides the interface for governance and marching services. This line is demonstrated user how to how do you do? They turn on the

Harwich beach weddings are in a config the visor for a walk will also Train song other place in the mall in the mall of behavior. What important change in the mall of change August the how to start the process or how long as the main process extends the commands to other workers in the custard? So in order to use heart of all the in the artiphon railroad a booster club Stratford Square Tattoo and use it on the master found the master that you blue striped sweater will send

the command. So I will try to fastest if you give you a configuration. You'll stay safe emissions and then useless information to set up an s. We're not going to come and then we're used to show fashion CTA information on the Walkers is the Republican View. Because we have a immediate CPS on the Walkers. So quickly is a media driver test. However different the immediate driver version with a place that has a different location where to find the battery before we executed. Once we have a to show this to where you subscribe to parcel, Northport and then I

use data in the NPR command. Different from a regular pants Volkswagen job with don't want some pasta training process full of peril take the evaluation. That's because we have to keep all the workers running at the same Pace. If one Walker pause for a violation. The other worker will have to wait for it then high jump with a failed because I got a full speed process on the master wanted Fresno checkpoints in OC CA house will trigger on the evaluation process and

the last year about it. Basa fish are we providing a framework is error handling 226 Narrows this the one example of a wrong Maryland one job this job around for more than 24 hours Advance disturbing the painting job for the Walker shut down due to the IRS because all worker have to be synchronized with a failing hard job. What time does the arrow and the waiting for the workers come back then automatically restart the job using restore the prom dress from the last song

written SharePoint and eventually competed with human rescue. This year we start exploring tot casserole to one off of the show as native supported by G-Eazy and what a provider a better experience. It supposed to collect evolves with your PC or nccl library. For example in c answers or even larger scale than before they call a circle in crude attend substance abuse is also in. College introduction to ring or 3 depending on the city of interconnections. Google Apps iPhone is going to support 2.0 2.0 Exodus out of the box. So you no longer

need to run to the push-ups. So if I open if you're building a model training involved we are team. I have four takeaways for you. First-lien try not to laugh impossible Raymond as well as lotion available on the market and rich houses on I believe there are waiting to solve the problem for you received a significant amount of support from a iPhone 10 between us a lot of tools focus on building the best Salvage on technology. Meanwhile anti foam team is a pasta machine learning service by learning the customer needs from us.

Secondly be aware of cost will not follow responsible responsibility is to help Company open the sustainable. Although model training can be very expensive. There are many ways to save a cost that you found him. Not only need to provide the hardware and software solution. But also the best part is improving the efficiency. Fernley Renown the community answering the fixing bugs that aren't delivering and delivering your features also want to thank him an excellent to open source projects and things from them all. So we're looking forward to the back.

The last but not least is the developer experience as a team and Cruz Reserve Auburn internal customers National engineer's they already facing management stay today to use our phones shouldn't be another one could have helped our service our customers thing the whale smooth ER and the Middle Passage time eventually running a flying well will be after this. Towing Santa Cruz, we're building the world's most advanced the temperature in technology in San Francisco. If you're interested, please check out our website and please feel free to reach out to us. Thank you

for listening.

Cackle comments for the website

Buy this talk

Access to the talk “Distributed TensorFlow model training on Cloud AI Platform (TF Dev Summit '20)”
Available
In cart
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “TensorFlow Dev Summit 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “AI and Machine learning”?

You might be interested in videos from this event

March 11, 2020
Sunnyvale
30
199.16 K
dev, google, js, machine learning, ml, scaling, software , tensorflow, web

Similar talks

Zongwei Zhou
Software Engineer at Google
Available
In cart
Free
Free
Free
Free
Free
Free
Jacques Pienaar
Software Engineer at Google
+ 1 speaker
Tatiana Shpeisman
Senior Engineering Manager at Google Brain
+ 1 speaker
Available
In cart
Free
Free
Free
Free
Free
Free
Na Li
Software Engineer at Google
Available
In cart
Free
Free
Free
Free
Free
Free

Buy this video

Video

Access to the talk “Distributed TensorFlow model training on Cloud AI Platform (TF Dev Summit '20)”
Available
In cart
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
551 conferences
21655 speakers
8015 hours of content