About the talk
Google Lens is a visual browser for the world around us. Bringing Lens’ server-side scene understanding abilities ondevice can impact the Lens user experience by lowering latency and increasing the reliability in poor network conditions. In this talk, we discuss a path to leveraging TFLilte to achieve the goal of powering Lens features ondevice.
Hi, my name is Gloria and I'm a software engineer at Google lens and I'll be talking today about how tensorflow and particularly TF lights help us. Bring the powerful capabilities of Google Lens with computer vision. Stack-On device first. What is Google lens? And what is it capable of? Lens of the mobile app that allows you to search what you see it takes input from your phone's camera and uses Advanced computer vision technology in order to extract semantic information from image pixels, as an example. If you can point lens to your train ticket, which is in Japanese will automatically
translate it and show it to you in English in the live you find her. You can also use lens to calculate the tip at the end of the end of dinner by simply pointing your phone at the receipt. Aside from church searching for answers and providing suggestions though lens also integrate live our experiences into your phone's viewfinder using optical motion-tracking and then on device rendering stack just like you saw and examples lens utilizes the full spectrum of computer vision capabilities starting with image quality enhancement such as do, you know a thing as motion-tracking in order to
enable our experiences and then particularly deep learning models for object detection and semantic understanding. In order to give you a quick overview of lenses computer vision pipeline today on the mobile client be selected image from the camera stream to send as a server for processing on the server side. The Clarion Mitch then gets processed using a stack of computer vision models in order to extract text an object information from the pixels. These semantic signals are then used to retrieve search results from our server site index which then get sent back to the client and display.
Do you think either length of current computer vision architecture is very powerful but it has some limitations because we send a lower resolution image to the server in order to minimize the payload size. The query the quality of the computer vision prediction is lower due to the compression artifacts and reduced image detail. Also, the queries are processed on a percentage basis which can sometimes lead to and consistencies especially for visually similar objects. You can see on the right there the moth was Miss identified by lens as a chocolate cake
something just example of how this may impact the user finally Lunds aims to provide great answers to all of our users instantly after opening the app. We want lunch to work extremely fast and reliably for all user regardless of device type and network connectivity the main bottleneck to achieving this vision is the network round trip time with the image payload to give you a better idea about how Network latency. Accessing particular it goes up significantly with poor connectivity as well as payload size in this graph. You can see latency plotted against payload size with a blue
bar is representing a 4G connection and the red a 3-g connection. For example, sending of 100 KB image on the 3-g network can take up to 2.5 seconds, which is very high from a user experience standpoint. In order to achieve our goal of less than one second and done latency for all lens users were exploring Moving Service. I'd computer vision models entirely on device in this new architecture. We can stop sending pics to the server by extracting text and objects features on the client side moving machine learning models on device eliminates the network latency, but this is
a significant shift from the way lens currently works and implementing. This change is complex and challenging. For the main technical challenges are that mobile CPUs are much less powerful than specialized server-side Hardware architectures like tpus. We've had some success and porting server models on device using deep learning architecture is optimized for mobile CPU such as mobile mats in combination with quantizing for mobile Hardware acceleration. Retraining models from scratch is also very time-consuming but training strategies like transfer learning and distillation significantly
reduce model development Time by leveraging existing server model potential mobile model. Finally the model it infrastructure. That's inherently mobile efficient and manages the trade-off between quality Concord Lane CN Tower reviews key of light in combination with media pipe as an executor framework in order to deploy an optimized rml pipeline for mobile devices are high-level developer workflow to Port a server model on device. Is the first pick a mobile-friendly architecture such as a mobile that and then train the model of using tensorflow training pipeline distilling from
a server mall, and then evaluating the performance by using tensorflow the valuation tools and finally to save the train bottle at a checkpoint that we like and then converting the format to TF light in order to deploy it on mobile. Here's an example of how easy it is by using tensorflow is command line tools to convert the safe model to TF light Switching gears a little bit of how lens uses on device computer vision to bring helpful suggestions instantly to the user we can use on device. Ml in order to determine if the user cameras pointed out something that lends can help the user with you
can see here in this video that when the user points on a block of text a suggestion ship is shown when pressed it brings a user to lens which then allows them to select the text and use it to search the web. In order to enable these kinds of experiences on device multiple visual signals are required. In order to generate a signal lens uses a Cascade of text barcode and visual detection models, which is implemented as a directive Kasich Last Breath some of which can run in parallel. The raw text barcode an object detection signals are further processed using various on device
Ana taters and higher level semantic models, which is fine during classifiers and abettors. This graph based framework of models allows lens to understand the scenes content as well as a user's intent. The further help optimize for low latency lens on device that uses a set of inexpensive ml models, which can be run within a few milliseconds on every camera frame these perform functions like frame selection and course classification in order to optimize for latency and compute by carefully selecting when to run the rest of the ml pipeline in summary lens can help
improve the user experience of all our users by moving computer vision on device are critical in enabling decision. We can rely on cascading multiple models in order to scale this Vision to mini device types and Tackle reliability and latency you too can add computer vision to your mobile mobile product first. You can try lens to get some inspiration for what you could do and then you could check out a pre-trained mobile models that tensorflow publishes. And so you can also follow something like the media pipe tutorial to help you build your own custom Cascade or you
Buy this talk
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.