Mustafa is a software engineer at Snap Inc. He currently leads the Core Camera team to improve the performance and reliability of the Snapchat camera which captures billions of moments every day. Prior to Snap, Mustafa spent 8 years at Amazon launching new AWS services (CloudFront, Aurora) and improving Elastic Block Store (EBS). Mustafa has a Bachelors of Systems Design Engineering from the University of Waterloo and lives in Seattle.View the profile
Vinit is a product manager on the Android camera platform team. In his role he focuses on camera framework and advancing the camera ecosystem. Prior to Google, Vinit led product at multiple startups and developed products for companies including Apple, Amazon, Nokia, and Qualcomm. Vinit earned a bachelor's degree in electrical engineering from the University of California at San Diego.View the profile
Developer advocate at Google working with multimedia partners, former camera engineer at Microsoft.View the profile
About the talk
Snapchat and Android camera engineers talk about developing camera-first experiences across all the different surfaces available including phones, laptops, and other form factors.
Higher One, thank you all for coming here. My name is Oscar. I am from the developer relations team. And today we're going to speak out camera. In this station will cover three major topics first. We're going to talk about the inner workings of camera on Android second. Snap is here with us today. They're going to Showcase. How do you say, maybe I can production to millions of users. We're going to close with our vision of the future of how camera API shoot the balls and ecosystem that goes with it. Let's Jump Right In The Open Camera app
is a series of compromises some are outside of your control as a developer. But understanding those will help you make better decisions where you have control weaken your app. Let's go over how camera works at the physical level. It is a pipeline of steps at the physical level visible light goes through a lens that directed to filters with fast as the light on to millions of light sensors. Then that converts the light finally into a to D. Major traffic sucks when you think light sensors the first name that make them to my name is Theo my tccd nowadays we have many other kinds such
a CMOS for APS are actually more commonly use in today's modern cameras DSLR phones. From now on we refer to this component as the camera module. The output of the camera module is a roll picture around frame. It is not entirely unprocessed, but he has not been processed outside of the camera module at this stage. You can think of the camera module as a black box that dispatchers these frames. In early Hardware, we have three physical pimps four different kinds of pictures
of the science. We have a Singleton. Thanks dude, tricks like multiplexing and talk space and so far but the conceptual model remains as having separate Bronx in 5 Points for each kind of image. We have generally speaking Three Brothers cases video preview photo and video recording. The camera modes also say this a black box and it contains an increasing amount of smarts with him some of which are in a cross feedback loop you as a developer who has very little visibility into what goes on in there. Region example when you said say
autofocus or Auto white balance become a model may be looking for faces to see how to better optimized off settings and you don't really have any control over. From a framework standpoint developers do is send one configuration in and you get one frame out. This is very very important. You send one in one frame out for each frame. There is a configuration that goes with it. To help you with this. We have a set of template configurations will cover those later as well. Celeste zoom into the preview by Prime
the one on top I said each pipeline showing is a broke his case when it comes to preview. You can think of it as the viewfinder. So whenever somebody is holding their phone up and you want to see what the camera sees. This is the feast, talking about. We need to compromise on something. So obviously they can read I can only do so many things at once when it comes to preview. We choose to compromise on quality We Care much more about latency. We want the user to be able to see what the camera sees much faster. Let's just generally
results in better latency because we're able to compromise as a save on quality. Some of those compromises come in the form of for example reviews most notably station or less noise reduction. Moving on today for the pipeline. You may want to enable our users to capture a high quality images. Keep in mind that I said before one fan configuration in one piercing in one frame out different frame configuration May mean that you will not be able to reuse the same refrain when it's all this is if you
are requesting a different exposure so for a different time, If you are mindful while you set up your pipelines, you may be able to avoid this compromise entirely. So when it comes to photos we care much more about the quality a lot less about this people. So we choose to compromise and let me see in little of having much better quality image. Last night we have the video by point. This is for the video record use case diagram indicates, you're staying at all. This pipeline should be filing
on theory in practice. We have the number of Bolin x 10 frames can only come so fast out of the camera some of those bottle next to include the exposure time. Some of them can be the processing power memory a number closings. And if you can't fly planes almost have plan on the previous two, we can all the speed. We definitely need to have friends coming consistently out of the camera as a certain rates by the same time. We also care about walking. So it is very important in this kind of fuse case. We find the right
balance. Generally speaking. We can only test for so many things Android runs in many many devices. Not just a high-end. Unfortunately, we can only test I said so many things. In general the best tested escenario is when you have preview pipeline running plus one other you're going to have a really hard time. If you want to have all three by plants running simultaneously. So we discussed trailers made by the hardware. We discuss trailers made by the framework and now we're going to dive into the application layer what we have to do
what I can do. Let's walk to the end-to-end process of a frame. From the creation of a request until the pictures are finally memory ready for us to use all the steps easy as ABC and then profit over the phone a lot. Don't believe me on Snapchat. The first decision you have to make as a result of burn is selecting the right device out of the ones available on your list. Device gear means the camera Moto phone device each phone may have multiple cameras.
this is especially important now that we have multi-camera apis you may have multiple devices that are say back facing but you can choose from so you have to go through those devices I see which one fits your use case best once you've successfully open the camera you're going to get the camera device instance and that is necessary to proceed forward in the open camera call back you can use the Ready set go back to monitor the status of your camera device entirely possible that the camera maybe if a higher priority process requires the device this
could happen recently in a multi-win environment or something Second step in a 2-hour change your camera Alpha Target's business where each of the five nights a week earlier are going to copy your frames in the founder of buffers. The memory may already be allocated then all you need to do is to get reference to the underlying surface. And otherwise, you may have to wait for a call back already the capture session, which is a specific that you open can be created. It is worth noting that I
said, the camera capture session is a specific that device that you open earlier. Why is there such thing as writing you can build a captcha request one of the predefined templates must be used if you don't want to take any of the hands from the system, you can use template manual for full control. Otherwise, the framework is going to provide a set of templates that ghostly line through the camera concept 2 model we discussed earlier. So you have templates for preview. You have templates for photo on your record. The captain requests
contains specific frame configuration and the Apple Target recall frame configuration in Apple sign up. Also Target must be one of the previously defined ones. You cannot just use another service here at active. He needs to be one of the services that you declare first as part of the second. Miss is the configuration for defect of the configuration in Fremont. The recap we've chosen the camera that I said we want we created the camera setting for that particular device and now we built at capture request
that will be used in that session. We can finally ask for a camera. We send the captain request to the session and then we were for the call back we can do this in two ways. We can send a request once we can make it a repeat my request. This makes more sense in use cases, like for a sample preview of your record. You don't want to be sending a configuration for every single frame. This is a very easy way to just send one complication and asked the framework to continue to repeat it. Now, which color do you want to listen to that really depends on what you want to get out of it? If
you want to get the Freeman today. You should look inside of the capture call back. If you want to get the frame for CPU processing one of the options is using imagery. Generally speaking. You can just get whatever personal information out of the Opera surface that you have set up image. That's one of these were to do it. If you want to get into the GPU a great idea is you can run off locations. Obviously, that's not the only choice you can also use opengl text. And this is the cycle Step 1 through 5 that we need to go
through for every single time. We call one configuration in one frame out and a bunch of playoffs in the process. My thanks Oscar. My name is Mustafa Brar. I'm a software engineer at snap. And so I work on the Snapchat app today. When I want to do is take what Oscar said about the lower layers of the camera and sort of work your app and design choices that you make and how they influence how you interact with. The camera will talk about some trade-offs in terms of architecture how to architect the camera framework inside of your app will talk
about specific challenges in forever more narrow use case. So nowadays if you ever use Snapchat, I'm hoping everyone has the first thing that happens is it opens directly into the camera. And the reason for this is that people use the camera now as a means of communication, they're taking a picture of their adding metadata creativity and then sending it and so it's really like the cursor where what do you have to say and trying to get that out as quickly as possible is sort of the thing we really value. There are a few key camera design
decisions that we've made and these will influence how we interact with the camera. The first of those is full screen in video and pictures front camera rear camera. All of them are full screen. Now, you typically don't see this on a lot of cameras. If you take your camera out usually picture mode is 4 by 3 aspect ratio because a lot of times you shared online the camera actually optimized for specific aspect ratios. And so since your displays usually, you know, 16 by 9 or some other aspect ratio that might not you know be available for picture mode. We have to make a trade-off.
Usually the devices will have something that matches but it might be really high resolution and to Justine code that and create a JPEG out of it just takes a long time to get to make trade-offs in terms of the resolution. We select for taking pictures and video and so forth. The second part is single-mode. There's no video mode. You don't swipe and enter a video mode. And when you have this design Choice, it leads to a set of optimizations. You can't take advantage for example video cuz there's a recording hint on Android that you can tell the system that uses lot to record. There's a lot
of optimization that you can set but if you don't know that users in 10, if they long press it's a video fit just take a picture single tap. You don't know ahead so you can't decide up front what configurations are always going to work. We try to do static configurations that are good in a lot of scenarios and then we have certain that our Dynamic and we update those depending on what the user's intent is. The other part is for long-time. Snapchat is offered lenses a lot of interactive features inside of the camera in each of those things for cistern trade-offs. For example, if there's
a lot of motion your face is moving and you're trying to track a user's face Optical image stabilization is a lot of things that might not work well with these other things that want to keep track of where the user is also has a lot of memory overhead. There's a lot of GPU work going on and you need to balance all the work that you do. Last part is you can't cheat you have to capture the picture in front of the user so they can get their message out as fast as possible most cameras. If you take a picture, it's high the latency and it saves it in the background so you don't see it. It's
just kind of Falls away and you don't really see how long it actually takes before us because the real meat is when users edit the canvas of their image or their video we have to do it right there and then it's been so you have to make certain trade-off to have lit low-latency to get there now to support these and some other features you need to kind of think about how the overall architecture of your app is going to be if you truly wanted to be Universal. So there's a few best practices that I'll share some of them really just sort of helping broaden the width of devices that you able to
Target and also how well they perform on those devices and you know, what time you going to ask teachers lenses and traction so you want a framework that's extensible so that you could plug and play different things. So the first thing I have to interact with is the API that Oscar mention it's the part you interact with the camera and there's a set of apis there's two versions there's camera one and it's been around for many years and then there's also camera to a newer API, although camera one is deprecated. You know, when you have hundreds of millions of users, it's big chunk that
still on camera one. So you need to support it regardless of what the what the device says. The second part is there's different modes for camera 240 yams to slowly adapt to have a legacy mode, which is simply you know, if we've made our API work but under the hood, it's really camera one and also camera to varies by device some devices say yep camera to work straight here, but in reality deep down it's still camera one. And so the performance you get my very even across devices. So you really need to know each of the different products that you want to Target. And so what we do is we
support both since 2016. We've written both camera one and camera to in terms of code. And then what we do is we add a layer on top. That's basically what are application fees to just unify what it looks like underneath and then if you have shared code for example of picking resolution, it doesn't really matter which version you're using that the logic for the full screen that I mention can be shared by both. So there's some shared code but the application deals with a Pop-Tart. The next part is there's millions of devices. A lot of them have buds. They say they support a certain teacher. It
doesn't work use zoom and Beyond 99% somebody comes so we have a lot of configuration that allows us to decide on specific devices and models what Behavior what configuration to adopt and so this remote configuration is really Keith and over time. You have to groom 8:00. You have to maintain it. It's a very key part of something that's just code. Now let's build up his architecture. So I've talked a bit about sort of the camera interaction the code that talks to the camera server, which is the Android system process that manages the camera and it talks to binder IPC, which is a form of
inter process communication. But if you expose this two applications, it's a little too raw because you might have a video chat feature a camera and video notes and if they all try to talk to the camera, they could do things that's her conflict with each other. There's life cycling a lot of things you need to be aware of and so what we do is we have a Q an operation q that allows us to coalesce things that are redundant allows us to manage sort of invalid States and Soviet avoid a lot of conflicting things across teachers by having an operation q and a thread that processes them and the next
part is so everything on your right. The gray side is really about the UI, you know, like the shutter button you see the autofocus to animations all that and it has a state machine that says okay to use Ur tapped autofocus. Let's do the animation. Let's ask the camera to autofocus the and then it's saved your intent in memory and tries to drive the asynchronous process to get it done. And so the right hand side on the left hand side have different responsibilities. An overtime your surface comes in. That's the preview frames that Oscar was saying start showing up in your
screen. If you have interactive features enabled like lenses, they get composited on top of it and it's done in opengl and we try to do it synchronously and not avoid copying anything large and then ultimately your ui's composite on top and so as far as the users concerned it all looks in a great day, but in reality, it's multiple layers that are coming together. Now, if you look at this architecture, what you notice is that the left and the right have sort of different things that you're concerned about and one way to think about it is client-server architecture. The right hand
side is like the client. The left hand side is your run time your processing and the concerns are different. So from a performance perspective the left hand side needs to worry about processing frames really fast and needs to worry about stability a lot of things and the right-hand side is really about animations, you know new features. A lot of their candy in terms of how the user interacts with a camera and so one cool thing you can do is if you ever work with client server is sometimes a server slow 600 the millisecond Sometimes some issues. So what we do is we have ways of replacing
components. You can use that girl yet other approaches where you say, okay, let's just extract out the camera so that they you I think it's talking to the real thing and we'll just Mark the apis to the Android system and maybe do a note or send black frames whatever we want to do and one place where we take advantage of this is performance testing. So for example, if you add new UI logic and you want to see if it were breast is the camera. Anyway, you don't care how long the camera operations take and because they're variable if let's say takes a hundred milliseconds to open the camera. And it's
variable because it's a system service then over time. You might not be able to catch 10 millisecond regressions. And so in this example hear what I'm showing is a start-up regression passed the mock camera. So the cameras been locked out. It's starting the app repeatedly and say, okay, let's collect the metrics and see how long it's taking to get to a stable point and then we can actually catch regressions across several for me to see you know, what changed my cause then another example is instead of replacing the whole camera. You can replace part of it. You could say the frames that are
coming out this just replace him with something else rather than the whole camera. You could keep other functionality intact and so in this case here, we could replace the frames with a video file. And so this is very useful in situations where let's say, you're automatically testing your face tracking in the lab environment where you run integration tests. There's no faces. There's no people in the lab to look at the camera. And so what we do is we actually use the Audio files to drive that interaction. So here's a picture of me running some lenses testing where the frames have been
locked out from a video file. And so there's me. Hello Ayo, and you can see the paper showing it's not the camera. It's just a video but everything else is real it took a snap it edits it. It does a lot of things we can lock out parts of the camera to allow you to add new features and just test things and isolation. And so this is kind of in terms of an architecture. It's one way to do it. There's many ways but this gives you the performance considerations on the left the remote configuration to get a wide range of devices supported that have different cameras multiple cameras Nokia, like
front facing rear end. So the remote configurations for that the performance considerations in the UI in extensibility. Now, let's jump this case Snapchat snow in the majority of our media is just images and everyone always asks, what have you guys take photos. How does it all work and we break it up into several stages the first stage of capturing quickly capturing the content so that you can actually process and allow the user to be creative. Once they're at all their creative you composite all of the different layers of creativity. They've done and you transcode it into something
that is sufficient to transport over the network to the recipient and the last part is rendering that's the part where the recipient looks at the content. And depending on the choices you make in your player can really affect the quality of the images that they see and all of these are complicated things and we have teams dedicated to each one. But I'll just touch on the capturing one for today. Now also mentioned you tell the camera. Hey capture a frame and you waved back and your call back is called but there's a lot of latency involved in this step in camera. One day Pi
is call take picture and it takes 400 milliseconds on a really high and phone from last year. Whereas if you take the preview frame that's optimized for low latency. It's an order of magnitude smaller and you can see the P90 is our fight High. That's what we do is periodically well-optimized as much as we can our code for specific devices that we feel are in a pretty good latency. And we enable it remotely for users on those devices and users are delighted. Wow, that quality went off. It looks really cool. It just happened today people notice it right away. But also they noticed that delay
and if you're trying to communicate and there's any lag or anything, you'll notice it right away. And so what we do is we try to do a trade-off we do both will take picture and screenshot on certain devices. If it takes too long we fall back to the screenshot. So have a lot of tricks to sort of work around this. You can see examples here where the quality when you really zoom in is noticeable, but it's not always a slam dunk in this case here take a picture on the left is causing noise and sometimes the algorithms my overcompensating you'll see spackling. So it's not always just turn it on
certain front facing cameras. We disable it even though it can sometimes result in higher quality, especially in low-light condition. So we balance that. And so it really is a hierarchy of trade-offs Oscar talked about the low-level. You've got the design choices of what you've decided in your application, and ultimately user intent is a user sending a picture to someone that's disappearing and you want a favor latency versus is a user preserving something. So hopefully some of the tips we've shared will help you guys build a better camera app. Thank you.
text text Mustafa. Hi everyone. My name is vineet Modi on the Android camera platform p.m. Oscar earlier identified a series of trade-offs required to use the camera 2 API Mustafa further elaborated on that theme on how to apply those trade-offs in the real world situations. I'm going to change the conversation a bit and talk about how working together we can truly Elevate both the developer and the end-user experience. The camera is evolving today. It's moved on from Muir
photography as an immersive medium a great example of this is what was demonstrated in the keynote by partner where the camera is now used to do go navigation. In addition cameras are everywhere today from iot devices. There's a great example in the Android things Booth to your laptops to multiple cameras on devices. And that's a trend that I want you all to take away multiple devices with more than two cameras are becoming the norm. Recall the camera model that Oscar talked about earlier in the talk.
How do you extend that model in a multi-camera situation? Let's walk through an example. Imagine a device with three back cameras, the native camera app today has access to the physical streams from each of these sensors. In addition to Native camera app has something called The Logical camera. This is a virtual camera. It's made up of all the physical sensors. It is a combined fuse to stream and it takes care of some of the trade-offs that Oscar and mustapha alluded to in terms of power performance and latency.
So in the native camera app, you actually have four cameras one virtual three physical in a three camera device what is developers you only get access to one of them. This is often depends on the different types of devices. And in this case you end up making trade-offs you're using the apis differently from the Native camera app. So we're pleased to announce that starting with Android P. You'll get access to all the cameras from The Logical camera to all the physical streams this applies to both the front
and back sensors as well as all many different form factors. Pearson example use cases the lady on the in the picture. This is an image of a bouquet mode. You can use this new multi camera API for death Optical Vision getting monochrome frames directly from the Spencer and many more would most of all were excited of the use cases that you guys are going to build. Celeste walkthrough this API real quick. The first thing you would do is check the camera characteristics and check if the device supports the logical multi-camera API.
Next you would check to see which physical cameras make up this logical device today. We start support by for RGB and monochrome and we're looking to add more here. It gets very interesting when you have a logical camera. That is extracting one front and one back Center. Finally you would check to see if the frames from all of these sensors are synchronized. So I know what you're thinking great uapi which devices will support it. Is it worth your time to invest in this new API the answer is yes starting with Android all new
devices will support this new API and we're working with many partners to ensure that upgrade devices. Also supported API. So let me call out a few partners with work very closely the qua way to ensure the monochrome sensors are supported. They actually helped us test this API we work with xiaomi to ensure that a majority of their devices support this API as well. So later this year, you'll see devices from both Huawei and show me supporting this new multi-camera API on upgrade a new devices. And finally the Android one team will work with manufacturers from the steam to
ensure to this API is Works across all tiers of Android and is not exclusive to just the high-tier. I want to share that together. We really can Elevate the camera experience working with you or partners are manufacturers were really trying to make sure that we can bring amazing experiences and most of all were very excited to see what you're going to build it with these new apis and knowing what trade-offs do you need to make using these apis? We love to continue the conversation in the after session meeting space and thank you Oscar and
Buy this talk
Access to all the recordings of the event
Buy this video
With ConferenceCast.tv, you get access to our library of the world's best conference talks.