Events Add an event Speakers Talks Collections
 
SIGCOMM 2020
August 11, 2020, Online, New York, NY, USA
SIGCOMM 2020
Request Q&A
SIGCOMM 2020
From the conference
SIGCOMM 2020
Request Q&A
Video
On-Camera Filtering for Resource-Efficient Real-Time Video Analytics
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
487
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About the talk

To cope with the high resource (network and compute) demands of real-time video analytics pipelines, recent systems have relied on frame filtering. However, filtering has typically been done with neural networks running on edge/backend servers that are expensive to operate. This paper investigates on-camera filtering, which moves filtering to the beginning of the pipeline. Unfortunately, we find that commodity cameras have limited compute resources that only permit filtering via frame differencing based on low-level video features. Used incorrectly, such techniques can lead to unacceptable drops in query accuracy. To overcome this, we built Reducto, a system that dynamically adapts filtering decisions according to the time-varying correlation between feature type, filtering threshold, query accuracy, and video content. Experiments with a variety of videos and queries show that Reducto achieves significant (51-97% of frames) filtering benefits, while consistently meeting the desired accuracy.

00:17 Trends in Video Analytics

01:52 Frame filtering

02:06 Approaches to Filtering Based on Prior Work

05:59 Reducto overview

10:06 How Long Does It Take To Update the Hash Table

15:49 How Long Does It Take To Update the Hash Table for New Video Content

17:20 Moving Cameras

About speaker

Arthi Padmanabhan
PhD Student at UCLA
Share

Hey, my name is aarthi padmanabha in and say I'm going to talk about reducto a system for filtering frames on cameras to make real-time video analytics mortician this is Joint work with my colleagues from UCLA. Recently, there have been two major Trends and video. Analytics one is the cameras are becoming pervasive and as a result, the amount of video data being generated is increasing rapidly simultaneously. What were able to do with that data? In terms of steep learning bass fishing processing is getting more advanced These Trends have given rise to this type of pipeline that

allows users to ask questions or queries on live video. We have a camera sitting at say, a traffic intersection, but since all friends to a server for the frames neural network, and the result can be used to answer queries, like, return the bounding boxes of all cars where their responses are given for a frame. A major goals of this kind of system are two one, meet the user specified accuracy, Target for the query results as compared to the brown truth or an expensive B&M configuration and 2 to respond with low light cameras typically generate 30 frames per second. Do to keep up with real-time

each frame would need to be processed in 33.3 milliseconds from the time as reported by the camera challenging is that achieving these two goals is extremely resource-intensive. Sending a 1080p require is about 2, megabits per second, which adds up quickly with many cameras on the same network and processing. It was faster rcnn. A state-of-the-art object detector takes about six seconds per 1 second video Babies cost, networking computer, make it hard to respond to queries with low latency. The predominant technique used for improving

efficiency is filtering frames before they reach. The expense of the nickel with filtering is to remove frames that wouldn't change the query result. For instance, if we're tracking cars and there are no cars at night, on a street, most frames can be dropped and we can reuse results from Friday. The three main approaches to filtering based on prior work. R21 run a smaller object detection model as an approximation of the full the end and only send frames to the tienen. If that model is not competent to run a cheaper Model A small binary classifier that only sends frames that contain certain

object. So the query is related to cars, send this friend, that has a car And three computationally the lightest one to calculate fix a level differences between a frame and the one before it and only send a frame. If it's sufficiently different to the point where we expect a different fairy results. For all these approaches, the benefits we get from Filter a more Amplified when it's done closer to the source. For example, we save the bandwidth needed to send all those frames that were filtered out by any of these methods question, we're dressing in this project is given the rise of

smart cameras. Can we take this to The Logical extreme and filter frames directly on the camera? Two things we need to know to answer that what kind of resources are available on existing cameras and how well do the existing Solutions work. To better understand what resources are on existing cameras. We did a study of available and deployed cameras in the LA region and we found that there's a wide range of cameras available from the $20. Wyze cam with a modest CPU to the price cdnn cam with ships, with a full cheap. You on board, or goes to design, a filtering solution that can operate

across the spectrum of cameras. So we focus on the left or the Wimpy cameras, but these techniques would help on more expensive cameras to by freeing up other resources. Now, we look at which filtering approach works best in this wimpy camera. Setting, we found that approximate models are still too expensive to run on these cameras, teniola ran, a 2.6 frames per second and recall that real-time is 30. Binary classifiers are fast enough, but we missed a lot of filtering opportunities by only looking, at whether an object is present or not. There's a query is Counting Cars. It would send

both of these consecutive friends because they both contain a car hasn't changed fix a little. Differences are able to sidestep these two issues so we focus on those. But the paper contains more information about comparing these three of throat. It's a level differences come with a different Challenge and that is that it can be tricky to meet accuracy targets because they can pay moderate amounts of noise. Like all that wood frame differencing, send it rained. At the difference is about some static threshold experiment. Showed that would this approach. Accuracy can follow up 9

to 15% short of the Target. And there are two main reasons for this one is that because video content is highly Dynamic. It's really hard to pick a single thresholds that works for significant portion of video to demonstrate that we considered to queries Counting Cars and detecting found in boxes for cars and we plotted the optimal trash. Hold that is the one that filters out the most trained while meeting accuracy. Give me the threshold changes rapidly over the course of the video. So the threshold is something we want to be able to pick dynamically. Another reason accuracy can drop

is because of relying solely on straight pixel comparison because there are other low-level frame differences that are potentially more effective at picking up on changes between frames. For example, we could also consider looking at areas of motion and comparing there's an example. Let's take the query to be Counting Cars. We see that. When the same car moved across the road, in the top two photos, the value of area is quite low because the area of motion and its eyes remain relatively steady. When a new car enters, go in the bottom, two photos, the value of area significantly higher

because there's a new area of motion. So different features with all these low-level differences, teachers pick up on different characteristics of videos, Overcome these limitations we built reducto is main challenges. Are how do we dynamically choose the threshold. And how do we know which feature to use? Answer those, we use the system design where the Wimpy camera. Does only the cheapest work. The frame different thing, but it's decisions on how to do that are Guided by a server will explain in the next few slides. How the work is. At the

threshold. What we want to know is what is the threshold that filters, the most fans will meeting in accuracy Target to get this week. Let's complete information for a small. Of unfiltered video. We split the video into several second segments and for each, we run the DNA and over every frame and extract the difference in values between Pairs of frames. Ben for a wide sweep of thresholds. We recorded the disc threshold percentage of frames filtered and accuracy. And since we only care about thresholds, but me, accuracy, we don't eliminate all entries that do case. That's 90% aggregate

the rest of the table by clustering, did values and getting the simplified hash-table, that maxdiff value ranges to their optimal threshold. It slightly conservative choice for Optimum pressure to make sure that actress easement. The building disable is clearly expensive. We have to run the fold and and over every frame for a while and collect all this information. So that's a job for the server. Looking up. The threshold though, it's super cheap. It's just a simple hash-table look up, so that can be done quickly on the camera and that allows the camera to keep up with the rapidly

changing threshold that we saw it without waiting for instructions from the server. Next question is how do we know which teacher to use to answer this week? Similarly take unfiltered frames for a short. We get both the ground troops being and results and calculate each different future for all sex with frame, then we consider a sweep of threshold. And for each teacher, we fix the threshold that filters the most frames while meeting accuracy. And then across features, we pick the one that you're supposed to rain. This is also fairly expensive again because we need to run the DNA and over

every frame. So we run this process on the server. When a nuke Larry comes in one optimization, send out here, this process could take over 10 minutes depending on how much video we want to use. So it's not something we want to do often and block each new query that comes in for that long. And to that end, we did a study across different videos and found that. While the best feature changes across query types found in box detection versus it remained stable between videos and this goes back to the idea that each of these teachers are Different and capture different characteristics about

videos. So importantly, here we only have to do this profiling one time for a query type, not her query or fur video putting it together to form reducto system design. We first get unfiltered frames from the camera and we use those to both decide the best feature and generate a hash table mapping just sell used two thresholds. We sent both of those back to the camera. Now, the camera, we can extract the feature. Use the hash table to calculate the threshold, and send only frames whose death is above the threshold, to the full ml pipeline. The paper

has details about how we handle previously, unseen video characteristics, and Trigger hash-table update. We tested productive across several publicly-available. Traffic video clips Andover, three queries bounding box mounting and a binary decision about whether an object exists and that it kept half the frames. While meeting accuracy with the network and compute savings from sending and processing. Fewer frames, we saw the latency Improvement of 22% and it runs at 47.84 second on a Raspberry Pi which matches the cameras. We target. Without all

conclude please email me or young Haley at these addresses with any questions. Thank you. In karate. Let's take one question from slack. Is that how long does it take to update the hash table for the new video content? And how often does this have to happen in real life? Kathy music. This can be done in several seconds. So we use 10 seconds to update the hash table and we found that it has to be done for longer-term changes such as daytime tonight time or sunny to rainy weather. So if we created the hash table during the day, we're saying if you see a segment in

the future that looks like this one use this threshold, spend my night time, we would see that the segments don't look like anything. We saw when we built them stored when we built the hash table. So we would have to trigger a hash table update, which means sending unfiltered frames for a second and wanting to know if there is that. While the hash table update is happening. We're still responding to the query actually responding at 100% because we have the ground truth. It's just that during that time, we're not doing any filtering.

In question is, is about to design space itself. A new Target really, you know, when Pete cameras for for your face, right? But the trend is clearly along the fact that the cameras are becoming more and more able in terms of the computation. So, you know, if I look at it from two years from now, five years from now. How do you say the doctor was still giving up on you? Now, that's a good question. Sightings in this paper, we kind of targeted the near future that is where cameras are getting more powerful. But it's not realistic to update every

deployment but if we look farther and see if he would say, now we have updated every deployment. I think using these techniques, could still be useful for a couple reasons. One is that it would free up resources on even more expensive cameras and allow them to scale like some more queries. The other is that even if we had a camera that was slightly more computationally powerful and we could run like a neural network on it. These techniques could be used in combination with those. So, it could be like a use reducto to filter out the frames that are very similar to pass ones. And then use the

neural network, to get a sense of what is in the frame, and then decide whether to send it. And so, by the time of frame does pass, these kind of like the Cascade and get to the expensive DNN. We're pretty sure on multiple levels that it's actually important. I think these techniques could be combined if we have more processing power, Another question on the zoom communication is by what car, which is excellent. By the way. And then he's asked question is, could you comment on how well this would prefer on a moving camera

cables? Have to be updated. Yes, I think that's exactly what we would see. We would see the hash tables having to be updated more often. But the main idea that we recapture, the amount of noise going to be the amount of motion and values. Themselves are going to be much higher. But the I think the trend remains that when we go to Hash table, we can get a sense of how much motion there is. No what these values look like. And like, when were using it, that still holds, but yes, I agree that the content change table more often, if there's a wider range of

values that were seen, I'll let you take one more question and this is a question I supposed good for you and it should be frequent as well. But since the other, let me ask you in print, I hope you can respond on the slack channel. For this question is, is the general architecture of the design for both the doctor and contains paper? And yet seem to have some similarities in terms of using the server to introduce the bandwidth and Jupiter filtering on the camera of the device

differences? Yes. I think the designs are the space is very similar in that. We have a camera sending to a server. The camera is not strong enough to do anything, particularly powerful, but I think they're kind of targeting different redundancies. So, ridiculous, when we have a lot of frames, we don't actually need all the frames. And I think he dies, they're saying we have a lot of pixels within a frame. We don't actually need all those pixel. So I think it would be interesting to see if they

could be combined to get more savings to, like, use reducto to decide which frames are important to use DDS to decide which parts of the important frames are important. And it seems like these two approaches could potentially work together to save more bandwidth. Sounds good. Sounds good. I guess there are a few more questions, but you could follow it up on on Slack. Great. Thank you. So what there's going to allow more questions on slack so I'll pick up you

on how long it does it take to update the hash table for new video content, or what happens when you perform this on a movie camera. Yes. As far as updating the hash table, we we do this when it when we build a hash table or saying is when you see it one second segment like this in the future or segment that looks like this one. Here's the threshold you should use but over some long. Of time that might change. And what we found is that changes at the level of

like daytime tonight time or rainy weather to Sunny weather. At that point you might start seeing segments that don't look like anything that you had when the table was made. And so at that point, we trigger hash-table update, which means that we send unfiltered frames to the server. For several seconds has to be done and several seconds, we use 10 and the server will use old and new data and send back and updated hash-table about that process, is that wild? That is happening. We're still responding to the query. We're actually responding at 100% accuracy because

we have the ground truth. It's just that. Why? That hash table update is happening. We're not filtering any friends. Resources definition. And I think the second question was about moving cameras. Are there a lot of questions on what happened is still holding that we can still look at like 1 second statement here should resemble a 1 second segment there and the thresholds will be similar but the range of the feature values is much greater cuz we're seeing like like a lot more movement and so it's much more likely that you're going to find segments that

don't look like anything before. So I think it's exactly what the question is there. You'll see a lot more hash-table updates with a moving camera. I see that you're also send more ranger, you can of adapt to the The content of self write numbers of how much you would send to the server. Yeah, yeah, it depends on what the best thresh hold is for that movement. So there is a lot of movement and it's actually relevant to the query, then we would send a lot. So, what we've seen a lot of activity on slacks,

on hook, a few more questions, I'll make some with all right. Have you thought about how you, need your design? Yes, sir. I think in this case, we targeted for the, the wimpiest cameras and the near future where the cameras are getting more expensive, but it's not realistic to update every camera, but let's say we did have more compute to work with. I think, what would, what would be interesting as combining these approaches? Because I think there is value that don't affect the querrey result, but there's also clearly value in understanding what's going on in the

frame. So, like, if you could run a neural network, you would also want to get a sense of what is in the frame. Maybe a combination of those, like, using them, kind of like a Cascade to say, like, Cascade would like increasing levels of computational intensiveness so that you're sure that, by the time of frames gets to like the final expenses CNN, you're like, very sure that this is important and warrants processing the folding on and on Okay. And there are few questions about comparing the two papers, the PDS paper that it.

So no. Do you envision any environment? So he's a briefly talked about what the differences is in terms of the environment. Yes, I think the, the setup of these two papers is very similar. We have a camera sending frames to a server, and the camera is like, pretty wealthy. Like you can do much, and I think these two papers focus on different redundancies. So it says, we have a lot of frames. We don't need all of them. So it's filter frames where is DDS says within a frame, we have many pixels, we don't need all of them. So it's

only send the parts that we need it. So I think they're actually pretty well suited to be combined where you can picture using, reducto to decide, which frames are important. And then using a DDS to stay within those important frames, which parts are important, and it would be interesting to see the bandwidth saving, so you could get from being like that. Yeah, I'd be also interested so final question. So and both of the papers to the doctor because send partial information, right? So you don't send the kitchen later or something. Love you? Or I want to

retrieve the fucale high-quality version of the thing because he wants to see what she wants to inspect the whole thing. And at the same time, you also want you to able to see station, WLX? I'd still get away with it. I think in this scenario, if you want the, the original video and it's full quality with every frame, probably have to have a camera storage solution for that because this does inherently say that were only going to send part of it, whether it's only some pixels and high-quality or only

some frames I think at some level you're going to have not the full video being sent.

Cackle comments for the website

Buy this talk

Access to the talk “On-Camera Filtering for Resource-Efficient Real-Time Video Analytics”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “SIGCOMM 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “IT & Technology”?

You might be interested in videos from this event

November 9 - 17, 2020
Online
50
94
future of ux, behavioral science, design engineering, design systems, design thinking process, new product, partnership, product design, the global experience summit 2020, ux research

Similar talks

Kuntai Du
PhD Candidate at University of Chicago
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Jaehong Kim
Master at KAIST
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Liangcheng Yu
Software Engineer at Google
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video
Access to the talk “On-Camera Filtering for Resource-Efficient Real-Time Video Analytics”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
949 conferences
37757 speakers
14408 hours of content