Events Add an event Speakers Talks Collections
 
SIGCOMM 2020
August 11, 2020, Online, New York, NY, USA
SIGCOMM 2020
Request Q&A
SIGCOMM 2020
From the conference
SIGCOMM 2020
Request Q&A
Video
Server-Driven Video Streaming for Deep Learning Inference
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Add to favorites
354
I like 0
I dislike 0
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
  • Description
  • Transcript
  • Discussion

About the talk

Video streaming is crucial for AI applications that gather videos from sources to servers for inference by deep neural nets (DNNs). Unlike traditional video streaming that optimizes visual quality, this new type of video streaming permits aggressive compression/pruning of pixels not relevant to achieving high DNN inference accuracy. However, much of this potential is left unrealized, because current video streaming protocols are driven by the video source (camera) where the compute is rather limited. We advocate that the video streaming protocol should be driven by real-time feedback from the server-side DNN. Our insight is two-fold: (1) server-side DNN has more context about the pixels that maximize its inference accuracy; and (2) the DNN's output contains rich information useful to guide video streaming. We present DDS (DNN-Driven Streaming), a concrete design of this approach. DDS continuously sends a low-quality video stream to the server; the server runs the DNN to determine where to re-send with higher quality to increase the inference accuracy. We find that compared to several recent baselines on multiple video genres and vision tasks, DDS maintains higher accuracy while reducing bandwidth usage by upto 59% or improves accuracy by upto 9% with no additional bandwidth usage.

About speaker

Kuntai Du
PhD Candidate at University of Chicago
Share

Today, I'm going to prison the pay-per-view streaming for the. I am the first year, 50 students, from University of Chicago Institute, going to work with Google brain after this 10 minutes, you will be able to learn about the new server, even approached streaming for Market scenario. So now let's start Nowadays video streaming or keep you in that place, and I'll takes it's more and more pervasive, for example, people use well drive camera for the animals. Other Street people deployed traffic cameras to monitor the

traffic. People also use drone camera to efficiently, collect information in a white area. That's so many strings are waiting for, not expect to generate better go inside. Our do is to scale Out video streaming for not expect to achieve this goal. We need to design a video streaming Portable Restrooms the video from the camera to the network and to the Parasite Eve net. To make video streaming particles terrible. We focus on two critical design, go first and foremost to make the results.

Must preserve High inference accuracy as if the video extending into the service. Second a video streaming particle should also say bandit this is because the band was called up. For example, the set of the network will be from Horsham out to the total that I sent by saving the banquet, we can save more and more Pampers cost as a video streaming time growth. On the other hand, the streaming today for frame, is proportional to x with you say frame and Timmy bandwidth, to reduce the screaming, deer eat for free at does

reduce the overall July. But where is the research opportunity for us to meet these two design goals? Lucky lady video streaming for Narcotics, as a new type of streaming. Opus new research opportunity in terms of bandwidth saving who care about the global reach of quality as shown on the left side CV, axle, contributed to the overall visual quality has to stream pixels with nearly the same quality, including those pics of that are not like the street and a tree. But in our scenario, the video is

streamed from the camera to the store at itm. Dog, become aggressive impress. Those pictures that are not related to the inference accuracy in the right figure, we Mark those pics of that can be aggressive. As impressed as great in short enable aggressive compression on an object takes us like three pictures and Street pixel. However, we are not the first one to see if it's time to see me opportunity. So now let's see how previous work explore this opportunity.

One. Common approach as real-time camera selfie sticks. These approaches use rutin can massage services as a filter to draw pixels. That a possibility relevant to the final, an artic results, concretely the cameras first capture the frame, and the camera runs, said she was texting real time to filter out some pistols that are irrelevant to the inference accuracy. The camera then sent the remaining part to the server after that, the server for mysterium first result. But you can come outside series takes about

to be stopped optimal. For example, let's compare the frame. Standby real-time cameras like you were sick at the ideal frame. That includes all of the population pixel 2 or 6, like the four objects on the top right corner and the two objects on the bottom left corner. This is because the computer is too limited to support accurate statistics. It's obvious, observation at your door to be accurate. We must put the real time to rotate on the server instead of of the camera.

It seems simple to meet this requirement on the first look. Surprisingly we find that there is a critical, chicken neck, problem, lying behind, and make the whole thing difficult. Photo of the pipeline time, the camera needs to use the feedback from. Saratoga turn on current video with the current video feedback chicken video. our solution is to let the server see the video first, and then eater, ate a hot wing, tell the video, Chevy compact represent streaming for sure. Specifically the camera first offer several frames to form a video segment.

Then the camera in toast. This video segment in low-quality and send it to the server. Upon receiving the sunroof piece. This video segment to a service ID and get the results and the feedback region. Here comes the key of D D. S different from previous Works TDS choose to to perform operation by reading called the service. I didn't feedback Regent in high protein, completely the server, sent the feedback, we just packed the camera. The camera then wrinkled ppac regions in

higher quality and sent the video back to the server. Last December of speech. The resulting video. With this illustration in mind. Let's take a look at the demo. Here we go through pain above to Mark, the objects detected before that information and youth Orange County Volvo to Mark those objects detected. A certain duration. The first video shows, the detection results before the information. The second video shows the video that DDs stand for the duration. The third video shelter, detect result

after the interasian. So now let's watch some videos. From these videos, we can see that when the first video made it update. Will try to call them in the second video and ultimately we called them in the third video. From the demo, we can get a sense that he dies in cows and recalled as detected object through the iteration. Why did you ask Henry called the undetected object? The secret lies in the way we generate the video, the video sent for the iteration method.

We take update detection as an example application from the new friends process. PBS Kids, all region that may contain object as show me the Groupon boxes in the theater. Miss Polly Pocket are generated from its immediate output of the Union. So there is little over ahead. II will eliminate those switches that are already confident to be detected by the end. So you already talked to Dad about them being told this region in higher-quality that I almost detected but not this but not is a key differentiator

between Pruitt work and he dies. Just do a complete coverage of object covers a subset of objects that are not detected I'll pay you for using the technical details of DDS. Let's take a look at experimental results. We demonstrate that GPS is able to hit a batter batter with accuracy. Trade-off investigator is a normalized bandwidth consumption. The Y exit is the accuracy of the top-left direction. It's a better Direction. This is Ellipsis show the standard deviation of DDS and the bassline from the latest. We can see that he has

to play TBS can save up to 59% as well as the chief higher accuracy. To wrap up our contribution is that as for banquet security to do, we need to use a real temperature in approach to eat lemon discount at we presented. You wish you wouldn't, I need to rent a Proflo that letter service, either of you do first and then get rid of all the videos to be encoded. Experimental results demonstrate that able to hit better bun with extra Secret. Can you check my personal

web page for tedious related resources? A lesson that I learned from DDs, That's all. Thank you. And you can send questions to the first. Question is from June. Told you the question is that it's kind of contrast. This design by saying how might the solution change if you can actually run a small neuter Network at the camera cameras? Often are fixed? Until if you're not, there is sufficient amount of temporary agency which means that this lightweight model, even if it's not really accurate to start

with, can we find you or time to turn on the camera itself? Given that design space is DDS Okay. So my intuition is that it could have really depends on the video content. So it's a video content is like a simple and has strong, temporal Corporation a redundancy, then then it has a good choice to try to use some cheap special at the end of the camera side because it will be pretty accurate and can't officially filter out. Many frames cousin is really complicated like in the drone camera,

assume that a drone camera, like flying the city and it goes from this tree to the street, then the content will change very fast. And there is little, we are so in that case, we might want to like, note tractor compute computer power on the kemocite, because it do not have sufficient compute to ice straight for that complex seemed. So in this case it is, it might be better for us to just stick to the summer. Something, cute and runs the recite models there yet. But I had a photo of myself to your answers, if you know, it makes sense, that

Dynamic camera setting, that makes, you know, there's more activity on streaming. No more Dynamic selling where the camera itself is moving around like a drone or something, isn't there? An alternate open question about the connectivity between the camera and the salad itself? Now that is much more unreliable compared to static camera as the throne moves around you. No connectivity to the server itself might not always be available. Yeah, we completely agree to that point so that's why we somehow racist. Give me my newest because I was in bed with it. Will somehow

like stuffers the video stream some hot, like the vibe under a very love. And with the next question is from slack, which is that what happens if the adenine entrance of the Southern actually cannot detect the the regions and and the bonding boxes because of just low-quality video in the first situation except they can be a drop-in recall more specifically because it's not even detected which means there's no chance for it to be incurring a better quality at all. How do you deal with that?

Olympic actually pretty good question. Is somehow illustrate like one potential limitation on TBS. So, you know, experiment with find that in the first generation, the video quality can't be too low to like to activate account is why you cannot provide reasonable feedback at will miss all those small objects and resulting for accurate is the video is like have mediocre quality than it is able to actually detect surprisingly. It can actually get tons of object there. It will actually unique are all objects and in that case to

get covered pretty well, I hope that that's a question, okay? And it is that what is the conceptual difference between what you've done with TDS? And just another, the big reason the big body of work that exists in region of interest-based and Coley. The main difference is that hideous tries to propose regions outside the video source, and version of Interest methods, they proposed to weld a proposed region. They have the whole video need their hands. So you actually in CVS, there is one, there is some communication overhead between the Beatles

Source provider and the region proposal couples. Are you dumb? Yes! Or yes. We have to overcome that overcome that overcome that limitation through my next reservation. Yeah, I think that's that. Also us with the main difficulty of CBS Cool. I think there are a few more actually questions on boots home and I'm slack but I just been interested time. We could move on. Thank you can drive but please go ahead and feel free to engage Batiste with the questions. Thank you. Thank you. Cuz I

so start off with a question from DNA encoding and inference on the second iteration, traditional overhead as a system posed to both server and the sender side. Okay, so basically GPS only requires the encoding capability on the camera. So are the cameras out there is little overhead and other service, I'd like it typically requires two times of hubris is Frank. So if you compare to the like to the very, very nice to approach that we just do you press on what four states that is 2x /, head-first

through the satellites work, like maybe a 5G plan, then if you can Stay Puft, look half of the. By the way, you can save half of the cost and also other approaches the Summerside compute. There's this other project that also try to leverage their service. I could puke to do something like everything and they have typically they have more overhead than DDS. For example of a Prius work like a double string, it will take approximately three to four x overhead on the server side. So the GPS is like in Mountain View on a certified it is

still hit a desirable cost at the banquet is relatively lightweight cost is relatively expensive and the cost is relatively cheap. Great. So I think I can come by now Harry's question and that's a question. So there are a lot of questions on the latest being used and data storage on the cameras. Side is required to store the captured data until December by Earth, Wind Speed bad. If you ever dated how much data storage is required on the camera, Okay, so let me answer them on my one

like so, could you please repeat that question types of some of the first question, there's a latency issue right now, because you do second generation and also you need, I think you just tore some of the data of the recent video on the cameras I-5. So have you been to any deal with these overhead? Okay, so first is a asexual higher because two guys do their first YouTuber for some frames Okemos eye to form a video segment. It itself is a buffering time card and also do you guys have to run an extra x-rays to

to further improve accuracy. So well, but we actually find that And generates about 90% of the results before that euthanization, and this part of results are delivered from lowkotv video. So the streaming delay of this part is actually shorter and so actually, you never had the average response to you. First result is actually pretty well. And as for the storage, we evaluated, that our, our implementation and they're typically requires two to three seconds of storage.

Storage storage. In 2, to 3 seconds, video is just loaded while we use. Yep. Thank you. Another question I can handle a quickly changing videos. Yes, dies can handle quickly changing video. The, the rationale is that the Do Not rely on the DS. Just it right on the conference so far. East Frontier has to look at it and and also dies. That's not that's not attacked. That's not try to do something like like a letter from previous video and user experience on the future video. So we do this without this kind of adaptation.

Yes, it's actually able to like you guys is Emma is able to become itself from the, but I guess I'll be able to just merely look at turn for him, instead of, like, learn from previous room that may lead to a little round results. I think that's that's basically is your holiday has handled. How do I send those to come to the come to change? I see. So I think I have one time for one more question and I'll use that to my advantage. I one question. So I seen a lot of papers that and video analytics,

that use multiple cameras are getting to work. And is there anything that you have thought of to save more bandwidth there? okay, so far, multiple camera, at least from my perspective, the main research opportunity lies in how to deal with inter chamera redundancy and yeah, so, Exploring this before in this opportunity is somehow like it. It will it will. It was somehow related to tasks like a b. I d identification and some some parts like trekking or

like how do we translate when objects in what camera to the other camera? So, at least from this tree sap from this perspective, I think they are somehow operating on the vertical play like another design condition compressor PDF. So If you can if so I think there is some like Jennifer Behr, General approach and can translate like from the particles of One camera to another camera. If that's a problem exists, an extra benefit from the cameras and the

almost but not right.

Cackle comments for the website

Buy this talk

Access to the talk “Server-Driven Video Streaming for Deep Learning Inference”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Ticket

Get access to all videos “SIGCOMM 2020”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Ticket

Interested in topic “IT & Technology”?

You might be interested in videos from this event

November 9 - 17, 2020
Online
50
93
future of ux, behavioral science, design engineering, design systems, design thinking process, new product, partnership, product design, the global experience summit 2020, ux research

Similar talks

Arthi Padmanabhan
PhD Student at UCLA
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Jaehong Kim
Master at KAIST
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Gautam Kumar
Staff Software Engineer at Google
+ 1 speaker
Nandita Dukkipati
Principal Engineer at Google
+ 1 speaker
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Buy this video

Video
Access to the talk “Server-Driven Video Streaming for Deep Learning Inference”
Available
In cart
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free
Free

Conference Cast

With ConferenceCast.tv, you get access to our library of the world's best conference talks.

Conference Cast
949 conferences
37757 speakers
14408 hours of content