D3
Internet Architecture

Towards a Cross-layer Video Streaming Optimization

Coordinator: Mirko Palmer

Video streaming is omnipresent and, due to recent global events, the number of people being at home watching streamed video only increased further. The main issue is that the network conditions under which user stream video is not always ideal. This results in a lowered visual quality due to the "adaptive bitrate algorithms" (ABR) that try to select a quality of video that is small enough, in terms of video bitrate and thus amount of data, to be streamed under all current network conditions. As those algorithms are not perfect, the visual quality degrades unnecessarily. In the worst case the ABR misjudges the network condition, the video does not get downloaded in time and the video stalls, presenting the viewer, typically, with a spinning indicator until enough new video is downloaded to continue playback. While there is vast literature on optimizing video streaming, virtually all prior work follow a piecemeal approach-either "tweaking" the transport layer or making the client "smarter."

With our system, which we called VOXEL, we follow a more holistic approach. First, we recognize that some video frames are more important than others, i.e., simply dropping certain frames does not degrade visual quality and thus does not influence the end-users quality of experience (QoE). But we start at the transport layer, avoiding TCPs need to transfer every single byte, even when this results in head-of-line blocking.

But we go further as to not only distinguish video frames by type but to analyze the entire video to rank each individual frame by their actual influence in the overall visual quality of the video. With this fine grained information, we can, instead of blindly reducing the video bitrate, hoping the visual impact will not degrade the QoE, reduce the required amount of data precisely to the network condition while knowing exactly what the impact on the QoE will be. We, therefore, created a new kind of ABR that does not aim to maximize the bitrate but the visual quality. This  synergy of video streaming tailored transport, one time in-depth video analysis and visual quality aware ABR, results in VOXEL reducing the rebuffering, even in challenging network conditions by at least 25% and up to 90%, all while providing a visual quality that is at least on-par with state-of-the-art streaming solutions.

In addition to measuring the QoE with objective quality metrics like SSIM, VMAF and PSNR, we also conducted a real user survey where we recruited 54 participants from different universities and asked them to watch short video clips that were recorded from streaming experiments under identical challenging network conditions with VOXEL and the state-of-the-art. 84% of the participants preferred watching the version streamed with VOXEL. When asked if they would continue a stream that behaves like the shown clips, 74% of participants would have abandoned the video when streamed via the state-of-the-art. In contrast, only 36% would have stopped watching a VOXEL clip.

One reason for this preference is the vastly reduced rebuffering, as confirmed by the participants. As the dropping of frames in VOXEL can introduce visual artifacts, the Mean Opinion Score (MOS) for "glitches" and "clarity" were slightly lower for VOXEL, though, the MOS for the overall watching experience was much higher for VOXEL. Lastly, to ease adopting, VOXEL is entirely backwards compatible to existing streaming solutions and each component can incrementally be deployed.

This project is currently under submission for the SIGCOMM 2021 conference.

Application of VOXEL to 360 degree video

Coordinator: Mirko Palmer

We want to apply VOXEL to 360 degree video, commonly referred to as VR video. Avoiding rebuffering a primary goal there as to not confuse or discomfort users when the video suddenly stops and rebuffers. The main problem is that, compared to regular video, one does not only have a single flat video stream but a spherical projection of several so called tiles, arranged in a grid, each of them being videos themselves, that are stitched together to form the 360 degree sphere. This results in a vastly increased complexity in terms of quality selection of each individual video tile, or in case of VOXEL, where to drop which frames, in order to avoid rebuffering.

Another aspect, different to regular video, is that the viewer can freely rotate their head and thus focus on different parts of the 360 scene. On one hand, this eases the steam as video data that is, behind the user’s head, so to speak, does not need to be transferred in the highest quality. Though, if the user suddenly turns, they do expect the quality to be as high as possible.

As a result, to avoid rebuffering, we have to anticipate where the user will look next, and maximize the quality of each tile, given the current network situation, i.e., the network transfer budget available to select tile qualities and again, what fraction of frames to not even transfer as the lack of them would not negatively influence the user’s quality of experience (QoE).