For a company that revolves around video calling, it is beneficial to find the difference in video quality of a publisher, someone who sends out a video, versus a subscriber. This is beneficial for not only testing purposes but to potentially optimize video quality under different constraints, such as limited network bandwidth.
So what defines “video quality”, and how can we evaluate it? To best evaluate video quality, we try to replicate a user’s experience and perception of quality. This is called objective video quality analysis. At Airtime, we want to perform full-reference objective quality analysis, which means that the entire reference image is known. The previous intern at Airtime, Caitlin, researched, and implemented a source analysis tool called Fresno. Fresno is a tool that is capable of taking in 2 frames, a reference frame, and a distorted frame. Fresno will then pass both of these frames into VMAF (Video Multi-method Assessment Fusion), an open-source software package developed by Netflix. VMAF calculates a quality score on a scale of 0 to 100, where a higher score represents a higher video quality. VMAF’s analysis considers the human visual system, as well as display resolution and viewing distance. More about Caitlin’s work can be found here: https://blog.airtime.com/objective-video-quality-analysis-at-airtime-6543dac7bc1b
However, a source quality analysis tool is not sufficient to conduct video quality analysis at Airtime. If Airtime only had 1-to-1 video calls, source analysis would be sufficient. However, Airtime is a multi-party real-time video chatting app where each subscriber gets a unique video. Thus, we need to implement end-to-end analysis to understand the experience each user is getting.
There are several challenges for Fresno to be used in end-to-end analysis. To embark on my journey of solving these challenges, I created Clovis, an application that would take in a reference video file, and a distorted video file. Clovis would produce an overall quality score from 0 to 100 that would represent the objective video quality of the distorted video relative to the reference video.
How can Clovis use Fresno to analyze the quality of these two video files? Since Fresno takes in individual video frames, the first challenge would be to break down both video files into individual frames. To do this, Clovis needed to be designed such that breaking down the video files into individual frames and analyzing them were done efficiently.
Clovis needed to be broken down into separate modules to simultaneously break down the input files into individual frames, and send frames through Fresno to generate a VMAF score for each frame pair.
After careful consideration, Clovis was designed as shown in the diagram above. The Clovis App would take in the file paths for both the reference and distorted video file, and send them both to the frame controller. The frame controller would create two FFmpegFrameSources (one for each video file), and an analyzer class. FFmpegFrameSource was a class that was designed to use the library FFmpeg to break down the video into separate frames. For each frame, FfmpegFrameSource would send an on_frame signal to the FrameController. The Analyzer class would receive these signals, and store the frame in a queue. When there exist a matching reference and distorted frame, the analyzer would feed them into VMAF to generate a score. Since Fresno expects frames of the same resolution, the Analyzer was also responsible for scaling the distorted frames to match the resolution of the original video if the resolutions differed. With this design, Clovis will be able to simultaneously decode video files into individual frames as well as analyzing existing frames. Once an FFmpegFrameSource has finished sending frames, it will send a signal to the FrameController. Once the frame controller has received a finished signal from both FFmpegFrameSources, it will signal to the analyzer that there are no more incoming frames. The analyzer will then return a score, which is an average of all VMAF scores of frame pairs. The frame controller will then report the returned score, and signal to the main Clovis App that it has finished executing.
Now that we’re able to perform objective video quality analysis on any two video files, what else needs to be done? To make Clovis work in practice, we would need to be able to generate a video file for both a publisher and a subscriber.
Eastwood simulates Airtime’s experience. It is capable of both publishing videos, and subscribing to them. Eastwood sends the publisher’s video to Airtime’s media server, which is responsible for receiving videos from the publisher, as well as sending the respective video to the subscriber. Before sending the publisher’s video to the subscriber, the media server will do one of three actions.
- Forward the video untouched to the subscriber.
- Forward video frames untouched to the subscriber, but reduce the frame rate.
- Re-encode the video and then send it to the subscriber.
The re-encoding of the video that the media server performs is dependent on the network constraints of the subscriber. Since the media server may further reduce video quality, being able to analyze the difference in the quality of the video before and after it goes through Airtime’s media server is the focal point of the project. To do this, Eastwood was modified to write to a video file before, and after the video was sent through the media server.
After implementing this feature, wouldn’t we have a complete end-to-end video quality analysis system? There was one more thing to consider. The media server’s re-encoding could drop frames in scenarios where the subscriber has restrictive network constraints, or when the subscriber doesn’t subscribe for the entire duration of the publisher’s video. This would lead to a difference in the number of frames between the reference video and the distorted video, so how would we know which reference frame each distorted frame corresponds to?
Imagine that we have a reference video of 100 seconds and 10 frames per second. The duration of our distorted video after going through the media server is also 100 seconds, but only 5 frames per second. This would leave us a total of 1000 frames in the reference video, but only 500 frames in the distorted video. How would we find out which 500 of the 1000 reference frames correspond to the 500 distorted frames? Would it be the first 500 frames, the first 250 and last 250 frames, or somewhere in between? To find out which reference frame each distorted frame corresponds to, we would need a way to consistently pass the frame number (or something that represents the frame number) through the media server’s re-encoding process.
After conducting sufficient research, I discovered potential solutions for our tricky frame correlation problem.
- Encoding the frame number into the encoder’s header or payload. This method would provide a simple and efficient method to retrieve the frame number. The drawback is that there are multiple encoder formats (VP8, VP9). We would need to consider all possibilities, and ensure that there is a suitable way to store the frame number in each encoder format.
- Each frame has an RTP header, so a possibility would be to store a 16-bit value that represents the frame number in the RTP header’s padding. This method would be troublesome to forward the frame number through to the distorted video. We would have to change code in the media server, making this feature reliant on any changes to the media server. We would also need to edit WebRTC code to include this field. WebRTC is an open-source project that provides web browsers and mobile applications with real-time communication.
- Stamping on a barcode on each reference frame, and then reading the barcode from each distorted frame to map it to an original frame. The disadvantage of using a barcode is that there is no guarantee that it will survive the media server’s encoding unlike options one and two. However, there would be less modification of existing code, and functionality should not be impacted if the media server code is modified. A barcode should be able to survive some degree of re-encoding, as a barcode is still readable even if the frame undergoes quality loss.
After serious consideration, I decided that going with the barcode option was optimal. I did some further research on barcodes to investigate different ways to implement them for our use case.
- Using a 1D Barcode.
This is likely not a viable option, because it most likely will not be able to survive the distortion in all scenarios, due to the lines being very thin. This was tested with a sample image with a 1D barcode stamped onto it. FFmpeg was then used to convert it to a significantly lower resolution and then scaled back to the original resolution. The original and distorted images were fed into a simple online barcode reader (it is assumed that the barcode reader has a similar capability of a C++ library that can decode 1D barcodes), and only the original image was recognized. The distorted image was compressed by a factor of 25. In the images below, 1dout.jpeg is the distorted image.
As you can see, the image quality of the distorted image is still decent, but the barcode is not decodable.
2. Using a QR Code
A Quick Response (QR) code seems like a more viable option than a 1D barcode because there isn’t the issue of struggling to read extremely thin lines, since it is 2 dimensional. Additionally, there are open source C++ libraries that can successfully read QR codes from images. The drawback of this method is that the minimum size for a QR code is 21×21, which is unnecessarily large for indexing the frames. Having a 21×21 QR code will make it less resistant to scaling than a smaller counterpart. For example, if our “code” takes up a constant percentage of the frame, a barcode with fewer bits (such as 10×10) will make the code easier to read, and more resistant to scaling.
3. Using a Data Matrix
A data matrix is an alternative and similar option to a QR code. The difference is that the minimum size for a data matrix is 10×10. A data matrix also has a larger margin for error correction than a QR code. The implementation of surviving scaling was tested by running a data matrix through scaling resolution down by a factor of 25(same as the 1D) barcode. The reader was still successfully able to decode the distorted image, unlike the 1D barcode. In the images below, dataout.jpeg is the distorted image.
Comparing Data Matrices to QR Codes
The first image shows a data matrix, and the second image shows a QR code. As you can see, the individual bits for the data matrix are significantly larger than the QR code bits, meaning that it will be able to survive the media server’s re-encoding process more easily. Below is a table comparing Data Matrices to QR codes.
Although the QR code can encode a large range of data, a Data Matrix is more than sufficient for our use case, as we are simply encoding the frame number in the matrix. After some more research on data matrices, I was able to find a suitable C++ library that is capable of encoding and decoding data matrices. Therefore, I decided to use data matrices in encoding frame numbers into the reference video frames.
The frame that is passed through the media server is in YUV format (specifically I420 format), so we would need to write a data matrix that encodes the frame number using this video frame format.
In a YUV frame, the Y-plane represents the luminance (brightness) component of the frame, and the U and V plane represent the chrominance (color) component of the frame. When Fresno conducts its analysis on a video frame pair, it only uses the Y-plane to generate its score. The images below show what a frame would look like with, and without values in the UV planes.
Initially, I implemented encoding and decoding the data matrix in Fresno. Eastwood would use Fresno to encode a data matrix onto reference videos before sending it to Airtime’s media server. Clovis would then use Fresno to decode the matrix values. This implementation proved successful for basic use cases, however, when the severe resolution or bitrate restrictions were put on the distorted video, the decoder failed reading several barcodes. Optimizations were needed to be made for both the matrix encoder and decoder to account for more restrictive scenarios.
One thing that I noticed was that the barcodes were always 70px by 70px. For larger resolutions, this meant that the barcode was often less than 1% of the total frame. If we were to increase the barcode size before passing it through the media server, it would likely survive the re-encoding process more easily. However, we would not want to increase the barcode size so much that it took over a significant portion of the frame. After careful consideration, I decided to increase the barcode size until the barcode’s width and height reach ⅓ of the smallest dimension of the frame. The barcode size can only be increased in multiples, such that the only possible dimensions are multiples of itself (ex. 70×70, 140×140, 210×210). For example, a 1280×720 video frame would have a barcode size of 210×210. This is because If we divide the minimum dimension of the video frame (720) by 3, we would have 240. The highest multiple of 70 that is less than 240 is 210, so our barcode size would be 210×210.
Additionally, neutralizing the UV planes of the data matrix makes it more resilient against the types of distortions introduced by the video encoding process by the media server. Below are examples of video frames with and without a neutralized barcode.
After performing these optimizations, you may be curious about how well the barcode survives distortion, as well as its effect on our final VMAF score.
Limitations of Data Matrices
2 main factors impact how well data matrices survive distortion.
- The change in resolution.
- The bitrate constraint
I ran some tests to see how many barcodes would be undecodable given resolution and bitrate constraints. I used a sample 720×720 video file with 256 frames generated by Eastwood, the tables below show the independent effect on barcode decoding of resolution and bitrate constraints.
Below are frames from the video file that were used to generate the data sets above. The reference frame, 100kbps frame, and 90×90 are shown respectively.
We can see that decoding the barcode region in our frame is more resistant to changes in resolution than bitrate. Even when we shrink the distorted video to approximately 1% of the reference video’s size, we are still able to decode about 90% of the frames. In the scenario of limiting bitrate, more frames are unable to be decoded for some extreme scenarios. However, even if several frames are unable to be decoded, the rest of the frames would still get passed into VMAF, generating a score that is likely very similar to the score that would’ve been generated if all frames were analyzed by VMAF.
It’s also important to note the impact the actual barcode has on the VMAF score as well. Since we are writing the barcode region in the Y-plane of the frame, it’s only natural for this to affect the VMAF score, which also depends on the Y-plane of the frame. To investigate this, I ran 2 sets of frame pairs (one without and one with the barcode) that were both scaled from 720×720 to 360×360, each with 100 frames through Fresno. The table below shows the VMAF score of every tenth frame pair.
To simulate the effect of the media server’s re-encoding process on the VMAF, more tests were run to find the independent effects of bitrate and resolution on VMAF scores. Using the same reference video as the test above, the tables below illustrate how the VMAF score changes under resolution and bitrate constraints respectively.
We can see that for both resolution and bitrate constraints, the VMAF score starts dropping significantly under more severe constraints. Both constraints seem to follow a logarithmic relationship between the VMAF score, although the VMAF score seems to drop more quickly for resolution constraints. This is the opposite of the number of unreadable data matrices given these constraints, as decoding data matrices are more resistant to resolution changes than bitrate changes.
For resolution constraints, the VMAF score drops to 0 when the distorted resolution is approximately 1% of the original size. In these scenarios, some data matrices are unable to be decoded as well. Therefore, it is safe to conclude that whenever data matrices are unreadable due to resolution constraints, the VMAF score would have been 0, or an extremely low value anyway.
On the contrary, for bitrate constraints, the VMAF score does not drop as low for severe conditions, but more data matrices become unreadable. When a few data matrices are unreadable due to bitrate constraints, it is still entirely possible to get a valid VMAF score (see 200kbps example). However, when a significant number of data matrices are unable to be decoded due to bitrate constraints, the VMAF score would likely have been a very low number (see 100kbps example).
To simulate a more realistic re-encoding of a video file, I used my phone to take a 720×1280 video of myself and simultaneously restricted the bitrate and resolution. The reference video and distorted video were then run through Clovis. Below is the table that shows the results of this test.
The results in this table very accurately reflect the trends found in the independent tests.
Finally, we’ve made the dream of end-to-end objective video quality analysis at Airtime a reality! Now that we’re able to analyze the video quality that users experience under different network constraints, what else needs to be done?
Clovis still needs to be integrated into Airtime’s testing environments. Being able to determine a score for videos under different constraints will allow Airtime’s testers and developers to further optimize the media server’s encoder, improving the app experience for all users of Airtime!