As a company where users are constantly making calls and watching videos, having the ability to objectively analyze the audio quality that a user receives is extremely advantageous for Airtime. Such a tool can be utilized in automated testing, allowing us to easily observe how any changes to our encoding process affects audio quality under various constraints such as packet loss. If we were to attempt to optimize our encoding process to give users a better experience, an audio analysis tool would allow us to verify how our changes affected audio quality, rather than trying to listen to the output by ear.
How Do We Measure Audio Quality?
So what is the best way to analyze audio quality? The numerous methods of audio analysis each have their pros and cons, and the best analysis method may be different in each use case. The most common type of audio analysis model is the media layer model. A media layer model is one that takes audio signals as inputs. Generally, other analysis models are computationally cheaper than media layer models. Examples of other models are:
- Packet layer models — use header information to estimate audio quality
- Parametric models — use pre-stored values to estimate audio quality
- Bitstream layer models — estimate the quality before decoding
However, since precision is the most important factor for Airtime, a media layer model is the optimal choice.
Media Layer Models
- Full-Reference — Analyzes a decoded audio file relative to an original audio file. The full-reference model has the most research and development compared to other measurement models and is very accurate. The drawbacks are that you require the original audio sample, and would only be able to analyze the difference in quality of the two audio samples.
- Reduced-Reference — Analyzes a decoded audio file by using features of the original sound. In practice, this method is usually only used when access to the entire original audio sample is unavailable.
- No-Reference — Analyzes a standalone audio file, and does not require input from an original audio file. Extracts distortion that originates from sources other than the human voice (ex. network constraints). This is not as accurate as a full-reference model.
Fortunately for Airtime, we will have access to the entire original audio sample. Additionally, we only care about the difference in quality of the 2 audio samples, rather than the standalone audio quality. This is because if the publisher sends out low-quality audio (original audio), there is nothing our encoding process can do to improve the subscriber’s audio experience if the input itself is low-quality. We only care about how we can give the subscriber the best possible experience, and that is by ensuring their audio quality is as close as possible to the original’s input.
Methods of the Full-Reference Audio Model
When choosing an audio analysis method, we must consider that audio quality analysis is different for:
- telephony audio (speech)
- high-fidelity audio (applicable with all kinds of sound such as music)
For Airtime, telephony audio is more important as real-time voice and video chat is a signature feature of the Airtime experience. However, high-fidelity audio is still relevant, as there will be scenarios where audio other than speech is the focal point, such as watching a video of a music concert. It would be ideal to choose a method in which the analysis can account for both types of audio.
Full-reference audio analysis methods generally return a MOS (mean opinion score) between 1 and 5 to determine how good the decoded audio quality is. Although there have been several full-reference audio analysis methods for telephony-type audio, there are currently two that can be potentially considered as the best full-reference audio analysis method.
- POLQA (Perceptual Objective Listening Quality Analysis) — POLQA is currently the ITU-T (International Telecommunication Union -Telecommunication) recommendation since 2011. POLQA is the successor of PESQ which was the previous ITU-T full-reference audio analysis method standard. POLQA compares the differences between the original and decoded signal. The model to determine the difference is a perceptual psycho-acoustic model that is based on similar models of human perception. POLQA assumes a temporal alignment of the original and decoded signal
- ViSQOL (Virtual Speech Quality Objective Listener) — ViSQOL is a more recent but similar audio quality analysis method. It is developed by Google and is open-source. It uses a spectro-temporal measure of similarity between a reference and a test speech signal to produce a MOS.
ViSQOL vs POLQA
The charts above show how ViSQOL performs compared to POLQA at different bitrates. The y-axis shows the MOS, and the x-axis shows the complexity of the Opus-encoding, the encoding that Airtime’s audio is encoded with. It is important to note that most modern devices can handle the CPU intensity of using the maximum algorithmic complexity, as the complexity is set to 10 by default. We can see that POLQA is more sensitive to changes at lower bitrates, and the original ViSQOL is more sensitive to change at higher bitrates. Although there is not subjective data for this dataset, the developers of ViSQOL expected that MOS should be less sensitive in higher bitrates, meaning that POLQA is a better match than the ViSQOL original, but similar to the ViSQOL v3 that we would be using.
The above chart shows the correlation coefficient and standard error of our audio analysis methods when compared to subjective scores from a database of audio files. The NOIZEUS database focuses on audio files with background noise (ex. cars driving by), and E4 focuses on IP degradations such as packet loss and jitter. We can see that PESQ performs the best with ViSQOL following for the NOIZEUS database, and POLQA performs the best with the other two similarly performing slightly worse for E4.
Although PESQ seems to be the best overall choice, there are several factors that make it an unviable option when compared to the other two. POLQA is tuned to respect modern codec behavior such as error correction, when PESQ does not. PESQ cannot evaluate speech above 7kHz, but multiple codecs are 8kHz in wideband mode. Lastly, PESQ cannot properly resolve time-warping and will give MOSs unrecognizably lower than expected. Between POLQA and ViSQOL, they perform quite similarly, as ViSQOL performs better with the NOIZEUS database and vice-versa.
Despite these differences, it is more important to note that ViSQOL is an open source library with C++ compatibility, whereas POLQA does not as it is primarily used in the telecommunication industry. With regards to telephony-type audio, both would be viable choices, but ViSQOL is more accessible.
Additionally, ViSQOL has a speech mode, and an audio mode which could be used for high-fidelity audio. The only other tool for general audio quality analysis is PEAQ (Perceptual Evaluation of Audio Quality). Comparing ViSQOL and PEAQ, the difference in performance at lower bitrates still stand, as PEAQ would struggle more than ViSQOL.
All in all, ViSQOL seems like the best overall choice for a full-reference audio analysis method. It performs extremely well, is the most accessible, and is the only tool capable of analyzing both telephony and high-fidelity audio so we wouldn’t have to simultaneously use 2 different tools.
The system diagram for ViSQOL is shown below:
First, the 2 audio signals are globally aligned. Then the spectrogram representations of the signals are created. The reference signals are then divided into patches for comparison. The Neurogram Similarity Index Measure (NSIM) is used to time align the patches. The point in which the max NSIM similarity score for each patch is the one that will be used. ViSQOL will then predict time warp by temporally warping the spectrogram patches. Time warp is when a reference patch is a degraded patch is shorter or longer (typically 1% to 5%) than a reference patch (due to “compression” or “stretching”). If a warped version of a patch has a higher similarity score, the score will be used for the patch. This is because NSIM is more sensitive to time warping than a human listener, so it must be accounted for. The NSIM scores are then passed into a mapping function and a MOS is generated.
One problem that we must tackle is audio alignment. What would happen if our original audio file contained 10 seconds of audio, but our degraded audio file contained 5 seconds of audio. Would we use the first 5 seconds of the original audio, the last 5 seconds, or somewhere in between for comparison? We would need a method to align the 2 audio files, such that only the common portions of each file are passed into ViSQOL for comparison.
Although there are plenty of methods to find the delay of 2 audio files, cross correlation seemed the best out of the options. Some other options included convolution and autocorrelation, however cross correlation would be the best in our use case. This is because convolution and autocorrelation are the measure of similarity of a signal with the same signal but with a time-lag. Cross-correlation is used for finding the similarity between 2 signals, even if they are not identical when lined up. Since wav files take periodic samples from the analog sound wave, cross-correlation would need to be done discretely.
The general cross-correlation formula for discrete functions is as follows:
Essentially, to find the cross-correlation at any given point, we must compute the sum of f(g(x)) at every point of the array. However, to find the time in which our audio signals align, we must compute the cross-correlation for every possible alignment.
An example of how cross correlation works is shown below.
We can observe that regardless of whether we compute f(g(x)), or g(f(x)), the point of max correlation will be the same. Due to this, we will always pass in the degraded file as. ‘g’, such that f(g(x)) is computed when calculating for a point’s cross correlation for simplicity’s sake. When our audio files are not of equal length, padding will be added to the shorter audio file. Zero padding is a common method to align audio files of unequal length, as the cross correlation algorithm expects both signals to have the same length. Although there are several ways to implement cross correlation, the optimal method when trying to find the delay between 2 audio files would be to use the Fast Fourier Transform. The cross correlation integral is equal to the convolution integral if one of the input signals is conjugated and time reversed. We then just need to take the reverse Fourier transform of the result to get the cross-correlation between 2 signals.
However, finding the delay is only the first step of audio alignment! The next step would be to cut off non-common parts of both signals using the delay. For example, if we have an original signal that is 8s long, and a degraded signal that is 9s long, but the degraded signal has a delay of -3s, which parts of which signals do we cut off? Below is a visual of how it would work.
- Compute the delays given the original signals
2. Once we find the delay of -3s, we move the degraded signal 3 seconds to the right in the time domain.
3. We must now cut off the non-common parts of each signal. The first 3 seconds of the original signal, and the final 4 seconds of the degraded signal would need to be cut off. Finally, the final 5 seconds of the original signal, and first 5 seconds of the degraded signal will be passed in to ViSQOL and return a score.
With this, we now have all the tools needed to objectively analyze quality at Airtime. In our next post, we dive deeper into how we made this work!