Skip to content


  • Bixby Load Test Analysis


    Bixby is the core media server backbone for Airtime’s distributed real-time communications framework. It routes audio/visual data to clients as needed and generates raw metrics that are utilized in creating usage and billing reports for third-party applications.

    A Bixby Load Test is manually started on Jenkins with a given Bixby version to run against and does the following:

    1. Deploy the specified Bixby Version to the automation host and begin running tests to collect probe data on how CPU, network, and memory is utilized for different use-cases of publisher-to-subscriber scenarios
    2. Parses and averages each publisher-to-subscriber test for an intuitive reading on how the Bixby server handles a test over a set period of time
    3. Generates a CSV file that contains the Bixby Load Test result for the specific build
    4. Lastly, the CSV file is utilized by Jenkins to plot figures similar to the following for CPU, Network, and Memory usage.


    Image of the current Jenkins static CPU graph

    The graphs generated by Jenkins, as shown above, is not user-friendly nor interactive. Although data is displayed, it is difficult to interpret how each Bixby Version changed the attributes probed and to compare the behavior between different Bixby branches. Additionally, Jenkins does not have an intuitive way to navigate between displaying different test types. Lastly, there is currently no way of disregarding or marking a build result as invalid, so adding a functionality that could archive a build on the front end is desired.


    Part One: store load test results in a persistent fashion.
    Part Two: generate a data visualization that allows user to compare the performance between each Bixby Versions, as well as with historical data.
    Part Three: mathematically categorize whether recent builds conform to the trends of previous builds.

    Part One: Using AWS RDS to store test results
    In order to store test results in a persistent manner, I wrote a python script that is called at the end of a successful Jenkins build to connect to AWS Relational Database Service(AWS RDS) and the most recent build result is appended to the Bixby Load Test table. MariaDB is the underlying database that is run by AWS RDS. Automated daily snapshots of the database is captured and stored for 30 days in case of any uploading failure or data corruption.

    The column attributes currently stores the type of test that is being performed, timestamp of when the data is uploaded, Bixby Version that is being built against, four separate CPU usage values (idle, nice, system, user), two network behavior values, one memory available value, the specific Jenkins URL associated with the build, as well as a notes attribute that can contain any additional information about the build.
    It was determined that the integration of the data persistence would be at the end of a load test and would utilize the outputted CSV file to update the database.

    There are certain builds of the Bixby Load Test, which were identified as less meaningful during data analysis (part 3) and are noted with more information as to why they are disregarded (for now).

    Automatic Backups and Database Recovery
    Our AWS RDS instance is configured to create an automatic backup each day and can easily be restored through the AWS RDS console. Furthermore, the decision was made to keep the script which outputs a CSV file as a fail-safe, in the event that both the database and its backups cannot be recovered, the data can still be found in individual Jenkins jobs.

    Part Two: Data Visualization
    The main purpose of a data visualizer for the Bixby Load Test is to be able to see the behavior of the Bixby server’s performance overtime and to compare one specific branch with another.

    In order to create an intuitive Bixby load tests visualizer, it should be interactive without overloading the user with controls.

    The load test visualization must contain the following capabilities:
    Each test must have its own graphs of three (CPU, network, memory), which totals to twelve graphs that would need to be accounted for

    • Comparison is only needed for different branches of the same test type, so at any point, at least six graphs must be present, to display the release branch versus other branches (develop and feature)
    • User should be able to see the average behavior of each Bixby version as well as individual tests that were done
    • Allow user to identify and append to the database to note if a build is to be disregarded for analysis (part 3)

    Bokeh is a free and open source data visualization python library, that was chosen as our weapon of choice to tackle this portion of the project due to its dedicated developers that are communicative on Bokeh forums. It also abstracts the process of serving Bokeh content, and does not require any HTML/CSS to display but uses objects that build upon one another to construct the graphs.

    Pandas and NumPy were two Python packages that were heavily utilized along with Bokeh to read files and databases, as well as grouping attributes of the database for analysis.

    Screenshot of a portion of the finalized version of the Bixby Load Test Dashboard internal webpage.

    The resulting data visualization is hosted on an internal webpage, called the Bixby Load Test Dashboard, as shown above.

    Detecting Failure and Restarting Host
    Since the Bixby visualization dashboard is hosted on its own VM, sysctl is used to monitor the Bokeh server and will kill and restart the program if any anomaly is detected. Additionally, a crontab file is created to check whether new changes have been made in the data visualization repository, and automatically pulls all changes to the host and restarts the dashboard.

    An abridged visual of how the Bixby Load Test Dashboard is structured.

    Part 3: Bixby Load Test Pass/Fail
    There are currently no metrics to categorize whether a Bixby build is in line with the recent trend-line and Bixby Load Test results have historically been assessed visually for validity and manually disregarded/archived if results were erroneous.

    The process can be laborious and riddled with human subjectivity. In order to automate the analysis of a load test result with historical data, a python script is created to run at the end of a successful Jenkins build. The script parses the outputted load_test.csv that is automatically generated at the end of a Jenkins build and contains the load test data. The historical median is then used to determine whether the most recent run conforms to the trend of previous data.

    An abridged visual of the new workflow for the Bixby Load Test.

    List of some of the things I did and learned this summer:

    • Utilized a database that is constantly updated and queried to create the front end components of the data visualization.
    • Learned about how Ansible can contain playbooks that describes an environment to which a host should include when being created and can automatically (and synchronously!) deploy hosts to which follows the specified playbook “template”.
    • How VPC security groups are created and defined to limit access to an IP
      Learned how Jenkins can be utilized to run tests by allocating a build parameterized host environment and obtain a specific Github branch to build with.
    • Practiced creating technical documentation.
    • Learned about AMIs and how hosts are automated to check system health and redeploy if needed using cron and sysctl.
    • Better understand how credentials that are needed in scripts can be better protected using vaults and hidden config files.
    • Learned about crontab files and cron processes.
    • Utilized Pandas, NumPy, Bokeh, and more to create the load test pass/fail and the data visualization.
    • Created a team-wide wiki that lists major active projects names and how they fit into the Airtime media stack.
  • Automating Network Constraint Tests in the Cloud

    High-quality audio and video that is reliable is key to an outstanding video call experience. At Airtime, we strive to give users the best experience possible. Thus it is crucial we verify that a client’s real-time video quality adapts as expected when impeding network conditions are encountered. Conditions covered by our testing include limited bandwidth, packet loss, packet latency, and combinations of these. We refer to these as network constraint tests. Network constraint testing can be performed manually using a router, but this is time consuming due to the unique setup required by each test. Testing productivity can be greatly increased by creating APIs and cloud infrastructure that enable us to automate the majority of our existing manual network constraint tests. In summary, the following were accomplished to get this project up and running:

    • Researched several constraint tools and selected Linux Traffic Control (TC) for deeper evaluation on usability and reliability
    • Modified automation infrastructure in AWS to integrate TC
    • Developed APIs to allow TC to be remotely configured from within tests
    • Wrote automated tests that utilized the new infrastructure and APIs

    Potential Constraint Tools for Automation

    In the past, several network constraint tools had been evaluated as potential candidates for creating constraints in automation. Some of these tools displayed unwanted behaviors during testing, such as long round-trip times when a simple bandwidth constraint was applied. A successful tool is one that only applies the network constraints needed for tests without introducing negative side effects that would invalidate automated test results.

    Network Link Conditioner (NLC)

    NLC runs on client devices (iPhone and macOS) and we use it as part of our manual test procedures. While NLC is also available on macOS, our manual test environments currently run on Linux virtual servers. We may have further considered NLC if no other promising solution was found at the time.


    pfSense is an open source firewall that can be installed on a physical computer or virtual machine and uses DummyNet as its traffic shaping tool. To clarify, traffic shaping is a bandwidth management technique used in computer networking. When pfSense was previously evaluated, unexpected network latency was discovered and this option was rejected as a result.


    Soekris is a physical router that can be configured to apply constraints and is used in manual testing. However, Soekris was challenging to work with since configurations weren’t reliable, which led to regular factory resets. This caused unreliable test results. Other concerns with Soekris were that it was not easily programmable, had no API, and is no longer available for sale. Since Soekris is a physical device, it cannot be used in a cloud context.

    Linux Traffic Control (TC)

    TC is a command in Linux that produces constraints by modifying the settings on a Network Interface Card (NIC). TC has a variety of command line options and add-ons like netEm (network emulator) to support the use of different rate limiting and queuing algorithms to produce the desired constraints.

    We saw TC as a good candidate because it offered the ability to apply different constraints we needed to test, is simple and quick to install, and can be programmatically configured for automation. One downside of working with TC is the scarcity of updated documentation, examples, and discussion of best practices for the different configurations options and combinations. This required us to investigate and experiment further to determine what actually worked for us.

    Evaluation of TC as a Reliable Tool

    We evaluate every constraint tool prior to adding it to our testing toolset to ensure the validity of our test results. Similarly, we verified that TC is able to reliably and consistently apply the necessary constraints without introducing unwanted side effects that would tamper with automated test results.

    Evaluation Criteria

    In order to objectively judge a tool, we have existing performance tests to evaluate a tool’s reliability when applying different constraints. Assumptions were made about the implementation of a typical rate shaper algorithm. One recognized implementation is to use a queue and the token bucket algorithm as shown in Fig. 1 below.

    Fig. 1. A Token Bucket With Parameters r and b. [1]

    The token bucket algorithm generates the configured limited rate r by filling the bucket with tokens at a rate of r tokens per second. Queued packets are transmitted when tokens are available in the bucket, otherwise they are enqueued until sufficient new tokens have been generated. When the queue is saturated, the token fill rate r becomes the transmission rate i.e. the limited rate. The bucket depth limits the size of a traffic burst, which is the amount of traffic that can pass through immediately. Packets still travel at the maximum link rate of R when transmitted. Overtime, the transmission rate of packets (R) and the idle periods between those transmissions together becomes an averaged rate of r.

    A bandwidth constraint is fully enforced when a queue becomes saturated (input rate is much higher than the maximum bitrate), which leads to packets being dropped (tail dropping behavior). This is also known as conditional packet loss. The depth of the queue will stay within a range when test packets sent are a constant size. This allows for expected results to be determined for bandwidth constraint performance tests. Simulation of unconditional packet loss can be done by either periodically dropping packets or randomly dropping packets. Ping is used as a traffic generator and can be used to send packets at a high rate. Flood Mode can be used to monitor a sent packet’s status to determine if packets are being dropped at the configured loss and if the dropping is periodic or random. Unconditional delay is the duration of time to hold packets before forwarding them. We can determined that a delay has been applied by using ping to see that the delay of packets has increased by the configured amount.

    Table 1 details some of the performance tests and expected results used to evaluate TC for different constraint behaviors.

    Table 1. Sample of Performance Tests for Bitrate Constraint, Packet Delay, and Loss Scenarios

    For the expected results in Table 1, we were willing to tolerate packet loss rates and received bandwidth values that were within 5% of the configured values. To ensure that the network simulator’s queues were not introducing traffic bursts, we accepted additional variations in delay of no more than the time required to send a small number of packets.

    Setting Up a Local Physical Environment to Evaluate TC

    To ensure a successful evaluation of TC, we started with a clean physical test environment. External network interference needed to be mitigated so no additional influence would be introduced to the evaluation test results. This was achieved with the private subnet setup shown in Fig. 2 below. Test traffic packets were routed through network interfaces on the Linux router box and no other traffic was passing through those used interfaces. TC also configures the same network interfaces to introduce constraints.

    Fig. 1. Diagram of local setup for performance evaluation of TC.

    Tools Used for Performance Testing

    iPerf3 was run on the client and server side with the client side generating test traffic and statistics collection occurring on the server side. Ping was used to check a packet’s RTT.

    Adjustments Made for Performance Tests

    iPerf3 does not include packet headers like UDP or IPv4 when calculating bitrates whereas the network simulator does. Thus the packet length and bitrate sent by iPerf3 on the client side were adjusted to account for a 28 byte overhead (8 bytes from UDP header and 20 bytes from IPv4 header). Packets arriving to TC included a 14 byte Ethernet header, which meant the constraint configured using TC also needed to be adjusted to account this additional overhead. Further experimentation was also done with changing the queue size on the network interface to investigate its effect on achieving a more accurate constraint. Performance tests were re-run with queue sizes of 50, 100, and the default size of 1000 packets.

    Further Evaluation of TC In a Cloud Environment

    After testing with the local environment was complete and the performance results were promising, we recreated a model of the local setup in AWS and the evaluation process was repeated. We wanted to confirm that TC worked the same way and that the cloud environment didn’t introduce unexpected problems. We also suspected that there could be an additional VLAN (virtual LAN) header for the cloud but it did not come up.

    TC Configurations Chosen for Automation

    TC supports a variety of rate limiting algorithms and setup combinations. Different features and configurations were researched, attempted and considered. One of the simplest rate limiting configurations available in TC is the Token Bucket Filter (TBF). TBF was used along with netEm (network emulator) to create the final set of configurations used in our automation environments. TBF limits the rate while netEm is an additional software tool that adds packet loss and delay to TC.

    Fig. 5. Example TBF and netEm TC configurations.

    We chose these configurations due to its reliability and simplicity for use in the automation environment compared to other configurations and features tried. Table 2 shows successful performance results for these configurations.

    Table 2. Sample of results for TBF and netEm commands successfully passing performance evaluation on cloud environment

    Intermediate Challenges with TC and additional experiments

    Experimentation was also done with TC’s HTB (Hierarchy Token Bucket) option to see if different rate limits could be configured for different IP addresses on a single network interface. TC’s filtering option was used in HTB to direct packets destined for different IP addresses to the correct queuing node. We found the constraints for HTB to be less reliable as they didn’t always apply as expected. Occasionally, we would also see a delay in the constraint taking effect (iPerf3 server results often showed 0% packets received for a period of time initially). We thought this could have been due to the queue size so that was also experimented. However, since the setup was more complex and it did not work well in combination with netEm for including packet loss and delay, we decided to forgo the use of HTB. Thus a tradeoff was made to modify the infrastructure of the automation environment in order to be able to use TBF and netEm instead. This choice also simplified the design of the APIs needed to control and access TC.

    Adjusting the queue size within TC also required experimentation to verify how it affected the adjusted bitrate received. By default, the queue size used by TBF is 1000 packets. However, the queue size can be configured by either configuring the TC command options or configuring the network interface using network commands. It was observed that the same TC configurations that worked in the testing environment caused slightly different results when actually tested using a real mobile client that measured the results using an internally built tool. Further experimentation was done with different queue sizes and evaluation criteria tests were run with queue sizes of 50 and 100 packets. We decided on a queue size of 100 packets based on results.

    Getting it Working

    Integration into the Automation Environment

    The structure of the automation environment was influenced by the type of TC configuration used and how our existing testing tools functioned and communicated back and forth with the media servers. Instead of relying on TC IP filtering to constrain the correct packets for specific IP addresses, we opted to have the packets of a constrained stream be routed differently by using an IP packet marking script called cgexec.

    Fig. 3 shows how the automated cloud infrastructure is set up. VPC stands for virtual private cloud. The automation host sets up and invokes the automation tests. TC was introduced into our automation environment by adding an AWS instance between our automation host and media server. When the automation tests are invoked, only a publisher or a subscriber stream can be constrained at a time and not both at once. In Fig. 3, constrained publisher packets route through the eth2 network interface to the media server using cgexec and subscribing clients route through eth1. Similarly, constrained subscriber packets route through the eth1 network interface. Packets traveling to a specific IP address are marked using cgexec so that they will be routed through the TC virtual instance, undergoing a constraint. Packets that are not marked using cgexec will travel directly to the media server without being constrained.

    Fig. 3. Diagram of automation environment with TC.


    A high-level overview of an automated test is as follows. When testing with a constrained publisher, the mock client first starts a process that creates a publisher stream. Traffic from that process is routed through the TC instance, where the constraint is applied, and then reaches the media server. In this case, the test subscriber client must be unconstrained due how the automation infrastructure is implemented and the subscriber client receives stream data back from the media server.

    APIs needed to be created in order to programmatically invoke and set up TC. We chose to modularize the implementation by organizing the API into two separate repositories. A high-level API that is more general and abstracts away the TC details and a low-level API that handles the specific details of automatically setting up TC configurations. This allows for flexibility as well in the future if TC commands are modified.

    These APIs allowed us to integrate the ability to run static and dynamic uplink and downlink tests into our existing automation. Static constraints refer to tests that only have 1 constraint applied for the entire test. Dynamic constraints refer to tests that change the constraint multiple times throughout test in succession and collects the video quality statistics after each constraint.

    Fig.6 Flow of automated tests


    Automation of network constraint tests has introduced numerous benefits. We’ve increased our efficiency and regularity of testing of how poor network conditions affect real-time video quality. The analysis of test results is more standardized and objective compared to manual analysis which can be prone to subjectivity.The tests are part of regular nightly tests to catch potential regressions and problems sooner. With regular and more frequent monitoring, a response can be made sooner to issues discovered. Manual network testing time has been reduced by more than 50% and testers can focus on creative testing tasks. Previously, running the tests manually would take several hours and now tests run in several minutes.

    Future Improvements

    During the time that the automation tests have been active, network adaption behavior have varied causing a broad range of numeric results. In the future, we could start averaging multiple test runs or use other measures of central tendency.



  • How Airtime Utilizes Objective Audio Quality Analysis

    Previously, I discussed my research on the most effective way to objectively analyze audio quality at Airtime here. But how can we use this information on audio quality analysis to actually help us maintain and improve Airtime’s audio quality? Simply cutting and pasting audio files that went through Airtime’s encoding process into ViSQOL, our audio quality analysis API, would not be sufficient. Creating a robust and flexible tool that leverages ViSQOL is paramount to making beneficial and impactful changes to Airtime.


    Huron is an application that takes in an original audio file and its degraded counterpart as inputs and outputs the results in JSON. The JSON file contains the MOS (mean opinion score) indicating the quality score of the degraded file relative to the original. The workflow of Huron is as follows:

    Huron Workflow Diagram

    At the beginning of execution, the Huron main application will create an AudioController, which is responsible for returning the MOS generated by ViSQOL’s API given the original and degraded audio files passed in from the main app. The AudioController will then create two WavParsers — one for the original audio sample, and one for the degraded audio sample. Each of these WavParsers will asynchronously decode the audio into a discrete array of numeric values. The WavParsers are also responsible for resampling the audio and converting it from stereo to mono if necessary. When each WavParser finishes decoding its respective audio, it will send a signal back to the AudioController to indicate that it is finished. When the AudioController receives a finished signal from both WavParsers it will trim the parsed audio data, such that only the common parts of both remain. Finally, the AudioController will pass in the trimmed data to ViSQOL’s API receive the MOS as a promise. The AudioController will output the result as a JSON, and then fire a signal to the main app, letting it know that everything is finished, and that it can exit safely.

    Now that we have a tool that allows us to generate a MOS given an original audio file and a degraded audio file as inputs, we can see how network and CPU constraints affect audio quality at Airtime.


    I gathered data sets using an iPhone 8 by recording audio that went through Airtime’s encoding process and applied network and CPU constraints to the iPhone. Different results are expected under different constraints, because the encoding process takes into account available CPU & network resources before generating the encoded output. Both one-variable and two-variable tests were used to observe how constraints impact audio quality both alone, and in conjunction with other constraints. Additionally, test scenarios for both speech and music were covered.

    Although we can evidently see that the score decreases as we have less bandwidth, the scores are all very similar until the last column. It is entirely possible that a moderate constraint can have a higher MOS than an output with no constraints due to variance. It appears that the audio quality only starts to noticeably decrease once the bandwidth drops below a certain threshold but once the bandwidth drops below this threshold, the audio quality drops extremely quickly as shown in the 100Kbps example.

    Initially, the score seems to decrease as packet loss increases. However, the moderate packet loss and severe packet loss have a similar average. This is likely due to variance. The 1.53 outlier in the moderate packet loss example was one where a very large portion of the beginning of the file is just completely missing. The outliers in the severe packet loss example (3.71, and 3.27) were due to insignificant parts of the audio being cut off. Since the packets being lost are random, there is a lot of variance in MOS, as it is unknown whether significant or insignificant packets of the audio will be lost. However, there is still an overall trend where MOS decreases as packet loss increases.

    It seems like there is a weak correlation between the CPU utilization and MOS when only audio is being transmitted. However, as CPU utilization increased, there were very occasional short hiccups in the degraded audio file, which would probably explain the cases where the score was 3.32 and 3.21. Overall, even at relatively severe CPU utilization the audio quality is only sometimes affected, and when it is, it is not severe.

    The above table is an example of a 2-variable test where both bandwidth and packet loss constraints were applied. We can observe that adding a moderate bandwidth constraint has essentially no effect on the MOS. The table where we have severe packet loss and moderate bandwidth constraint is very similar to the severe packet loss one-variable table. It appears that the MOS strongly gravitates towards the scores generated in the 1-variable test case with lower scores. In the above example’s bottom-right cell, one might expect this 2-variable example to have a significantly lower score than 1.9. However, the average score is still around 1.9, because the score is gravitating towards the lower one-variable score. This is likely due to the fact that MOSs are not calculated using a linear scale, so the variable that individually affects the MOS less will seem almost insignificant compared to the other. The other 2-variable tests that have been performed further reinforce this idea.

    We can observe from the above table that publishing a 180×180 video has essentially no impact on the MOS. Even a 1920×1080 video has minimal impact. For the moderate bandwidth and packet loss values, we can see that the scores are very similar to the constraint’s 1-variable MOS values. The exception here seems to be the CPU constraint. Publishing a video at the same time with a heavy CPU constraint seems to lower the score further than without the video. This is likely because when there is a video, the already limited CPU has to both render the video and process the audio. The reason that publishing a video does not affect the other constraints to nearly the same degree is probably due to the fact that bandwidth and packet loss constraints don’t strain the CPU.


    It appears that the MOS significantly drops even with no constraints when a complex music sample is passed through the encoder. For the bandwidth constraints, similar to speech, the bandwidth only seems to noticeably affect the MOS when it is a severe constraint.

    Overall, the music MOSs show that the score doesn’t drop nearly as much given constraints. This could partially be due to the fact that the audio quality score is already quite low with no constraints, and adding some constraints wouldn’t make it much worse. Despite having a smaller effect, the scores follow the same pattern as the speech examples. We can observe that the MOS also gravitates to the more affected single variable constraint score when there are 2 variables.

    Because we learned how different constraints affect audio quality in different ways at Airtime, we now have a basis to refer to, allowing us to evaluate if our encoding process could be improved under specific conditions! Another application of Huron would be ensuring that any changes to our encoding process does not negatively impact the audio received by users.

    Using Huron in Automation Tests

    To ensure that we maintain our current level of audio quality, Huron was added to Airtime’s automation environment. When features are added to Airtime’s encoding process, the tests in the automation environment are run to verify that everything is working as intended. The new tests recorded an audio file that went through our encoding process, and passed it into Huron with the original file to receive a MOS. The MOS was then checked against a pre-defined value based on the results gathered from the tables above. The tests added were as follows:

    Huron Automation Tests

    The goal is for these tests to have high accuracy, while not having an excessively long runtime. To do this, we ran each test case three times, and if the median score was above the expected MOS, it would be considered as a passing test.

    These tests serve as a baseline for Airtime’s expected audio quality under different circumstances. With these tests in place, we can ensure that our level of audio quality is maintained when any changes are made to Airtime’s encoding process. Additionally, if we are looking to improve audio quality under certain constraints, utilizing these test cases is a simple an effective method to check if these optimizations are working as intended!

  • Objective Audio Quality Analysis at Airtime

    As a company where users are constantly making calls and watching videos, having the ability to objectively analyze the audio quality that a user receives is extremely advantageous for Airtime. Such a tool can be utilized in automated testing, allowing us to easily observe how any changes to our encoding process affects audio quality under various constraints such as packet loss. If we were to attempt to optimize our encoding process to give users a better experience, an audio analysis tool would allow us to verify how our changes affected audio quality, rather than trying to listen to the output by ear.

    How Do We Measure Audio Quality?

    So what is the best way to analyze audio quality? The numerous methods of audio analysis each have their pros and cons, and the best analysis method may be different in each use case. The most common type of audio analysis model is the media layer model. A media layer model is one that takes audio signals as inputs. Generally, other analysis models are computationally cheaper than media layer models. Examples of other models are:

    • Packet layer models — use header information to estimate audio quality
    • Parametric models — use pre-stored values to estimate audio quality
    • Bitstream layer models — estimate the quality before decoding

    However, since precision is the most important factor for Airtime, a media layer model is the optimal choice.

    Media Layer Models

    • Full-Reference — Analyzes a decoded audio file relative to an original audio file. The full-reference model has the most research and development compared to other measurement models and is very accurate. The drawbacks are that you require the original audio sample, and would only be able to analyze the difference in quality of the two audio samples.
    • Reduced-Reference — Analyzes a decoded audio file by using features of the original sound. In practice, this method is usually only used when access to the entire original audio sample is unavailable.
    • No-Reference — Analyzes a standalone audio file, and does not require input from an original audio file. Extracts distortion that originates from sources other than the human voice (ex. network constraints). This is not as accurate as a full-reference model.

    Fortunately for Airtime, we will have access to the entire original audio sample. Additionally, we only care about the difference in quality of the 2 audio samples, rather than the standalone audio quality. This is because if the publisher sends out low-quality audio (original audio), there is nothing our encoding process can do to improve the subscriber’s audio experience if the input itself is low-quality. We only care about how we can give the subscriber the best possible experience, and that is by ensuring their audio quality is as close as possible to the original’s input.

    Methods of the Full-Reference Audio Model

    When choosing an audio analysis method, we must consider that audio quality analysis is different for:

    • telephony audio (speech)
    • high-fidelity audio (applicable with all kinds of sound such as music)

    For Airtime, telephony audio is more important as real-time voice and video chat is a signature feature of the Airtime experience. However, high-fidelity audio is still relevant, as there will be scenarios where audio other than speech is the focal point, such as watching a video of a music concert. It would be ideal to choose a method in which the analysis can account for both types of audio.

    Full-reference audio analysis methods generally return a MOS (mean opinion score) between 1 and 5 to determine how good the decoded audio quality is. Although there have been several full-reference audio analysis methods for telephony-type audio, there are currently two that can be potentially considered as the best full-reference audio analysis method.

    • POLQA (Perceptual Objective Listening Quality Analysis) — POLQA is currently the ITU-T (International Telecommunication Union -Telecommunication) recommendation since 2011. POLQA is the successor of PESQ which was the previous ITU-T full-reference audio analysis method standard. POLQA compares the differences between the original and decoded signal. The model to determine the difference is a perceptual psycho-acoustic model that is based on similar models of human perception. POLQA assumes a temporal alignment of the original and decoded signal
    • ViSQOL (Virtual Speech Quality Objective Listener) — ViSQOL is a more recent but similar audio quality analysis method. It is developed by Google and is open-source. It uses a spectro-temporal measure of similarity between a reference and a test speech signal to produce a MOS.


    The charts above show how ViSQOL performs compared to POLQA at different bitrates. The y-axis shows the MOS, and the x-axis shows the complexity of the Opus-encoding, the encoding that Airtime’s audio is encoded with. It is important to note that most modern devices can handle the CPU intensity of using the maximum algorithmic complexity, as the complexity is set to 10 by default. We can see that POLQA is more sensitive to changes at lower bitrates, and the original ViSQOL is more sensitive to change at higher bitrates. Although there is not subjective data for this dataset, the developers of ViSQOL expected that MOS should be less sensitive in higher bitrates, meaning that POLQA is a better match than the ViSQOL original, but similar to the ViSQOL v3 that we would be using.

    The above chart shows the correlation coefficient and standard error of our audio analysis methods when compared to subjective scores from a database of audio files. The NOIZEUS database focuses on audio files with background noise (ex. cars driving by), and E4 focuses on IP degradations such as packet loss and jitter. We can see that PESQ performs the best with ViSQOL following for the NOIZEUS database, and POLQA performs the best with the other two similarly performing slightly worse for E4.

    Although PESQ seems to be the best overall choice, there are several factors that make it an unviable option when compared to the other two. POLQA is tuned to respect modern codec behavior such as error correction, when PESQ does not. PESQ cannot evaluate speech above 7kHz, but multiple codecs are 8kHz in wideband mode. Lastly, PESQ cannot properly resolve time-warping and will give MOSs unrecognizably lower than expected. Between POLQA and ViSQOL, they perform quite similarly, as ViSQOL performs better with the NOIZEUS database and vice-versa.

    Despite these differences, it is more important to note that ViSQOL is an open source library with C++ compatibility, whereas POLQA does not as it is primarily used in the telecommunication industry. With regards to telephony-type audio, both would be viable choices, but ViSQOL is more accessible.

    Additionally, ViSQOL has a speech mode, and an audio mode which could be used for high-fidelity audio. The only other tool for general audio quality analysis is PEAQ (Perceptual Evaluation of Audio Quality). Comparing ViSQOL and PEAQ, the difference in performance at lower bitrates still stand, as PEAQ would struggle more than ViSQOL.

    All in all, ViSQOL seems like the best overall choice for a full-reference audio analysis method. It performs extremely well, is the most accessible, and is the only tool capable of analyzing both telephony and high-fidelity audio so we wouldn’t have to simultaneously use 2 different tools.


    The system diagram for ViSQOL is shown below:

    First, the 2 audio signals are globally aligned. Then the spectrogram representations of the signals are created. The reference signals are then divided into patches for comparison. The Neurogram Similarity Index Measure (NSIM) is used to time align the patches. The point in which the max NSIM similarity score for each patch is the one that will be used. ViSQOL will then predict time warp by temporally warping the spectrogram patches. Time warp is when a reference patch is a degraded patch is shorter or longer (typically 1% to 5%) than a reference patch (due to “compression” or “stretching”). If a warped version of a patch has a higher similarity score, the score will be used for the patch. This is because NSIM is more sensitive to time warping than a human listener, so it must be accounted for. The NSIM scores are then passed into a mapping function and a MOS is generated.

    Audio Alignment

    One problem that we must tackle is audio alignment. What would happen if our original audio file contained 10 seconds of audio, but our degraded audio file contained 5 seconds of audio. Would we use the first 5 seconds of the original audio, the last 5 seconds, or somewhere in between for comparison? We would need a method to align the 2 audio files, such that only the common portions of each file are passed into ViSQOL for comparison.

    Although there are plenty of methods to find the delay of 2 audio files, cross correlation seemed the best out of the options. Some other options included convolution and autocorrelation, however cross correlation would be the best in our use case. This is because convolution and autocorrelation are the measure of similarity of a signal with the same signal but with a time-lag. Cross-correlation is used for finding the similarity between 2 signals, even if they are not identical when lined up. Since wav files take periodic samples from the analog sound wave, cross-correlation would need to be done discretely.

    The general cross-correlation formula for discrete functions is as follows:

    Essentially, to find the cross-correlation at any given point, we must compute the sum of f(g(x)) at every point of the array. However, to find the time in which our audio signals align, we must compute the cross-correlation for every possible alignment.

    An example of how cross correlation works is shown below.

    We can observe that regardless of whether we compute f(g(x)), or g(f(x)), the point of max correlation will be the same. Due to this, we will always pass in the degraded file as. ‘g’, such that f(g(x)) is computed when calculating for a point’s cross correlation for simplicity’s sake. When our audio files are not of equal length, padding will be added to the shorter audio file. Zero padding is a common method to align audio files of unequal length, as the cross correlation algorithm expects both signals to have the same length. Although there are several ways to implement cross correlation, the optimal method when trying to find the delay between 2 audio files would be to use the Fast Fourier Transform. The cross correlation integral is equal to the convolution integral if one of the input signals is conjugated and time reversed. We then just need to take the reverse Fourier transform of the result to get the cross-correlation between 2 signals.

    However, finding the delay is only the first step of audio alignment! The next step would be to cut off non-common parts of both signals using the delay. For example, if we have an original signal that is 8s long, and a degraded signal that is 9s long, but the degraded signal has a delay of -3s, which parts of which signals do we cut off? Below is a visual of how it would work.

    1. Compute the delays given the original signals

    2. Once we find the delay of -3s, we move the degraded signal 3 seconds to the right in the time domain.

    3. We must now cut off the non-common parts of each signal. The first 3 seconds of the original signal, and the final 4 seconds of the degraded signal would need to be cut off. Finally, the final 5 seconds of the original signal, and first 5 seconds of the degraded signal will be passed in to ViSQOL and return a score.

    With this, we now have all the tools needed to objectively analyze quality at Airtime. In our next post, we dive deeper into how we made this work!

  • End-to-End Objective Video Quality Analysis at Airtime

    For a company that revolves around video calling, it is beneficial to find the difference in video quality of a publisher, someone who sends out a video, versus a subscriber. This is beneficial for not only testing purposes but to potentially optimize video quality under different constraints, such as limited network bandwidth.


    So what defines “video quality”, and how can we evaluate it? To best evaluate video quality, we try to replicate a user’s experience and perception of quality. This is called objective video quality analysis. At Airtime, we want to perform full-reference objective quality analysis, which means that the entire reference image is known. The previous intern at Airtime, Caitlin, researched, and implemented a source analysis tool called Fresno. Fresno is a tool that is capable of taking in 2 frames, a reference frame, and a distorted frame. Fresno will then pass both of these frames into VMAF (Video Multi-method Assessment Fusion), an open-source software package developed by Netflix. VMAF calculates a quality score on a scale of 0 to 100, where a higher score represents a higher video quality. VMAF’s analysis considers the human visual system, as well as display resolution and viewing distance. More about Caitlin’s work can be found here:


    However, a source quality analysis tool is not sufficient to conduct video quality analysis at Airtime. If Airtime only had 1-to-1 video calls, source analysis would be sufficient. However, Airtime is a multi-party real-time video chatting app where each subscriber gets a unique video. Thus, we need to implement end-to-end analysis to understand the experience each user is getting.

    There are several challenges for Fresno to be used in end-to-end analysis. To embark on my journey of solving these challenges, I created Clovis, an application that would take in a reference video file, and a distorted video file. Clovis would produce an overall quality score from 0 to 100 that would represent the objective video quality of the distorted video relative to the reference video.

    How can Clovis use Fresno to analyze the quality of these two video files? Since Fresno takes in individual video frames, the first challenge would be to break down both video files into individual frames. To do this, Clovis needed to be designed such that breaking down the video files into individual frames and analyzing them were done efficiently.

    Clovis Workflow

    Clovis needed to be broken down into separate modules to simultaneously break down the input files into individual frames, and send frames through Fresno to generate a VMAF score for each frame pair.

    Clovis Workflow Diagram

    After careful consideration, Clovis was designed as shown in the diagram above. The Clovis App would take in the file paths for both the reference and distorted video file, and send them both to the frame controller. The frame controller would create two FFmpegFrameSources (one for each video file), and an analyzer class. FFmpegFrameSource was a class that was designed to use the library FFmpeg to break down the video into separate frames. For each frame, FfmpegFrameSource would send an on_frame signal to the FrameController. The Analyzer class would receive these signals, and store the frame in a queue. When there exist a matching reference and distorted frame, the analyzer would feed them into VMAF to generate a score. Since Fresno expects frames of the same resolution, the Analyzer was also responsible for scaling the distorted frames to match the resolution of the original video if the resolutions differed. With this design, Clovis will be able to simultaneously decode video files into individual frames as well as analyzing existing frames. Once an FFmpegFrameSource has finished sending frames, it will send a signal to the FrameController. Once the frame controller has received a finished signal from both FFmpegFrameSources, it will signal to the analyzer that there are no more incoming frames. The analyzer will then return a score, which is an average of all VMAF scores of frame pairs. The frame controller will then report the returned score, and signal to the main Clovis App that it has finished executing.

    Now that we’re able to perform objective video quality analysis on any two video files, what else needs to be done? To make Clovis work in practice, we would need to be able to generate a video file for both a publisher and a subscriber.


    Eastwood simulates Airtime’s experience. It is capable of both publishing videos, and subscribing to them. Eastwood sends the publisher’s video to Airtime’s media server, which is responsible for receiving videos from the publisher, as well as sending the respective video to the subscriber. Before sending the publisher’s video to the subscriber, the media server will do one of three actions.

    1. Forward the video untouched to the subscriber.
    2. Forward video frames untouched to the subscriber, but reduce the frame rate.
    3. Re-encode the video and then send it to the subscriber.

    The re-encoding of the video that the media server performs is dependent on the network constraints of the subscriber. Since the media server may further reduce video quality, being able to analyze the difference in the quality of the video before and after it goes through Airtime’s media server is the focal point of the project. To do this, Eastwood was modified to write to a video file before, and after the video was sent through the media server.

    After implementing this feature, wouldn’t we have a complete end-to-end video quality analysis system? There was one more thing to consider. The media server’s re-encoding could drop frames in scenarios where the subscriber has restrictive network constraints, or when the subscriber doesn’t subscribe for the entire duration of the publisher’s video. This would lead to a difference in the number of frames between the reference video and the distorted video, so how would we know which reference frame each distorted frame corresponds to?

    Frame Correlation

    Imagine that we have a reference video of 100 seconds and 10 frames per second. The duration of our distorted video after going through the media server is also 100 seconds, but only 5 frames per second. This would leave us a total of 1000 frames in the reference video, but only 500 frames in the distorted video. How would we find out which 500 of the 1000 reference frames correspond to the 500 distorted frames? Would it be the first 500 frames, the first 250 and last 250 frames, or somewhere in between? To find out which reference frame each distorted frame corresponds to, we would need a way to consistently pass the frame number (or something that represents the frame number) through the media server’s re-encoding process.

    Potential Solutions

    After conducting sufficient research, I discovered potential solutions for our tricky frame correlation problem.

    1. Encoding the frame number into the encoder’s header or payload. This method would provide a simple and efficient method to retrieve the frame number. The drawback is that there are multiple encoder formats (VP8, VP9). We would need to consider all possibilities, and ensure that there is a suitable way to store the frame number in each encoder format.
    2. Each frame has an RTP header, so a possibility would be to store a 16-bit value that represents the frame number in the RTP header’s padding. This method would be troublesome to forward the frame number through to the distorted video. We would have to change code in the media server, making this feature reliant on any changes to the media server. We would also need to edit WebRTC code to include this field. WebRTC is an open-source project that provides web browsers and mobile applications with real-time communication.
    3. Stamping on a barcode on each reference frame, and then reading the barcode from each distorted frame to map it to an original frame. The disadvantage of using a barcode is that there is no guarantee that it will survive the media server’s encoding unlike options one and two. However, there would be less modification of existing code, and functionality should not be impacted if the media server code is modified. A barcode should be able to survive some degree of re-encoding, as a barcode is still readable even if the frame undergoes quality loss.


    After serious consideration, I decided that going with the barcode option was optimal. I did some further research on barcodes to investigate different ways to implement them for our use case.

    1. Using a 1D Barcode.

    This is likely not a viable option, because it most likely will not be able to survive the distortion in all scenarios, due to the lines being very thin. This was tested with a sample image with a 1D barcode stamped onto it. FFmpeg was then used to convert it to a significantly lower resolution and then scaled back to the original resolution. The original and distorted images were fed into a simple online barcode reader (it is assumed that the barcode reader has a similar capability of a C++ library that can decode 1D barcodes), and only the original image was recognized. The distorted image was compressed by a factor of 25. In the images below, 1dout.jpeg is the distorted image.

    Distorted Image (186×113)
    Original Image (930×565)

    As you can see, the image quality of the distorted image is still decent, but the barcode is not decodable.

    2. Using a QR Code

    A Quick Response (QR) code seems like a more viable option than a 1D barcode because there isn’t the issue of struggling to read extremely thin lines, since it is 2 dimensional. Additionally, there are open source C++ libraries that can successfully read QR codes from images. The drawback of this method is that the minimum size for a QR code is 21×21, which is unnecessarily large for indexing the frames. Having a 21×21 QR code will make it less resistant to scaling than a smaller counterpart. For example, if our “code” takes up a constant percentage of the frame, a barcode with fewer bits (such as 10×10) will make the code easier to read, and more resistant to scaling.

    3. Using a Data Matrix

    A data matrix is an alternative and similar option to a QR code. The difference is that the minimum size for a data matrix is 10×10. A data matrix also has a larger margin for error correction than a QR code. The implementation of surviving scaling was tested by running a data matrix through scaling resolution down by a factor of 25(same as the 1D) barcode. The reader was still successfully able to decode the distorted image, unlike the 1D barcode. In the images below, dataout.jpeg is the distorted image.

    Original Image (745×540)
    Distorted Image (149×108)

    Comparing Data Matrices to QR Codes

    Data Matrix
    QR Code

    The first image shows a data matrix, and the second image shows a QR code. As you can see, the individual bits for the data matrix are significantly larger than the QR code bits, meaning that it will be able to survive the media server’s re-encoding process more easily. Below is a table comparing Data Matrices to QR codes.

    Data Matrix vs QR Code [1]

    Although the QR code can encode a large range of data, a Data Matrix is more than sufficient for our use case, as we are simply encoding the frame number in the matrix. After some more research on data matrices, I was able to find a suitable C++ library that is capable of encoding and decoding data matrices. Therefore, I decided to use data matrices in encoding frame numbers into the reference video frames.

    Data Matrices

    The frame that is passed through the media server is in YUV format (specifically I420 format), so we would need to write a data matrix that encodes the frame number using this video frame format.

    In a YUV frame, the Y-plane represents the luminance (brightness) component of the frame, and the U and V plane represent the chrominance (color) component of the frame. When Fresno conducts its analysis on a video frame pair, it only uses the Y-plane to generate its score. The images below show what a frame would look like with, and without values in the UV planes.

    Original Image
    Image with neutralized UV Planes

    Initially, I implemented encoding and decoding the data matrix in Fresno. Eastwood would use Fresno to encode a data matrix onto reference videos before sending it to Airtime’s media server. Clovis would then use Fresno to decode the matrix values. This implementation proved successful for basic use cases, however, when the severe resolution or bitrate restrictions were put on the distorted video, the decoder failed reading several barcodes. Optimizations were needed to be made for both the matrix encoder and decoder to account for more restrictive scenarios.

    One thing that I noticed was that the barcodes were always 70px by 70px. For larger resolutions, this meant that the barcode was often less than 1% of the total frame. If we were to increase the barcode size before passing it through the media server, it would likely survive the re-encoding process more easily. However, we would not want to increase the barcode size so much that it took over a significant portion of the frame. After careful consideration, I decided to increase the barcode size until the barcode’s width and height reach ⅓ of the smallest dimension of the frame. The barcode size can only be increased in multiples, such that the only possible dimensions are multiples of itself (ex. 70×70, 140×140, 210×210). For example, a 1280×720 video frame would have a barcode size of 210×210. This is because If we divide the minimum dimension of the video frame (720) by 3, we would have 240. The highest multiple of 70 that is less than 240 is 210, so our barcode size would be 210×210.

    Additionally, neutralizing the UV planes of the data matrix makes it more resilient against the types of distortions introduced by the video encoding process by the media server. Below are examples of video frames with and without a neutralized barcode.

    Barcode Region’s UV Planes neutralized
    Barcode Region’s UV Planes not neutralized

    After performing these optimizations, you may be curious about how well the barcode survives distortion, as well as its effect on our final VMAF score.

    Limitations of Data Matrices

    2 main factors impact how well data matrices survive distortion.

    1. The change in resolution.
    2. The bitrate constraint

    I ran some tests to see how many barcodes would be undecodable given resolution and bitrate constraints. I used a sample 720×720 video file with 256 frames generated by Eastwood, the tables below show the independent effect on barcode decoding of resolution and bitrate constraints.

    Resolution vs # Frames Decoded
    Max Bitrate vs # Frames Decoded

    Below are frames from the video file that were used to generate the data sets above. The reference frame, 100kbps frame, and 90×90 are shown respectively.

    Original Frame
    Frame at 100kbps
    Frame at 90×90

    We can see that decoding the barcode region in our frame is more resistant to changes in resolution than bitrate. Even when we shrink the distorted video to approximately 1% of the reference video’s size, we are still able to decode about 90% of the frames. In the scenario of limiting bitrate, more frames are unable to be decoded for some extreme scenarios. However, even if several frames are unable to be decoded, the rest of the frames would still get passed into VMAF, generating a score that is likely very similar to the score that would’ve been generated if all frames were analyzed by VMAF.

    It’s also important to note the impact the actual barcode has on the VMAF score as well. Since we are writing the barcode region in the Y-plane of the frame, it’s only natural for this to affect the VMAF score, which also depends on the Y-plane of the frame. To investigate this, I ran 2 sets of frame pairs (one without and one with the barcode) that were both scaled from 720×720 to 360×360, each with 100 frames through Fresno. The table below shows the VMAF score of every tenth frame pair.

    VMAF Score: Without Barcode vs With Barcode


    To simulate the effect of the media server’s re-encoding process on the VMAF, more tests were run to find the independent effects of bitrate and resolution on VMAF scores. Using the same reference video as the test above, the tables below illustrate how the VMAF score changes under resolution and bitrate constraints respectively.

    Resolution vs VMAF Score
    Bitrate vs # Frames Decoded

    We can see that for both resolution and bitrate constraints, the VMAF score starts dropping significantly under more severe constraints. Both constraints seem to follow a logarithmic relationship between the VMAF score, although the VMAF score seems to drop more quickly for resolution constraints. This is the opposite of the number of unreadable data matrices given these constraints, as decoding data matrices are more resistant to resolution changes than bitrate changes.

    For resolution constraints, the VMAF score drops to 0 when the distorted resolution is approximately 1% of the original size. In these scenarios, some data matrices are unable to be decoded as well. Therefore, it is safe to conclude that whenever data matrices are unreadable due to resolution constraints, the VMAF score would have been 0, or an extremely low value anyway.

    On the contrary, for bitrate constraints, the VMAF score does not drop as low for severe conditions, but more data matrices become unreadable. When a few data matrices are unreadable due to bitrate constraints, it is still entirely possible to get a valid VMAF score (see 200kbps example). However, when a significant number of data matrices are unable to be decoded due to bitrate constraints, the VMAF score would likely have been a very low number (see 100kbps example).

    To simulate a more realistic re-encoding of a video file, I used my phone to take a 720×1280 video of myself and simultaneously restricted the bitrate and resolution. The reference video and distorted video were then run through Clovis. Below is the table that shows the results of this test.

    VMAF Score under Resolution and Bitrate Constraints

    The results in this table very accurately reflect the trends found in the independent tests.


    Finally, we’ve made the dream of end-to-end objective video quality analysis at Airtime a reality! Now that we’re able to analyze the video quality that users experience under different network constraints, what else needs to be done?

    Clovis still needs to be integrated into Airtime’s testing environments. Being able to determine a score for videos under different constraints will allow Airtime’s testers and developers to further optimize the media server’s encoder, improving the app experience for all users of Airtime!

  • Interning at Airtime

    This past Fall I worked at Airtime as a software engineering intern on the media team. Specifically, I implemented a video quality analysis framework into the media team’s testing applications. See my other article, Objective Video Quality Analysis at Airtime, to learn more about my project in detail!

    As a third year University of Waterloo biomedical engineering student with a specialization in software, I was drawn to Airtime by the opportunity to experience:

    • Meaningful and impactful work
    • Cool tech stack
    • Great location
    • Nurturing environment
    • Balance
    • Fun!

    Meaningful and impactful work

    When interviewing for Airtime, my manager and coworker described my potential project in great detail. I learned the history of its conception and its importance to Airtime. The video quality analysis system was vital to our testing suite, providing the company with meaningful data and allowing our manual testers to focus on other tasks.

    My project also had numerous extensions, eliminating the possibility of finishing the project early and twiddling my thumbs for the rest of the term.

    Me at Airtime!

    Cool Tech Stack

    Another important part of being an intern is exposure to neat tech stacks! Airtime’s media team predominantly codes in C++, which is a foundational programming language. Yes, type casting and memory management can be a pain, but these considerations will whip you into a more thoughtful programmer and will make coding in other languages seem like a breeze.

    Airtime’s code base is composed of many different applications, libraries, and frameworks that communicate with each other. Because of these dependencies, our engineers must be mindful of code compatibility, maintainability, and readability. Even though these tools are used internally, we continue to implement best practices and industry standards to ensure the highest quality and mitigate future problems.

    Airtime starter pack and work station

    Nurturing Environment

    Airtime is an incredibly nurturing environment. Daily stand-ups allow the media team to stay connected and are a great place to ask for expert advice on road-blockers from the previous day. Weekly or bi-weekly one-on-ones are scheduled for all employees and provide an opportunity to ask questions, reflect on your performance, or chat about your weekend plans. Plus, the office is an open concept, encouraging people to consult each other throughout the work day.

    Airtime is a family composed of incredibly talented and kind individuals. As a family, we celebrate company successes and reflect on recent failures through our biweekly “All-hands” meetings. Holiday and birthday parties also bring the company together for fun activities and delicious treats!

    Birthday cannoli!

    Great Location

    Airtime has offices in Brooklyn, New York and Palo Alto, California! As a member of the Palo Alto office, I was able to experience life in the Bay Area and extensively explore downtown Palo Alto. On my one-on-ones with my manager, we’d walk through the residential areas and stop to pet curious cats. Every Friday, the office takes lunch trips to Sancho’s, our favorite local taco shop. Our office is also a short 5 minute walk to the PA Caltrain station, which is awesome for commuting from SF or from South Bay.

    Sunrise at the Santa Clara Caltrain station — Note, usually I took the 8:47am train 🙂


    Work-life balance is important for sustainable productivity and satisfaction. Flexible work hours allow everyone to beat Bay Area traffic. Online communication also provides the opportunity to work remotely on occasion.

    Well-being is emphasized with a kitchen stocked with fresh snacks and drinks. Also, every Tuesday and Thursday, one of my mentors and I would run 8 miles to Mountain View, where we would then play basketball with other coworkers. On our runs, we chatted about my project, software topics, and history.

    Sweaty with hard work after basketball (Left). Shoreline Trail run with Jenny and Matthew (Right)!

    While interning, I was also able to take many weekend trips! Work-hard, play-hard. Over the course of 16 weeks, I traveled to San Francisco (NOT “San Fran”), Napa Valley, Santa Cruz, Monterey Bay, Lake Tahoe, Los Angeles, and Tucson, Arizona.

    Weekend travels to SF, Tahoe, LA, and Arizona.


    Every day in the Airtime office was a joy. From planting plastic cockroaches on desks of unsuspecting coworkers to reviving a succulent plant with La Croix, each day was filled with fun and adventure. FIFA on Fridays fuels a friendly workplace competition.

    Outside of the office, we went on numerous team bonding outings, including a hike and go-kart racing! With delicious food and company-wide competitions for best costumes and ugly sweaters, holiday parties were also a blast.

    Palo Alto Halloween and Christmas Parties 2019.

    Interning at Airtime was an amazing and incredibly impactful experience. I loved every day in the office and looked forward to working on my project. I will truly miss the Airtime family.

    Although I had to return to school, thus ending my internship at Airtime, you can still join! Airtime is hiring: .

    Me and my manager Jim at the holiday party!
  • Objective Video Quality Analysis at Airtime

    How far is New York City from Los Angeles? Far.
    How cold is the North Pole? Freezing.
    How hard is a diamond? Hard.

    If you are inquisitive, or have a knack for technicalities, the above qualitative responses are insufficient answers. You’d prefer to know that:

    NYC and LA are 2,446 miles apart, measuring along a direct flight route.
    In January, the North Pole averages a high of 2°F and a low of -13°F.
    A diamond has a hardness of 10 on the Mohs Hardness Scale.

    This leads us to the question:
    How good is the video quality on Airtime?

    As humans, we like to know the trends of change, and we like to know the delta, or quantitative measure, for this change. The data of a video stream slightly changes, too, as it is sent from one device and received by another. However, video quality is still predominantly measured with subjective analysis, where adjectives like “bad”, “okay”, or “great” are assigned to video streams after they have traversed the network.

    Subjective evaluation, typically measured as a Mean Opinion Score is costly, time-consuming, and inconvenient. Furthermore, there may be disagreements in how to interpret the viewers’ opinions and scores.

    At Airtime, we recognized the need to objectively quantify the quality of our real-time video chatting and streaming services. As a result, we embarked on a quest to implement objective video quality analysis into our real-time test applications.

    In the Airtime app, a user publishes video captured by their camera to other members subscribing to the video chat. As the video frames are transferred, the original image can become distorted by the video encoder as it reacts to constraints like network bandwidth and the processing power of the subscribing device. Therefore, the image rendered on the subscriber’s screen can appear differently than what was originally captured on the publisher’s camera. Formally, the original image is referred to as the reference image, and the received image is known as the distorted image.

    High level work flow of video traversal from a publisher to the receiving clients.

    The perceived difference in quality between a reference image and a distorted image is dependent on the human visual system. Different types of errors have different weightings in their affect on the perception of video quality. For instance, we accept changes in contrast more readily than added blurring or blockiness.

    Below is an example of a reference image (top left corner) and five distorted images. All of the distorted images have the same Mean Squared Error (MSE) quality score. MSE averages the squared intensity differences of the reference image pixel to the corresponding pixel in the distorted image. Although the images have the same MSE score, when looking at the image collection, it is clear that a manual tester would classify some of the distortions as worse than others.

    Comparison of five distorted images to a reference image, displaying the difference between algorithmic measurement of error and human perceived error [1].

    To best replicate a user’s experience and perception of quality, it is necessary that the video quality analysis algorithm implemented by Airtime considers the human visual system.

    Objective Quality Analysis

    Objective image quality analysis can be split into three categories of implementation:

    1. Full-Reference: The entire reference image is known.
    2. Reduced-Reference: Only features of the reference image are known and used in computation.
    3. No-Reference: Nothing is known about the reference image. This is a “blind” analysis.


    Full-reference models allow for complete comparison of the distorted image to the reference image. They are characterized by having high accuracy; however, they require a backchannel to pass the original image to where the comparison is taking place.

    Common algorithms include PSNR, histogram analysis, SSIM, and MSE, which was previously exampled.

    Peak Signal-to-Noise Ratio (PSNR) is an objective algorithmic analysis that treats the change in photo quality as an error signal overlaying the original image. This method does not assign weightings to different types of errors; thus, all errors are handled as if they have the same visual impact.

    Histograms can be used for similarity analysis. The reference and distorted images are graphically represented as histograms where the x-axis is tonal variation and the y-axis is the number of pixels for that particular tone. The two histograms are then compared for similarity using various mathematical formulas such as correlation, chi-square, intersection, Bhattacharyya distance, and Kullback-Leiblier divergence. Multiple histograms my be used to analyze other features within the image such as luminance.

    The Structural Similarity Index (SSIM) is a variation of MSE analysis that incorporates the human visual system. Weightings are assigned to luminance, contrast, and image structure. The algorithm is computationally quick and considers human perception, however it does not consider viewing distance or screen size.

    When researching, we also came across Video Multi-method Assessment Fusion (VMAF), an open-source software package developed by Netflix. The VMAF algorithm calculates a quality score on a scale of 0 to 100, which serves to replicate the 0 to 100 opinion scoring that subjective evaluation typically uses. A score of 20 maps to “bad” quality, and a score of 100 maps to “excellent” quality. VMAF analysis considers the human visual system, as well as display resolution and viewing distance. Additionally, VMAF is capable of calculating PSNR and SSIM scores.


    Reduced-Reference models require information about the reference image, such as structural vectors, but they do not require the entire image. Therefore, they are most useful in cases with limited bandwidth. Since the full reference image is not available for analysis, there is a decrease in accuracy in comparison to full-reference models.

    Both ST-RRED and SpEED-QA are reduced-reference models.


    No-Reference models do not use any information from the reference image. Instead, a machine learning model is trained with a supplied dataset of images. The model is then able to identify features in the distorted image (blockiness, contrast levels etc.) and correspond these errors to examples found in the image dataset. A resulting quality score is then calculated.

    This method has lower accuracy than full-reference models, since the computation relies on a provided training dataset. Accuracy can be improved by increasing the variation of errors found in the data set. No-Reference models do not require a backchannel to the reference image which allows for easier integration into the service provision chain.

    Examples of reduced-reference models are Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) and Naturalness Image Quality Evaluator (NIQE), which differ by their model training.

    The BRISQUE model is trained with a dataset of images and their corresponding subjective scores. This allows for low computational complexity, which makes BRISQUE highly compatible with real-time applications. However, because the algorithm calculates scores based on previous image-to-score pairings, the algorithm is unable to detect errors that it has not previously been exposed to during training.

    On the other hand, the NIQUE model is trained with a dataset of images; it is not trained with subjective analysis data. This is beneficial, because the system can begin to recognize errors that it is not explicitly exposed to during training by noting vectorized similarities between types of errors. NIQUE, however, does not take into account the human visual system.

    In order to ensure that the quality analysis framework that Airtime implements is accurate and a true qualification of video quality of a given stream, there are multiple considerations that need to be made during development, including frame synchronization, pooling strategy, and region of interest.

    Frame Synchronization

    Because full-reference and reduced-reference models require the reference image for evaluation, there is a need to synchronize the frames of the reference and distorted stream, in addition to adding a backchannel to grab the frames. It is important to note that Airtime’s encoder will drop frames when placed under network constraints; therefore, our video quality analysis system must be able to only match frames that will have a decoded counterpart.

    Common methods for synchronization are:

    1. Optical Character Recognition (OCR) — Text, such as a frame number, or an image, like a QR code or barcode, is included in the reference video.
    2. Pattern Match — Comparison of the structure and/or coloration of a small region of interest in both the reference and distorted frames — note, this method is only useful for pattern changes and generated examples.

    Both OCR and pattern match methods require extensive work to implement, which would detract from our main focus of getting an initial quality analysis system up and running. Therefore, it was determined that we would create our own matching solution in the case of proceeding with a full-reference or reduced-reference model.

    Pooling Strategy

    The majority of the evaluation methods described above are designed to be applied to an image or single frame in a video. In order to measure the quality of an entire video segment, analysis is conducted on each frame, or on a sample of frames in a video. The data is then pooled to calculate a single quality metric. The pooling strategy that leads to the most accurate results differs with each quality analysis model. Recommended pooling strategies can be found in: Study of Temporal Effects on Subjective Video Quality of Experience.

    Region of Interest

    An important consideration when conducting quality analysis is the region being analyzed. The decision to crop an image and only analyze a region of interest or to analyze the entire image will greatly affect quantitive scoring.

    At Airtime, our general use-case is a video of a user’s face centered in the screen, since our app is purposed for video chatting. Background imperfections are not as noticeable as errors in the center of the screen, since the viewer’s focus is on the person. The effect of selecting a region of interest is exampled below by running an image of a cat through MATLAB and conducting image quality analysis using a built-in SSIM function. All grayscale images are compared to the corresponding original image and scored from 0 to 1, where 1 is a perfect match.

    The affects of analyzing different regions of interest during objective quality analysis.

    The scores greatly differ depending on the analyzed region. If we take the full image, the distortion with the highest quality is Distortion 2; however, if we use the cropped images, then the distortion with the highest quality is Distortion 3. Perceptually, Distortion 3 seems to be the best quality. This indicates that isolating a region of interest may be the most accurate method when implementing Airtime’s objective video quality analysis.


    After preliminary research, it was determined that Airtime’s quality scoring algorithm must:

    • Provide a single quality score for a specified stream — using an accurate pooling strategy.
    • Be able to run on real-time video streams (and allow for delay in input material retrieval).
    • Reflect the human visual system in its scoring.
    • Be embedded into Airtime’s testing environments (C++ compatible).
    • Not be too computationally expensive.
    • Handle dropped frames (and match frames in the cases of full or reduced reference).
    • Handle frame rescaling.
    • Run on Mac, Linux, and iOS.

    With these requirements in mind, a decision was made to select the quality analysis algorithm that best fit Airtime’s needs.

    The Decision

    In the end, the open-sourced VMAF analysis was selected for Airtime’s video quality analysis algorithm.

    The full-reference, VMAF static C++ library best fit Airtime’s requirements. The software calculates a VMAF score, which is based on data from subjective analysis, and accurately reflects human perception. Moreover, it also has the capacity to calculate SSIM and PSNR scores, which is useful for result validation.

    However, many personalized modifications would need to be made to the library, including the analysis of real-time frame inputs rather than predetermined files. A backchannel in our video system must also be constructed in order to extract the original, reference frames.


    To handle the needed changes, an application programming interface (API) was developed to act as the communication link between Airtime’s testing tools and the modified VMAF. This was the birth of the Fresno project, whose name follows Airtime’s media team’s California landmark project naming convention — fun fact, Fresno is the largest raisin producer in the world!

    Fresno makes it into Airtime’s Jira project tracking dashboard!

    To accurately pair corresponding reference and distorted frames, we dove deep into WebRTC, the open source video chat framework that Airtime is built upon — see here for more details about Airtime’s use of WebRTC.

    In our use of WebRTC, frames are dropped before the encoding process when there is not enough bandwidth to send every frame. Because of this, we are able to intercept the pre-encoded, or reference, frame before the encoding process and after the potential frame drop checkpoint. This timing ensures that every reference image stored for Fresno will have a matching distorted frame. By accessing the reference and distorted frames in WebRTC, we were able to overcome the hurdle of frame synchronization.

    In addition to dropping frames, frame resolution will be decreased by our encoder in cases with limited bandwidth. Fresno is able to detect this difference between the reference and distorted frames and rescale the distorted frame to match the dimensions of the reference frame.

    Fresno itself is written in C++ and utilizes process synchronization to ensure proper timing and data collection. It allows user specified start and stop of video quality analysis, and upon completion, it returns the aggregate VMAF quality score. A series of command line options allows for user specified configuration of the analysis. A JSON data log also fills with frame by frame analysis data, including PSNR, SSIM, MS-SSIM, and VMAF scores.

    Sample of the quality analysis data report.

    Fresno is currently compatible with OS X and is integrated into Airtime’s video publication and subscription test application. Here, Fresno analyzes the frames of the publishing stream for the specified duration of the testing session.

    Fresno in Action

    To test the integrity of Fresno’s analysis, quality analysis was conducted on a pattern generated video stream for differing constraints, specifically analysis duration, frame rate, and bitrate. VMAF scores can be interpreted ona linear scale where a score of 20 corresponds to “bad” quality and 100 to “excellent” quality.

    VMAF score interpretation.

    Bitrate is the speed at which data are transmitted along a network. Quality analysis plots for 250kbps, 1Mbps, and 2Mbps can be seen below.

    Quality scores per frame for generated video streams at 250kbps, 1Mbps, and 2Mbps.

    Across the board, the video quality is initially high. The first 25 to 50 frames have scores ranging from 95–100, which translates to excellent quality. Once the network is initially probed, the encoder responds to the available bandwidth and the quality drastically decreases.

    At 250kbps, the network never improves, and the scores fluctuate between 50 and 70 for the rest of the video stream. At 1Mbps, the quality increases from 75 to 85 as the network is probed and bandwidth is recovered. At 2Mbps, sufficient bandwidth is available, and the quality score skyrockets to 100. It is important to remember that a score of 100 does not indicate identical frames. Rather, a score of 100 translates to excellent quality where the difference between the reference and distorted frame is negligible.

    For each 30 second video stream, the aggregate VMAF scores are as follows:

    • 250kbps → VMAF score: 63.15, fair
    • 1Mbps → VMAF score: 84.84, good
    • 2Mbps → VMAF score: 89.76, very good

    These scores are the arithmetic mean of the individual frame scores of the testing session.

    Quality score distribution for generated video streams at 250kbps, 1Mbps, and 2Mbps.

    Overall video quality is higher for videos streamed at a higher bitrate which is expected.

    The Future of Fresno

    The Fresno project is still in progress as we continue to build upon its compatibility with our testing environments. We plan to integrate Fresno into our media server, so we can evaluate the quality of transcoding streams and end-to-end data from live cameras. Further extensions of Fresno, such as analyzing a specified region of interest, are also in the future plans. Additionally, we plan to iterate upon our current WebRTC frame sychonization method and implement a less invasive method, such as optical character recognition.

    In the meantime, we can celebrate the fact that we can now objectively quantify Airtime’s video quality in real-time on OS X. Fresno will allow our testers to numerically score videos under different network constraints. This quantitative analysis will allow us to further optimize our encoder and understand how it handles situations of limited bandwidth.

    With Fresno in Airtime’s codebase, we can now rest, knowing that we can begin to objectively, and sufficiently, answer the question, “How good is Airtime’s video quality?”


    [1] Z. Wang, A. Bovik, H. Sheikh and E. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity”, IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004. Available: 10.1109/tip.2003.819861.

  • Software Development at Airtime

    In our case, tickets start in an Open state and are assigned by the engineering manager to a developer. That developer moves that ticket to In Development at which point it belongs to that developer.

    When that developer has implemented the ticket, they move it to Branch Testing and the ticket is now owned by QA. QA can either fail the Branch Testing and send the ticket back to the developer or indicate Branch Testing Passed at which point the developer owns merging the feature branch.

    Once merged QA once again owns performing integration testing and moving the ticket from On Staging to Ready for Release.

    Note that in the diagram above Ready for Release transitions to Resolved once a new version of the app has been released or the service deployed.


    As soon as an organization grows to more than a few people, communication and interruption become a real problem and slow down your organization’s progress.

    The secret to combatting this is to make sure tickets are in the right state and up to date. This allows any team member to asynchronously make the right choice as to what to do without needing to wait on or interrupt anyone else.

    Minimal Complexity

    The easier it is for a developer to have mental model of your development process the better since it will feel like a tool rather than a constant puzzle to be solved. Even if you are able to automate much of the bookkeeping process it is good to understand what is happening and why.

    Save your focus and concentration for developing software that matters, not for trying to remember how your software development process works.

    The worst situation to be in is having a manual complex system since that will surely burn out your team.

    Similar to our software development process goals above we also have specific goals we wanted to achieve with how we used Git.

    1. Stable Development Branch. An up-to-date, stable branch as a starting point for features so that each developer is not distracted by bugs and build issues that aren’t theirs.
    2. Feature branches. So that engineers can share in progress work with the product team and QA that are only merged once everyone is satisfied with the changes.
    3. Release branches. Allows us to know what changes went into a release and also allow for an easy starting point to make hot fix releases with targeted changes.

    These goals could be achieved with the much-referenced “A Successful Git Branching Model” (diagram above) but for us it violated our high level goal of minimal complexity.

    Instead what we opted for is a simpler model in which we have a develop branch that has the latest features where other developers can start branches for their feature work from, and where periodic release branches can be branched from. The two models are largely similar except we do not use master and our hotfixes are done directly on the release branches.

    Airtime’s Simplified Git Branching Model

    Previously I stated that you should focus on what matters and this means also knowing when to automate so you can use the saved time to build product instead of manually distributing a test build for example.

    It can be fun to write scripts and tools but if they are only used infrequently or only by a small subset of people of your organization then it is probably a net loss.

    Use your head or a convenient table.

    Here’s a few DevOps projects we thought were worth it at Airtime.

    Automating Builds

    At Airtime anytime code is pushed to a feature branch, develop, or a release branch on GitHub a build is kicked off.

    If you have more than one developer on a team you should always automate the building of branches that are part of your developer flow.

    Here’s a couple of the reasons:

    Minimizing downtime. Per our branching model above, every developer starts a new feature from the develop branch. If develop does not build then no developers can start a new feature without fixing it first. This means multiple developers may spend time fixing the same build failure. Takeaway: Always build develop every time it is updated.

    Checking all targets and tests. In our experience developers rarely check that all targets/configurations build and tests pass when pushing code to a feature branch. And frankly it’s probably better that they don’t since rather than eat up the CPU cycles of their development machine they can let a build server check.

    The best thing about automating your builds is that once you set it up, it is easy to add it everywhere. Given how cheap build services are these days and the benefit it provides, there is little reason not to do it.

    Distributing Builds

    We also thought it made sense to make it as easy as possible to distribute builds. This means that it would not only be easy for developers but for anyone to send themselves a build they needed.

    Specifically we wanted to make it easy for

    1. Product managers to see in progress work
    2. QA to be able to get a build for a feature branch they are testing,
    3. Broader Airtime team to get an early preview of what is about to be released
    4. For developers to upload builds to the App and Play store (mobile teams only)

    For the first three goals, we built a system where if someone comments qabuild or stagebuild on a pull request, a build is triggered on Jenkins (needed for keyword detection) which starts a CircleCI job which uploads the build to Fabric and distributes it to the QA team or everyone respectively.

    We also build develop once a day if there have been any changes that day and distribute the build to the entire company so everyone can see the latest and greatest changes. They also feel confident that if they see a bug that it is probably still a bug and should report it.

    The fourth goal is currently only relevant for the iOS team. Any time a build is pushed to a release branch it is automatically uploaded to the AppStore. We do not release automatically but it simplifies the process of getting release builds uploaded.

    Integrating Development Tools

    At Airtime we decided to integrate our development tools as much as possible to reduce time wasted checking multiple UIs and keeping them in sync. This has become much easier over the years so there is little reason not to do it.

    There are some suites of tools such as GitLab that come integrated out of the box. Chances are though that your team is using at least some tools that require some level of integration.

    Here are a few examples of what we’ve done:

    Slack and Everything.

    At this point most organizations that run Slack have integrated it with their development flow and there are plenty of articles about it. We try to use it as our information center for everything.

    We use Slack integrations for notification of build failures on CircleCI and Jenkins, backend service deployment status from AWS, creating JIRA tickets in the flow of a discussion, service interruption alerts from 3rd party services we rely on, notifications on ticket flow on jira, and the list goes on.


    In order to automate the bookkeeping tasks that should be done in Jira when working on a ticket we created a tool called Atflow (short for Airtime Flow) based on hubflow and go-jira.

    Atflow allows us to start development work on a ticket from the command line and the tool takes care of naming the branch and updating the ticket state in JIRA and making sure it is in the current sprint.

    Starting feature development in atflow

    By ensuring that you are naming your branches and pull requests correctly you get the little magic in Jira like this on every ticket.

    Product Development

    In the interest of brevity and staying on topic I did not go into much detail about roadmapping, feedback, and task breakdown.

    Suffice to say that we have product managers that take input from a variety of sources, create a feature roadmap, collaborate with engineers and designers on the best way to break up the roadmap into high level feature epics which engineers can then break down into individual tickets.

    In the future we will have an article from the product team on how we do product development.

    Process Variations Between Teams

    Note that building an app like Airtime requires multiple engineering teams and each team has slight variations in the process necessitated by the challenges they face but large in part the process is the same across teams.

    A few examples of major differences I did not mentioned above:

    • Our backend team relies on test driven development (TDD) rather than manual QA
    • Our media team has a combination of automated and manual QA test suites for validation of new media server and client library releases.

    Manual QA Versus Automation

    For app testing we rely on manual QA as opposed to automated UI testing since our product spec changes frequently as we adapt the product to what users want. We felt it was more actually more effort to maintain the tests until we got to a certain level of product market fit.

    Good luck developing and improving your own software development process for your organization.

    If you like what you see and what we are working on then come join us!

    Check out our open roles!

    Originally published at on July 2, 2019.

  • An Introduction to the Airtime Media Architecture

    In previous posts, we’ve talked about some of the techniques our application and devops teams have used to put together a flexible and scalable back-end to support the Airtime application. For a change of pace, we’d like to dive into the more specialized software that makes the group video chat features of the application work. We’ll talk about some of the challenges we’ve faced and the architecture we’ve built to address them, and followup articles will dig into some of the more interesting aspects of the system in greater detail.

    The Challenge of Real Time Video

    Video on the Internet is pervasive. By some measures, video accounts for a staggering 70% of Internet traffic in North America during peak hours. People watch hundreds of millions of hours of video on YouTube every day. With all that video flying around, it might be tempting to think that it’s a solved problem: when you can catch an NFL game live on Amazon Prime Video or instantly stream a live feed to your friends on Facebook, how hard can video chat be? As it turns out, it’s actually really hard.

    Live video on the Internet is typically distributed through Content Delivery Networks (CDNs), large networks of computers located around the world that take video data from its point of origin and pass it along until it reaches a server close to the viewer. This approach is highly effective when it comes to delivering video to thousands or millions of viewers, because it allows the associated video to fan out from its source to a much larger number of geographically distributed servers, and no single server is required to bear the brunt of the entire audience. However, each server the stream passes through within the CDN adds delay, so viewers see what’s happening sometime after it actually happened, in some cases by as much as tens of seconds.

    These live video streams must also accommodate a wide range of network conditions, from barely adequate and highly variable 2.5G mobile networks to rock-solid 100 Mbps home fiber connections. As anyone who has ever waited impatiently for a stream to start while staring at a spinner knows, changes in network conditions can be papered over via buffering, which allows a viewer to start downloading a stream before it starts playing, hopefully giving the network enough of a head start that the viewing experience won’t be interrupted if a hiccup occurs, and some part of the stream can’t be retrieved right away. Of course, buffering also adds delay — in fact, that’s really all it is — so the viewer’s experience is further removed from the action that’s taking place in real time.

    Thanks to these techniques, it’s relatively easy to distribute one-way video streams via CDNs, but at the cost of significant delay. This is fine if all you want to do is sit back and watch a YouTube stream, but intolerable if you’re trying to have a two-way conversation. Large delays lead to awkward pauses and cross-talk, as neither side knows when the other is done talking. This is the crux of the difference between streaming video and real-time video: whereas streaming applications can easily tolerate large delays, effective real-time communication requires latencies in the neighborhood of tens of milliseconds, a thousand-fold difference. Consequently, the server infrastructure, management tools, and network protocols for real-time applications are radically different from those used for simple streaming.

    A Multiplicity of Devices

    Further complicating matters is the tremendous variety of devices people carry around every day. You might have the latest and greatest flagship phone, but odds are that not all of your friends and family do. Encoding and decoding video make heavy use of your device’s processing resources, so the differences in horsepower between high-end and bargain-basement phones can have real effects in terms of the quality of video they can support.

    These differences can easily be mitigated for simple two-person calls: the callers’ devices can exchange information regarding their capabilities and mutually decide on a configuration that works well for each of them. The situation is much more complicated for multi-person video chat, however. In scenarios in which two or more powerful devices are participating in a session with one that’s significantly less capable, falling back to the capabilities of the weak device needlessly shortchanges the users with more powerful phones.

    Making this work well for everyone requires a solution that allows powerful devices to send and receive high-quality streams, while simultaneously ensuring that less powerful devices aren’t overwhelmed with high-resolution video they can’t handle. Additionally, this must be done at a reasonable cost, which precludes using back-end resources to create bespoke versions of every stream for every participant in a conversation.

    Toward a Real-Time Delivery Network

    At Airtime, we’ve spent the last few years working to solve these problems, with the goal of creating a real-time delivery network that can deliver high-quality, real-time video around the world and seamlessly accommodate both the broad range of devices our users own and the diverse networks they’re connected to.

    In doing so, we’ve also had to address the challenge of delivering these services in a way that is scalable and cost-effective. Processing video is computationally expensive, so we’re forced to eschew the popular web frameworks typically used to run large-scale sites, instead relying on an old standby: C++, which also happens to be the implementation language of the WebRTC framework that underlies our service. This approach gives us native performance and easier integration with the framework code. By prioritizing efficiency and minimizing external dependencies, we’ve put together a system that will support us as we grow to millions of users and beyond without breaking the bank.

    To make all this work, we’ve built a number of independent components, including:

    • A high-performance media server that distributes streams and continuously tailors them to the needs of users and their devices. Written in C++ to maximize scalability, this service is built on top of the open-source WebRTC project and ensures that we can deliver real-time audio and video to users regardless of the type of devices they are using and the conditions of the networks they’re connected to.
    • Native client frameworks for iOS and Android that allow our apps to publish and subscribe to video using our infrastructure, built on a customized and tuned port of the WebRTC stack.
    • A corresponding JavaScript framework that allows web apps to fully participate in Airtime calls using browsers’ built-in WebRTC implementations.
    • A stream management service that enables clients to publish, discover, and subscribe to new streams.
    • Globally distributed systems to support the discovery and allocation of media servers, working in concert with real-time monitoring of audio and video performance.
    • Automated systems to bring up additional capacity as needed and to manage the deployment of new software releases.

    From 50,000 feet, the system looks a little like this:

    As you can imagine, there’s more to it than we can cover in this post, so we’ll dig into parts of the system in more detail down the road a bit. Stay tuned!

    In the meantime, if this sounds interesting to you, check out our open engineering roles here:

    Originally published at on March 8, 2019.

  • Airtime + vLine

    We are excited to announce that vLine has joined the Airtime team. When we first met Sean Parker it was clear he had a vision that resonated with ours: that the proliferation of smartphones and increasing network bandwidth and coverage provide the foundation for a world where real-time video will give rise to new ways of communication and experiences.

    Since joining Airtime, we have been working on building out a globally distributed WebRTC platform optimized for mobile devices and networks that powers the Airtime application. Building scalable real-time multi-party video chat is quite challenging. Unlike simple streaming applications that can tolerate latency on the order of seconds and leverage existing content delivery networks, our network has to provide latency on the order of milliseconds so that people anywhere in the world can hold intelligible conversations. We have made much progress towards this goal, but we have more to do and exciting ideas on how to further improve. If you are interested in complex technical challenges and want to help us push the boundaries of real-time communication, we are hiring.

    We thank you for using vLine and being our customers. We value your privacy and will be deleting all vLine customer data permanently. We hope to continue creating compelling products and experiences for you as part of Airtime.

    Please check out the Airtime app. Comments or questions? Send us a note at We’d love to hear from you.

    Originally published at on June 28, 2016.