High-quality audio and video that is reliable is key to an outstanding video call experience. At Airtime, we strive to give users the best experience possible. Thus it is crucial we verify that a client’s real-time video quality adapts as expected when impeding network conditions are encountered. Conditions covered by our testing include limited bandwidth, packet loss, packet latency, and combinations of these. We refer to these as network constraint tests. Network constraint testing can be performed manually using a router, but this is time consuming due to the unique setup required by each test. Testing productivity can be greatly increased by creating APIs and cloud infrastructure that enable us to automate the majority of our existing manual network constraint tests. In summary, the following were accomplished to get this project up and running:
- Researched several constraint tools and selected Linux Traffic Control (TC) for deeper evaluation on usability and reliability
- Modified automation infrastructure in AWS to integrate TC
- Developed APIs to allow TC to be remotely configured from within tests
- Wrote automated tests that utilized the new infrastructure and APIs
Potential Constraint Tools for Automation
In the past, several network constraint tools had been evaluated as potential candidates for creating constraints in automation. Some of these tools displayed unwanted behaviors during testing, such as long round-trip times when a simple bandwidth constraint was applied. A successful tool is one that only applies the network constraints needed for tests without introducing negative side effects that would invalidate automated test results.
Network Link Conditioner (NLC)
NLC runs on client devices (iPhone and macOS) and we use it as part of our manual test procedures. While NLC is also available on macOS, our manual test environments currently run on Linux virtual servers. We may have further considered NLC if no other promising solution was found at the time.
pfSense is an open source firewall that can be installed on a physical computer or virtual machine and uses DummyNet as its traffic shaping tool. To clarify, traffic shaping is a bandwidth management technique used in computer networking. When pfSense was previously evaluated, unexpected network latency was discovered and this option was rejected as a result.
Soekris is a physical router that can be configured to apply constraints and is used in manual testing. However, Soekris was challenging to work with since configurations weren’t reliable, which led to regular factory resets. This caused unreliable test results. Other concerns with Soekris were that it was not easily programmable, had no API, and is no longer available for sale. Since Soekris is a physical device, it cannot be used in a cloud context.
Linux Traffic Control (TC)
TC is a command in Linux that produces constraints by modifying the settings on a Network Interface Card (NIC). TC has a variety of command line options and add-ons like netEm (network emulator) to support the use of different rate limiting and queuing algorithms to produce the desired constraints.
We saw TC as a good candidate because it offered the ability to apply different constraints we needed to test, is simple and quick to install, and can be programmatically configured for automation. One downside of working with TC is the scarcity of updated documentation, examples, and discussion of best practices for the different configurations options and combinations. This required us to investigate and experiment further to determine what actually worked for us.
Evaluation of TC as a Reliable Tool
We evaluate every constraint tool prior to adding it to our testing toolset to ensure the validity of our test results. Similarly, we verified that TC is able to reliably and consistently apply the necessary constraints without introducing unwanted side effects that would tamper with automated test results.
In order to objectively judge a tool, we have existing performance tests to evaluate a tool’s reliability when applying different constraints. Assumptions were made about the implementation of a typical rate shaper algorithm. One recognized implementation is to use a queue and the token bucket algorithm as shown in Fig. 1 below.
The token bucket algorithm generates the configured limited rate r by filling the bucket with tokens at a rate of r tokens per second. Queued packets are transmitted when tokens are available in the bucket, otherwise they are enqueued until sufficient new tokens have been generated. When the queue is saturated, the token fill rate r becomes the transmission rate i.e. the limited rate. The bucket depth limits the size of a traffic burst, which is the amount of traffic that can pass through immediately. Packets still travel at the maximum link rate of R when transmitted. Overtime, the transmission rate of packets (R) and the idle periods between those transmissions together becomes an averaged rate of r.
A bandwidth constraint is fully enforced when a queue becomes saturated (input rate is much higher than the maximum bitrate), which leads to packets being dropped (tail dropping behavior). This is also known as conditional packet loss. The depth of the queue will stay within a range when test packets sent are a constant size. This allows for expected results to be determined for bandwidth constraint performance tests. Simulation of unconditional packet loss can be done by either periodically dropping packets or randomly dropping packets. Ping is used as a traffic generator and can be used to send packets at a high rate. Flood Mode can be used to monitor a sent packet’s status to determine if packets are being dropped at the configured loss and if the dropping is periodic or random. Unconditional delay is the duration of time to hold packets before forwarding them. We can determined that a delay has been applied by using ping to see that the delay of packets has increased by the configured amount.
Table 1 details some of the performance tests and expected results used to evaluate TC for different constraint behaviors.
For the expected results in Table 1, we were willing to tolerate packet loss rates and received bandwidth values that were within 5% of the configured values. To ensure that the network simulator’s queues were not introducing traffic bursts, we accepted additional variations in delay of no more than the time required to send a small number of packets.
Setting Up a Local Physical Environment to Evaluate TC
To ensure a successful evaluation of TC, we started with a clean physical test environment. External network interference needed to be mitigated so no additional influence would be introduced to the evaluation test results. This was achieved with the private subnet setup shown in Fig. 2 below. Test traffic packets were routed through network interfaces on the Linux router box and no other traffic was passing through those used interfaces. TC also configures the same network interfaces to introduce constraints.
Tools Used for Performance Testing
iPerf3 was run on the client and server side with the client side generating test traffic and statistics collection occurring on the server side. Ping was used to check a packet’s RTT.
Adjustments Made for Performance Tests
iPerf3 does not include packet headers like UDP or IPv4 when calculating bitrates whereas the network simulator does. Thus the packet length and bitrate sent by iPerf3 on the client side were adjusted to account for a 28 byte overhead (8 bytes from UDP header and 20 bytes from IPv4 header). Packets arriving to TC included a 14 byte Ethernet header, which meant the constraint configured using TC also needed to be adjusted to account this additional overhead. Further experimentation was also done with changing the queue size on the network interface to investigate its effect on achieving a more accurate constraint. Performance tests were re-run with queue sizes of 50, 100, and the default size of 1000 packets.
Further Evaluation of TC In a Cloud Environment
After testing with the local environment was complete and the performance results were promising, we recreated a model of the local setup in AWS and the evaluation process was repeated. We wanted to confirm that TC worked the same way and that the cloud environment didn’t introduce unexpected problems. We also suspected that there could be an additional VLAN (virtual LAN) header for the cloud but it did not come up.
TC Configurations Chosen for Automation
TC supports a variety of rate limiting algorithms and setup combinations. Different features and configurations were researched, attempted and considered. One of the simplest rate limiting configurations available in TC is the Token Bucket Filter (TBF). TBF was used along with netEm (network emulator) to create the final set of configurations used in our automation environments. TBF limits the rate while netEm is an additional software tool that adds packet loss and delay to TC.
We chose these configurations due to its reliability and simplicity for use in the automation environment compared to other configurations and features tried. Table 2 shows successful performance results for these configurations.
Intermediate Challenges with TC and additional experiments
Experimentation was also done with TC’s HTB (Hierarchy Token Bucket) option to see if different rate limits could be configured for different IP addresses on a single network interface. TC’s filtering option was used in HTB to direct packets destined for different IP addresses to the correct queuing node. We found the constraints for HTB to be less reliable as they didn’t always apply as expected. Occasionally, we would also see a delay in the constraint taking effect (iPerf3 server results often showed 0% packets received for a period of time initially). We thought this could have been due to the queue size so that was also experimented. However, since the setup was more complex and it did not work well in combination with netEm for including packet loss and delay, we decided to forgo the use of HTB. Thus a tradeoff was made to modify the infrastructure of the automation environment in order to be able to use TBF and netEm instead. This choice also simplified the design of the APIs needed to control and access TC.
Adjusting the queue size within TC also required experimentation to verify how it affected the adjusted bitrate received. By default, the queue size used by TBF is 1000 packets. However, the queue size can be configured by either configuring the TC command options or configuring the network interface using network commands. It was observed that the same TC configurations that worked in the testing environment caused slightly different results when actually tested using a real mobile client that measured the results using an internally built tool. Further experimentation was done with different queue sizes and evaluation criteria tests were run with queue sizes of 50 and 100 packets. We decided on a queue size of 100 packets based on results.
Getting it Working
Integration into the Automation Environment
The structure of the automation environment was influenced by the type of TC configuration used and how our existing testing tools functioned and communicated back and forth with the media servers. Instead of relying on TC IP filtering to constrain the correct packets for specific IP addresses, we opted to have the packets of a constrained stream be routed differently by using an IP packet marking script called cgexec.
Fig. 3 shows how the automated cloud infrastructure is set up. VPC stands for virtual private cloud. The automation host sets up and invokes the automation tests. TC was introduced into our automation environment by adding an AWS instance between our automation host and media server. When the automation tests are invoked, only a publisher or a subscriber stream can be constrained at a time and not both at once. In Fig. 3, constrained publisher packets route through the eth2 network interface to the media server using cgexec and subscribing clients route through eth1. Similarly, constrained subscriber packets route through the eth1 network interface. Packets traveling to a specific IP address are marked using cgexec so that they will be routed through the TC virtual instance, undergoing a constraint. Packets that are not marked using cgexec will travel directly to the media server without being constrained.
A high-level overview of an automated test is as follows. When testing with a constrained publisher, the mock client first starts a process that creates a publisher stream. Traffic from that process is routed through the TC instance, where the constraint is applied, and then reaches the media server. In this case, the test subscriber client must be unconstrained due how the automation infrastructure is implemented and the subscriber client receives stream data back from the media server.
APIs needed to be created in order to programmatically invoke and set up TC. We chose to modularize the implementation by organizing the API into two separate repositories. A high-level API that is more general and abstracts away the TC details and a low-level API that handles the specific details of automatically setting up TC configurations. This allows for flexibility as well in the future if TC commands are modified.
These APIs allowed us to integrate the ability to run static and dynamic uplink and downlink tests into our existing automation. Static constraints refer to tests that only have 1 constraint applied for the entire test. Dynamic constraints refer to tests that change the constraint multiple times throughout test in succession and collects the video quality statistics after each constraint.
Automation of network constraint tests has introduced numerous benefits. We’ve increased our efficiency and regularity of testing of how poor network conditions affect real-time video quality. The analysis of test results is more standardized and objective compared to manual analysis which can be prone to subjectivity.The tests are part of regular nightly tests to catch potential regressions and problems sooner. With regular and more frequent monitoring, a response can be made sooner to issues discovered. Manual network testing time has been reduced by more than 50% and testers can focus on creative testing tasks. Previously, running the tests manually would take several hours and now tests run in several minutes.
During the time that the automation tests have been active, network adaption behavior have varied causing a broad range of numeric results. In the future, we could start averaging multiple test runs or use other measures of central tendency.