Deploying BBR v2 on geo-distributed cloud servers
Google’s new transport protocol BBR offers significant improvements in the queueing and throughput performance of connections when compared to the state-of-the-art TCP CUBIC. For example, TCP BBR flows show a 2 to 20 times improvement in throughput when compared to TCP CUBIC. However, deployment on geo-distributed cloud servers has revealed a problem: a consistent and sometimes severe bandwidth disparity amongst competing flows with different Round-Trip-Times (RTTs).
In fact, unlike the traditional TCP variants (Reno/CUBIC) which are biased against flows with longer RTTs (i.e., the further away the server, the slower the relative connection speed), BBR favours flows with longer RTT. Given that Next Generation Networks (e.g., 5G, GPON) and commercial cloud computing strategies all aim to reduce the RTT, this tendency to favour flows with longer RTTs is contrary to the conventional wisdom.
Figure 1 – BBR operates slightly to the right of Klienrock’s optimal operating point which can potentially cause throughput unfairness. Short-RTT flows are also cwnd bounded further exacerbating the problem for large absolute differences in RTT
The new version of BBR (BBR v2) offers improvements in fairness with TCP Reno/CUBIC, reduced loss rates for bottleneck buffer sizes that are less than 1.5 X the bandwidth-delay product (BDP) and increased throughput of the Wi-Fi path. However (and notable by its absence), is the mention of two key BBR v2 parameters that can effectively offset the throughput unfairness problem across a wide range of RTTs whilst maintaining a high utilisation.
A key control parameter for BBR is the so-called congestion window (cwnd) which (much like traditional TCP) is the maximum number of unacknowledged packets that can be in flight. Indeed, the throughput of a TCP flow is approximately equal to the ratio of the maximum cwnd to the sum of the RTT and queuing delay. To a first approximation, BBRs cwnd is given by the expression
cwnd = cwnd_gain × bottleneck_bandwidth x min_rtt
According to the specifications, BBR spends most of the time (about 98 percent of the time) in the ProbeBW state during which cwnd_gain = 2, so that all BBR flows are bounded to 2 × the bandwidth-delay product. This means that not only do flows with long RTTs have more packets in flight, but also flows with short RTT encounter the cwnd bound first and therefore cannot effectively probe for more bandwidth.
BBRs throughput unfairness problem manifests with slight differences in RTT and becomes particularly salient as the RTT difference increases. Experiments on testbeds show that whilst the bias against short RTT flows is alleviated on a low bandwidth bottleneck, long RTT flows can easily seize more than 90% of the available bandwidth for a 5 X ratio in the RTT.
In response to this problem, two solutions have been proposed in recent publications:
BBQ – limiting the amount of time spent probing for bandwidth when a queue starts to build
DA-BBR – limiting the bandwidth-delay product (and hence cwnd) of long RTT flows relative to short RTT flows
All things being equal, both solutions offer significant improvements in the throughput fairness based on the ratio of RTTs but not the absolute difference between the RTTs.
As suggested by both the BBQ and DA-BBR implementations, the panacea for a wide variety of RTTs and high utilisation is twofold:
Control the pacing rate of packets to reduce queues at the bottleneck (and hence the overestimation of the round delivery time by short-RTT flows)
Ensure that the cwnd of the short-RTT flows allows for the probing of bandwidth relative to long-RTT flows (overcome the cwnd bound for short-RTT flows)
Fortunately, both these parameters can be readily “tuned” within BBR v2 in accordance with the original BBR specifications. In the Linux net/ipv4 implementation, the parameter bbr_pacing_margin_percent and the function bbr_tso_segs_generic can both be adapted to provide the necessary leverage.
By reducing bbr_pacing_margin_percent to ensure pacing at 5 – 10 percent (updated from 10 - 15 percent for a three tier pacing rate - ref Google BBR development forum) below the bottleneck bandwidth, throughput fairness can be observed for ratios as large as 5 X RTT (e.g., 10ms vs 50ms). However, large absolute difference in RTTs (e.g., 50ms vs 250ms) require a solution that allows cwnd to be set in inverse proportion to the RTT. Indeed, the function bbr_tso_segs_generic can be adapted to produce TSO segments that are a linear function of RTT with a negative gradient.
In the SureLink-XG C++ implementation (which is closely aligned to the Linux net/ipv4 implementation), the range selected for the TSO burst size based on RTTs is 10ms to 400ms. The maximum TSO segments at 10ms was 64Kbytes/1500bytes and minimum TSO segments at 400ms is 2 segments in accordance with the BBR RFC. More details of the implementation and analysis are provided in the Google BBR development forum.
Based on the modifications described above, TCP BBR v2 behaves like the traditional TCP versions (e.g., Reno/CUBIC) when the absolute difference in the RTT of the competing flows is low (~50ms). This means that BBR v2 favours short RTT flows and that long-RTT flows will still get a significant share (~50%) of the bottleneck bandwidth.
On the other hand, when the absolute difference in the RTT of competing flows is large (~200ms) and the pacing rate of competing flows is high (> 50Mbps), BBR v2 will be biased in favour of the long-RTT flows and short-RTT flows will still get a significant share (~50%) of the bottleneck bandwidth.
Figure 2 – BBRs throughput fairness is both a function of the ratio of RTT and the absolute difference in RTTs (Updated please ref to the Google BBR development forum)
Unlike the traditional TCP variants, TCP BBR does not appear to rely on the network (AQM e.g., Sally Floyd’s Random Early Detection) to ensure throughput fairness. BBR is thus aligned with the architectural principle of the Internet – a simple network with agile edges.
The modifications proposed in this blog have not been vetted against competing traditional TCP variants and may need to be further adjusted for a heterogeneous environment. In addition, the throughput fairness problem is noticeable at ratios that are greater than 5 X RTT and flow starvation may still occur at even higher ratios.
However, the modifications suggested not only fit with the current BBR specifications and implementations, but they also offer the requisite throughput fairness improvements for LAN and long-haul connections with RTTs that range between 5ms and 300ms.
And all the roads we have to walk are winding, and all the lights that lead us there are blinding.