Prague Congestion Control

Prague Congestion Control Nokia Bell Labs

Antwerp Belgium koen.de_schepper@nokia.com https://www.bell-labs.com/usr/koen.de_schepper

Nokia Bell Labs

Antwerp Belgium olivier.tilmans@nokia-bell-labs.com

Independent

UK ietf@bobbriscoe.net http://bobbriscoe.net/

Apple Inc

Cupertino United States vidhi_goel@apple.com

IRTF Internet Congestion Control Research Group (ICCRG) Internet-Draft I-D This specification defines the Prague congestion control scheme, which is derived from DCTCP and adapted for Internet traffic by implementing the Prague L4S requirements. Over paths with L4S support at the bottleneck, it adapts the DCTCP mechanisms to achieve consistently low latency and full throughput. It is defined independently of any particular transport protocol or operating system, but notes are added that highlight issues specific to certain transports and OSs. It is mainly based on experience with the reference Linux implementation of TCP Prague and the Apple implementation over QUIC, but it includes experience from other implementations where available. The implementation does not satisfy all the Prague requirements (yet) and the IETF might decide that certain requirements need to be relaxed as an outcome of the process of trying to satisfy them all. Future plans that have typically only been implemented as proof-of-concept code are outlined in a separate section.

Introduction This document defines the Prague Congestion Control Algorithm (CCA), which is intended to maintain consistently low queuing delay over network paths that offer Low Latency, Low Loss, and Scalable throughput (L4S) support at the bottleneck . Where the bottleneck does not support L4S, a Prague CCA is intended to fall back to behaving like a conventional 'Classic' congestion control. L4S support in the network involves Active Queue Management (AQM) that applies 'Immediate Explicit Congestion Notification (ECN)' at a very shallow target queueing delay (of the order of a millisecond) or even a virtual queue with no queuing delay, for example . 'Immediate ECN' means that the network applies ECN marking based on the instantaneous queue, without any smoothing or filtering, The Prague CCA takes on the job of smoothing and filtering the congestion signals from the network. The Prague CCA is a particular instance of a scalable congestion control, which is defined in . Scalable congestion control is the part of the L4S architecture that does the actual work of maintaining low queuing delay and ensuring that the delay and throughput properties scale with flow rate. The L4S architecture places the host congestion control in the context of the other parts of the system. In particular the different types of L4S AQM in the network and the codepoints in the IP-ECN field that convey to the network that the host supports the L4S form of ECN. The architecture document also covers other issues such as: incremental deployment; protection of low latency queues against accidental or malicious disruption; and the relationship of L4S to other low latency technologies. The specification of the L4S ECN Protocol sets down the requirements that a Prague CCA has to follow (called the Prague L4S Requirements - see for a summary). This specification defines the Prague CCA independent of any particular transport protocol or operating system, but notes are added that highlight issues specific to certain transports and OSs. These primarily draw on experience with i) the reference implementation of Prague on Linux over TCP, and ii) the Apple implementation of Prague over QUIC. Nonetheless, wherever possible, experience from implementers on other platforms is included. Links to implementations of the Prague CCA and other scalable congestion controls can be found via the L4S landing page , which also links to numerous other L4S-related resources. A (slightly dated) paper on the specific implementation of the Prague CCA in Linux over TCP is also available , and the open-source code for Linux is at .

Motivation: Low Queuing Delay /and/ Full Throughput A Prague CCA is capable of keeping queuing delay consistently low while fully utilizing available capacity. In contrast, Classic congestion controls need to induce a reasonably large queue (approaching a bandwidth-delay product) in order to fully utilize capacity. Therefore, prior to scalable CCAs like DCTCP and Prague, it was believed that very low delay was only possible by limiting throughput and isolating the low delay traffic from capacity-seeking traffic. A Prague CCA uses additive increase multiplicative decrease (AIMD), in which it increases its window until an ECN mark (or loss) is detected, then yields in a continual sawtooth pattern. The key to keeping queuing delay low without under-utilizing capacity is to keep the sawteeth tiny. This is achieved by ensuring that the reduction in rate that starts each sawtooth is proportionate to the prevailing level of congestion. In contrast, a classical CCA merely responds to the existence of congestion, not its extent. So each classical reduction has to be large enough to cope with the worst case. For example the average duration of a Prague CCA sawtooth is of the order of a round trip, whereas a classic congestion control sawtooths over hundreds of round trips, implying multiple seconds. For instance, at 120 Mb/s and with a maximum RTT of 30 ms at the peak of each sawtooth, CUBIC takes 4.3 s to recover from each sawtooth reduction. At this rate, CUBIC is still fully in its Reno-friendly mode. If flow rate scales by 8 times to 960 Mb/s, it enters true CUBIC mode, with a recovery time of 12.2 s. From then on, each further scaling by 8 times doubles CUBIC's recovery time (because the cube root of 8 is 2), e.g., at 7.68 Gb/s, the recovery time is 24.3 s. Keeping the sawtooth amplitude down keeps queue variation down and utilization up. Keeping the duration of the sawteeth down ensures control remains tight. The definition of a scalable CCA is that the duration between congestion marks does not increase as flow rate scales, all other factors being equal. This is important, because it means that the sawteeth will always stay tiny. So queue delay will remain very low, and control will remain very tight. The tip of each sawtooth occurs when the bottleneck link emits a congestion signal. Therefore such small sawteeth are only feasible when ECN is used for the congestion signals. If loss were used after each brief recovery time, the loss level would be prohibitively high. This is why L4S-ECN has to depart from the requirement of Classic ECN that an ECN mark is equivalent to a loss. The Prague CCA is derived from Data Center TCP (DCTCP ). DCTCP is confined to controlled environments like data centres precisely because it uses such small sawteeth, which induce such a high level of congestion marking. For a CCA using Classic ECN, this would be interpreted as equivalent to the same, very high, loss level. The Classic CCA would then continually drive its own rate down in the face of such an apparently high level of congestion. This is why coexistence with existing traffic is important for the Prague CCA. It has to be able to detect whether it is sharing a Classic bottleneck with Classic traffic, and if so fall back to behaving in a Classic way. If the bottleneck does not support ECN at all, that is easy - a Prague CCA just responds in the Classic way to loss (see ). But if it is sharing a Classic ECN bottleneck with Classic ECN traffic, this is more difficult to detect (see ). Because the Prague CCA removes most of the queue, it also has to address RTT-dependence. Otherwise, at low base RTTs, its flow rate would become far more RTT-dependent than Classic CCAs.

Document Purpose There is not 'One True Prague CCA'. L4S is intended to enable development of any scalable CCA that meets the L4S Prague requirements . This document attempts to describe a design that transfers a byte stream. It is generalized across different transports and OS platforms.

Maturity Status (To be Removed Before Publication) The field of congestion control is always a work in progress. However, there are areas of Prague implementations that are still just placeholders while separate research code is evaluated. And in other implementations of the Prague CCA, other areas are incomplete. In the Linux reference implementation of TCP Prague, interim code is used in the incomplete areas, which are:

Flow start and restart (standard slow start is used, even though it often exits early in L4S environments were ECN marking tends to be frequent);
Faster than additive increase (standard Reno additive increase is used, which makes the flow particularly sluggish if it has dropped out of slow start early).

The body of this document describes the Prague CCA as implemented. Any non-default options or any planned improvements are separated out into on "Variants and Future Work". As each of the above areas is addressed, it will will be removed from this section and its description in the body of the document will be updated. Once all areas are complete, this section will be removed. Prague will then still be a work in progress, but only on a similar footing as all other congestion controls. No implementation satisfies all the Prague requirements (yet), and the IETF might decide that certain requirements need to be relaxed as an outcome of the process of trying to satisfy them all.

Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 when, and only when, they appear in all capitals, as shown here. Definitions of terms:

Classic Congestion Control:: A congestion control behaviour that can co-exist with standard TCP Reno without causing significantly negative impact on its flow rate . With Classic congestion controls, as flow rate scales, the number of round trips between congestion signals (losses or ECN marks) rises with the flow rate. So it takes longer and longer to recover after each congestion event. Therefore control of queuing and utilization becomes very slack, and the slightest disturbance prevents a high rate from being attained .
Scalable Congestion Control:: A CCA where the average time from one congestion signal to the next (the recovery time) remains invariant as the flow rate scales, all other factors being equal. This maintains the same degree of control over queueing and utilization whatever the flow rate, as well as ensuring that high throughput is robust to disturbances. For instance, DCTCP averages 2 congestion signals per round-trip whatever the flow rate. For the public Internet a Scalable CCA has to comply with the requirements in Section 4 of (aka. the 'Prague L4S requirements').
Response function:: The relationship in steady state between the window (cwnd) of a congestion control and the congestion signalling probability, p, . A widely used family of response functions has the form cwnd = K/p^B, where K and B are constants. In an approximation of the response function of the standard Reno CCA, B=1/2. For a scalable congestion control B=1, so its response function takes the form cwnd = K/p. The number of congestion signals per round is p*cwnd, which equates to the constant, K, for a scalable CCA. Hence the definition of a scalable CCA above.
Reno-friendly:: The subset of Classic traffic that excludes unresponsive traffic and excludes experimental congestion controls intended to coexist with Reno but without always being strictly friendly to it (as allowed by ). Reno-friendly is used in place of 'TCP-friendly', given that the TCP protocol is used with many different congestion control behaviours.
Classic ECN:: The original Explicit Congestion Notification (ECN) protocol , which requires ECN signals to be treated the same as drops, both when generated in the network and when responded to by the sender. The names used for the four codepoints of the 2-bit IP-ECN field are as defined in : Not ECT, ECT(0), ECT(1) and CE, where ECT stands for ECN-Capable Transport and CE stands for Congestion Experienced. A packet marked with the CE codepoint is termed 'ECN-marked' or sometimes just 'marked' where the context makes ECN obvious.
CC:: Congestion Control
CCA:: Congestion Control Algorithm
ACK:: an ACKnowledgement, or to ACKnowledge
EWMA:: Exponentially Weighted Moving Average
RTT:: Round Trip Time

Definitions of Parameters and Variables:

MTU_BITS:: Maximum transmission unit [b]
cwnd:: Congestion window [B]
ssthresh:: Slow start threshold [B]
inflight:: The amount of data that the sender has sent but not yet received ACKs for [B]
p:: Steady-state probability of drop or marking []
alpha:: EWMA of the ECN marking fraction []
acked_sacked:: the amount of new data acknowledged by an ACK [B]
ece_delta:: the amount of newly acknowledged data that was ECN-marked [B]
ai_per_rtt:: additive increase to apply per RTT [B]
srtt:: Smoothed round trip time [s]
MAX_BURST_DELAY:: Maximum allowed bottleneck queuing delay due to segmentation offload bursts [s] (default 0.25 ms for the public Internet)

Prague Congestion Control

The Prague L4S Requirements The beneficial properties of L4S traffic (low queuing delay, etc.) depend on all L4S sources satisfying a set of conditions called the Prague L4S Requirements. The name is after an ad hoc meeting of about thirty people co-located with the IETF in Prague in July 2015, the day after the first public demonstration of L4S. The meeting agreed on a list of modifications to DCTCP to focus activity on a variant that would be safe to use over the public Internet. It was suggested that this could be called TCP Prague to distinguish it from DCTCP. This list was adopted by the IETF, and has continued to evolve (see section 4 of ). The requirements are no longer TCP-specific, applying irrespective of wire-protocol (TCP, QUIC, RTP, SCTP, etc). This unusual start to the life of the project led to the unusual development process of a reference implementation that had to resolve a number of ambitious requirements, already known to be in tension . DCTCP already implements a scalable congestion control. So most of the changes to make it usable over the Internet seemed trivial, some 'merely' involving adoption of other parallel developments like Accurate ECN TCP feedback or RACK . Others have been more challenging (e.g. RTT-independence). And others that seemed trivial became challenging given the complex set of bugs and behaviours that characterize today's Internet and a modern end system stack such as Linux. The more critical implementation challenges are highlighted in the following sections, in the hope we can prevent mistakes being repeated (see for instance , ). There was also a set of five intertwined 'bugs' - all masking each other, but causing unpredictable or poor performance as different code modifications or environments unmasked them. A comprehensive write-up of the experiments that had to be designed to isolate these bugs is available separately . During the development process, we have unearthed fundamental aspects of the implementation and indeed the design of DCTCP and Prague that have still not caught up with the paradigm shift from existence to extent-based congestion response. Some have been implemented by default, e.g. not suppressing additive increase for a round trip after a congestion event (). Others have been implemented but not fully evaluated, e.g. removing the 1-2 unnecessary round trips of lag in feedback processing () and yet others are still future plans, e.g. further RTT-independence () and exploiting combined congestion metrics in more cases (). The requirements are categorized into those that would impact other flows if not handled properly and performance optimizations that are important but optional from the IETF's point of view, because they only affect the flow itself. The list below maps the order of the requirements in to the order in this document (which is by functional categories and code status):

Mandatory or Advisory Requirements:

L4S-ECN packet identification: use of ECT(1) ()
Accurate ECN feedback ()
Reno-friendly response to a loss ()
Detection of a classic ECN AQM ()
Reduced RTT dependence ()
Scaling down to a fractional window (recommended but no longer mandatory, see )
Detecting loss in units of time ()
Minimizing bursts (

Optional performance optimizations:

ECN-capable control packets ()
Faster flow start ()
Faster than additive increase ()
Segmentation offload ()

Packet Identification On the public Internet, a sender using a Prague CCA MUST set the ECT(1) codepoint on all the packets it sends, in order to identify itself as an L4S-capable congestion control (Req 4.1. ). This applies whatever the transport protocol, whether TCP, QUIC, RTP, etc. In the case of TCP, unlike an RFC 3168 TCP ECN transport, a sender can set all packets as ECN-capable, including TCP control packets and retransmissions , . A Prague CCA SHOULD optionally be configurable to use the ECT(0) codepoint in private networks, such as data centres, which might be necessary for backward compatibility with DCTCP deployments where ECT(1) might already have another usage. Linux implementation note:

TCP Prague in Linux kernel:: The Linux kernel was updated to allow the ECT(1) flag to be set from within a CCA module. A Prague CCA then has full control over the ECN code point it uses at any one time. In this way it asserts the use of ECT(1) (or optionally ECT(0)) and non-ECT when required.

Detecting and Measuring Congestion

Accurate ECN Feedback When feedback of ECN markings was added to TCP , it was decided not to report any more than one mark per RTT. L4S-capable congestion controls need to know the extent, not just the existence of congestion (Req 4.2. ). Recently defined transports (DCCP, QUIC, etc) typically already satisfy this requirement. So they are dealt with separately below, while TCP and derivatives such as SCTP are covered first.

Accurate ECN Feedback with TCP & Derivatives The TCP wire protocol is being updated to allow more accurate feedback (AccECN ). Therefore, in the case where a sender uses a Prague CCA over TCP, whether as client or server:

it MUST itself support AccECN;
to support AccECN it also has to check that its peer supports AccECN during the handshake.

If the peer does not support accurate ECN feedback, the sender MUST fall back to a Reno-friendly CCA behaviour for the rest of the connection. The non-Prague TCP sender MUST then no longer set ECT(1) on the packets it sends. Note that the peer only needs to support AccECN; there is no need to find out whether the peer is using an L4S-capable congestion control. Note that a sending TCP client that uses a Prague CCA can set ECT(1) on the SYN prior to checking whether the other peer supports AccECN (as long as it follows the procedure in if it subsequently discovers the peer does not support AccECN). Linux implementation note:

TCP Prague in Linux kernel:: The Linux kernel update to support AccECN for TCP is independent of the CCA module in use. So the kernel tries to negotiate AccECN feedback whichever congestion control module is selected. The Prague CCA module does an additional check to verify that the kernel actually does support AccECN, based on which Prague will either decide to proceed using a scalable CCA or fall back to a Classic CCA (Reno in the current Linux implementation). A system-wide option is available to enable or disable AccECN negotiation. However, TCP flows using the Prague CCA module depend on AccECN; so they always ignore this system-wide sysctl and enable AccECN negotiation anyway.

Accurate ECN Feedback with Other Modern Transports Transport protocols specified recently, .e.g. DCCP , QUIC , are unambiguously suitable for Prague CCAs, at least according to the protocol specs, because they were designed from the start with accurate ECN feedback. However, in practice, ECN feedback was omitted from the early versions of some QUIC implementations. At the start of a connection, a QUIC sender that wishes to use ECN checks incoming feedback for whether ECN correctly traversed the forward path. To such a QUIC sender, a peer that has not implemented ECN feedback is indistinguishable from a network path that bleaches the ECN field. So, in either case, the sender will disable the use of ECN. Without ECN, Prague essentially turns into a Classic CCA. For instance, the Apple implementation of Prague for QUIC responds to each loss as CUBIC would. So without ECN, its behaviour becomes indistinguishable from CUBIC. This is because Prague's response to each loss is Reno-friendly (as required in ). In the case of RTP/RTCP, ECN feedback was added in , which is sufficient for a Prague CCA. However, it is preferable to use the most recent improvements to ECN feedback in , as used in the implementation of the L4S variant of SCReAM .

Moving Average of ECN Feedback A Prague CCA currently maintains a moving average of ECN feedback in a similar way to DCTCP. This section is provided mainly because performance has proved to be sensitive to implementation precision in this area. So first, some background is necessary. A Prague CCA triggers update of its moving average once per rtt_virt (see ). To maintain its moving average, it measures the fraction, frac, of ACKed packets {ToDo: add Future Work on using ACKed bytes instead, if decided appropriate} that carried ECN feedback over the previous rtt_virt. It then updates an exponentially weighted moving average (EWMA) of this fraction, called alpha, using the following algorithm:

alpha += g * (frac - alpha);

where g (default 1/16) is the gain of the EWMA. The moving average, alpha, is initialized to 1 at the first sign of ECN feedback, which ensures the maximum congestion response to the first appearance of congestion at a bottleneck supporting ECN. Linux implementation notes:

Rounding problems in DCTCP:

Alpha is a fraction between 0 and 1, and it needs to be represented with high resolution because the larger the bandwidth-delay product (BDP) of a flow, the smaller the value that alpha converges to (in steady state alpha = 2/cwnd). In principle, Linux DCTCP maintains the moving average 'alpha' using the same formula as Prague CCA uses (as above). Linux DCTCP represents alpha with a 10-bit integer (with resolution 1/1024). However, up to kernel release 3.19, Linux used integer arithmetic that could not reduce alpha below 15/1024. Then it was patched so that any value below 16/1024 was rounded down to zero . For a flow with a higher BDP than 128 segments, this means that, alpha flip-flops. Once it has flopped down to zero DCTCP becomes unresponsive until it has built sufficient queue to flip up to 16/1024. For larger BDPs, this causes DCTCP to induce larger sawteeth, which loses the low-queuing-delay and high-utilization intent of the algorithm.

Upscaled alpha in Prague CC:

To resolve the above problem the implementation of TCP Prague in Linux maintains upscaled_alpha = alpha/g instead of alpha:

upscaled_alpha += frac - g * upscaled_alpha;

This technique is the same as Linux uses for the retransmission timer variables, srtt and mdev. The Linux Prague CCA also uses 20 bits for alpha.

The above per-rtt_virt update to the moving average is currently used in the Linux Prague CCA. However, another approach is being investigated because these per-RTT updates introduce 1--2 rounds of delay into the congestion response on top of the inherent round of feedback delay (see in the section on variants and future work).

Scaling Loss Detection with Flow Rate After an ACK leaves a gap in the sequence space, a Prague CCA is meant to deem that a loss has occurred using 'time-based units' (Req 4.3. ). This is in contrast to the traditional approach that counts a hard-coded number of duplicate ACKs, e.g. the 3 Dup-ACKs specified in . The reason is that counting packets rather than time unnecessarily tightens the time within which parallelized links have to keep packets in sequence as flow rate scales over the years. To satsify this requirement, a Prague CCA SHOULD wait until a certain fraction of the RTT has elapsed before it deems that the gap is due to packet loss. The reference implementation of TCP Prague in Linux uses RACK to address this requirement. An approach similar to TCP RACK is also used in QUIC. At the start of a connection, RACK counts 3 DupACKs to detect loss because the initial smoothed RTT estimate can be inaccurate. This would depend indirectly on time as long as the initial window (IW) is paced over a round trip (see ). For instance, if the initial window of 10 segments was paced evenly across the initial RTT then, in the next round, an implementation that deems there has been a loss after (say) 1/4 of an RTT can count 1/4 of 10 = 3 DupACKs (rounded up). In subsequent rounds, if the TCP connection detects reordering, RACK shifts to using a fraction of the RTT for loss detection.

Congestion Response Algorithm In congestion avoidance phase, a Prague CCA uses a similar additive increase multiplicative decrease (AIMD) algorithm to DCTCP, but with the following differences:

Loss behaviour A Prague CCA MUST use a Reno-friendly congestion response (such as that of CUBIC or Reno ) on detection of a loss (Req 2 in section 4.3. of ). DCTCP falls back to Reno for the round trip after a loss, and the Linux reference implementation of TCP Prague currently inherits this behaviour. On detection of loss, the Apple implementation of Prague uses CUBIC's behaviour instead of Reno's, for both the reduction after the loss and the subsequent growth of cwnd until the next congestion event. If a Prague CCA has already reduced the congestion window due to ECN feedback less than a round trip before it detects a loss, it MAY reduce the congestion window by a smaller amount due to the loss, as long as the reductions due to ECN and the loss, when combined, result in a reduction that is at least as large as the implementation usually makes in response to loss (e.g. 50% to emulate Reno or 30% to emulate CUBIC). See for discussion of future work on congestion control using a combination of delay, ECN and loss. Linux implementation note:

DCTCP bug prior to v5.1:: A Prague CCA cannot rely on inheriting the fall-back-on-loss behaviour of the DCTCP code in the Linux kernel prior to v5.1, due to a previous bug in the fast retransmit code (but not in the retransmission timeout code) .

Multiplicative Decrease on ECN Feedback A Prague CCA currently responds to ECN feedback in a similar way to DCTCP. This section is provided mainly because performance has proved to be sensitive to implementation details in this area. So the following recap of the congestion response is needed first. As explained in , like DCTCP, the Linux Prague CCA clocks its moving average of ECN-marking, alpha, once per rtt_virt throughout a connection. Nonetheless, it only triggers a multiplicative decrease to its congestion window when it actually receives an ACK carrying ECN feedback. Then it suppresses any further decreases for one round trip, even if it receives further ECN feedback. This is termed Congestion Window Reduced or CWR state. A Prague CCA (also like DCTCP) ensures that the average recovery time remains invariant as flow rate scales (Req 4.3 of ) by making the multiplicative decrease depend on the prevailing value of alpha as follows:

ssthresh = (1 - alpha/2) * cwnd;

Linux implementation notes:

Upscaled alpha:: With reference to the earlier discussion of integer arithmetic precision (), alpha = g * upscaled_alpha.
Carry of fractional cwnd remainder:: Typically the absolute reduction in the window is only a small number of segments. So, if the Prague CCA implementation counts the window in integer segments (as in the Linux reference code), delay can be made significantly less jumpy by tracking a fractional value, cwnd_carry, alongside the integer window and carrying over any fractional remainder to the next reduction. Also, integer rounding bias ought to be removed from the multiplicative decrease calculation. Specifically:

In dynamic scenarios, as flows find a new operating point, alpha will have often tailed away to near-nothing before the onset of congestion. Then DCTCP's tiny reduction followed by no further response for a round is precisely the wrong way for a CCA to respond. A solution to this problem is being evaluated as part of the work already mentioned to improve Prague's responsiveness (see in the section on variants and future work).

Additive Increase and ECN Feedback Unlike DCTCP, a Prague CCA does not suppress additive increase for one round trip after a congestion window reduction (i.e., while in CWR state). Instead, a Prague CCA applies additive increase irrespective of its CWR state, but only for bytes that have been ACK'd without ECN feedback. Specifically, on each ACK, where:

acked_sacked is the number of new bytes acknowledged by the ACK;
ece_delta is the number of newly acknowledge ECN-marked bytes;
ai_per_rtt is a scaling factor that is typically 1 SMSS except for small RTTs (see )

Superficially, the classical suppression of additive increase for the round after a decrease seems to make sense. However, DCTCP and Prague are designed to induce an average of 2 congestion marks per RTT in steady state, which leaves very little space for any increase between the end of one round of CWR and the next mark. In tests, when a test version of a Prague CCA is configured to completely suppress additive increase during CWR (like Reno and DCTCP), its sawteeth become more irregular, which is its way of making some decreases large enough to open up enough space for an increase. This irregularity tends to reduce link utilization. Therefore, the Prague CCA has been modified to continue additive increase irrespective of CWR state. Nonetheless, rather than continue additive increase regardless of congestion, it is safer to only increase on those ACKs that do not feed back congestion. This approach reduces additive increase as the marking probability increases, which tends to keep the marking level unsaturated (below 100%) (see Section 3.1 of ). Under stable conditions, Prague's congestion window then becomes proportional to (1-p)/p, rather than 1/p. See also 'Faster than Additive Increase' ()

Reduced RTT-Dependence The window-based AIMD described so far was inherited from Reno via DCTCP. When many long-running Reno flows share a link, their relative packet rates become roughly inversely proportional to RTT. Then a flow with very small RTT will dominate any flows with larger RTTs. Queuing delay sets a lower limit to the smallest possible RTT. So, prior to the extremely low queuing delay of L4S, extreme cases of RTT dependence had never been apparent. Now that L4S has removed most of the queuing delay, we have to address the root-cause of RTT-dependence, which a Prague CCA is required to do, at least when the RTT is small (see the 'Reduced RTT bias' aspect of Req 4.3. ). Here, a small RTT is defined as below "typical RTTs expected in the intended deployment scenario". The Linux reference Prague CCA reduces RTT bias by using a virtual RTT (rtt_virt) rather than the actual smoothed RTT (srtt) for all three of: i) the period of additive window increase; ii) the EWMA update period; and iii) the duration of CWR state after a decrease. rtt_virt is calculated as a function of the actual smoothed RTT, chosen so that, when the srtt is high, the virtual RTT is essentially the same; but for lower actual RTTs, the virtual RTT is increasingly larger than the actual RTT. Example functions that implementations might use for the virtual RTT are:

rtt_virt = max(srtt, RTT_VIRT_MIN);
rtt_virt = srtt + AdditionalRTT;

where the parameters RTT_VIRT_MIN or AdditionalRTT would be set for a particular deployment environment. The Linux implementation of Prague uses the first example and for the public Internet sets RTT_VIRT_MIN=25ms. Thus, Linux Prague defines rtt_virt = max(srtt, 25ms), which addresses the Prague requirement for Reduced RTT-Dependence when the RTT is smaller than typical public Internet RTTs. As the actual window (cwnd) is still sent within 1 actual RTT, we also need to use a (conceptual) virtual window, cwnd_virt. For instance, if rtt_virt = 25 ms then, when the actual RTT is 5 ms, there are rtt_virt/srtt = 5 times more packets in cwnd_virt, than in the actual window, cwnd, because cwnd_virt spans 5 actual round trips. We define M as the ratio rtt_virt/srtt. In Reno or DCTCP, additive increase is implemented by dividing the desired increase of 1 segment per round over the cwnd packets in the round. This requires an increase of 1/cwnd per packet. In the Linux implementation of TCP Prague, the aim is to increase the reference window by 1 segment over a virtual RTT. However, in practice the increase is applied to the actual window, cwnd, which is M times smaller than cwnd_virt. So cwnd has to be increased by only 1/M segments over rtt_virt. But again, in practice, the increase is applied over an actual window of packets spanning an actual RTT, which is also M times smaller than the virtual RTT. So the desired increase in cwnd is only 1/M^2 segments over an actual round trip containing cwnd packets. Therefore, the increase in cwnd per packet has to be (1/M^2) * (1/cwnd). Unless a flow lasts long enough for rates to converge, aiming for equal rates will not be relevant. So, in the Linux implementation of Prague, the Reduced RTT-Dependence algorithm only comes into effect after D rounds, where D is configurable (current default 500). Continuing the previous example, if actual srtt=5 ms and rtt_virt = 25 ms, Prague would use the regular RTT-dependent algorithm for the first 500*5ms = 2.5s. Then it would start to converge to more equal rates using its Reduced RTT-Dependence algorithm. If the actual RTT were higher (e.g. 20ms), it would stay in the regular RTT-dependent mode for longer (500 rounds = 10s), but this would be mitigated by the actual RTT it uses at the start being closer to the virtual RTT that it eventually uses (20ms and 25ms resp.). This approach prevents reduced RTT-dependence from making the flow less responsive at start-up and ensures that its early throughput share is based on its actual RTT. The benefit is that short flows (mice) give themselves priority over longer flows (elephants), and shorter RTTs will still converge faster than longer RTTs. Nonetheless, the throughput still converges to equal rates after D rounds. It is planned to reset the algorithm to the regular RTT-dependent behaviour after an idle, not just at flow start, as discussed under Future Work in . also discusses extending the reduction in RTT-dependence to longer RTTs than RTT_VIRT_MIN (i.e. longer than 25ms). The current Linux Prague implementation does not support this.

Flow Start or Restart Currently the Linux reference implementation of TCP Prague uses the standard Linux slow start code. Slow start is exited once a single mark is detected. When other flows are actively filling the link, regular marks are expected, causing slow start of new flows to end prematurely. This is clearly not ideal, so other approaches are being worked on (see ). However, slow start has been left as the default until a properly matured solution is completed.

Packet Sending

Packet Pacing A Prague CCA MUST pace the packets it sends to avoid the queuing delay and under-utilization that would otherwise be caused by bursts of packets that can occur, for example, when restarting from idle or a jump in the acknowledgement number opens up cwnd. Prague does this in a similar way to the base Linux TCP stack, by spacing out the window of packets evenly over the round trip time, using the following calculation of the pacing rate [b/s]:

pacing_rate = MTU_BITS * max(cwnd, inflight) / srtt;

During slow start, as in the base Linux TCP stack, Prague factors up pacing_rate by 2, so that it paces out packets twice as fast as they are acknowledged. This keeps up with the doubling of cwnd, but still prevents bursts in response to any larger transient jumps in cwnd. During congestion avoidance, the Linux TCP Prague implementation does not factor up pacing_rate at all. This contrasts with the base Linux TCP stack, which currently factors up pacing_rate by a ratio parameter set to 1.2. The developers of the base Linux stack confirmed that this factor of 1.2 was only introduced in case it improved performance, but there were no scenarios where it was known to be needed. In testing of Prague, this factor was found to cause queue delay spikes whenever cwnd jumped more than usual. And throughput was no worse without it. So it was removed from the Linux TCP Prague CCA. A Prague CCA can use alternatives to the traditional slow-start algorithm that use different pacing (see ).

Segmentation Offload In the absence of hardware pacing, it becomes increasingly difficult for a machine to scale to higher flow rates unless it is allowed to send packets in larger bursts, for instance using segmentation offload. Happily, as flow rate scales up, proportionately more packets can be allowed in a burst for the same amount of queuing delay at the bottleneck. Therefore, a Prague CCA sends packets in a burst as long as it will not induce more than MAX_BURST_DELAY of queuing at the bottleneck. From this constant and the current pacing_rate, it calculates how many MTU-sized packets to allow in a burst:

max_burst = pacing_rate * MAX_BURST_DELAY / MTU_BITS

The current default in the Linux TCP Prague for MAX_BURST_DELAY is 250us which supports marking thresholds starting from about 500us without underutilization. This approach is similar to that in the Linux TCP stack, except there MAX_BURST_DELAY is 1ms.

Variants and Future Work

Getting up to Speed Faster Appendix A.2. of outlines the performance optimizations needed when transplanting DCTCP from a DC environment to a wide area network. The following subsections address two of those points: faster flow startup and faster than additive increase. Then covers the flip side, in which established flows have to yield faster to make room, otherwise queuing will result.

Flow Start (or Restart) For faster flow start, two approaches are currently being investigated in parallel:

Modified Slow Start:

The traditional exponential slow start can be modified both at the start and the end, with the aim of reducing the risk of queuing due to bursts and overshoot:

Pacing IW:: A Prague CCA can use an initial window of 10 (IW10 ), but pacing of this Initial Window is recommended to try to avoid the pulse of queuing that could otherwise occur. Pacing IW10 also spreads the ACKs over the round trip so that subsequent rounds consist of ten subsets of packets (with 2, 4, 8 etc. per round in each subset), rather than a single set with 20, 40, 80 etc. in each round. With IW paced, if a queue builds during a round (e.g. due to other unexpected traffic arriving) it can drain in the gap before the next subset, rather than the whole set backing up into a much larger queue. As smoothed RTT is unknown or inaccurate at the start of a flow, an implementation can pace IW over a fraction of the initial smoothed RTT (perhaps also clamped between hard-coded sanity limits). The implementation could also initialize SRTT with a value it had previously cached per destination (as long as it is sufficiently fresh). The safety factor could depend on whether a cached value was available and how recent it was. In the Linux reference implementation of TCP Prague, IW pacing can be optionally enabled, but it is off by default, because it is yet to be fully evaluated. It currently paces IW over half the initial smoothed round trip time (SRTT) measured during the handshake. SRTT is halved because the RTT often reduces after the initial handshake. For example: i) some CDNs move the flow to a closer server after establishment; ii) the initial RTT from a server can include the time to wake a sleeping handset radio; iii) some uplink technologies take a link-level round trip to request a scheduling slot. It is also planned to exploit any cached knowledge of the path RTT to improve the initial estimate, for instance using the Linux per-destination cache. It is also planned to allow the application to give an RTT hint (by setting sk_max_pacing_rate in Linux) if the developer has reason to believe that the application has a better estimate.
Exiting slow start more gracefully:: In the wide area Internet (in contrast to data centres), bottleneck access links tend to have much less capacity than the line rate of the sender. With a shallow immediate ECN threshold at this bottleneck, the slightest burst can tend to induce an ECN mark, which traditionally causes slow start to exit. A more gradual exit is being investigated for a Prague CCA using the extent of marking, not just the existence of a single mark. This will be more consistent with the extent-based marking that scalable congestion controls use during congestion avoidance. Delay measurements (similar to Hystart++ ) can also be used to complement the ECN signals.

Paced Chirping:

In this approach, the aim is to both increase more rapidly than exponential slow-start and to greatly reduce any overshoot. It is primarily a delay-based approach, but the aim is also to exploit ECN signals when present (while not forgetting loss either). Therefore Paced Chirping is generally usable for any congestion control - not solely for a Prague CCA and L4S. Instead of only aiming to detect capacity overshoot at the end of flow-start, brief trains of rapidly decreasing inter-packet spacing called chirps are used to test many rates with as few packets and as little load as possible. A full description is beyond the scope of this document. introduces the concepts and the code as well as citing the main papers on Paced Chirping. Paced chirping works well over continuous links such as Ethernet and DSL. But better averaging and noise filtering are necessary over discontinuous link technologies such as WiFi, LTE cellular radio, passive optical networks (PON) and data over cable (DOCSIS). This is the current focus of this work. The current Linux implementation of TCP Prague does not include Paced Chirping, but research code is available separately in Linux and ns3. it is accessible via the L4S landing page .

Faster than Additive Increase A Prague CCA has a startup phase and congestion avoidance phase like traditional CCAs. In steady-state during congestion avoidance, like all scalable congestion controls, it induces frequent ECN marks, with the same average recovery time between ECN marks, no matter how much the flow rate scales. If available capacity suddenly increases, e.g. other flow(s) depart or the link capacity increases, these regular ECN marks will stop. Therefore after a few rounds of silence (no ECN marks) in congestion avoidance phase, a Prague CCA can assume that available capacity has increased, and switch to using the techniques from its startup phase () to rapidly find the new, faster operating point. Then it can shift back into its congestion avoidance behaviour. That is the theory. But, as explained in , the startup techniques, specifically paced chirping, are still being developed for discontinuous link types. Once the startup behaviour is available, the Linux implementation of a Prague CCA will also have a faster than additive increase behaviour. S.3.2.3 of ) gives a brief preview of the performance of this approach over an Ethernet link type in ns3.

Remove Lag in Congestion Response To keep queuing delay low, new flows can only push in fast if established flows yield fast. It has recently been realized that the design of a Prague EWMA and congestion response introduces 1-2 rounds of lag (on top of the inherent round of feedback delay due to the speed of light). These lags were inherited from the design of DCTCP (see and ), where a couple of extra hundred microseconds was less noticeable. But congestion control in the wide area Internet cannot afford up to 2 rounds trips of extra lag. To be clear, lag means delay before any response at all starts. That is qualititatively different from the smoothing gain of an EWMA, which /reduces/ the response by the gain factor (1/16 by default) in case a change in congestion does not persist. Smoothing gain can always be increased. But 1-2 rounds of lag means that, when a new flow tries to push in, the sender of an established flow will not respond /at all/ for 1-2 rounds after it first receives congestion feedback. A Prague CCA spends the first round trip of this lag gathering feedback to measure frac before it is input into the EWMA algorithm (see ). Then there is up to one further round of delay because the implementations of DCTCP and Prague did not fully adopt the paradigm shift to extent-based marking - the timing of the decrease is still based on Reno. Both Reno and DCTCP/Prague respond immediately on the first sign of congestion. Reno's response is large, so it waits a round in CWR state to allow the response to take effect. DCTCP's response is tiny (extent-based), but then it still waits a round in CWR state. So it does next-to-nothing for a round. New EWMA and response algorithms to remove these 1-2 extra rounds of lag are described in . They have been implemented in Linux and an iterative process of evaluation and redesign is in progress. The EWMA is updated per-ACK, but it still changes as if it is clocked per round trip. The congestion response is still triggered by the first indication of ECN feedback, but it proceeds over the subsequent round trip so that it can take into account further incoming feedback as the EWMA evolves. The reduction is applied per-ACK but sized to result as if it had been a single response per round trip.

Combining Congestion Metrics Ultimately, it would be preferable to take an integrated approach and use a combination of ECN, loss and delay metrics to drive congestion control. For instance, using a downward trend in ECN marking and/or delay as a heuristic to temper the response to loss. Such ideas are not in the immediate plans for the Linux TCP Prague, but some more specific ideas are highlighted in the following subsections.

ECN with Loss If the bottleneck is ECN-capable, a loss due to congestion is very likely to have been preceded by a period of ECN marking. When the current Linux TCP Prague CCA detects a loss, like DCTCP, it halves cwnd, even if it has already reduced cwnd in the same round trip due to ECN marking. This double reduction can end up factoring down cwnd to as little as 1/4 in one round trip. This is not necessarily detrimental (experimentation will tell), but, if necessary, the response to loss can be factored down, so that the combination of both responses is the same as the reduction that would have occurred due to loss. Specifically, on a loss while in CWR state following an ECN reduction, for an implementation that uses Reno response, it would be possible to use a decrease factor of 1/(2-alpha), which would compound with the previous decrease factor of (1-alpha/2) to result in a factor of: (1 - alpha/2) / (2-alpha)) = 1/2. In integer arithmetic, this division would be possible but relatively expensive. A less expensive alternative would be a decrease factor of (2+alpha)/4, which approximates to a compounded decrease factor of 1/2 for typical low values of alpha, even up to 30%. The compound decrease factor is never greater than 1/2 and in the worst case, if alpha were 100%, it is 3/8. If an implementation uses a CUBIC response on loss after an ECN reduction in the same round trip, in response to both ECN and loss it would not be appropriate to always aim for a combined reduction to 70%. This is because Prague's response to ECN alone can reduce cwnd to as little as 50%, so aiming for 70% would perversely require cwnd to increase on a loss in the same round as ECN marking. Experimentation is needed, but an initial proposal would be a multiplicative decrease factor of (2+alpha)/3. This would never result in an increase on loss. It would result in a combined reduction factor of about 2/3 (i.e. almost 70%) if alpha was low, rising to a combined reduction factor of 1/2 as alpha tends towards 100%.

ECN with Delay described the plans to shift between using ECN when close to the operating point and using delay by injecting paced chirps to find a new operating after the ECN signal goes silent for a few rounds. Paced chirping shifts more slowly to the new operating point the more noise there is in the delay measurements. Work is ongoing on treating any ECN marking as a complementary metric. The resulting less noisy combined metric should then allow the controller to shift more rapidly to each new operating point. An alternative would be to combine ECN with the BBR approach, which induces a much less noisy delay signal by using less frequent but more pronounced delay spikes. The approach currently being taken is to adapt the chirp length to the degree of noise, so the chirps only become longer and/or more pronounced when necessary, for instance when faced with a discontinuous link technology such as WiFi. With multiple chirps per round, the noise can still be filtered out by averaging over them all, rather than trying to remove noise from each spike. This keeps the 'self-harm' to the minimum necessary, and ensures that capacity is always being sampled, which removes the risk of going stale.

Fall-Back on Classic ECN The implementation of the TCP Prague CCA in Linux includes an algorithm to detect a Classic ECN AQM and fall back to Reno as a result, as required by the 'Coexistence with Classic ECN' aspect of the Prague Req 4.3. . The algorithm currently used (v2) is relatively simple, but rather than describe it here, full rationale, pseudocode and explanation can be found in the technical report about it . This also includes a selection of the evaluation results and a link to visualizations of the full results online. The current algorithm nearly always detects a Classic ECN AQM, and in the majority of the wide range of scenarios tested it is good at detecting an L4S AQM. However, it wrongly identifies an L4S AQM as Classic in a significant minority of cases when the link rate is low, or the RTT is high. The report gives ideas on how to improve detection in these scenarios, but in the mean time the algorithm has been disabled by default. Recently, the report cited above has been updated to include new ideas on other ways to distinguish Classic from L4S AQMs. The interested reader can access it themselves, so this living document will not be further summarized here.

Further Reduced RTT-Dependence The algorithm to reduce RTT dependence is only relevant for long-running flows. So in the current TCP Prague implementation it remains disabled for a certain number of round trips after the start of a flow, as explained in . Instead, it would be possible to make rtt_virt gradually move from the actual RTT to the target virtual RTT, or perhaps depend on other parameters of the flow. Nonetheless, just switching in the algorithm after a number of rounds works well enough. It is planned to also disable the algorithm for a similar duration if a flow becomes idle then restarts, but this is yet to be evaluated. Prague Req 4.3. in ) only requires reduced RTT bias "in the range between the minimum likely RTT and typical RTTs expected in the intended deployment scenario". Nonetheless, in future it would be preferable to be able to reduce the RTT bias for high RTT flows as well. If a step AQM is used, the congestion episodes of flows with different RTTs tend to synchronize, which exacerbates RTT bias. To prevent this two candidate approaches will need to be investigated: i) It might be sufficient to deprecate step AQMs for L4S (they are not the preferred recommendation in ); or ii) the virtual RTT approach of might be usable for higher than typical RTTs as well as lower. In this latter case, (srtt/rtt_virt)^2 segments would need to be added to the window per actual RTT. The current TCP Prague implementation does not support this faster AI for RTTs longer than RTT_VIRT_MIN (25ms), due to the expected (but unverified) impact on latency overshoot and responsiveness.

Scaling Down to Fractional Windows A modification to v5.0 of the Linux TCP stack that scales down to sub-packet windows is available for research purposes via the L4S landing page . The L4S Prague Requirements in section 4.3 of recommend but no longer mandate scaling down to sub-packet windows. This is because becoming unresponsive at a minimum window is a tradeoff between protecting against other unresponsive flows and the extra queue you induce by becoming unresponsive yourself. So this code is not maintained as part of the Linux implementation of TCP Prague. Firstly, the stack has to be modifed to maintain a fractional congestion window. Then, because the ACK clock cannot work below 1 packet per RTT, the code sets the time to send each packet, then readjusts the timing as each ACK arrives (otherwise any queuing accumulates a burst in subsequent rounds). Also, additive increase of one segment does not scale below a 1-segment window. So instead of a constant additive increase, the code uses a logarithmically scaled additive increase that slowly adapts the additive increase constant to the slow start threshold. Despite these quite radical changes, the diff is surprisingly small. The design and implementation is explained in , which also includes evaluation results.

IANA Considerations This specification contains no IANA considerations.

Security Considerations on scaling down to fractional windows discusses the tradeoff in becoming unresponsive at a minium window, which causes a queue to build (harm to self and to others) but protects oneself against other unresponsive flows (whether malicious or accidental). This draft inherits the security considerations discussed in and in the L4S architecture . In particular, the self-interest incentive to be responsive and minimize queuing delay, and protections against those interested in disrupting the low queuing delay of others.

References Normative References Key words for use in RFCs to Indicate Requirement Levels In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements. The Addition of Explicit Congestion Notification (ECN) to IP This memo specifies the incorporation of ECN (Explicit Congestion Notification) to TCP and IP, including ECN's use of two bits in the IP header. [STANDARDS-TRACK] Relaxing Restrictions on Explicit Congestion Notification (ECN) Experimentation This memo updates RFC 3168, which specifies Explicit Congestion Notification (ECN) as an alternative to packet drops for indicating network congestion to endpoints. It relaxes restrictions in RFC 3168 that hinder experimentation towards benefits beyond just removal of loss. This memo summarizes the anticipated areas of experimentation and updates RFC 3168 to enable experimentation in these areas. An Experimental RFC in the IETF document stream is required to take advantage of any of these enabling updates. In addition, this memo makes related updates to the ECN specifications for RTP in RFC 6679 and for the Datagram Congestion Control Protocol (DCCP) in RFCs 4341, 4342, and 5622. This memo also records the conclusion of the ECN nonce experiment in RFC 3540 and provides the rationale for reclassification of RFC 3540 from Experimental to Historic; this reclassification enables new experimental use of the ECT(1) codepoint. The Explicit Congestion Notification (ECN) Protocol for Low Latency, Low Loss, and Scalable Throughput (L4S) This specification defines the protocol to be used for a new network service called Low Latency, Low Loss, and Scalable throughput (L4S). L4S uses an Explicit Congestion Notification (ECN) scheme at the IP layer that is similar to the original (or 'Classic') ECN approach, except as specified within. L4S uses 'Scalable' congestion control, which induces much more frequent control signals from the network, and it responds to them with much more fine-grained adjustments so that very low (typically sub-millisecond on average) and consistently low queuing delay becomes possible for L4S traffic without compromising link utilization. Thus, even capacity-seeking (TCP-like) traffic can have high bandwidth and very low delay at the same time, even during periods of high traffic load. The L4S identifier defined in this document distinguishes L4S from 'Classic' (e.g., TCP-Reno-friendly) traffic. Then, network bottlenecks can be incrementally modified to distinguish and isolate existing traffic that still follows the Classic behaviour, to prevent it from degrading the low queuing delay and low loss of L4S traffic. This Experimental specification defines the rules that L4S transports and network elements need to follow, with the intention that L4S flows neither harm each other's performance nor that of Classic traffic. It also suggests open questions to be investigated during experimentation. Examples of new Active Queue Management (AQM) marking algorithms and new transports (whether TCP-like or real time) are specified separately. More Accurate Explicit Congestion Notification (ECN) Feedback in TCP Independent Ericsson NetApp Explicit Congestion Notification (ECN) is a mechanism where network nodes can mark IP packets instead of dropping them to indicate incipient congestion to the endpoints. Receivers with an ECN-capable transport protocol feed back this information to the sender. ECN was originally specified for TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT). Recent new TCP mechanisms like Congestion Exposure (ConEx), Data Center TCP (DCTCP) or Low Latency, Low Loss, and Scalable Throughput (L4S) need more accurate ECN feedback information whenever more than one marking is received in one RTT. This document updates the original ECN specification in RFC 3168 to specify a scheme that provides more than one feedback signal per RTT in the TCP header. Given TCP header space is scarce, it allocates a reserved header bit previously assigned to the ECN-Nonce. It also overloads the two existing ECN flags in the TCP header. The resulting extra space is exploited to feed back the IP-ECN field received during the 3-way handshake as well. Supplementary feedback information can optionally be provided in two new TCP option alternatives, which are never used on the TCP SYN. The document also specifies the treatment of this updated TCP wire protocol by middleboxes. Informative References HighSpeed TCP for Large Congestion Windows The proposals in this document are experimental. While they may be deployed in the current Internet, they do not represent a consensus that this is the best method for high-speed congestion control. In particular, we note that alternative experimental proposals are likely to be forthcoming, and it is not well understood how the proposals in this document will interact with such alternative proposals. This document proposes HighSpeed TCP, a modification to TCP's congestion control mechanism for use with TCP connections with large congestion windows. The congestion control mechanisms of the current Standard TCP constrains the congestion windows that can be achieved by TCP in realistic environments. For example, for a Standard TCP connection with 1500-byte packets and a 100 ms round-trip time, achieving a steady-state throughput of 10 Gbps would require an average congestion window of 83,333 segments, and a packet drop rate of at most one congestion event every 5,000,000,000 packets (or equivalently, at most one congestion event every 1 2/3 hours). This is widely acknowledged as an unrealistic constraint. To address his limitation of TCP, this document proposes HighSpeed TCP, and solicits experimentation and feedback from the wider community. Datagram Congestion Control Protocol (DCCP) The Datagram Congestion Control Protocol (DCCP) is a transport protocol that provides bidirectional unicast connections of congestion-controlled unreliable datagrams. DCCP is suitable for applications that transfer fairly large amounts of data and that can benefit from control over the tradeoff between timeliness and reliability. [STANDARDS-TRACK] Specifying New Congestion Control Algorithms The IETF's standard congestion control schemes have been widely shown to be inadequate for various environments (e.g., high-speed networks). Recent research has yielded many alternate congestion control schemes that significantly differ from the IETF's congestion control principles. Using these new congestion control schemes in the global Internet has possible ramifications to both the traffic using the new congestion control and to traffic using the currently standardized congestion control. Therefore, the IETF must proceed with caution when dealing with alternate congestion control proposals. The goal of this document is to provide guidance for considering alternate congestion control algorithms within the IETF. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements. TCP Congestion Control This document defines TCP's four intertwined congestion control algorithms: slow start, congestion avoidance, fast retransmit, and fast recovery. In addition, the document specifies how TCP should begin transmission after a relatively long idle period, as well as discussing various acknowledgment generation methods. This document obsoletes RFC 2581. [STANDARDS-TRACK] Explicit Congestion Notification (ECN) for RTP over UDP This memo specifies how Explicit Congestion Notification (ECN) can be used with the Real-time Transport Protocol (RTP) running over UDP, using the RTP Control Protocol (RTCP) as a feedback mechanism. It defines a new RTCP Extended Report (XR) block for periodic ECN feedback, a new RTCP transport feedback message for timely reporting of congestion events, and a Session Traversal Utilities for NAT (STUN) extension used in the optional initialisation method using Interactive Connectivity Establishment (ICE). Signalling and procedures for negotiation of capabilities and initialisation methods are also defined. [STANDARDS-TRACK] Increasing TCP's Initial Window This document proposes an experiment to increase the permitted TCP initial window (IW) from between 2 and 4 segments, as specified in RFC 3390, to 10 segments with a fallback to the existing recommendation when performance issues are detected. It discusses the motivation behind the increase, the advantages and disadvantages of the higher initial window, and presents results from several large-scale experiments showing that the higher initial window improves the overall performance of many web services without resulting in a congestion collapse. The document closes with a discussion of usage and deployment for further experimental purposes recommended by the IETF TCP Maintenance and Minor Extensions (TCPM) working group. Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings. Data Center TCP (DCTCP): TCP Congestion Control for Data Centers This Informational RFC describes Data Center TCP (DCTCP): a TCP congestion control scheme for data-center traffic. DCTCP extends the Explicit Congestion Notification (ECN) processing to estimate the fraction of bytes that encounter congestion rather than simply detecting that some congestion has occurred. DCTCP then scales the TCP congestion window based on this estimate. This method achieves high-burst tolerance, low latency, and high throughput with shallow- buffered switches. This memo also discusses deployment issues related to the coexistence of DCTCP and conventional TCP, discusses the lack of a negotiating mechanism between sender and receiver, and presents some possible mitigations. This memo documents DCTCP as currently implemented by several major operating systems. DCTCP, as described in this specification, is applicable to deployments in controlled environments like data centers, but it must not be deployed over the public Internet without additional measures. Self-Clocked Rate Adaptation for Multimedia This memo describes a rate adaptation algorithm for conversational media services such as interactive video. The solution conforms to the packet conservation principle and uses a hybrid loss-and-delay- based congestion control algorithm. The algorithm is evaluated over both simulated Internet bottleneck scenarios as well as in a Long Term Evolution (LTE) system simulator and is shown to achieve both low latency and high video throughput in these scenarios. RTP Control Protocol (RTCP) Feedback for Congestion Control An effective RTP congestion control algorithm requires more fine-grained feedback on packet loss, timing, and Explicit Congestion Notification (ECN) marks than is provided by the standard RTP Control Protocol (RTCP) Sender Report (SR) and Receiver Report (RR) packets. This document describes an RTCP feedback message intended to enable congestion control for interactive real-time traffic using RTP. The feedback message is designed for use with a sender-based congestion control algorithm, in which the receiver of an RTP flow sends back to the sender RTCP feedback packets containing the information the sender needs to perform congestion control. The RACK-TLP Loss Detection Algorithm for TCP This document presents the RACK-TLP loss detection algorithm for TCP. RACK-TLP uses per-segment transmit timestamps and selective acknowledgments (SACKs) and has two parts. Recent Acknowledgment (RACK) starts fast recovery quickly using time-based inferences derived from acknowledgment (ACK) feedback, and Tail Loss Probe (TLP) leverages RACK and sends a probe packet to trigger ACK feedback to avoid retransmission timeout (RTO) events. Compared to the widely used duplicate acknowledgment (DupAck) threshold approach, RACK-TLP detects losses more efficiently when there are application-limited flights of data, lost retransmissions, or data packet reordering events. It is intended to be an alternative to the DupAck threshold approach. QUIC: A UDP-Based Multiplexed and Secure Transport This document defines the core of the QUIC transport protocol. QUIC provides applications with flow-controlled streams for structured communication, low-latency connection establishment, and network path migration. QUIC includes security measures that ensure confidentiality, integrity, and availability in a range of deployment circumstances. Accompanying documents describe the integration of TLS for key negotiation, loss detection, and an exemplary congestion control algorithm. Stream Control Transmission Protocol This document describes the Stream Control Transmission Protocol (SCTP) and obsoletes RFC 4960. It incorporates the specification of the chunk flags registry from RFC 6096 and the specification of the I bit of DATA chunks from RFC 7053. Therefore, RFCs 6096 and 7053 are also obsoleted by this document. In addition, RFCs 4460 and 8540, which describe errata for SCTP, are obsoleted by this document. SCTP was originally designed to transport Public Switched Telephone Network (PSTN) signaling messages over IP networks. It is also suited to be used for other applications, for example, WebRTC. SCTP is a reliable transport protocol operating on top of a connectionless packet network, such as IP. It offers the following services to its users: The design of SCTP includes appropriate congestion avoidance behavior and resistance to flooding and masquerade attacks. Low Latency, Low Loss, and Scalable Throughput (L4S) Internet Service: Architecture This document describes the L4S architecture, which enables Internet applications to achieve low queuing latency, low congestion loss, and scalable throughput control. L4S is based on the insight that the root cause of queuing delay is in the capacity-seeking congestion controllers of senders, not in the queue itself. With the L4S architecture, all Internet applications could (but do not have to) transition away from congestion control algorithms that cause substantial queuing delay and instead adopt a new class of congestion controls that can seek capacity with very little queuing. These are aided by a modified form of Explicit Congestion Notification (ECN) from the network. With this new architecture, applications can have both low latency and high throughput. The architecture primarily concerns incremental deployment. It defines mechanisms that allow the new class of L4S congestion controls to coexist with 'Classic' congestion controls in a shared network. The aim is for L4S latency and throughput to be usually much better (and rarely worse) while typically not impacting Classic performance. Dual-Queue Coupled Active Queue Management (AQM) for Low Latency, Low Loss, and Scalable Throughput (L4S) This specification defines a framework for coupling the Active Queue Management (AQM) algorithms in two queues intended for flows with different responses to congestion. This provides a way for the Internet to transition from the scaling problems of standard TCP-Reno-friendly ('Classic') congestion controls to the family of 'Scalable' congestion controls. These are designed for consistently very low queuing latency, very low congestion loss, and scaling of per-flow throughput by using Explicit Congestion Notification (ECN) in a modified way. Until the Coupled Dual Queue (DualQ), these Scalable L4S congestion controls could only be deployed where a clean-slate environment could be arranged, such as in private data centres. This specification first explains how a Coupled DualQ works. It then gives the normative requirements that are necessary for it to work well. All this is independent of which two AQMs are used, but pseudocode examples of specific AQMs are given in appendices. HyStart++: Modified Slow Start for TCP This document describes HyStart++, a simple modification to the slow start phase of congestion control algorithms. Slow start can overshoot the ideal send rate in many cases, causing high packet loss and poor performance. HyStart++ uses increase in round-trip delay as a heuristic to find an exit point before possible overshoot. It also adds a mitigation to prevent jitter from causing premature slow start exit. CUBIC for Fast and Long-Distance Networks CUBIC is a standard TCP congestion control algorithm that uses a cubic function instead of a linear congestion window increase function to improve scalability and stability over fast and long-distance networks. CUBIC has been adopted as the default TCP congestion control algorithm by the Linux, Windows, and Apple stacks. This document updates the specification of CUBIC to include algorithmic improvements based on these implementations and recent academic work. Based on the extensive deployment experience with CUBIC, this document also moves the specification to the Standards Track and obsoletes RFC 8312. This document also updates RFC 5681, to allow for CUBIC's occasionally more aggressive sending behavior. ECN++: Adding Explicit Congestion Notification (ECN) to TCP Control Packets Universidad Carlos III de Madrid Independent This document specifies an experimental modification to ECN when used with TCP. It allows the use of ECN in the IP header of the following TCP packets: SYNs, SYN/ACKs, pure ACKs, Window probes, FINs, RSTs and retransmissions. This specification obsoletes RFC5562, which described a different way to use ECN on SYN/ACKs alone. Implementing the `TCP Prague' Requirements for Low Latency Low Loss Scalable Throughput (L4S) Independent Nokia Bell Labs Simula Research Lab Simula Research Lab Nokia Bell Labs ETH Zurich Simula Research Lab Improving DCTCP/Prague Congestion Control Responsiveness Independent tcp: allow dctcp alpha to drop to zero tcp: Ensure DCTCP reacts to losses Resolving Tensions between Congestion Control Scaling Requirements L4S: Ultra-Low Queuing Delay for All Linux kernel tree with L4S patches Paced Chirping - Rethinking TCP start-up University of Oslo Independent TCP Prague Fall-back on Detection of a Classic ECN AQM Independent Simula and Uni Oslo Extending TCP for Low Round Trip Delay Simula and Uni Oslo Disentangling Flaws in Linux DCTCP

Comments and Contributions Solicited Comments and questions are encouraged and very welcome. They can be addressed to the IRTF Internet Congestion Control Research Group's mailing list <iccrg@irtf.org>, and/or to the authors via <draft-briscoe-iccrg-congestion-control@ietf.org>. Contributions of design ideas and/or code are also encouraged and welcome. During the drafting process, the intention is to gather experience into this document from a wider set of Prague congestion control implementations.

Acknowledgements Thanks to Neal Cardwell for review comments. The following contributed implementations and evaluations that validated and helped to improve this specification:

Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> of Nokia Bell Labs, Belgium, prepared and maintains the Linux implementation of TCP Prague.
Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> of Nokia Bell Labs, Belgium, contributed to the Linux implementation of TCP Prague.
Joakim Misund <joakim.misund@gmail.com> of Uni Oslo, Norway, wrote the Linux paced chirping code.
Asad Sajjad Ahmed <me@asadsa.com>, Independent, Norway, wrote the Linux code that maintains a sub-packet window.
Vidhi Goel <vidhi_goel@apple.com> of Apple Inc, Cupertino, wrote and maintains the Apple implmentation of QUIC Prague.

Bob Briscoe's contribution was part-funded by the Comcast Innovation Fund and part-funded by Apple Inc. The views expressed here are solely those of the authors.