Re: [tcpm] CUBIC rfc8312bis / WGLC Issue 2

Markku Kojo <kojo@cs.helsinki.fi> Fri, 29 July 2022 01:31 UTC

Return-Path: <kojo@cs.helsinki.fi>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7DFF0C13CCD3 for <tcpm@ietfa.amsl.com>; Thu, 28 Jul 2022 18:31:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.007
X-Spam-Level:
X-Spam-Status: No, score=-2.007 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cs.helsinki.fi
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0qz8M01oBDio for <tcpm@ietfa.amsl.com>; Thu, 28 Jul 2022 18:31:00 -0700 (PDT)
Received: from script.cs.helsinki.fi (script.cs.helsinki.fi [128.214.11.1]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2E653C13CCDC for <tcpm@ietf.org>; Thu, 28 Jul 2022 18:30:58 -0700 (PDT)
X-DKIM: Courier DKIM Filter v0.50+pk-2017-10-25 mail.cs.helsinki.fi Fri, 29 Jul 2022 04:30:53 +0300
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.helsinki.fi; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version:content-type:content-id; s=dkim20130528; bh=WOCDLD s4XChEhROfz3IetA9OVIyKiOOx+khHoSWvk/8=; b=KwUhhUR+l+9FHZRDWqDTNM N3O97ZH1lDKiaV8r7EB7S2I5Ymp9pwQL5j1ohnhCP63T1xdsfUUGABfbhMx5gCF7 y93M7P/DwQmteY9toUPcxktgdmfSuYVc3Rz+n3yN33jJzeY3X/TE+nd3ZMrdZA0j SuifJtKVJaUOY6XSFANcM=
Received: from hp8x-60 (85-76-102-15-nat.elisa-mobile.fi [85.76.102.15]) (AUTH: PLAIN kojo, TLS: TLSv1/SSLv3,256bits,AES256-GCM-SHA384) by mail.cs.helsinki.fi with ESMTPSA; Fri, 29 Jul 2022 04:30:53 +0300 id 00000000005A0069.0000000062E3384D.00005CBF
Date: Fri, 29 Jul 2022 04:30:48 +0300
From: Markku Kojo <kojo@cs.helsinki.fi>
To: Michael Welzl <michawe@ifi.uio.no>
cc: Yoshifumi Nishida <nsd.ietf@gmail.com>, "tcpm@ietf.org Extensions" <tcpm@ietf.org>
In-Reply-To: <DAED20B6-5EC1-41E2-94A5-DD1A8D671962@ifi.uio.no>
Message-ID: <alpine.DEB.2.21.2207232057050.7292@hp8x-60.cs.helsinki.fi>
References: <CAAK044QZxWR6EMi6x+KFWrzkx885BnoQAAbPLqf-EqRHOc_htw@mail.gmail.com> <DAED20B6-5EC1-41E2-94A5-DD1A8D671962@ifi.uio.no>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=_script-23767-1659058253-0001-2"
Content-ID: <alpine.DEB.2.21.2207290330180.7292@hp8x-60.cs.helsinki.fi>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/lZ5hoUZYuHtE-_BpC6m8EliOXrY>
Subject: Re: [tcpm] CUBIC rfc8312bis / WGLC Issue 2
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 29 Jul 2022 01:31:05 -0000

Hi Michael, all,

On Wed, 20 Jul 2022, Michael Welzl wrote:

> Hi all,
> That strong assumption about cwnd is in CA, but I believe Markku’s concern is about the back-off right after
> SS.

Exactly. A loss signalled in slow start means that the current 
cwnd (and flightsize (*)) is (nearly) double the available bandwidth 
because signalling the loss takes an RTT during which the cwnd (nearly) 
doubles.

(*) assuming the flow is not application limited during the last RTT.

> Now, Markku’s calculations (about cwnd overshoot) are all correct if everything is paced (which is why the
> bbrv2 strategy mentioned by neal sounds just right!), but without pacing, the point of first loss depends on
> the burst size (of all flows) vs the bottleneck queue length, and as a result, cubic (or reno, for that
> matter) may exit SS much earlier than what Markku describes, I think. Hence somehow being bursty “helps” a
> little here.

I believe you mean when the network paces the packets, i.e., when the 
packet delivery rate is more or less fixed such that Acks arrive nicely 
paced?

If the sender does the pacing that should help only during the initial RTT 
as after the bottleneck is fully utilized, packets are autumatically paced 
(if the network deliveres packets nicely at fixed rate).

Sure, I think the exit may be both earlier or later! A large enough 
back-to-back burst with a shallow bottleneck queue may possibly have a 
notable effect and make SS exit to occur somewhat earlier.

However, if in deterministic network conditions a sender injects 100% 
more packets during the last RTT of SS than what the available bandwidth 
is that results in 40% overload (undelivered packets) during the RTT 
following the SS (with Beta=0.7). Now, having zero undelivered pkts with 
Beta=0.7 in slow start would require that the SS exit in the last RTT of 
SS occurs before the sender has injected more than 43% beyond the 
available bandwidth. Otherwise, it will inject at least some undelivered 
packets. Sending any undelivered packets is unadviseable.

It would be nice to see measurement data from various networks 
showing how much the SS exit point varies in reality.

Also, we should remember that there certainly are network paths that 
represent the deterministic/paced behaviour and CC algos must work 
appropriately in a wide range of environments.

On the other hand, most (if not all) AQMs are very slugghish is reacting 
to the rapid ramp up in slow start. They often react only after the sender 
has increased its cwnd well beyond the average dropping (marking) point, 
i.e., the actual cwnd at the time the congestion is signalled is often 
(much) more than double the (average) saturation point. And most of these 
excess packets are queued because there are only a few losses (or 
just marks) until the physical queue is exhausted. CUBIC specifies Beta = 
0.7 also when in slow start and ECN is enabled, while ABE (RFC 8511) does 
not specify larger Beta in slow start, only in CA. I couldn't access the 
papers ABE cites, but I believe larger Beta in slow start resulted in 
prolonged delay peak after the slow start, indicating exactly the same 
overload that results in undelivered packets in a tail-drop queue?
With Beta 0.7 during SS and an AQM at the bottleneck, we are likely to 
see longer delay spikes due to slow-start overshoot?

Maybe Michael has some insights what were the results/reasons behind the 
ABE dacision? Nevertheless, what is the justification for CUBIC to use 
Beta=0.7 also in slow start with ECN enabled while ABE does not?

> Not sure what recommendation to take from this….. but I think it’s why the current 0.7 choice often works
> reasonably well in practice.

Do we possibly have any measurement data to back up "0.7 choice often 
works reasonably well in practice".

Cheers,

/Markku

> Cheers,
> Michael
> 
> 
> Sent from my iPhone
>
>       On 20 Jul 2022, at 08:39, Yoshifumi Nishida <nsd.ietf@gmail.com> wrote:
>
>       Hi Markku, folks,
> 
> In my understanding, compared to Reno, Cubic makes a strong assumption that the last cwnd which caused
> packet loss is more or less close to the available bandwidth.
> If this assumption is correct, it can utilize bandwidth efficiently. 
> However, if the assumption deviates from the actual value, it may create more packet losses than Reno.
> As a result, Cubic may suffer poor performance in this case.
> I don't believe it leads to congestion collapse although b=0.7 may result in slower convergence than
> Reno. But I think it's a part of design choice. 
> I agree 40% overshooting case is an unfortunate case, but I am not sure we should adjust the entire
> design to this kind of case. At least for now.  
> 
> Thanks,
> --
> Yoshi
> 
> On Tue, Jul 19, 2022 at 5:16 PM Markku Kojo <kojo@cs.helsinki.fi> wrote:
>       Hi Yoshi, all,
>
>       On Tue, 19 Jul 2022, Yoshifumi Nishida wrote:
>
>       > Hi folks, 
>       > I think I understand this issue, but I'm personally not sure how bad this is.
>       > Because this looks a rather pathological case to me, also I don't think this can cause
>       congestion collapse as
>       > this is still multicative decrease.
>
>       It is multiplicative decerase from a FlightSize (cwnd) that is double the
>       available network capacity. If the decrease factor is 0.5, we end up
>       sending at exactly the full rate the network allows, i.e., there is no
>       unused capacity. Decrease factor 0.7 means that the flow is
>       effectively unresponsive for 40% of the packets it injects after the cwnd
>       decrease.
>
>       If you consider a congested scenario where new flows start up
>       continuously (e.g., a large number of Web users sharing a heavily
>       congested bottleneck router), it resembles a situation where the flows do
>       not appropriately react to congestion but keep on sending up to 40% of
>       undelivered packets. Congestion collapse does not necessarily mean full
>       (100%) collapse but several different degrees of congestion collapse is
>       possible (pls. see the description of undelivered packets in RFC 2914).
>       In this case we may see up to 40% congestion collapse because these
>       undelivered packets eat up useful capacity from other users.
>
>       The question to answer is: what sense does it make and what is the
>       justification for a flow to inject that many packets into the network
>       unnecessarily knowing that they will get dropped (or cause drops for
>       other flows) with a tail-drop bottleneck router?
>
>       Apologies for the strong words but for me this would be insane design.
>
>       > It seems to me that this is a kind of shooting in the foot, a suboptimal case. However, 
>       there are some
>       > advantages in the current logic. 
>
>       Could you possibly elaborate?
>
>       > I'm not very sure if we should sacrifice better results to address some rare cases. I
>       think we will need more
>       > analysis of the pros and cons for this.
>
>       I don't think this could be considered a rare (or corner) case as this
>       occurs potentially every time a flow starts and every time a flow
>       encounters RTO with sudden congestion (the latter is very bad because
>       in front of heavy congestion it is extremely important that every flow
>       reacts appropriately).
>
>       Thanks,
>
>       /Markku
>
>       > Thanks,
>       > --
>       > Yoshi
>       >
>       > On Wed, Jul 13, 2022 at 7:17 AM Neal Cardwell <ncardwell@google.com> wrote:
>       >       Hi Markku and TCPMers,
>       >
>       > My understanding of Markku's concern here is that in slow start the cwnd can continue to
>       grow in
>       > response to ACKs after the lost packet was sent, so that the cwnd is often twice the level
>       of in-flight
>       > data at which the loss happened, by the time the loss is detected. So the cwnd ends up at
>       2 * 0.7 = 1.4x
>       > the level at which losses happened, which causes an unnecessary follow-on round with
>       losses, in order to
>       > again cut the cwnd, this time to 1.4 * 0.7 = 0.98x of the level that causes losses, which
>       is likely to
>       > finally fit in the network path.
>       >
>       > However, there are two technical issues with this concern, as expressed in the proposed
>       draft text in
>       > this thread:
>       >
>       > (1) The analysis for slow-start is not correct for the very common case where the flow is
>       > application-limited in slow-start, in which case the cwnd would not grow at all between
>       the packet loss
>       > and the time the loss is detected. So the text is needlessly strict in this case.
>       >
>       > (2) For CUBIC the problematic dynamic (of cwnd growth between loss and loss detection
>       exceeding the
>       > multiplicative decrease) can also occur outside of slow-start, in congestion avoidance.
>       The CUBIC cwnd
>       > growth in congestion avoidance can be up to 1.5x per round trip. So after a packet loss
>       the cwnd could
>       > grow by 1.5x before loss detection and then be cut in response to loss by 0.7, causing the
>       ultimate cwnd
>       > to be 1.5 * 0.7 = 1.05x the volume of in-flight data at the time of the packet loss. This
>       would likely
>       > cause an unnecessary follow-on round of packet loss due to failing to cut cwnd below the
>       level that
>       > caused loss. So the problem is actually wider than slow-start.
>       >
>       > AFAICT a complete/general fix for this issue is best solved by recording the volume of
>       inflight data at
>       > the point of each packet transmission, and then using that metric as the baseline for the
>       multiplicative
>       > decrease when packet loss is detected, rather than using the current cwnd as the baseline.
>       This is the
>       > approach that BBRv2 uses. Perhaps there are other, simpler approaches as well.
>       >
>       > I also agree with Vidhi's concern, that a change to the multiplicative decrease changes
>       the algorithm
>       > substantially. To ensure that the draft/RFC is not recommending something that has
>       unforeseen
>       > significant negative consequences, we shouldn't make such a significant change to the text
>       until we get
>       > experience w/ the new variation.
>       >
>       > best regards,
>       > neal
>       >
>       >
>       > On Tue, Jul 12, 2022 at 6:08 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org> wrote:
>       >       Hi Markku,
>       >
>       >       I emailed about this to other co-authors and we think that this change is completely
>       >       untested for Cubic and we think that this could be considered of a future version of
>       Cubic,
>       >       not the current rfc8312bis.
>       >       To change Beta from 0.7 to 0.5 during slow-start, we would at least need some
>       experience
>       >       either from lab testing or deployment since all current deployments of Cubic for
>       both TCP
>       >       and QUIC use 0.7 as Beta during slow start. Since a lot of implementations currently
>       use
>       >       hystart(++) along with Cubic, we don’t see any high risk of overaggressive sending
>       rate and
>       >       that is what the current rfc8312bis suggests as well. In fact, changing Beta from
>       0.7 to 0.5
>       >       can still be aggressive without using hystart.
>       >
>       >       Thanks,
>       >       Vidhi
>       >
>       >       > On Jul 11, 2022, at 5:55 PM, Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org>
>       wrote:
>       >       >
>       >       > Hi all,
>       >       >
>       >       > below please find proposed text to solve the Issue 2 a). I will propose text to
>       solve 2 b)
>       >       once we have come to conclusion with 2 a). For description and arguments for issues
>       2 a) and
>       >       2 b), please see the original issue descriptions below.
>       >       >
>       >       > Sec 4.6. Multiplicative Decrease
>       >       >
>       >       > Old:
>       >       >   The parameter Beta__cubic_ SHOULD be set to 0.7, which is different
>       >       >   from the multiplicative decrease factor used in [RFC5681] (and
>       >       >   [RFC6675]) during fast recovery.
>       >       >
>       >       >
>       >       > New:
>       >       >   If the sender is not in slow start when the congestion event is
>       >       >   detected, the parameter Beta__cubic_ SHOULD be set to 0.7, which
>       >       >   is different from the multiplicative decrease factor used in
>       >       >   [RFC5681] (and [RFC6675].
>       >       >   This change is justified in the Reno-friendly region during
>       >       >   congestion avoidance because a CUBIC sender compensates the higher
>       >       >   multiplicative decrease factor than that of Reno by applying
>       >       >   a lower additive increase factor during congestion avoidance.
>       >       >
>       >       >   However, if the sender is in slow start when the congestion event is
>       >       >   detected, the parameter Beta__cubic_ MUST be set to 0.5 [Jacob88].
>       >       >   This results in the sender continuing to transmit data at the maximum
>       >       >   rate that the slow start determined to be available for the flow.
>       >       >   Using Beta__cubic_ with a value larger than 0.5 when the congestion
>       >       >   event is detected in slow start would result in an overagressive send
>       >       >   rate where the sender injects excess packets into the network and
>       >       >   each such packet is guaranteed to be dropped or force a packet from
>       >       >   a competing flow to be dropped at a tail-drop bottleneck router.
>       >       >   Furthermore, injecting such undelivered packets creates a danger of
>       >       >   congestion collapse (of some degree) "by delivering packets through
>       >       >   the network that are dropped before reaching their ultimate
>       >       >   destination." [RFC 2914]
>       >       >
>       >       >
>       >       >   [Jacob88] V. Jacobson, Congestion avoidance and control, SIGCOMM '88.
>       >       >
>       >       > Thanks,
>       >       >
>       >       > /Markku
>       >       >
>       >       > On Tue, 14 Jun 2022, Markku Kojo wrote:
>       >       >
>       >       >> Hi all,
>       >       >>
>       >       >> this thread starts the discussion on the issue 2: CUBIC is specified to use
>       incorrect
>       >       multiplicative-decrease factor for a congestion event that occurs when operating in
>       slow
>       >       start. And, applying HyStart++ does not remove the problem, it only mitigates it in
>       some
>       >       percentage of cases.
>       >       >>
>       >       >> I think it is useful to discuss this in two phases: 2 a) and 2 b) below.
>       >       >> For anyone commenting/arguing on the part 2 b), it is important to first
>       >       >> acknowledge whether (s)he thinks the original design and logic by Van Jacobson is
>       >       correct. If not, one should explain why Van's design logic is incorrect.
>       >       >>
>       >       >> Issue 2 a)
>       >       >> ----------
>       >       >>
>       >       >> To begin with, let's but aside a potential use of HyStart++ (also assume tail
>       drop router
>       >       unless otherwise mentioned).
>       >       >>
>       >       >> The use of an MD factor larger than 0.5 is against the theory and original design
>       by Van
>       >       Jacobson as explained in the congavoid paper [Jacob88]. Any MD factor value larger
>       then 0.5
>       >       will result sending extra packets during Fast Recovery following the congestion
>       event
>       >       (drop). All extra packets will become dropped at a tail-drop bottleneck (if a lonely
>       flow).
>       >       >>
>       >       >> Note that at the time when the drop becomes signalled at the TCP sender, the size
>       of the
>       >       cwnd is double the available network capacity that slow start determined for the
>       flow. That
>       >       is, using MD=0.5 is already as aggressive as possible, leaving no slack. Therefore,
>       if
>       >       MD=0.7 is used, the TCP sender enters fast recovery with cwnd that is 40% larger
>       that the
>       >       determined network capacity and all excess packets are guaranteed to become dropped,
>       or even
>       >       worse, the excess packets are likely to force packets for any competing flows to
>       become
>       >       unfairly be dropped.
>       >       >>
>       >       >> Moreover, if NewReno loss recovery is in use, a CUBIC sender will
>       >       >> operate overagressively for a very long time. For example, if the
>       >       >> available network capacity for the flow is 100 packets, cwnd will have
>       >       >> value 200 when the congestion is signalled and the CUBIC sender enters
>       >       >> fast recovery with cwnd=140 and injects 40 excess packets for each of
>       >       >> the subsequent 100 RTTs it stays in fast recovery, forcing 4000 packets to become
>       >       inevitably and totally unnecessarily dropped.
>       >       >>
>       >       >> Even worse, this behaviour of sending 'undelivered packets' is against
>       >       >> the congestion control principles as it creates a danger of congestion
>       >       >> collapse (of some degree) "by delivering packets through the network
>       >       >> that are dropped before reaching their ultimate destination." [RFC 2914]
>       >       >>
>       >       >> Such undelivered packets unnecessarily eat capacity from other flows
>       >       >> sharing the path before the bottleneck.
>       >       >>
>       >       >> RFC 2914 emphasises:
>       >       >>
>       >       >> "This is probably the largest unresolved danger with respect to
>       >       >> congestion collapse in the Internet today."
>       >       >>
>       >       >> It is very easy to envision a realistic network setup where this creates a degree
>       of
>       >       congestion collapse where a notable portion of useful network capacity is wasted due
>       to the
>       >       undelivered packets.
>       >       >>
>       >       >>
>       >       >> [Jacob88] V. Jacobson, Congestion avoidance and control, SIGCOMM '88.
>       >       >>
>       >       >>
>       >       >> Issue 2 b)
>       >       >> ----------
>       >       >>
>       >       >> The CUBIC draft suggests that HyStart++ should be used *everywhere* instead of
>       the
>       >       traditional Slow Start (see section 4.10).
>       >       >>
>       >       >> Although the draft does not say it, seemingly the authors suggest using HyStart++
>       instead
>       >       of traditional Slow Start in order to avoid the problem of over-aggressive behaviour
>       >       discussed above. This, however, has several issues.
>       >       >>
>       >       >> First. it is directly in conflict with HyStart++ specification which says that
>       HyStart++
>       >       should be used only for the initial Slow Start. However, the overaggressive
>       behaviour after
>       >       slow start is also a potential problem with slow start during an RTO recovery; in
>       case of
>       >       sudden congestion that reduces available capacity for a flow down to a fraction of
>       the
>       >       currently available capacity, it is very likely that an RTO occurs. In such a case
>       the RTO
>       >       recovery in slow start inevitably overshoots and it is crucial for all flows not to
>       be
>       >       overaggressive.
>       >       >>
>       >       >> Second, the experimental results for initial slow start in HyStart++ draft
>       suggest that
>       >       while HyStart++ achieves good results HyStart++ is unable to exit slow start early
>       and avoid
>       >       overshoot in a significant percentage of cases.
>       >       >>
>       >       >> Given the above issues, the CUBIC draft must require that MD of 0.5 is used when
>       the
>       >       congestion event occurs while the sender is (still) in slow start. The use of MD=0.5
>       is an
>       >       obvious stumble in the original CUBIC and the original CUBIC authors have already
>       >       acknowledged this. It seems also obvious that instead of correcting the actual
>       problem (use
>       >       of MD other than 0.5), HyStart and HyStart++ have been proposed to address the
>       design
>       >       mistake. While HyStart++ is a useful method also when used with MD=0.5, when used
>       alone it
>       >       only mitigates the impact of the actual problem rather than solves the problem.
>       >       >>
>       >       >> What should be done for the cases where HyStart++ exits slow start but
>       >       >> is not able to avoid (some level of) overshoot and dropped packets is IMO an open
>       issue.
>       >       Resolving it requires additional experiments and it should be resolved separately
>       when we
>       >       have more data. For now when we do not have enough data and understanding of the
>       behaviour
>       >       we should IMO follow the general IETF guideline "be conservative in what you send"
>       and
>       >       specify that MD = 0.5 should be used for a congestion event that occurs for a packet
>       sent in
>       >       slow start.
>       >       >>
>       >       >> Thanks,
>       >       >>
>       >       >> /Markku
>       >       >>
>       >       >
>       >       > _______________________________________________
>       >       > tcpm mailing list
>       >       > tcpm@ietf.org
>       >       > https://www.ietf.org/mailman/listinfo/tcpm
>       >
>       >       _______________________________________________
>       >       tcpm mailing list
>       >       tcpm@ietf.org
>       >       https://www.ietf.org/mailman/listinfo/tcpm
>       >
>       >
>       >
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm
> 
> 
>