Re: [tcpm] CUBIC rfc8312bis / WGLC Issue 2

Markku Kojo <kojo@cs.helsinki.fi> Fri, 15 July 2022 00:42 UTC

Return-Path: <kojo@cs.helsinki.fi>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DB820C16ECF8 for <tcpm@ietfa.amsl.com>; Thu, 14 Jul 2022 17:42:13 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.006
X-Spam-Level:
X-Spam-Status: No, score=-2.006 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cs.helsinki.fi
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sJuIQEhyrU_R for <tcpm@ietfa.amsl.com>; Thu, 14 Jul 2022 17:42:09 -0700 (PDT)
Received: from script.cs.helsinki.fi (script.cs.helsinki.fi [128.214.11.1]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2C6B0C14F741 for <tcpm@ietf.org>; Thu, 14 Jul 2022 17:42:08 -0700 (PDT)
X-DKIM: Courier DKIM Filter v0.50+pk-2017-10-25 mail.cs.helsinki.fi Fri, 15 Jul 2022 03:42:02 +0300
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.helsinki.fi; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version:content-type:content-id; s=dkim20130528; bh=Gfh9q1 2jWUB4FqW2Qsa8st3pkK1Oh4fQLCUdvyMgE3s=; b=cYJCQbzMCcvfcUGoxX2HrS vyt5/FPKmYnSCrldXRyao7CyfL/WFfM+EUQk8JE0C/0ZpjRJgu5kxjvAmDVCGBNV rQDLv5rtO+Qr6N/Pg43hE9h5qXhzsUs48mG2Vuy8kFjpCkcKDaWMNaF0sCpCOGm3 MKHxb2L8aTaNpzv+FcmV0=
Received: from hp8x-60 (85-76-46-15-nat.elisa-mobile.fi [85.76.46.15]) (AUTH: PLAIN kojo, TLS: TLSv1/SSLv3,256bits,AES256-GCM-SHA384) by mail.cs.helsinki.fi with ESMTPSA; Fri, 15 Jul 2022 03:42:01 +0300 id 00000000005A0069.0000000062D0B7DA.000025A1
Date: Fri, 15 Jul 2022 03:41:55 +0300
From: Markku Kojo <kojo@cs.helsinki.fi>
To: Neal Cardwell <ncardwell=40google.com@dmarc.ietf.org>
cc: Vidhi Goel <vidhi_goel@apple.com>, Yoshifumi Nishida <nsd.ietf@gmail.com>, "tcpm@ietf.org Extensions" <tcpm@ietf.org>
In-Reply-To: <CADVnQykd9z=vgkQ-FkQ8-sj_E0BrQnpwhsj8AoF9QgQiQNQEhg@mail.gmail.com>
Message-ID: <alpine.DEB.2.21.2207150213530.7292@hp8x-60.cs.helsinki.fi>
References: <alpine.DEB.2.21.2206141500480.7292@hp8x-60.cs.helsinki.fi> <alpine.DEB.2.21.2207112144430.7292@hp8x-60.cs.helsinki.fi> <7CF26B3A-D6C3-48F6-AA82-424231DD95D4@apple.com> <CADVnQykd9z=vgkQ-FkQ8-sj_E0BrQnpwhsj8AoF9QgQiQNQEhg@mail.gmail.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=_script-9658-1657845722-0001-2"
Content-ID: <alpine.DEB.2.21.2207150302080.7292@hp8x-60.cs.helsinki.fi>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tcpm/iLM5t4KJnIy5YvgzdS06w9XTOqg>
Subject: Re: [tcpm] CUBIC rfc8312bis / WGLC Issue 2
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 15 Jul 2022 00:42:13 -0000

Hi Neal,

Thank you for your comments and constructive proposal. Please see below.

On Wed, 13 Jul 2022, Neal Cardwell wrote:

> Hi Markku and TCPMers,
> 
> My understanding of Markku's concern here is that in slow start the cwnd can continue to grow in response to
> ACKs after the lost packet was sent, so that the cwnd is often twice the level of in-flight data at which the
> loss happened, by the time the loss is detected. So the cwnd ends up at 2 * 0.7 = 1.4x the level at which
> losses happened, which causes an unnecessary follow-on round with losses, in order to again cut the cwnd, this
> time to 1.4 * 0.7 = 0.98x of the level that causes losses, which is likely to finally fit in the network path.

That quite correctly captures my concern.

Note also that when SACK is not enabled and a TCP sender employs NewReno 
loss-recovery during the fast recovery phase, it injects packet with this 
1.4x cwnd for roughly as many RTTs as there were MSS-sized segments 
in-flight at the time the loss happened. E.g., if cwnd was 100 MSS at the 
time loss happened, the sender enters fast recovery with cwnd=140 MSS and 
injects 40 x 100 = 4000 undelivered packets during a successful fast 
recovery (i.e., with no RTO). I.e., for a notably long period of time and 
a significant number of undelivered packets.

> However, there are two technical issues with this concern, as expressed in the proposed draft text in this
> thread:
> 
> (1) The analysis for slow-start is not correct for the very common case where the flow is application-limited
> in slow-start, in which case the cwnd would not grow at all between the packet loss and the time the loss is
> detected. So the text is needlessly strict in this case.

Sure. Thanks for pointing this out. This should be addressed in one way or 
the other but at the same time the requirement must be correct for flows 
other than application-limited.

> (2) For CUBIC the problematic dynamic (of cwnd growth between loss and loss detection exceeding the
> multiplicative decrease) can also occur outside of slow-start, in congestion avoidance. The CUBIC cwnd growth
> in congestion avoidance can be up to 1.5x per round trip. So after a packet loss the cwnd could grow by 1.5x
> before loss detection and then be cut in response to loss by 0.7, causing the ultimate cwnd to be 1.5 * 0.7 =
> 1.05x the volume of in-flight data at the time of the packet loss. This would likely cause an unnecessary
> follow-on round of packet loss due to failing to cut cwnd below the level that caused loss. So the problem is
> actually wider than slow-start.

Yes, the problem in CA in milder but may essentially result in injecting 
undelivered packets in a similar way. To my understanding this would 
often be the case when a flow probes for more capacity in the convex 
region. A typical CA cycle that ends somewhere around plateau should not 
be injecting  packets at a rate as high as 1.5. Anyway, the problem exist 
and should IMO also be addressed.

> AFAICT a complete/general fix for this issue is best solved by recording the volume of inflight data at the
> point of each packet transmission, and then using that metric as the baseline for the multiplicative decrease
> when packet loss is detected, rather than using the current cwnd as the baseline. This is the approach that
> BBRv2 uses. Perhaps there are other, simpler approaches as well.

This sounds like more accurate and better way than the original way of 
using the current cwnd as the baseline (which, of course, is simpler as it 
does not require keeping per packet state) and is likely to work out 
nicely also with application limited slow start. In addition, it would 
seem to work also fine with Hystart++ when it is not able to avoid all 
losses.

> I also agree with Vidhi's concern, that a change to the multiplicative decrease changes the algorithm
> substantially. To ensure that the draft/RFC is not recommending something that has unforeseen significant
> negative consequences, we shouldn't make such a significant change to the text until we get experience w/ the
> new variation.

I fully understand that people maintaining an implementation in their 
stack and possibly running various important services need to test and 
check any change.

However, we are writing a standard document to give requirements/advices 
for any potential implementer, not documenting what current 
implementations do. Changing current text for Beta from 0.7 to 0.5 for a 
congestion event during slow-start would result in exactly the same 
behaviour (until the point where the subsequent loss recovery ends or 
with ECN up to the beginning of the subsequent CA cycle) that we have 
tons of experience with TCP for over three decades. The behaviour is very 
well understood, extensively tested and reported, and it is what is 
currently our draft standard for congestion control. So, what might be 
the unforeseen significant negative consequences that the extensive 
studies and wide deployment experience for over three decades has not 
brought up?

Changing or deviating from the current draft standard behaviour (to Beta 
= 0.7) would require extensive evidence that it is safe. We have 
none. Instead, it is extremely obvious that using Beta 0.7 will result in 
un-wanted behaviour that has been clearly flagged as dangerous in RFC 
2914.

Thanks,

/Markku

> best regards,
> neal
> 
> 
> On Tue, Jul 12, 2022 at 6:08 PM Vidhi Goel <vidhi_goel=40apple.com@dmarc.ietf.org> wrote:
>       Hi Markku,
>
>       I emailed about this to other co-authors and we think that this change is completely untested for
>       Cubic and we think that this could be considered of a future version of Cubic, not the current
>       rfc8312bis.
>       To change Beta from 0.7 to 0.5 during slow-start, we would at least need some experience either
>       from lab testing or deployment since all current deployments of Cubic for both TCP and QUIC use
>       0.7 as Beta during slow start. Since a lot of implementations currently use hystart(++) along with
>       Cubic, we don’t see any high risk of overaggressive sending rate and that is what the current
>       rfc8312bis suggests as well. In fact, changing Beta from 0.7 to 0.5 can still be aggressive
>       without using hystart.
>
>       Thanks,
>       Vidhi
>
>       > On Jul 11, 2022, at 5:55 PM, Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org> wrote:
>       >
>       > Hi all,
>       >
>       > below please find proposed text to solve the Issue 2 a). I will propose text to solve 2 b) once
>       we have come to conclusion with 2 a). For description and arguments for issues 2 a) and 2 b),
>       please see the original issue descriptions below.
>       >
>       > Sec 4.6. Multiplicative Decrease
>       >
>       > Old:
>       >   The parameter Beta__cubic_ SHOULD be set to 0.7, which is different
>       >   from the multiplicative decrease factor used in [RFC5681] (and
>       >   [RFC6675]) during fast recovery.
>       >
>       >
>       > New:
>       >   If the sender is not in slow start when the congestion event is
>       >   detected, the parameter Beta__cubic_ SHOULD be set to 0.7, which
>       >   is different from the multiplicative decrease factor used in
>       >   [RFC5681] (and [RFC6675].
>       >   This change is justified in the Reno-friendly region during
>       >   congestion avoidance because a CUBIC sender compensates the higher
>       >   multiplicative decrease factor than that of Reno by applying
>       >   a lower additive increase factor during congestion avoidance.
>       >
>       >   However, if the sender is in slow start when the congestion event is
>       >   detected, the parameter Beta__cubic_ MUST be set to 0.5 [Jacob88].
>       >   This results in the sender continuing to transmit data at the maximum
>       >   rate that the slow start determined to be available for the flow.
>       >   Using Beta__cubic_ with a value larger than 0.5 when the congestion
>       >   event is detected in slow start would result in an overagressive send
>       >   rate where the sender injects excess packets into the network and
>       >   each such packet is guaranteed to be dropped or force a packet from
>       >   a competing flow to be dropped at a tail-drop bottleneck router.
>       >   Furthermore, injecting such undelivered packets creates a danger of
>       >   congestion collapse (of some degree) "by delivering packets through
>       >   the network that are dropped before reaching their ultimate
>       >   destination." [RFC 2914]
>       >
>       >
>       >   [Jacob88] V. Jacobson, Congestion avoidance and control, SIGCOMM '88.
>       >
>       > Thanks,
>       >
>       > /Markku
>       >
>       > On Tue, 14 Jun 2022, Markku Kojo wrote:
>       >
>       >> Hi all,
>       >>
>       >> this thread starts the discussion on the issue 2: CUBIC is specified to use incorrect
>       multiplicative-decrease factor for a congestion event that occurs when operating in slow start.
>       And, applying HyStart++ does not remove the problem, it only mitigates it in some percentage of
>       cases.
>       >>
>       >> I think it is useful to discuss this in two phases: 2 a) and 2 b) below.
>       >> For anyone commenting/arguing on the part 2 b), it is important to first
>       >> acknowledge whether (s)he thinks the original design and logic by Van Jacobson is correct. If
>       not, one should explain why Van's design logic is incorrect.
>       >>
>       >> Issue 2 a)
>       >> ----------
>       >>
>       >> To begin with, let's but aside a potential use of HyStart++ (also assume tail drop router
>       unless otherwise mentioned).
>       >>
>       >> The use of an MD factor larger than 0.5 is against the theory and original design by Van
>       Jacobson as explained in the congavoid paper [Jacob88]. Any MD factor value larger then 0.5 will
>       result sending extra packets during Fast Recovery following the congestion event (drop). All extra
>       packets will become dropped at a tail-drop bottleneck (if a lonely flow).
>       >>
>       >> Note that at the time when the drop becomes signalled at the TCP sender, the size of the cwnd
>       is double the available network capacity that slow start determined for the flow. That is, using
>       MD=0.5 is already as aggressive as possible, leaving no slack. Therefore, if MD=0.7 is used, the
>       TCP sender enters fast recovery with cwnd that is 40% larger that the determined network capacity
>       and all excess packets are guaranteed to become dropped, or even worse, the excess packets are
>       likely to force packets for any competing flows to become unfairly be dropped.
>       >>
>       >> Moreover, if NewReno loss recovery is in use, a CUBIC sender will
>       >> operate overagressively for a very long time. For example, if the
>       >> available network capacity for the flow is 100 packets, cwnd will have
>       >> value 200 when the congestion is signalled and the CUBIC sender enters
>       >> fast recovery with cwnd=140 and injects 40 excess packets for each of
>       >> the subsequent 100 RTTs it stays in fast recovery, forcing 4000 packets to become inevitably
>       and totally unnecessarily dropped.
>       >>
>       >> Even worse, this behaviour of sending 'undelivered packets' is against
>       >> the congestion control principles as it creates a danger of congestion
>       >> collapse (of some degree) "by delivering packets through the network
>       >> that are dropped before reaching their ultimate destination." [RFC 2914]
>       >>
>       >> Such undelivered packets unnecessarily eat capacity from other flows
>       >> sharing the path before the bottleneck.
>       >>
>       >> RFC 2914 emphasises:
>       >>
>       >> "This is probably the largest unresolved danger with respect to
>       >> congestion collapse in the Internet today."
>       >>
>       >> It is very easy to envision a realistic network setup where this creates a degree of congestion
>       collapse where a notable portion of useful network capacity is wasted due to the undelivered
>       packets.
>       >>
>       >>
>       >> [Jacob88] V. Jacobson, Congestion avoidance and control, SIGCOMM '88.
>       >>
>       >>
>       >> Issue 2 b)
>       >> ----------
>       >>
>       >> The CUBIC draft suggests that HyStart++ should be used *everywhere* instead of the traditional
>       Slow Start (see section 4.10).
>       >>
>       >> Although the draft does not say it, seemingly the authors suggest using HyStart++ instead of
>       traditional Slow Start in order to avoid the problem of over-aggressive behaviour discussed above.
>       This, however, has several issues.
>       >>
>       >> First. it is directly in conflict with HyStart++ specification which says that HyStart++ should
>       be used only for the initial Slow Start. However, the overaggressive behaviour after slow start is
>       also a potential problem with slow start during an RTO recovery; in case of sudden congestion that
>       reduces available capacity for a flow down to a fraction of the currently available capacity, it
>       is very likely that an RTO occurs. In such a case the RTO recovery in slow start inevitably
>       overshoots and it is crucial for all flows not to be overaggressive.
>       >>
>       >> Second, the experimental results for initial slow start in HyStart++ draft suggest that while
>       HyStart++ achieves good results HyStart++ is unable to exit slow start early and avoid overshoot
>       in a significant percentage of cases.
>       >>
>       >> Given the above issues, the CUBIC draft must require that MD of 0.5 is used when the congestion
>       event occurs while the sender is (still) in slow start. The use of MD=0.5 is an obvious stumble in
>       the original CUBIC and the original CUBIC authors have already acknowledged this. It seems also
>       obvious that instead of correcting the actual problem (use of MD other than 0.5), HyStart and
>       HyStart++ have been proposed to address the design mistake. While HyStart++ is a useful method
>       also when used with MD=0.5, when used alone it only mitigates the impact of the actual problem
>       rather than solves the problem.
>       >>
>       >> What should be done for the cases where HyStart++ exits slow start but
>       >> is not able to avoid (some level of) overshoot and dropped packets is IMO an open issue.
>       Resolving it requires additional experiments and it should be resolved separately when we have
>       more data. For now when we do not have enough data and understanding of the behaviour we should
>       IMO follow the general IETF guideline "be conservative in what you send" and specify that MD = 0.5
>       should be used for a congestion event that occurs for a packet sent in slow start.
>       >>
>       >> Thanks,
>       >>
>       >> /Markku
>       >>
>       >
>       > _______________________________________________
>       > tcpm mailing list
>       > tcpm@ietf.org
>       > https://www.ietf.org/mailman/listinfo/tcpm
>
>       _______________________________________________
>       tcpm mailing list
>       tcpm@ietf.org
>       https://www.ietf.org/mailman/listinfo/tcpm
> 
> 
>