Re: [tcpm] Proceeding CUBIC draft - thoughts and late follow-up

At this point, we have a higher level decision to make. Cubic controls a rather large share of the Internet traffic. It may well be more aggressive than Reno in some scenarios, and it may very well be that this extra aggressivity is one of the reason for Cubic adoption. So we have a dilemma: publish the doc as is, knowing that it reflects the reality of deployments; or, insist on a set of changes to create a new version  of Cubic that would be fully compatible with Reno.

My personal opinion is that Reno is by now obsolete, and that just should not bother. And then, there is also the quasi certainty that a "tamer" new version of Cubic would be just ignored by implementations.

-- Christian Huitema 

> On Jul 4, 2022, at 9:34 AM, Markku Kojo <kojo=40cs.helsinki.fi@dmarc.ietf.org> wrote:
> 
> Hi,
> 
>> On Wed, 22 Jun 2022, Lars Eggert wrote:
>> 
>> Hi,
>> 
>>> On 2022-6-21, at 3:06, Markku Kojo <kojo@cs.helsinki.fi> wrote:
>>> To my understanding we have quite a bit QUIC traffic for which RFC 9002 has just been published and it follows Reno CC quite closely with some exceptions.
>> 
>> see Vidhi's message on the differences between Reno and RFC9002.
> 
> See my replies to Vidhi.
> 
>> Also, my understanding is that the most widely deployed QUIC stacks in production actually use CUBIC or BBR v1 or v2 and not RFC9002.
> 
> That makes it even more important to specify CUBIC in an RFC such that it does not have issues or that any known issues are documented and clearly pointed to being problematic/questionable so that they get attention and are likely to be corrected, instead of hiding the issues.
> 
>>> We have also some SCTP traffic that follows very closely Reno CC >
>> The SCTP used for WebRTC in production (Webex, Zoom, etc.) is AFAIK not using Reno (or CUBIC, or RMCAT).
> 
> If that is the case then the algos that are used should get thoroughly evaluated and documented in the RFC series.
> 
>>> and numerous proprietary UDP-based protocols that RFC 8085 requires to follow the congestion control algos as described in RFC 2914 and RFC 5681. So, are you saying RFC 2914, RFC 8085 and RFC 9002 are just academic exercises?
>> 
>> What the IETF requires in RFCs and what sees deployment are two different things. These RFCs are meant to give implementors who may not be aware of the intricacies of CC some background and a solid foundation to implement.
> 
> To my understanding these RFCs document pretty much the consensus within the CC community what is seen best for the Internet as well as advice for all implementers to follow.
> 
>>> Moreover, my answer to why we see so little Reno CC traffic is very simple: people deployed CUBIC that is more aggressive than Reno CC, so it is an inherent outcome that hardly anyone is willing to run Reno CC when others are running a more aggressive CC algo that leaves little room for competing Reno CC.
>> 
>> CUBIC might be more aggressive than Reno, but it is not problematically so. And its slight increase in aggressiveness  - w/o any apparent major issues - results in better application performance, which is why it is seeing deployment.
> 
> Please see below the items w.r.t the Issue 2 and explain why it is not apparent and why injecting 40% more packets than what is the slow-start determined available capacity is "not problematically so", even though RFC 2914 states that such behaviour is
> "probably the largest unresolved danger with respect to congestion
>  collapse in the Internet today."
> 
> Such increase in aggressiveness cannot result in better application performance but is higly unwanted and causes damage also for CUBIC-only traffic.
> 
>>> First, if the CUBIC draft is published as it currently is that would give an IETF stamp and 'official' start for "a spiral of increasingly
>>> aggressive TCP implementations" that RFC 2914 appropriately warns about.
>> 
>> RFC2914 was written at a time when the IETF had practically no participation from the engineers that implemented and shipped CC algorithms for the major stacks, and the need for proper CC was a lot less well and widely understood as it is now.
> 
> If I recall it correctly some number of people implementing TCP and CC algos for various stacks have been involved, even though almost all algo proposals have traditionally arised from the CC research community.
> RFC 2914 provides important advice for "engineers" as well. CC is very challenging art, particularly when one needs to take into account the wide range of environments and contention scenarios. A wider IETF/ICCRG evaluation for any CC algo is very important just because it is hard to get things correct at once but we need eyes from a broader set of CC experts to challenge and review any CC proposals. AFAIK all TCP CC algos submitted to the IETF have leveraged from the IETF evaluation and resulted in correction of many aspects that escaped the original design.
> 
>> We are in a much different situation now, where hyperscalar and other massively deployed services pay extremely close attention to how well their content pipeline operates, and whose engineers are participating in this group and the broader IETF.
> 
> Sure, no doubt that engineers pay extremely close attention on how their services work. It is inherent that one primarily keeps eyes open for their own traffic that it operates well and are often busy to ensure it. There is nothing to blame them in doing so, I would do the same if working for a company developping a stack for their business. But to ensure that Internet is for all it is important that nobody's traffic encounters systematic negative impact due to non-conformant traffic. The whole idea of congetion control is to avoid congestion collapse and aim at equal treatment for all traffic in the Internet, even for those who are not using CUBIC but some other TCP-compatible traffic and even if they are less than 1% of all traffic. If some co-existing, non-conformant traffic affects their traffic, it may affect close to 100% of their important everyday traffic that is not from the major services. And IETF must take care of providing advise such that in encounters traffic for all Internet users, not just the traffic of major service providers.
> 
>> There is an increasing desire to optimize CC, BBR being maybe the latest example, but at the same time there is also a huge awareness of the risks of being too aggressive, maybe more so now than at any time in the past. I don't think there is a risk of a CC spiral of death.
> 
> Zero measurements in more than a decade does not support this claim.
> 
>>> The little I had time to follow L4S discussions in tsvwg people already insisted to compare L4S performance to CUBIC instead of Reno CC.
>> 
>> Of course they would - CUBIC is what runs on the Internet. If you want to compare yourself to the current status quo, that is your baseline.
> 
> Sure, I understand this very well but you did not get the point that spiral has already been initiated.
> 
>>> Second, by recognizing CUBIC as a standard as it is currently written would ensure that all issues that have been raised would get ignored and forgotten forever.
>> 
>> I don't see this risk at all. One motivation for publishing a bis version of RFC 8312 was to document the bug fixes that have occurred in deployments since RFC 8312 was published. Publishing the bis will not stop us from publishing future improvements.
> 
> I'd hope this is the case. However, given that there has been more than a decade time for "engineers" to do the necessary measurements ensuring CUBIC is safe for other traffic, no single measurement has been published nor articulated. So, there seems to be nothing factual that speaks for your beliefs spoken above. Instead, publishing CUBIC as PS without correcting or appropriately documenting the problems easily takes away any incentive for doing the necessary measurements for which we have current consensus in the IETF. If no measurements have carried out so far, what is the incentive to run the, it in the future?
> 
> With no doubt any corrections to CUBIC are surely seen if the CUBIC traffic itself enconters some problems.
> 
> 
>>> As I have tried to say, I do not care too much what would be the status of CUBIC when it gets published as long as we do not hide the obvious issues it has and we have a clear plan to ensure that all issues that have not been resoved by the time of publishing it will have a clear path and incentive to get fixed.
>> 
>> I'd like to point out that I see nobody else in the WG claiming that CUBIC has "obvious issues" or is a "flawed design". It's not perfect, but nothing ever is.
> 
> There are six separate remaining issues raised. I claim your above statement that nobody else is agreeing that there are issues is incorrect. Let's look at the Issue 2:
> 
> 1) If cwnd is reduced by less than half when the sender is in slow start,
>   it inevitably results in injecting excess packets into the network
>   with a tail-drop bottleneck router and that these packets are
>   guaranteed to be droppd and is thereby in conflict with the design
>   logic presented in Van Jacobson's original paper [Jacob88] that
>   specifies slow start and explains how MD factor must be selected for
>   a congestion event encountered in slow start. Note that using MD-factor
>   of 0.5 as explained in [Jacob88] guarantees that the bottleneck is
>   fully utilized after the cwnd reduction, so there is no gain being
>   more aggressive as it only results in injecting "undelivered packets"
>   that are known to be high risk for congestion collapse of some degree.
> 
>   AFAIK, nobody has ever claimed that Van's design logic is incorrect.
>   Instead, it is well understood among CC researchers and widely
>   accepted to be correct.
> 
>   Please explain what is wrong in Van's design logic for slow start?
> 
>   [Jacob88] V. Jacobson, Congestion avoidance and control, SIGCOMM '88.
> 
> 2) Two of the co-authors of the CUBIC draft who are also co-authors of
>   the original CUBIC design have agreed in the github discussions
>   that using MD-factor of 0.7 when in slow start is an issue.
> 
> 3) In the github discussions Neal indicated that he has seen traces
>   that show problematic behaviour with MD=0.7.
> 
> 4) In my understanding and based on my experience in the IETF work
>   for over two decades, if somebody raises an issue with a proposed
>   technology it is supposed to be taken as being the fact unless
>   somebody explains or shows why it is not. I expect people who
>   agree on with the problem description stay quiet and those who
>   disagree will argue why there is no problem. Now even two
>   co-athors agree and others say nothing.
> 
>   Also a common practise has been that at least the authors respond
>   with technical arguments to any issues raised. That brings up
>   possibly two opposite views into the discussions that other people
>   in the wg may much easier contribute to and provide their view and
>   thereby help wg to come up with a working solution based on technical
>   arguments.
> 
> How do the items 1-4 above support your claim that nobody else is agreeing with the Issue 2 and that the CUBIC specification as currently written is correct for Issue 2?
> 
>> CUBIC has been running the majority of the Internet traffic for the last decade, and the Internet seems to be doing OK.
> 
> Internet may seem to operate Ok but it does not tell anything because you don't see the effect on the other traffic without measuring it separately.
> As explained several times and a number of people seconding, we cannot know the impact without measurements. If you think we can, please explain how.
> 
>> We'll publish additional improvements to CUBIC when they are proposed, tested and have WG consensus.
>> 
>>> IMO that can be best achieved by publishing it as Experimental and documenting all unresolved issues in the draft.
>>> That approach would involve the incentive for all proponents to do whatever is needed (measurements, algo fixes/tuning) to solve the remaining issues and get it to stds track.
>> 
>> Please propose a short paragraph of text that outlines these "unresolved issues", which we might then see if the WG has consensus for adding it to the draft?
> 
> I have done it already for Issue 2 but it was ignored with no comment. I'll propose text for each of the issues (or at least some of them).
> 
>>> But let me ask a different question: what is gained and how does the community benefit from a std that is based on flawed design that does not behave as intended?
>> 
>> So even if CUBIC was a "flawed design that does not behave as intended", it seems in practice to perform pretty well without major issues, seems to deliver QoE improvements to the applications that run above it, and is ubiquitously deployed on the Internet.
>> 
>> Not publishing it on the standards track sends a pretty strong message to the implementer community that the IETF community is completely out of touch with deployed realities. This risks us being taken seriously.
> 
> IMO it is much bigger risk to standardize technology that is incorrect or that has not been shown to be correct or is otherwise not properly validated. Particularly, when reasonable doubt has been raised and nobody has argued against any of the issues raised.
> 
>>> Congestion control specifications are considered as having significant operational impact on the Internet similar to security mechanisms. Would you in IESG support publication of a security mechanism that is shown to not operate as intended?
>> 
>> Why do you believe CUBIC does not "operate as intended"?
> 
> Issue 1:
> 
> Because the design goal of the CUBIC CC algos id correct and therefore the operation has been divided into two regions.
> 
> In the TCP-friendly region, CUBIC has been designed to have equal performance (aggressiveness) as Reno CC because in this region Reno CC has no problems in fully utilizing the available bandwidth. That is exactly the correct goal per RFC 2914 and RFC 5033. Unfortunately, it was left unnoticed that the AIMD-model borrowed from the manuscript paper [FHP00] was unvalidated and turned out to be incorrect (due to incorrect assumptions). So the outcome is that the operation within this region is not as intended and the measurements that we have reveal this problem.
> 
> In the region that uses CUBIC increase function is intended to be used when Reno CC is not able to utilize the available network capacity, i.e., in high BDP environments. This is the major contribution of CUBIC for better performance and the operation of CUBIC in this region is pretty fine (there are some smaller doubts for some operating conditions but they are not subject to disussion here) and not subject to the remaining issues.
> 
> Issue 2:
> 
> Using MD-factor of 0.7 when in slow start being wrong decicion was left unnoticed by the original CUBIC design and two of the co-authors have admitted this.
> 
> There are other issues as well and I have tried to explain each of the issues very carefully. Please reply separately for each issue with technical arguments telling what is wrong in my problem description or why no problem exists. I'm happy to clarify if I have not been clear enough for some of them.
> 
>> What matters is whether a security or congestion control mechanism is fit for purpose and without major failure cases. I believe that is the case for CUBIC.
>> 
>>> Could we now finally focus on solving each of the remaining issues and discussing the way forward separately with each of them? Issue 3 a) has pretty much been solved already (thanks Neal), some text tweaking may still be needed.
>> 
>> As editors of a WG document, we'll incorporate changes as they gain WG consensus. There was a proposal (and support) to address one of your suggestions, and we merged Neal's PR. If and when that happens for other suggestions, we'll follow suit.
> 
> Please reply with technical arguments for each of the remaining issues so that we are able have WG opinion and consensus for each of them separately. The issues are different and require different resolution each.
> 
> Thanks,
> 
> /Markku
> 
>> Thanks,
>> Lars
> 
> _______________________________________________
> tcpm mailing list
> tcpm@ietf.org
> https://www.ietf.org/mailman/listinfo/tcpm