Re: [ippm] [tsvwg] [iccrg] New Internet Draft: Congestion Signaling (CSIG)

Sebastian Moeller <moeller0@gmx.de> Thu, 22 February 2024 08:36 UTC

Return-Path: <moeller0@gmx.de>
X-Original-To: ippm@ietfa.amsl.com
Delivered-To: ippm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A52BBC14F60D; Thu, 22 Feb 2024 00:36:18 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.854
X-Spam-Level:
X-Spam-Status: No, score=-6.854 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmx.de
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NyGjNtYBL58T; Thu, 22 Feb 2024 00:36:14 -0800 (PST)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.18]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1B45DC14F682; Thu, 22 Feb 2024 00:36:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.de; s=s31663417; t=1708590967; x=1709195767; i=moeller0@gmx.de; bh=zraG02hg2SocV687sjkdqoAgZC6mx/NwfLGKb8IgkME=; h=X-UI-Sender-Class:Subject:From:In-Reply-To:Date:Cc:References: To; b=VQheeEcEiEMSFrkzj0DLd1Ur1WMypOTs9ywuhWvH/Uji2pa066t6OKvyYSWwbxnh /YXmCnm4zCC8ABBkr18T8/Dp5FlcB201EcMUIUmSQyYp+oD0QsaGQ6yJff/xmZhmE svjb9KpcpznZARAYbCvs6O7GF5QG+xgRJBOtKr32KOs2Q70iJfLAR7nmzT1ziOpeW kUseMLdFoprkoCeobaaAcrWRTbgHATUW0YeOc7ldaL/GsYL2pi66g94nSleNnHSKk OA/kWRUPh0eDGozbfUPGuYFwv2PwhZoZ1VdA3PXEYcQ/DdKyzvgJUCCGXwlYeCauS I0GbOcqYKA1qtIAgyg==
X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a
Received: from smtpclient.apple ([134.76.241.253]) by mail.gmx.net (mrgmx004 [212.227.17.190]) with ESMTPSA (Nemesis) id 1Mk0JW-1rF3TK3mnY-00kQ3e; Thu, 22 Feb 2024 09:36:06 +0100
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.400.31\))
From: Sebastian Moeller <moeller0@gmx.de>
In-Reply-To: <743365DB-5B00-4494-970E-95B6C347A85F@broadcom.com>
Date: Thu, 22 Feb 2024 09:35:55 +0100
Cc: Tom Herbert <tom@herbertland.com>, Christian Huitema <huitema@huitema.net>, tsvwg <tsvwg@ietf.org>, IETF IPPM WG <ippm@ietf.org>, Nandita Dukkipati <nanditad@google.com>, Naoshad Mehta <naoshad@google.com>, Abhiram Ravi <abhiramr@google.com>
Content-Transfer-Encoding: quoted-printable
Message-Id: <58467649-ADA1-4556-8B70-C4393D393E0F@gmx.de>
References: <DF64A214-06A0-4313-BEF5-B54566823DAC@gmx.de> <743365DB-5B00-4494-970E-95B6C347A85F@broadcom.com>
To: Jai Kumar <jai.kumar@broadcom.com>
X-Mailer: Apple Mail (2.3774.400.31)
X-Provags-ID: V03:K1:djXlsM0fpz1qSQPAbo7LMsdyAR21SSly+Xfsi+zC4/rossmGcBL K2aj8auuMFBPLxRenT3PSIBjRj5VI27eharB3FVvsTNfouPAgciEFwX/iwdcgvRwVsuhdiu +3QwZZ7dcgepWWcSu8zCRxq5xfPpHAaL5D8U9Bnb+LDnSxahWJW425Bq61uuXVTOAdY2z0g Qw1aPQPpAFozJXTSOF8mA==
UI-OutboundReport: notjunk:1;M01:P0:u7t74tYUVEg=;ZGCq+tSdONEBhen64VLqxDy1bOr mUjIAPW7Ah+jaUjQkSJReJNcabWWHkJKNaI6zwZjTDkoYnNfOkgZ6TKG4sEM09CAXzIgk9lo1 yx3PQaakbb8wQcNddOR34jJvhb4UZZ8Kk61cVykqa+BtaY7Qu9FqlDkdFf/K9tUUL8WuKdV0D O7tO7kvCXiW3KSlRwn48ZUg8v5YlqfIKrDuHwBxS1f68TzIpCa09HvHy6h4vDSg/NDvaAhsNx GzByzJXKAZ98YDTxE61GS6fT5SQTx0d63zObeE59JTqvjx8VF9N4JtWbtqaEpkVuGSJyjOcNf RduJcZuqa06oaUK45KwdU1GM2ph+ovlXDY+XffVXsVEKrEQT7PKDFIBi0qLpSJFZUlQs6WygL uJ/kwMXX82HYMkwo8Cy0kqlsuXkKDwxdMn+/hsFISupmXM2Wuq26KdmFsUFEOPypapvPahYmr xiykwcYVxEinHLyl/bBnirS1/+NcSrszrBUViKrK7lAA34+Lzpj8oSG0cCVAzAAUazzvnrowL VSItxOW0tIAApfesYDyo5fMCF25aRCdmuMAeFLRt3LgloPsWYJIe0NLss9hXWrRfHGwaCqEEB ajiD/81CF4dRAVASCzW1lDaFOPNn6egB+KqBboIZfURNfb7Uwy0tkoYdTYt1D4QqJq1Zrrg5a Jl5+Ar8a97UzBn12C9W4VvXMjV0lqxhjumbwsNAYVy5xdk8e2WMeSK9BBcr6azzj5zG45wU5q RJlbVP4A0kvug4N1VEGMvCYAcWj8TAM5NXinYug0MFJEdCig9sCD+8OMqUk+CqIh2ejGaMNEN 2y7tpXAkQ9m/iZXDaUIw2pWNM941lWH5I7hOpdTyrvgWI=
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/ULnZdD1OpKfZ94Zuc-oN0PsYJYI>
Subject: Re: [ippm] [tsvwg] [iccrg] New Internet Draft: Congestion Signaling (CSIG)
X-BeenThere: ippm@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF IP Performance Metrics Working Group <ippm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ippm>, <mailto:ippm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ippm/>
List-Post: <mailto:ippm@ietf.org>
List-Help: <mailto:ippm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ippm>, <mailto:ippm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 22 Feb 2024 08:36:18 -0000

Hi Jai,


> On 21. Feb 2024, at 17:06, Jai Kumar <jai.kumar@broadcom.com> wrote:
> 
> Hi Sebastian,
> 
> We will create a summary record of this discussion and post it.
> 
> I do want to answer your questions but will keep it short here. 
> 1. We want single consistent, compact lightweight encap that works for v4, v6 and newer evolving formats.

[SM] Which other L3 formats do you see evolving?

> 2. Howsoever we want industry to migrate to v6, there is a pervasive usage of v4 and is there to stay. We must have a consistent solution for v4.

[SM] There is the proposal for extension headers for IPv4, that would solve that issue, after all the environments you mention like DC are well controlled and requiring special IP or TCP options is not a show stopper... (as seen by CSIG itself which requires active buy in by network nodes and end points).

> 3. Parsing and processing latency impact on  switch latency is extremely important for inference.

[SM] Yes might be, but that is not an actionable argument, how much delay budget is acceptable, that is the question you need to answer.

> 4. Minimizing Protocol overhead is extremely important.

[SM] Please prioritise your requirements, if everything is extremely important, essentially nothing is.

> Training workload is bandwidth intensive and requires expensive accelerators and network bandwidth.

[SM] But also high fidelity congestion/load information, so this clearly is worth some investment.

> There is IEEE work going on reducing the link layer overhead for such applications as well.

[SM] How much is in the books, and how much gain is this going to give over jumboframes?

> 5. Traffic engineered domains have tunnels (will see if this information can be shared by operators) and tunnels in tunnels.

[SM] Yes, but tunnelling comes with consequences...

> Though copying of data may seem innocuous but is different from tunnel encap information that comes from single rewrite table. Its expensive in hw to copy data from incoming packet shove it in payload or tunnel headers and trigger various checksum updates.

[SM] Well you can always include a field in your L2 headers giving the exact offset to the L3 record so even in tunnelled situations you can write to the single record if you consider data moving during en- and decapsulation prohibitive expensive... 

> 6. There is a viable path for internet deployment and this is captured in the draft.

[SM] Could you please summarize this, with a focus of how this is supposed to traverse internet routers, that IMHO mostly deal with IP payloads?

> It is better then asking the switch pipelines to support hop by hop options in data path that were designed for slow path.

[SM] Sorry, what is the fast and what is the slow path is an engineering decision, not a natural law... and hence a function of economics.

> We will capture this as well.
> 
> I (on behalf of other authors as well) again Thank You all for a lively discussion and feedback. 
> 
> 
> Best,
> -Jai
> 
>> On Feb 20, 2024, at 11:26 PM, Sebastian Moeller <moeller0@gmx.de> wrote:
>> 
>> Hi Jai,
>> 
>> 
>>> On 21. Feb 2024, at 06:33, Jai Kumar <jai.kumar@broadcom.com> wrote:
>>> 
>>> Hi Tom,
>>> 
>>> Tunneling is pretty common in DC and DCI.
>> 
>> [SM] Well, then collect congestion information in L2 in non-IP tunnelled section and copy the information into the L3 congestion record on decapsulation (and from the L3 record to the L2 on encapsulation), this is hardly rocket science but simply the cost of tunnelling... or reconsider the decisions that lead to massive tunnelling...
>> 
>> However I would guess that in each DC environment there will likely be only a small set of different tunnels in play, so one might as well poke directly through to the L3 record...
>> 
>> 
>>> 
>>> Your proposal doesn’t work as it forces to create yet another encap for IPV4
>> 
>> [SM] While your proposal requires a new L2 ethertype record... both require changes the question is where, and respectfully, nobody forces DCs to use IPv4. I would assume that this is a simple trade-off between IPv4 and IPv6 could well be that better congestion information available in IPv6 tilts the balance towards IPv6 here... (my understanding is that the IETF desires IPv6 to become more and more prevalent)...
>> 
>>> which I mentioned is the predominant traffic type and for so many other reasons iterated in this thread.
>> 
>> [SM] Well, please layout, how your L2 approach will ever be usable over the internet end to end. If the argument is 'it will not' I would argue that this draft is not in the IETF's best interest. Nothing against starting this in the DC environment but IMHO there needs to be a viable path for internet wide deployment.
>> 
>> Regards
>>   Sebastian
>> 
>>> 
>>> I will let this thread rest here from my side and let others contribute as needed.
>>> 
>>> Thank you for your inputs.
>>> 
>>> Best,
>>> -Jai
>>> 
>>>>> On Feb 20, 2024, at 8:52 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>> 
>>>> On Tue, Feb 20, 2024, 8:06 PM Jai Kumar <jai.kumar@broadcom.com> wrote:
>>>>> 
>>>>> Hi Tom,
>>>>> 
>>>>> There are multiple problems and I think best is if we have a side meeting if you are interested in learning about them.
>>>>> 
>>>>> Here is a one problem definition during tunneling as observed in one of the actual deployment.
>>>>> 
>>>>> 1. Assuming IPv6 deployment.Say packet enters a tunneled domain. Tunnel encap is also the Init node for IOAM domain so metadata is inserted in the Hop by Hop option.
>>>>> 2.  As packet traverse the transit nodes option size keeps on growing.
>>>>> 
>>>>> 3. Tunnel decap includes stripping of tunnel header any associated hop by hop options metadata, parsing and doing a lookup on inner payload to forward the packet. Note modern chips do single lookup involving tunnel and payload to resolve the FEC.
>>>>> 4. If the size of the hop by hop options grows such that the inner payload is shifted to a 2nd cell boundary i.e. where switch pipeline do not see the bytes of the packet, there is no way for pipeline to resolve the forwarding.
>>>>> 
>>>>> This problem is from the forwarding point of view where essentially forwarding is broken when packet exits the tunnel domain.
>>>>> 
>>>>> There are similar problems from pipeline editor point of view even when the size doesn’t grow meaning what you call as just copy few bytes to tunnel header and vice-versa.
>>>> 
>>>> Jai,
>>>> 
>>>> Actually, the size of a Hop-by-Hop Options header is constant once
>>>> it's created by the host. The problem you're citing can happen anytime
>>>> the size of the header chain that a router wants to process exceeds
>>>> the size of its parsing buffer. This can happen with options (at
>>>> various) levels in the packet, big routing headers, or it could happen
>>>> because of two many encapsulations. Unfortuantely, there was never any
>>>> guidance on how many bytes of header a router is expected to process;
>>>> so this is why draft-ietf-6man-eh-limits sets a requirement that
>>>> routers should be able to process at least 128 bytes of protocol
>>>> headers (Ethernet through the first eight bytes of the transport layer
>>>> header for ECMP).
>>>> 
>>>>> 
>>>>> 
>>>>> As I mentioned I will be more than happy to share these deployment experiences if you are interested.
>>>> 
>>>> You still seem to be focused on problems related to tunneling which is
>>>> not the common case in DC. What about non-tunneling cases? What is
>>>> issue with IOAM when packets are not being tunneled? (again, note in
>>>> that case the HBH options does not grow). If you compare the 62 byte
>>>> format I proposed for CSIG to L2 format in the draft, do you see any
>>>> issues in hardware implementation that would make the L2 solution
>>>> substantially faster?
>>>> 
>>>> Tom
>>>> 
>>>>> 
>>>>> Best,
>>>>> -Jai
>>>>> 
>>>>> 
>>>>>>> On Feb 20, 2024, at 5:39 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>>>> 
>>>>>> On Tue, Feb 20, 2024 at 3:59 PM Jai Kumar <jai.kumar@broadcom.com> wrote:
>>>>>>> 
>>>>>>> Tom,
>>>>>>> 
>>>>>>> AI/ML and HPC clusters are evolving for multi-tenancy and will take some time but there are other use cases for DC traffic engineering that are captured in the draft. In such cases packets may undergo an unspecified number of tunneled domains and as I mentioned IOAM doesn't work well when it traverses the hierarchy of tunnels besides you have got a proposal only for IPv6.
>>>>>> 
>>>>>> Hi Jai,
>>>>>> 
>>>>>> Sorry, I don't see why tunneling is an issue. If I want to tunnel a
>>>>>> packet containing IOAM across the Internet, it's really simple-- just
>>>>>> use some variant of IPIP tunneling. But if I want to tunnel CSIG
>>>>>> across the Internet, it seems that I need to tunnel layer 2 (that is
>>>>>> we need a fourteen byte Ethernet header in something like VXLAN)--
>>>>>> it's not as simple as IPIP tunneling. If this idea is that we need to
>>>>>> expose the CSIG information in the outer headers, then I still don't
>>>>>> see much issue. Just push the outer header stack and copy fields for
>>>>>> CSIG (work the same regarless if the CSIG information is in L2 or L3).
>>>>>> 
>>>>>> As for just my proposal only being for IPv6, I've already pointed out
>>>>>> draft-herbert-ipv4-eh and there are a number of encapsulation methods
>>>>>> in RFC9378 for IOAM that could be adapted to carry CSIG. Also, just to
>>>>>> be fair :-), you have got a proposal only for TCP and PonyExpress over
>>>>>> Ethernet and assumes that switches are configured properly and
>>>>>> supports what appears to be a non-standard use of VLAN tags.
>>>>>> 
>>>>>>> 
>>>>>>> I have nothing against IOAM if there is an adoption and customer ask. There are multitudes of IETF RFC drafts that are just drafts. All I am saying is that I have spent time and  energy supporting IOAM in HW that no customer is asking for it. You may not like it but that's the fact. Do some digging around and find out the number of switches supporting it.
>>>>>> 
>>>>>> I am less interested in marketing numbers and more interested in the
>>>>>> technical details as to why IOAM is not implementable in hardware and
>>>>>> whether those issues are inherent to any L3 protocol or just IOAM.
>>>>>> I've already demonstrated a format that avoids the problem of "they
>>>>>> all commonly stack up multiple per-switch telemetry data per-hop in
>>>>>> the path of a packet" and as I mentioned I don't see the issue with
>>>>>> tunneling. I am still interested if there is some other problem...
>>>>>> 
>>>>>> Tom
>>>>>>> 
>>>>>>> Best,
>>>>>>> -Jai
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Feb 20, 2024 at 3:24 PM Tom Herbert <tom@herbertland.com> wrote:
>>>>>>>> 
>>>>>>>> On Tue, Feb 20, 2024 at 2:23 PM Jai Kumar <jai.kumar@broadcom.com> wrote:
>>>>>>>>> 
>>>>>>>>> Tom,
>>>>>>>>> 
>>>>>>>>> Please see inline ...
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 20, 2024 at 1:54 PM Tom Herbert <tom@herbertland.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 20, 2024 at 1:02 PM Jai Kumar <jai.kumar@broadcom.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Thank you for a lively discussion.
>>>>>>>>>>> 
>>>>>>>>>>> I just want to add a few more points in favor of using CSIG in L2.
>>>>>>>>>>> 
>>>>>>>>>>> As we are trending towards high performance ethernet fabric for HPC and AI/ML clusters, there are innovations happening in both link layer and layer 3 headers. Amongst many some of the key requirements are
>>>>>>>>>>> - minimal and constant overhead (in CSIG it is a fixed 16bit TPID, where as for IOAM IPv4 packets if I recall it is a GRE header encapsulation and for v6 it is Hop by Hop option)
>>>>>>>>>> 
>>>>>>>>>> Please see the solution I proposed (again not IOAM). This could be
>>>>>>>>>> done effectively as a fixed size header consisting of 14 byte Ethernet
>>>>>>>>>> header, 20 byte IPv6 header, and 8 byte HBH Options containing a CSIG
>>>>>>>>>> option. As I mentioned, hardware can identify these headers from
>>>>>>>>>> fields as fixed offsets, and otherwise assume that all the fields that
>>>>>>>>>> need to be processed are at fixed offsets in the packet.
>>>>>>>>>> 
>>>>>>>>> [Jai] Proposal is good only for IPv6. What about IPv4 and/or evolving compressed IP headers?
>>>>>>>>> 
>>>>>>>>>>> - independence from layer 3 transport. So that it can work in legacy IP networks or in optimized networks where there MAY be some form of compressed headers present.
>>>>>>>>>> 
>>>>>>>>>> By legacy, I assume you mean IPv4? There are some approaches that
>>>>>>>>>> could be applied (draft-herbert-ipv4-eh or maybe a lightweight
>>>>>>>>>> encapsulation protocol).
>>>>>>>>> 
>>>>>>>>> [Jai] We have a proposal. Why not look into what we have proposed? It works for any kind of Layer 3; IPv4, IPv6 and/or new compressed IP headers and even for MPLS. I do not see any good reason to invent yet another lightweight encap when we have CSIG as an L2 header. Also note that today's DC traffic is almost 80% IPv4. Having just a solution for IPv6 itself is a non-starter.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Also, while independence for L3 transport might be a goal, as I
>>>>>>>>>> already pointed out there are at most only two L3 protocols we need to
>>>>>>>>>> worry about: IPv6 and IPv4. If the information is L2 then we have to
>>>>>>>>>> map that into the different L2 technologies-- so IMO, avoiding
>>>>>>>>>> dependence on L2 would be a better goal to constrain complexity of the
>>>>>>>>>> solution. This is also true for the upper layer >=L4. Right now the
>>>>>>>>>> draft is dependent on L4 or higher to reflect the signal. If we put
>>>>>>>>>> this information in L3, then we again only have to worry how to do
>>>>>>>>>> this for the two IP protocols.
>>>>>>>>>> 
>>>>>>>>>>> - low latency hw implementation. AI/ML clusters are latency sensitive for inference and demands <.20us switch latency.
>>>>>>>>>> 
>>>>>>>>>> Please look at the format I proposed. I don't offhand see anything
>>>>>>>>>> that would make processing any slower than putting the information in
>>>>>>>>>> L2. Again, everything needed could be at fixed offsets and mapping the
>>>>>>>>>> packet to the format for processing is a matter of comparing a few
>>>>>>>>>> fields in the parsing buffer (and most of those fields are probably
>>>>>>>>>> already checked for normal forwarding anyway).
>>>>>>>>>> 
>>>>>>>>> [Jai] We can have this discussion offline but putting CSIG information in L3 header is very complex for HW updates and also when a CSIG packet enters a new tunneled domain or exits a tunnel domain, Switches do not have the ability to infinitely parse into packet and typically operate at a cell boundary of 196 Bytes to 256 Bytes. You either have to bubble up the CSIG header from L3 to the outer tunnel header for encap or copy it down to the payload header when doing tunnel decap. This is a very expensive operation to do in HW and MAY not be possible of the CSIG tag gets pushed beyond single cell boundary.
>>>>>>>> 
>>>>>>>> But earlier you said that "hough we talk about vxlan tunnels in AI/ML
>>>>>>>> or HPC clusters but that is still a couple of years away"? Do you see
>>>>>>>> any issues in hardware for forwarding the packet format I proposed
>>>>>>>> (just plain switching).
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> - decoupling of congestion signaling from transport. One such example is the Ultra Ethernet Transport proposed in UEC. Reflection for source based congestion control is in purview of transport as and when needed.
>>>>>>>>>> 
>>>>>>>>>> But reflecting the information doesn't decouple congestion signaling
>>>>>>>>>> from transport. This is the reason why the reflected signal should be
>>>>>>>>>> in L3.
>>>>>>>>>> 
>>>>>>>>> [Jai] It does, There MAY not be a need of reflection and intersection with transport if the congestion control algorithm is receiver based.
>>>>>>>>> 
>>>>>>>>>>> - fixed offset and constant header variance for v4 or v6 frames for both low latency and future proofing for any innovation happening in these layers. Note that these clusters are not L2 bridged domains and are mostly L3 routed domains.
>>>>>>>>>> 
>>>>>>>>>> This is where I don't understand how an L2 solution can work
>>>>>>>>>> end-to-end. If clusters are L3 routed domains then how does
>>>>>>>>>> information in an L2 header get propagated through an L3 switch?
>>>>>>>>>> 
>>>>>>>>> [Jai] Now you are trying to understand the concept. I am glad. CSIG tag is an opaque tag that is not used in forwarding but is transparently forwarded. Think of it as a vlan tag swap with itself (if there are no updates to CSIG signals) or swap to a new vlan tag (if there is an updated signal). Bottom line is that this tag is not used for forwarding.
>>>>>>>>> Just to clarify the concept of opaque tags are widely used in DC services and most of the switch vendors support it either as 4 Byte or 8 Byte opaque tag.
>>>>>>>> 
>>>>>>>> Is there a standard on this?
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> Though we talk about vxlan tunnels in AI/ML or HPC clusters but that is still a couple of years away. So the argument of variable layer 2 headers is a weak argument.
>>>>>>>>>> 
>>>>>>>>>> But in that argument you're assuming the use of the protocol is just
>>>>>>>>>> in high performance AI/ML or HPC clusters. We get much better utility
>>>>>>>>>> out of developing protocols with a variety of use cases. For instance,
>>>>>>>>>> I think CSIG will be very high value in 6G networks which will have
>>>>>>>>>> different characteristics and probably heterogeneous L2 in the path.
>>>>>>>>> 
>>>>>>>>> [Jai] We need to start from somewhere.
>>>>>>>> 
>>>>>>>> You seem to be assuming that no one has ever conceived of the idea of
>>>>>>>> signaling congestion with more bits. That's incorrect. There have been
>>>>>>>> a number of proposals in IETF for network to host signaling even
>>>>>>>> beyond IOAM.
>>>>>>>> 
>>>>>>>>> And with the explosion in AI/ML use cases, this is the burning topic that we are trying to solve.
>>>>>>>> 
>>>>>>>> To be more specific, this is the burning topic that _you_ are trying
>>>>>>>> to solve; that's not the same thing as saying this is the burning
>>>>>>>> topic that all of IETF is trying to solve. Clearly, this solution is
>>>>>>>> targeted towards high end data centers, and doesn't  address all the
>>>>>>>> interesting use cases of AI/ML (especially if it's bound to L2).
>>>>>>>> 
>>>>>>>>> Why solve something which no one cares about or there is no market or money for it (like IOAM, sorry couldn't stop taking a jab).
>>>>>>>> 
>>>>>>>> I have no idea why you'd think addressing congestion in the network is
>>>>>>>> something no one cares about. As for IOAM, while you may not like it,
>>>>>>>> I will point out that it is published RFC with several Standards track
>>>>>>>> documents, whereas CSIG is currently just an I-D and Experimental
>>>>>>>> status. If you want to get any traction with CSIG in IETF, I think
>>>>>>>> you'll want a much stronger rationalization why existing IETF
>>>>>>>> protocols cannot be adapted for the purposes of CSIG.
>>>>>>>> 
>>>>>>>> Tom
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> -Jai
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I think the draft makes it clear that applicability of the proposal is DC focussed.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Sure, a lot of great protocols start with a narrow use case :-)
>>>>>>>>>> 
>>>>>>>>>>> Tom,
>>>>>>>>>>> Also, there is a discussion and presentation done before talking about the HW complexity and merits/demerits of encapsulating a v4 packet in IOAM GRE header and more. It should be present in archives.
>>>>>>>>>> 
>>>>>>>>>> Still haven't found it. Maybe I'm looking in the wrong archives?
>>>>>>>>>> 
>>>>>>>>>> Tom
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> -Jai
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 20, 2024 at 12:35 PM Sebastian Moeller <moeller0@gmx.de> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi Tom,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 20. Feb 2024, at 21:26, Tom Herbert <tom@herbertland.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 20, 2024 at 12:09 PM Christian Huitema <huitema@huitema.net> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 2/20/2024 9:55 AM, Sebastian Moeller wrote:
>>>>>>>>>>>>>>>> That's more of a statement of security and not feasibility. There's simply no security in the Internet, so we cannot trust or validate that anonymous intermediate nodes are going to write correct information. Any plain text in a packet on the Internet is subject to inspection and modification if the data isn't authenticated, and in the worst case this could be a DoS vector by writing bad information.
>>>>>>>>>>>>>>> [SM3] Indeed, but e.g. for TCP you would need to know a lot about the most recent packet to be able to play games, no? So either you are on path and already can drop/duplicate packets at will or you are off path but still need a recent enough veridical packet to be able cause mischief, no? (I might be insufficiently creative in attack vectors)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am analysis congestion control information using the framework of
>>>>>>>>>>>>>> "honest signals". In human communication, "honest signals" are those
>>>>>>>>>>>>>> that cannot be easily faked by the communicator. For example, smiling is
>>>>>>>>>>>>>> not really a honest signal, because it is easy to fake; blushing, on the
>>>>>>>>>>>>>> other hand, is hard to fake.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> When it come to Internet wide congestion control, we have pretty much
>>>>>>>>>>>>>> the same issue. Networks may want to fool the application for a variety
>>>>>>>>>>>>>> of reasons, and may start faking congestion signals. Some of these
>>>>>>>>>>>>>> signals are hard to fake. End to end data rate for example: slowing a
>>>>>>>>>>>>>> specific stream of packets is hard to fake; measuring the end to end
>>>>>>>>>>>>>> data rate is a pretty good indication of the state of the network. End
>>>>>>>>>>>>>> to end RTT is also a rather honest signal: yes, routers could put some
>>>>>>>>>>>>>> specific packets in a slow queue, but that requires resource.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Packet losses almost belong in that category. They are not hard to fake,
>>>>>>>>>>>>>> routers could play favorites and selectively drop packets with a certain
>>>>>>>>>>>>>> profile. But dropping too many packets affects the "quality rating" of a
>>>>>>>>>>>>>> provider, so there is some pressure to not fake it. That pressure is
>>>>>>>>>>>>>> probably one of the reasons behind bufferbloat. The main problem with
>>>>>>>>>>>>>> packet loss as a signal is that losses may have other causes than
>>>>>>>>>>>>>> congestion.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ECN is not really a honest signal. Setting a bit in a packet header does
>>>>>>>>>>>>>> not require a lot of efforts, so routers could do that to play
>>>>>>>>>>>>>> favorites. In fact, past bugs in some networks caused almost all packets
>>>>>>>>>>>>>> to be marked as CE. Using ECN is very nice when you can trust it, but
>>>>>>>>>>>>>> end nodes should probably do that cautiously, detecting for example a
>>>>>>>>>>>>>> sudden raise in ECN marks rather than reacting to an average value.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ECN is just one bit. There is always a temptation to do a better ECN
>>>>>>>>>>>>>> with many more bits. For example, CE directs a sender to slow down. It
>>>>>>>>>>>>>> would be nice to have a corresponding "All clear" signal telling the
>>>>>>>>>>>>>> senders that they can speed up. L4S attempts to do that by modulating
>>>>>>>>>>>>>> the CE bit, so that a low frequency kinda indicates "all clear", while a
>>>>>>>>>>>>>> high frequency says "slow down", and give some indication of how much.
>>>>>>>>>>>>>> Suddenly, one bit becomes several bits, just spread over many packets.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The idea of adding more bits in packet headers is not exactly new -- see
>>>>>>>>>>>>>> for example TCP QUIC Start by Sally Floyd et al., RFC 4782, January
>>>>>>>>>>>>>> 2007. The problem is that the more bits you add, the more you exacerbate
>>>>>>>>>>>>>> issues of trust, and also risks of bugs. "Many more bits" may work in a
>>>>>>>>>>>>>> controlled environment, but I really do not see that working on the
>>>>>>>>>>>>>> whole Internet.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Christian,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Do you know what the state of ECN deployment over the Internet is?
>>>>>>>>>>>> 
>>>>>>>>>>>> [SM5] Nobody knows exactly, but quite a lot of Linux servers use the LInux defaults and will use ECN if the client negotiates it. In my qdisc statistics I routinely see not only drops, but also CE marks logged (from my AQM).... so believe it or not, ECN over the internet mostly works...
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> It seems to me that if someone is sending to an arbitrary host over the
>>>>>>>>>>>>> Internet they're already pretty much accepting "best effort" service:
>>>>>>>>>>>> 
>>>>>>>>>>>> [SM5] Not sure about the US, but that is all ISPs offer to end users over here... but ECN works well even over a best effort internet access link in my personal experience.
>>>>>>>>>>>> 
>>>>>>>>>>>>> long latencies and with potentially high variance, such that getting
>>>>>>>>>>>>> fined grained congestion information from intermediate routers, even
>>>>>>>>>>>>> if it's honest, probably doesn't add much to the information that we
>>>>>>>>>>>>> can derive from packet loss or measuring RTT with no additional
>>>>>>>>>>>>> mechanisms or implementation.
>>>>>>>>>>>> 
>>>>>>>>>>>> [SM5] How can you come to that conclusion without ever trying?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> The situation is very different in a limited domain which could
>>>>>>>>>>>>> include large service provider networks.
>>>>>>>>>>>> 
>>>>>>>>>>>> [SM5] Indeed, papers discussion 'better congestion signalling' often come out of those environments. But IMHO not because these methods only help in those environments, but that this is where people are willing to spend the money to test and implement potential solutions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sebastian
>>>>>>>>>>>> 
>>>>>>>>>>>>> In that case more information
>>>>>>>>>>>>> is good, it's easier to provide security so we can trust the
>>>>>>>>>>>>> information, and we're not restricted to just one or two bits of
>>>>>>>>>>>>> information to carry the information in a packet. This is also where I
>>>>>>>>>>>>> see host-to-network signaling being useful-- this allows applications
>>>>>>>>>>>>> to request QoS for their packets
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Tom
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- Christian Huitema
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.
>>>>>>> 
>>>>>>> 
>>>>>>> This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.
>>>>> 
>>>>> --
>>>>> This electronic communication and the information and any files transmitted
>>>>> with it, or attached to it, are confidential and are intended solely for
>>>>> the use of the individual or entity to whom it is addressed and may contain
>>>>> information that is confidential, legally privileged, protected by privacy
>>>>> laws, or otherwise restricted from disclosure to anyone else. If you are
>>>>> not the intended recipient or the person responsible for delivering the
>>>>> e-mail to the intended recipient, you are hereby notified that any use,
>>>>> copying, distributing, dissemination, forwarding, printing, or copying of
>>>>> this e-mail is strictly prohibited. If you received this e-mail in error,
>>>>> please return the e-mail to the sender, delete it from your computer, and
>>>>> destroy any printed copy of it.
>>> 
>>> --
>>> This electronic communication and the information and any files transmitted
>>> with it, or attached to it, are confidential and are intended solely for
>>> the use of the individual or entity to whom it is addressed and may contain
>>> information that is confidential, legally privileged, protected by privacy
>>> laws, or otherwise restricted from disclosure to anyone else. If you are
>>> not the intended recipient or the person responsible for delivering the
>>> e-mail to the intended recipient, you are hereby notified that any use,
>>> copying, distributing, dissemination, forwarding, printing, or copying of
>>> this e-mail is strictly prohibited. If you received this e-mail in error,
>>> please return the e-mail to the sender, delete it from your computer, and
>>> destroy any printed copy of it.
>> 
> 
> -- 
> This electronic communication and the information and any files transmitted 
> with it, or attached to it, are confidential and are intended solely for 
> the use of the individual or entity to whom it is addressed and may contain 
> information that is confidential, legally privileged, protected by privacy 
> laws, or otherwise restricted from disclosure to anyone else. If you are 
> not the intended recipient or the person responsible for delivering the 
> e-mail to the intended recipient, you are hereby notified that any use, 
> copying, distributing, dissemination, forwarding, printing, or copying of 
> this e-mail is strictly prohibited. If you received this e-mail in error, 
> please return the e-mail to the sender, delete it from your computer, and 
> destroy any printed copy of it.