Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11

Toerless Eckert <tte@cs.fau.de> Tue, 25 October 2022 13:30 UTC

Return-Path: <eckert@i4.informatik.uni-erlangen.de>
X-Original-To: mboned@ietfa.amsl.com
Delivered-To: mboned@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id D0557C1522AA; Tue, 25 Oct 2022 06:30:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.96
X-Spam-Level:
X-Spam-Status: No, score=-3.96 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QOZRbyuHzXfX; Tue, 25 Oct 2022 06:30:15 -0700 (PDT)
Received: from faui40.informatik.uni-erlangen.de (faui40.informatik.uni-erlangen.de [131.188.34.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 99AAFC14CE43; Tue, 25 Oct 2022 06:30:10 -0700 (PDT)
Received: from faui48e.informatik.uni-erlangen.de (faui48e.informatik.uni-erlangen.de [131.188.34.51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by faui40.informatik.uni-erlangen.de (Postfix) with ESMTPS id B2E4D54858A; Tue, 25 Oct 2022 15:30:03 +0200 (CEST)
Received: by faui48e.informatik.uni-erlangen.de (Postfix, from userid 10463) id 9E7104EBE25; Tue, 25 Oct 2022 15:30:03 +0200 (CEST)
Date: Tue, 25 Oct 2022 15:30:03 +0200
From: Toerless Eckert <tte@cs.fau.de>
To: Dino Farinacci <farinacci@gmail.com>
Cc: Yisong Liu <liuyisong@chinamobile.com>, msr6@ietf.org, pim@ietf.org, BIER WG <bier@ietf.org>, mboned@ietf.org, Stig Venaas <stig@venaas.com>, hooman.bidgoli@nokia.com
Message-ID: <Y1fk24n/Fc229HCb@faui48e.informatik.uni-erlangen.de>
References: <011701d8e361$88780710$99681530$@chinamobile.com> <D0BA8841-BA90-4DF5-AAE5-A0113D4F17C7@gmail.com> <02fc01d8e537$6037c7e0$20a757a0$@chinamobile.com> <1A893DF5-816E-4D09-AAC6-065BBD1BD409@gmail.com> <Y1X2kvbLv0qXtD8z@faui48e.informatik.uni-erlangen.de> <DDD735E2-0930-4CB8-8992-E3E74C715D16@gmail.com> <Y1a8+EK9qA2kKDBF@faui48e.informatik.uni-erlangen.de> <03B2B681-FE16-4961-8932-1F3F29932837@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <03B2B681-FE16-4961-8932-1F3F29932837@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/mboned/sJZLHMkGOGgkHJEjGU4lIVhB9Fw>
Subject: Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11
X-BeenThere: mboned@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Mail List for the Mboned Working Group <mboned.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/mboned>, <mailto:mboned-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/mboned/>
List-Post: <mailto:mboned@ietf.org>
List-Help: <mailto:mboned-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/mboned>, <mailto:mboned-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Oct 2022 13:30:17 -0000

On Mon, Oct 24, 2022 at 07:31:22PM -0700, Dino Farinacci wrote:
> Toerless, a packet without source routing gives more packet space to the user than one has source routing. Everything else is secondary and not relevant to this basic point.

And an IPv4 packet header also gives more space to the user than an IPv6 header.

> Solve the problem, whatever you think it is, with a control-plane.

You did not provide technical discreditation of the network bandwidth saving example
of the larger header vs. less "duplicate packets"  i made in the prior message.
You did not provide any technical evidence against the operational cost reduction and
all the other benefits of source routing.  You are arguing on principle.

If you argument is that we should not only not build MSR but also not BIER, that is well
noted. The question really is how we should consider such foundational opposition to
technology choices in the IETF WG forming decision process. 

My principle: When i need to send traffic to 1000 or 10,000 receivers and the
overall system most simple, scaleable and reliably to build solution comes at the
cost of sending 2x instead of 1x the payload, then thats a great 500x or 5000x bandwidth
saving, and thats great.

Cheers
    Toerless

> Dino
> 
> > On Oct 24, 2022, at 9:27 AM, Toerless Eckert <tte@cs.fau.de> wrote:
> > 
> > On Sun, Oct 23, 2022 at 07:54:09PM -0700, Dino Farinacci wrote:
> >>> Think of a DC with 10,000 nodes and considering stateless
> >>> multicast source routing with 10,000 addressable destinations. 1280 min
> >>> MTU for ethernet is not a concern in such a network. Even if it
> >> 
> >> It is a concern if no user data is in the packet.
> > 
> > Rephrasing: Requiring a larger MTU than 1500 to support such an option is not
> > a limiting factor for such type of controlled network with next-generation hardware
> > (do you remember when we had to muck around with 64kbyte jumbo packet support for
> > HEPnet networks in the 1990th and even in the early 200x - aka: networks with
> > larger MTU have been around forever).
> > 
> >>> costs 50% or more overhead on e.g. a 1280 byte packet payload, cutting
> >> 
> >> Its going to cost 1250 bytes for 10,000 destinations.
> > 
> > Right. And i provided the calculation how the overal amount of traffic even with
> > such a large header would be lower than replicating the packet and sending it
> > multiple times with smaller header to reach all destinations.
> > 
> > BIER for example itself is spec'ed to support up to 4096 bits, based on wort
> > case/largest deployment case considerations as of 10 years ago. That might not
> > be required for the current SP-WAN deployments, but obviously, when we started BIER,
> > we also looked into broader use-case candidates than whats currently the BIER
> > deployment focus. Thinking of another factor 2 as a possible maximum is not a big
> > stretch of the imagination.
> > 
> >>> the addresing down to 5,000 and sending two packets across the network
> >>> would be more bandwidth incurred on it.
> >> 
> >> Sorry, your argument makes no sense.
> > 
> > Payload 1280. total number of egres routers: 10,000
> > Sending one packet with 10,000 bit bitstring:      1250+1280 =  2530 byte
> > Sending two packets with 5,000 bit bitstring:   2*(1250+625) =  3810 bytes
> > 
> >>> Splitting multicast across multiple packets also brings up the unfairness
> >>> concern of differential latency and the synchronization deciding "last-receiver"
> >>> highest latency propagation latency.
> >> 
> >> But if you don't have to split it up across two packets, it is better for the user.
> > 
> > You are making my point. Of course i know how you do not want to, because you are
> > arguing for stateful multicast solutions at scale.
> > 
> >> You CANNOT argue this point. You might say wasting data packet bandwidth to elminate control state is a good tradeoff, but it clearly is not. And you won't be able to convince this point to anyone.
> > 
> > Customers are accepting the overhead of source routing headers over managing forwarding
> > state in networks. This applies both to unicast (SR vs. RSVP-TE) and multicast (BIER/MSR6).
> > Customers have eliminated stateful solutions with e.g.: RSVP-TE in favor of that. They
> > have replaced stateful multicast with ingres replication to avoid state in the core.
> > 
> > These customers are looking at the overall traffic savings. The reference is unicast, not
> > stateful multicast. If i take a unicast solution sending to 1000 to 10000 receivers it
> > requires 1000x...10000x of the payload size. If i give then a stateless multicast solution
> > tat requires 1x...3x of the payload size because of header, that IS preferrable over
> > introducing a whole new stateful service into the network.
> > 
> > Larger MTUs btw. have ben common in many controlled networks forever. Remember all the
> > requirements for HEPnet in the 1990th with networks using up to 64k IP MTU ? Nowadays
> > all those DC networks with RoCE also use larger MTU.
> > 
> > Of course, thinking of a header size of 1k is a stretch today, so most of my colleagues
> > also think that a 1/10th of this is a a reasonable  limit, but i think thats just too much
> > grounded in backward fears and not comparing the actual use-case benefits, especially
> > simpliciy, predictability, minimum latency/jitter. Those are going to be the criteria
> > of interest going forward.
> > 
> >> If you want people to take msr6 seriously, you have to make good obvious tradeoffs.
> > 
> > But lets make sure we do not asume that tradeoffs for an SP-WAN based on todays
> > router hardware limits the ability for the best tradeoffs in a DC 5 years down the road.
> > 
> > [ I think we've seen this in the opposite direction with IPv6, where i think everybody
> > was happy to see min MTU raised to 1280 over IPv4, and the Internet was happy to get the
> > 64 bit routing address space (tongue in cheek ;-), except that "everybody" didn't
> > include all those IoT and other controlled networks, that since then had to almost start
> > an IETF area of their own to come up with all those workaround to make IPv6 better fit
> > those networks (header compression, fragmentation, routing etc.). ]
> > 
> > I just don't want this to
> > happen for MSR6, but i want MSR6 to be scaleable across a wider range of networks, especially
> > at the higher end in its core design. Thats why i am happily being provocative here with
> > the source-routing size to have us think further. Especially when the IETF is mostly looking
> > (bcause of participartion) at mostly the SP-WAN market in the west, which alas is not
> > moving much, and ignoring that n countries like china there are still a lot more scale
> > requirements to solve. WAN and DC. 
> > 
> >>> And some key applictions in DCs may actually want to send lowest-latency traffic
> >>> to thousands of receivers.  Consider parallel compute application worker
> >>> management, like those customers have used since the early 200x in DC.
> >>> Those packets may today need to go to thousands of parallel instances and
> >>> for fastest synchronization they should arrive at all of them at the same,
> >>> fastest time.
> >> 
> >> We can talk about low latency solutions once you give up the need to put so much state in a data packet. That is a different topic, and your data plane bloat won't solve either.
> > 
> > I respectfully disagree. Being able to send with an equal latency in the usec range a
> > packet to multiple destinations WITHOUT prior estblishment of multicast state is
> > one core benefit of stateless multicast, and the data overhead of that is just a quantifyable
> > cost that can easily be judged vs. the alternative (stateful) based on use-case.
> > 
> >> Note if a packet is delivered on a state based delivery tree, with no source-route, and you have a joiner downstream on the distribution tree that joins "at the same time as the packet is traveling down the tree", that new receiver can get the packet.
> > 
> > A high dynamic rate of join/leaves is a myth that we succumbed to when
> > designing multicast RMT solutions in the IETF. In fact it is the stateless
> > multicast with source routing that is the key enableer of scaleable/rate-adaptive
> > multicast transport solutions.  See draft-ietf-bier-multicast-http-response
> > (we'll have to rewrite this).
> > 
> >> With a source-route, any existing packets won't get delivered to that receiver. So you will have high join latency and missed opportunities to deliver packets already in flight to the new receiver.
> > 
> > This observation does not appply to applications where the sender knows best
> > who needs to get what. Which ultimately is the case in almost all multicast
> > applications. DASH/ABR over multicast (see above) or distributed coordination via
> > multicast.
> > 
> > Even without adaptive video, but most boring MPEG IPTV, the receiver driven joins
> > where a complete pain: Channel zapping where you really don't want to receive
> > duplicate traffic even for short periods (limited receiver link BW), but you also
> > don't want to join as soon as possible multicast, but get unicast until the next GOP.
> > This switchover is done sooooo much easier with source-routing.
> > 
> > Even when it's not the case, join propagation times are in the order of msec
> > per hop, vs. typically much shorter packet forwarding times host-to-host
> > in networks up to metro size.
> > 
> >>> Of course, one would certainly like a stateless source routing header
> >>> design that does not require to carry the 10,000 bit receiver information
> >> 
> >> "like"? How about it's a strong requirement for obvious reasons.
> > 
> > I should have said "unnecessary 10,000 bits". Akas: In the same way that
> > customers where happy to start SRv6 with 8*16=128 byte SRH steering data,
> > they also would like to see CRH. I was thinking of the same for bitstrings:
> > If all i have is a 10,000 "flat" bitstring, customers can make the easy calculation
> > how this is e.g.: 2x..3x of the payload data in the DC, so they'll go with it. But
> > when its clear that we could compress the header significantly when the number
> > of receivers is more sparse, then they would of course want that (RBS).
> > 
> >>> if the addressed set is actually smaller (such as 200). And there are proposals for
> >>> that (dynamic source-route-header size based on size of receiver set).
> >>> See e.g: draft-eckert-msr6-rbs.
> >> 
> >> So you will have multiple solutions for group size? That is a bad tradeoff too. And what happens when you go from 200 receivers 201, is there a major shift to a different solution?
> > 
> > [ There are working groups that had to claim the use-case was simple and clear and a
> > single solution easily feasible to get chartered. And then they evolved into completely
> > different use-cases and a wide range of alternative solution component pieces, but
> > never the original use-case ;-) ]
> > 
> > What we have with MSR6 a different candidate proposals, and one of the core part of the
> > work of the proposed WG is to figure out what the best compromise is for first WG and
> > industry adopted mechanisms.
> > 
> > Obviosly, like Mr Fynman, i would like a unified source-routing-header
> > theory of the universe. To this end it is for example that we want to investigate
> > if the RBS option could also have favourable performnce metrics over
> > destination-only source-routing. aka: like BIER - MSR6 docs call this the "BE" mode (Best Effort). 
> > Even though RBS at its core is just a much better evolution for tree engineering over
> > BIER-TE.
> > 
> > If this fully unified option is insufficient, we most likely would arrive at one option for
> > BE and one for tree engineering (TE). Aka: maybe RBS for TE and flat-bitstring for
> > BE, and beside all the IPv6 forwarding plane specifics, we might just increase the
> > maximum supported header size for both based on hardware capability in future routers
> > (aka: today we have 512 byte examinable packet heder in routers, then it likely could be
> > at least 1024). But thats personal conjecture.
> > 
> >> That sounds far worse than what we experirenced switching from shared-tree to source-tree.
> > 
> > I may have worked with more customers having pain with that across more different
> > HW accelerated platforms than you did, so my mileage may vary. But i have a hard time
> > thinking these two technology aspects could be compared fairly in the same ballpark.
> > 
> > IMHO, packet header size is an easily quantified cost vs. benefit evaluation for the
> > customers.  Operations of RTP/SPT switchover is an ugly technology detail
> > nobody should need to understand, but if you're operating a PIM-SM network you MUST
> > understand all it's bloody intricacies. And then there are the customers threatening you
> > with a lot of money and explaining you that their definition of IP datagram forwarding is:
> > no loss, no jitter, no reorder. And you know how you can build multicast to do it, and
> > its actually a nice way to justify the high cost of your product. Just RPT/SPT is not the way.
> > (but source-routing is one such option ;-).
> > 
> >>> Wrt to the receiver tracking: Remember that in end-to-end applications
> >>> only the sender may need to be involved in calculating the receivers,
> >>> (No network control plane harmed!).
> >> 
> >> Then you have the same problem with head-end replication at the source, as we do today with CDNs. You are just moving the problem and not solving the problem where a source just sends packets to a group address.
> > 
> > The source is the one sending the source-routed multicast packet
> > 
> >> Today, multicast sending from a source CANNOT get more efficient. You just send one UDP packet to a multicast group. That is pretty simple in my mind and can't get any simpler. So anything you change will add overhead and of course a non-starter.
> > 
> > We did outsource the source discovery from network layer to application
> > land when we introduced SSM, because the network as doing a shitty job at
> > scaling for that (RPT/SPT just being one part of the problem).
> > 
> > We're moving membership management out of the
> > network with source-routing. Whether its done locally in the source aplication
> > or with a help of a controller, it is so much easier to NOT do it in the
> > network where routers always have constrained contrl plane, and not do it
> > distributed in 20 hops along a path, but only in source or controller, both
> > of which can much more easily have arbitrary amount of CPU compared to
> > network devices - and a source especially only needs to bother about its own
> > traffic.
> > 
> > lets also remind the rest of the audience here that i am hopefully fair to
> > say that this critique of yours is not specific to MSR6 but also applies to BIER.
> > 
> >>> Those host nodes typically can do a shi.load of compute in high-speed
> >>> CPU cache when they are DC servers.  In the mentioned parallel compute
> >>> worker management, this would for example be a dynamic subset
> >>> calculation of 10,000 parallel workers based on ongoing performance telemetry.
> >>> I bet those existing large-scale distributed compute apps already have to
> >>> spend orders of magnitude more compute than converting any such subset into
> >>> a bitstring.
> >> 
> >> They can do better and faster by not doing this.
> > 
> > First of all, this is not true for native source-routing apps like i mentioned above,
> > where the application sender can now perfectly manage things like sending excactly
> > only whats needed during channel zapping of a receiver (receiver in old TV channel 
> > source-routing header, followed by unicasted cached unicast reference frames of new TV channel,
> > followed by receiver in new TV channel source-routing header). Likewise all the
> > similar application examples for other applications in DC.
> > 
> > Btw: It's one of the big pains that we never so far got into writing down more elaborately
> > all those application benefits of source-routing. Alas, this is because no application
> > developer can buy today really routers with BIER to experiment with.
> > 
> > Secondly: Even when we have "good-old-ip-multicast" applications with receiver
> > joins, the overall solution complexity network+application goes down by moving
> > to SSM, and it even further moves down when we do the receiver tracking in
> > the SSM sender. 
> > 
> > Cheers
> >    Toerless
> > 
> >> Dino
> >> 
> > 
> > -- 
> > ---
> > tte@cs.fau.de
> 

-- 
---
tte@cs.fau.de