Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11

Toerless Eckert <tte@cs.fau.de> Mon, 24 October 2022 16:27 UTC

Return-Path: <eckert@i4.informatik.uni-erlangen.de>
X-Original-To: mboned@ietfa.amsl.com
Delivered-To: mboned@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0BC69C14CE38; Mon, 24 Oct 2022 09:27:49 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.66
X-Spam-Level:
X-Spam-Status: No, score=-1.66 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id a7eTUKugPFgX; Mon, 24 Oct 2022 09:27:47 -0700 (PDT)
Received: from faui40.informatik.uni-erlangen.de (faui40.informatik.uni-erlangen.de [IPv6:2001:638:a000:4134::ffff:40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 336C2C1524AB; Mon, 24 Oct 2022 09:27:43 -0700 (PDT)
Received: from faui48e.informatik.uni-erlangen.de (faui48e.informatik.uni-erlangen.de [IPv6:2001:638:a000:4134::ffff:51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by faui40.informatik.uni-erlangen.de (Postfix) with ESMTPS id C041054857E; Mon, 24 Oct 2022 18:27:36 +0200 (CEST)
Received: by faui48e.informatik.uni-erlangen.de (Postfix, from userid 10463) id AD3634EBE10; Mon, 24 Oct 2022 18:27:36 +0200 (CEST)
Date: Mon, 24 Oct 2022 18:27:36 +0200
From: Toerless Eckert <tte@cs.fau.de>
To: Dino Farinacci <farinacci@gmail.com>
Cc: Yisong Liu <liuyisong@chinamobile.com>, msr6@ietf.org, pim@ietf.org, BIER WG <bier@ietf.org>, mboned@ietf.org, Stig Venaas <stig@venaas.com>, hooman.bidgoli@nokia.com
Message-ID: <Y1a8+EK9qA2kKDBF@faui48e.informatik.uni-erlangen.de>
References: <011701d8e361$88780710$99681530$@chinamobile.com> <D0BA8841-BA90-4DF5-AAE5-A0113D4F17C7@gmail.com> <02fc01d8e537$6037c7e0$20a757a0$@chinamobile.com> <1A893DF5-816E-4D09-AAC6-065BBD1BD409@gmail.com> <Y1X2kvbLv0qXtD8z@faui48e.informatik.uni-erlangen.de> <DDD735E2-0930-4CB8-8992-E3E74C715D16@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <DDD735E2-0930-4CB8-8992-E3E74C715D16@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/mboned/NDzJnd7YYH6FlT3ijjITR9FQ2Wo>
Subject: Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11
X-BeenThere: mboned@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Mail List for the Mboned Working Group <mboned.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/mboned>, <mailto:mboned-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/mboned/>
List-Post: <mailto:mboned@ietf.org>
List-Help: <mailto:mboned-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/mboned>, <mailto:mboned-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Oct 2022 16:27:49 -0000

On Sun, Oct 23, 2022 at 07:54:09PM -0700, Dino Farinacci wrote:
> > Think of a DC with 10,000 nodes and considering stateless
> > multicast source routing with 10,000 addressable destinations. 1280 min
> > MTU for ethernet is not a concern in such a network. Even if it
> 
> It is a concern if no user data is in the packet.

Rephrasing: Requiring a larger MTU than 1500 to support such an option is not
a limiting factor for such type of controlled network with next-generation hardware
(do you remember when we had to muck around with 64kbyte jumbo packet support for
 HEPnet networks in the 1990th and even in the early 200x - aka: networks with
 larger MTU have been around forever).

> > costs 50% or more overhead on e.g. a 1280 byte packet payload, cutting
> 
> Its going to cost 1250 bytes for 10,000 destinations.

Right. And i provided the calculation how the overal amount of traffic even with
such a large header would be lower than replicating the packet and sending it
multiple times with smaller header to reach all destinations.

BIER for example itself is spec'ed to support up to 4096 bits, based on wort
case/largest deployment case considerations as of 10 years ago. That might not
be required for the current SP-WAN deployments, but obviously, when we started BIER,
we also looked into broader use-case candidates than whats currently the BIER
deployment focus. Thinking of another factor 2 as a possible maximum is not a big
stretch of the imagination.

> > the addresing down to 5,000 and sending two packets across the network
> > would be more bandwidth incurred on it.
> 
> Sorry, your argument makes no sense.

Payload 1280. total number of egres routers: 10,000
Sending one packet with 10,000 bit bitstring:      1250+1280 =  2530 byte
Sending two packets with 5,000 bit bitstring:   2*(1250+625) =  3810 bytes

> > Splitting multicast across multiple packets also brings up the unfairness
> > concern of differential latency and the synchronization deciding "last-receiver"
> > highest latency propagation latency.
> 
> But if you don't have to split it up across two packets, it is better for the user.

You are making my point. Of course i know how you do not want to, because you are
arguing for stateful multicast solutions at scale.

> You CANNOT argue this point. You might say wasting data packet bandwidth to elminate control state is a good tradeoff, but it clearly is not. And you won't be able to convince this point to anyone.

Customers are accepting the overhead of source routing headers over managing forwarding
state in networks. This applies both to unicast (SR vs. RSVP-TE) and multicast (BIER/MSR6).
Customers have eliminated stateful solutions with e.g.: RSVP-TE in favor of that. They
have replaced stateful multicast with ingres replication to avoid state in the core.

These customers are looking at the overall traffic savings. The reference is unicast, not
stateful multicast. If i take a unicast solution sending to 1000 to 10000 receivers it
requires 1000x...10000x of the payload size. If i give then a stateless multicast solution
tat requires 1x...3x of the payload size because of header, that IS preferrable over
introducing a whole new stateful service into the network.

Larger MTUs btw. have ben common in many controlled networks forever. Remember all the
requirements for HEPnet in the 1990th with networks using up to 64k IP MTU ? Nowadays
all those DC networks with RoCE also use larger MTU.

Of course, thinking of a header size of 1k is a stretch today, so most of my colleagues
also think that a 1/10th of this is a a reasonable  limit, but i think thats just too much
grounded in backward fears and not comparing the actual use-case benefits, especially
simpliciy, predictability, minimum latency/jitter. Those are going to be the criteria
of interest going forward.

> If you want people to take msr6 seriously, you have to make good obvious tradeoffs.

But lets make sure we do not asume that tradeoffs for an SP-WAN based on todays
router hardware limits the ability for the best tradeoffs in a DC 5 years down the road.

[ I think we've seen this in the opposite direction with IPv6, where i think everybody
was happy to see min MTU raised to 1280 over IPv4, and the Internet was happy to get the
64 bit routing address space (tongue in cheek ;-), except that "everybody" didn't
include all those IoT and other controlled networks, that since then had to almost start
an IETF area of their own to come up with all those workaround to make IPv6 better fit
those networks (header compression, fragmentation, routing etc.). ]

I just don't want this to
happen for MSR6, but i want MSR6 to be scaleable across a wider range of networks, especially
at the higher end in its core design. Thats why i am happily being provocative here with
the source-routing size to have us think further. Especially when the IETF is mostly looking
(bcause of participartion) at mostly the SP-WAN market in the west, which alas is not
moving much, and ignoring that n countries like china there are still a lot more scale
requirements to solve. WAN and DC. 

> > And some key applictions in DCs may actually want to send lowest-latency traffic
> > to thousands of receivers.  Consider parallel compute application worker
> > management, like those customers have used since the early 200x in DC.
> > Those packets may today need to go to thousands of parallel instances and
> > for fastest synchronization they should arrive at all of them at the same,
> > fastest time.
> 
> We can talk about low latency solutions once you give up the need to put so much state in a data packet. That is a different topic, and your data plane bloat won't solve either.

I respectfully disagree. Being able to send with an equal latency in the usec range a
packet to multiple destinations WITHOUT prior estblishment of multicast state is
one core benefit of stateless multicast, and the data overhead of that is just a quantifyable
cost that can easily be judged vs. the alternative (stateful) based on use-case.

> Note if a packet is delivered on a state based delivery tree, with no source-route, and you have a joiner downstream on the distribution tree that joins "at the same time as the packet is traveling down the tree", that new receiver can get the packet.

A high dynamic rate of join/leaves is a myth that we succumbed to when
designing multicast RMT solutions in the IETF. In fact it is the stateless
multicast with source routing that is the key enableer of scaleable/rate-adaptive
multicast transport solutions.  See draft-ietf-bier-multicast-http-response
(we'll have to rewrite this).

> With a source-route, any existing packets won't get delivered to that receiver. So you will have high join latency and missed opportunities to deliver packets already in flight to the new receiver.

This observation does not appply to applications where the sender knows best
who needs to get what. Which ultimately is the case in almost all multicast
applications. DASH/ABR over multicast (see above) or distributed coordination via
multicast.

Even without adaptive video, but most boring MPEG IPTV, the receiver driven joins
where a complete pain: Channel zapping where you really don't want to receive
duplicate traffic even for short periods (limited receiver link BW), but you also
don't want to join as soon as possible multicast, but get unicast until the next GOP.
This switchover is done sooooo much easier with source-routing.

Even when it's not the case, join propagation times are in the order of msec
per hop, vs. typically much shorter packet forwarding times host-to-host
in networks up to metro size.

> > Of course, one would certainly like a stateless source routing header
> > design that does not require to carry the 10,000 bit receiver information
> 
> "like"? How about it's a strong requirement for obvious reasons.

I should have said "unnecessary 10,000 bits". Akas: In the same way that
customers where happy to start SRv6 with 8*16=128 byte SRH steering data,
they also would like to see CRH. I was thinking of the same for bitstrings:
If all i have is a 10,000 "flat" bitstring, customers can make the easy calculation
how this is e.g.: 2x..3x of the payload data in the DC, so they'll go with it. But
when its clear that we could compress the header significantly when the number
of receivers is more sparse, then they would of course want that (RBS).

> > if the addressed set is actually smaller (such as 200). And there are proposals for
> > that (dynamic source-route-header size based on size of receiver set).
> > See e.g: draft-eckert-msr6-rbs.
> 
> So you will have multiple solutions for group size? That is a bad tradeoff too. And what happens when you go from 200 receivers 201, is there a major shift to a different solution?

[ There are working groups that had to claim the use-case was simple and clear and a
single solution easily feasible to get chartered. And then they evolved into completely
different use-cases and a wide range of alternative solution component pieces, but
never the original use-case ;-) ]

What we have with MSR6 a different candidate proposals, and one of the core part of the
work of the proposed WG is to figure out what the best compromise is for first WG and
industry adopted mechanisms.

Obviosly, like Mr Fynman, i would like a unified source-routing-header
theory of the universe. To this end it is for example that we want to investigate
if the RBS option could also have favourable performnce metrics over
destination-only source-routing. aka: like BIER - MSR6 docs call this the "BE" mode (Best Effort). 
Even though RBS at its core is just a much better evolution for tree engineering over
BIER-TE.

If this fully unified option is insufficient, we most likely would arrive at one option for
BE and one for tree engineering (TE). Aka: maybe RBS for TE and flat-bitstring for
BE, and beside all the IPv6 forwarding plane specifics, we might just increase the
maximum supported header size for both based on hardware capability in future routers
(aka: today we have 512 byte examinable packet heder in routers, then it likely could be
 at least 1024). But thats personal conjecture.

> That sounds far worse than what we experirenced switching from shared-tree to source-tree.

I may have worked with more customers having pain with that across more different
HW accelerated platforms than you did, so my mileage may vary. But i have a hard time
thinking these two technology aspects could be compared fairly in the same ballpark.

IMHO, packet header size is an easily quantified cost vs. benefit evaluation for the
customers.  Operations of RTP/SPT switchover is an ugly technology detail
nobody should need to understand, but if you're operating a PIM-SM network you MUST
understand all it's bloody intricacies. And then there are the customers threatening you
with a lot of money and explaining you that their definition of IP datagram forwarding is:
no loss, no jitter, no reorder. And you know how you can build multicast to do it, and
its actually a nice way to justify the high cost of your product. Just RPT/SPT is not the way.
(but source-routing is one such option ;-).

> > Wrt to the receiver tracking: Remember that in end-to-end applications
> > only the sender may need to be involved in calculating the receivers,
> > (No network control plane harmed!).
> 
> Then you have the same problem with head-end replication at the source, as we do today with CDNs. You are just moving the problem and not solving the problem where a source just sends packets to a group address.

The source is the one sending the source-routed multicast packet

> Today, multicast sending from a source CANNOT get more efficient. You just send one UDP packet to a multicast group. That is pretty simple in my mind and can't get any simpler. So anything you change will add overhead and of course a non-starter.

We did outsource the source discovery from network layer to application
land when we introduced SSM, because the network as doing a shitty job at
scaling for that (RPT/SPT just being one part of the problem).

We're moving membership management out of the
network with source-routing. Whether its done locally in the source aplication
or with a help of a controller, it is so much easier to NOT do it in the
network where routers always have constrained contrl plane, and not do it
distributed in 20 hops along a path, but only in source or controller, both
of which can much more easily have arbitrary amount of CPU compared to
network devices - and a source especially only needs to bother about its own
traffic.

lets also remind the rest of the audience here that i am hopefully fair to
say that this critique of yours is not specific to MSR6 but also applies to BIER.

> > Those host nodes typically can do a shi.load of compute in high-speed
> > CPU cache when they are DC servers.  In the mentioned parallel compute
> > worker management, this would for example be a dynamic subset
> > calculation of 10,000 parallel workers based on ongoing performance telemetry.
> > I bet those existing large-scale distributed compute apps already have to
> > spend orders of magnitude more compute than converting any such subset into
> > a bitstring.
> 
> They can do better and faster by not doing this.

First of all, this is not true for native source-routing apps like i mentioned above,
where the application sender can now perfectly manage things like sending excactly
only whats needed during channel zapping of a receiver (receiver in old TV channel 
source-routing header, followed by unicasted cached unicast reference frames of new TV channel,
followed by receiver in new TV channel source-routing header). Likewise all the
similar application examples for other applications in DC.

Btw: It's one of the big pains that we never so far got into writing down more elaborately
all those application benefits of source-routing. Alas, this is because no application
developer can buy today really routers with BIER to experiment with.

Secondly: Even when we have "good-old-ip-multicast" applications with receiver
joins, the overall solution complexity network+application goes down by moving
to SSM, and it even further moves down when we do the receiver tracking in
the SSM sender. 

Cheers
    Toerless

> Dino
> 

-- 
---
tte@cs.fau.de