Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11

Toerless Eckert <tte@cs.fau.de> Mon, 24 October 2022 02:21 UTC

Return-Path: <eckert@i4.informatik.uni-erlangen.de>
X-Original-To: mboned@ietfa.amsl.com
Delivered-To: mboned@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AFCAFC14CEFC; Sun, 23 Oct 2022 19:21:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.66
X-Spam-Level:
X-Spam-Status: No, score=-1.66 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id E4gIeCMxS3gh; Sun, 23 Oct 2022 19:21:12 -0700 (PDT)
Received: from faui40.informatik.uni-erlangen.de (faui40.informatik.uni-erlangen.de [131.188.34.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8EA40C14F75F; Sun, 23 Oct 2022 19:21:10 -0700 (PDT)
Received: from faui48e.informatik.uni-erlangen.de (faui48e.informatik.uni-erlangen.de [IPv6:2001:638:a000:4134::ffff:51]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by faui40.informatik.uni-erlangen.de (Postfix) with ESMTPS id 8D1C554857B; Mon, 24 Oct 2022 04:21:06 +0200 (CEST)
Received: by faui48e.informatik.uni-erlangen.de (Postfix, from userid 10463) id 6C5334EBE02; Mon, 24 Oct 2022 04:21:06 +0200 (CEST)
Date: Mon, 24 Oct 2022 04:21:06 +0200
From: Toerless Eckert <tte@cs.fau.de>
To: Dino Farinacci <farinacci@gmail.com>
Cc: Yisong Liu <liuyisong@chinamobile.com>, msr6@ietf.org, pim@ietf.org, BIER WG <bier@ietf.org>, mboned@ietf.org, Stig Venaas <stig@venaas.com>, hooman.bidgoli@nokia.com
Message-ID: <Y1X2kvbLv0qXtD8z@faui48e.informatik.uni-erlangen.de>
References: <011701d8e361$88780710$99681530$@chinamobile.com> <D0BA8841-BA90-4DF5-AAE5-A0113D4F17C7@gmail.com> <02fc01d8e537$6037c7e0$20a757a0$@chinamobile.com> <1A893DF5-816E-4D09-AAC6-065BBD1BD409@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <1A893DF5-816E-4D09-AAC6-065BBD1BD409@gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/mboned/ecHky877tFXTYUByfcInumV9hso>
Subject: Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11
X-BeenThere: mboned@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Mail List for the Mboned Working Group <mboned.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/mboned>, <mailto:mboned-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/mboned/>
List-Post: <mailto:mboned@ietf.org>
List-Help: <mailto:mboned-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/mboned>, <mailto:mboned-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Oct 2022 02:21:16 -0000

Dino, *:

Think of a DC with 10,000 nodes and considering stateless
multicast source routing with 10,000 addressable destinations. 1280 min
MTU for ethernet is not a concern in such a network. Even if it
costs 50% or more overhead on e.g. a 1280 byte packet payload, cutting
the addresing down to 5,000 and sending two packets across the network
would be more bandwidth incurred on it.

Splitting multicast across multiple packets also brings up the unfairness
concern of differential latency and the synchronization deciding "last-receiver"
highest latency propagation latency.

And some key applictions in DCs may actually want to send lowest-latency traffic
to thousands of receivers.  Consider parallel compute application worker
management, like those customers have used since the early 200x in DC.
Those packets may today need to go to thousands of parallel instances and
for fastest synchronization they should arrive at all of them at the same,
fastest time.

Or think of high volume multicast apps distributing content, and
flow completion time is the key factor. Even if that multicasts just to 100
receivers out of 10,000: If you can not predict the subset at deployment 
time then you could end up requiring to send to different receiver subsets
between 4 and 40 times the traffic because 10,000/256 (bitstring size) = 40,
and there is at lest one out of the 100 receivers in each Set Identifier (bitstring).
Thats a whole new layer of traffic management pain problems for DCs and
for the application owner: Performance could vary from best to up to 8 times
(40/5) slower. And yes: maybe i could McGyver with SI and entropy field and
available ECMP on the DC routers a solution where i ECMP load-split the
traffic for different SI, but that would be highly complex and difficult to generalize.
And it still wouldn't reduce the overall fabric load that changes by receiver set.

Of course, one would certainly like a stateless source routing header
design that does not require to carry the 10,000 bit receiver information
if the addressed set is actually smaller (such as 200). And there are proposals for
that (dynamic source-route-header size based on size of receiver set).
See e.g: draft-eckert-msr6-rbs.

Wrt to the receiver tracking: Remember that in end-to-end applications
only the sender may need to be involved in calculating the receivers,
(No network control plane harmed!).

Those host nodes typically can do a shi.load of compute in high-speed
CPU cache when they are DC servers.  In the mentioned parallel compute
worker management, this would for example be a dynamic subset
calculation of 10,000 parallel workers based on ongoing performance telemetry.
I bet those existing large-scale distributed compute apps already have to
spend orders of magnitude more compute than converting any such subset into
a bitstring.

Cheers
   Toerless

On Fri, Oct 21, 2022 at 09:18:07AM -0700, Dino Farinacci wrote:
> > 1. From the view of existing MSR6 TE document, whether 10000 receivers encoded in a packet is acceptable or not depends 
> 
> There is no reason that it is acceptable. There is not enough MTU for user data. Its absurd to even consider it.
> 
> > on the topology and the encoding method. In my understanding, in a total stateless solution, at least 10000 bit should be used for encoding such a multicast tree in the packet as the header expanse.
> 
> 10000/8 = 1250 bytes
> 
> This is a non-starter of epic proportions.
> 
> > 2. From the view of MSR6, we think this is valid scenario for the large network with large multicast trees and we can discuss more efficient solutions based on source-routing.
> 
> Right, and the exiting solutions have been proven, grant it with given control-plane complexity, are multi-router distribution trees where branches of the tree aggregate receiver state so per-receiver state need not be stored anywhere in the network. 
> 
> Dino
> 

-- 
---
tte@cs.fau.de