Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11

Dino Farinacci <farinacci@gmail.com> Mon, 24 October 2022 02:54 UTC

Return-Path: <farinacci@gmail.com>
X-Original-To: mboned@ietfa.amsl.com
Delivered-To: mboned@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A14F0C14F726; Sun, 23 Oct 2022 19:54:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.108
X-Spam-Level:
X-Spam-Status: No, score=-2.108 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vI822jY0cBva; Sun, 23 Oct 2022 19:54:43 -0700 (PDT)
Received: from mail-qv1-xf35.google.com (mail-qv1-xf35.google.com [IPv6:2607:f8b0:4864:20::f35]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C6493C14F75F; Sun, 23 Oct 2022 19:54:16 -0700 (PDT)
Received: by mail-qv1-xf35.google.com with SMTP id e15so5664573qvo.4; Sun, 23 Oct 2022 19:54:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=d2CIUg7GSlzWf3p2AjJT5Ue9TKjctmt9p0c/EkNMl5E=; b=VI7Z0al77Jta/EVnGdg+dWktKQi0/4QkEG4KVumVewN7ncwKUZgbNEfkQZA7CVpwP/ caGn5WJx6Qe5LJF2STqGDV8aT04F4kPzNCrCAQcup1QqR/AjYlLcPVuYg32i54OJQ6gT M7D8TYC7HUG7I53qG2mwc+dtSyEutks6yVE8MPLNh/967fD/tC/P6vfN48+sW/ouHWI5 NL/wLfDUHfXxkKWCr/ko9ync9aVGhm3dk86gbtjsD8n7WNMx2iHk8U8xQ3jkGMCF1EYV unYl7O5mvp2Jq74xq4Y5EY2hPct3B/anyUMWeygxBxjQWeoWtcLSYtAs98c9aZva11mz jtdg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d2CIUg7GSlzWf3p2AjJT5Ue9TKjctmt9p0c/EkNMl5E=; b=2fFxDW/SAySAvkROgHYEQfIBVUamJlQSN6TWYST9/Mnra9FC4PZtnWdkGtjBPgGQoR ACbRhDIpSheI4OtHx/aw6zd4Yx3JAi950ah010CaiEXwahJZkmeSNlJbDBE29fiGUR1S h4E22EwY/hcqkltNaIZ3SI6A4GF/ArsalBUlEbT93p2f9vrEIo+S7GTY+Y57lli5Qxs7 i4Eq27U5yOxT558CTqW/Gmpf1oY3BgZkkIfywA+/pwgc48RnBIVRvoQFdu5PfGPZzAP8 SXHHbnd6uwkl+dDobtwoiIyGikeQZeQDMzm1ancZiW9x+w41asEm0gL8SGB46Y4Cejcy tM8w==
X-Gm-Message-State: ACrzQf28F56U2ek/qB1R0AXhuYpl7mCQLD3uaUdTzPVZ9aWIopfAvE/l wiocAuhtx1i5o33H/nmwINE=
X-Google-Smtp-Source: AMsMyM7qR6cP/+XKlCzhdR78NKY9rQvnrSPvH+UO7yW/qcwtqT+HGKTSjnISVp1Sk3fGWcpWSZP6Qw==
X-Received: by 2002:a05:6214:1cc5:b0:4af:91d5:8d1a with SMTP id g5-20020a0562141cc500b004af91d58d1amr26006066qvd.70.1666580055902; Sun, 23 Oct 2022 19:54:15 -0700 (PDT)
Received: from smtpclient.apple ([2600:381:9227:7795:6590:d6aa:312:31ed]) by smtp.gmail.com with ESMTPSA id q3-20020a05620a2a4300b006bb8b5b79efsm14415008qkp.129.2022.10.23.19.54.14 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 23 Oct 2022 19:54:15 -0700 (PDT)
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\))
From: Dino Farinacci <farinacci@gmail.com>
In-Reply-To: <Y1X2kvbLv0qXtD8z@faui48e.informatik.uni-erlangen.de>
Date: Sun, 23 Oct 2022 19:54:09 -0700
Cc: Yisong Liu <liuyisong@chinamobile.com>, msr6@ietf.org, pim@ietf.org, BIER WG <bier@ietf.org>, mboned@ietf.org, Stig Venaas <stig@venaas.com>, hooman.bidgoli@nokia.com
Content-Transfer-Encoding: quoted-printable
Message-Id: <DDD735E2-0930-4CB8-8992-E3E74C715D16@gmail.com>
References: <011701d8e361$88780710$99681530$@chinamobile.com> <D0BA8841-BA90-4DF5-AAE5-A0113D4F17C7@gmail.com> <02fc01d8e537$6037c7e0$20a757a0$@chinamobile.com> <1A893DF5-816E-4D09-AAC6-065BBD1BD409@gmail.com> <Y1X2kvbLv0qXtD8z@faui48e.informatik.uni-erlangen.de>
To: Toerless Eckert <tte@cs.fau.de>
X-Mailer: Apple Mail (2.3696.120.41.1.1)
Archived-At: <https://mailarchive.ietf.org/arch/msg/mboned/J5f-fTXTtqL-6TwX-41CGXjo9hA>
Subject: Re: [MBONED] [Msr6] MSR6 BOF 3rd Issue Category: More details are requested about the large scale use cases, including issue 8-11
X-BeenThere: mboned@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Mail List for the Mboned Working Group <mboned.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/mboned>, <mailto:mboned-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/mboned/>
List-Post: <mailto:mboned@ietf.org>
List-Help: <mailto:mboned-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/mboned>, <mailto:mboned-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Oct 2022 02:54:45 -0000

> Think of a DC with 10,000 nodes and considering stateless
> multicast source routing with 10,000 addressable destinations. 1280 min
> MTU for ethernet is not a concern in such a network. Even if it

It is a concern if no user data is in the packet.

> costs 50% or more overhead on e.g. a 1280 byte packet payload, cutting

Its going to cost 1250 bytes for 10,000 destinations.

> the addresing down to 5,000 and sending two packets across the network
> would be more bandwidth incurred on it.

Sorry, your argument makes no sense.

> Splitting multicast across multiple packets also brings up the unfairness
> concern of differential latency and the synchronization deciding "last-receiver"
> highest latency propagation latency.

But if you don't have to split it up across two packets, it is better for the user. You CANNOT argue this point. You might say wasting data packet bandwidth to elminate control state is a good tradeoff, but it clearly is not. And you won't be able to convince this point to anyone.

If you want people to take msr6 seriously, you have to make good obvious tradeoffs.

> And some key applictions in DCs may actually want to send lowest-latency traffic
> to thousands of receivers.  Consider parallel compute application worker
> management, like those customers have used since the early 200x in DC.
> Those packets may today need to go to thousands of parallel instances and
> for fastest synchronization they should arrive at all of them at the same,
> fastest time.

We can talk about low latency solutions once you give up the need to put so much state in a data packet. That is a different topic, and your data plane bloat won't solve either.

> Or think of high volume multicast apps distributing content, and
> flow completion time is the key factor. Even if that multicasts just to 100
> receivers out of 10,000: If you can not predict the subset at deployment 
> time then you could end up requiring to send to different receiver subsets
> between 4 and 40 times the traffic because 10,000/256 (bitstring size) = 40,
> and there is at lest one out of the 100 receivers in each Set Identifier (bitstring).
> Thats a whole new layer of traffic management pain problems for DCs and
> for the application owner: Performance could vary from best to up to 8 times
> (40/5) slower. And yes: maybe i could McGyver with SI and entropy field and
> available ECMP on the DC routers a solution where i ECMP load-split the
> traffic for different SI, but that would be highly complex and difficult to generalize.
> And it still wouldn't reduce the overall fabric load that changes by receiver set.

Note if a packet is delivered on a state based delivery tree, with no source-route, and you have a joiner downstream on the distribution tree that joins "at the same time as the packet is traveling down the tree", that new receiver can get the packet. With a source-route, any existing packets won't get delivered to that receiver. So you will have high join latency and missed opportunities to deliver packets already in flight to the new receiver.

> Of course, one would certainly like a stateless source routing header
> design that does not require to carry the 10,000 bit receiver information

"like"? How about it's a strong requirement for obvious reasons.

> if the addressed set is actually smaller (such as 200). And there are proposals for
> that (dynamic source-route-header size based on size of receiver set).
> See e.g: draft-eckert-msr6-rbs.

So you will have multiple solutions for group size? That is a bad tradeoff too. And what happens when you go from 200 receivers 201, is there a major shift to a different solution?

That sounds far worse than what we experirenced switching from shared-tree to source-tree.

> Wrt to the receiver tracking: Remember that in end-to-end applications
> only the sender may need to be involved in calculating the receivers,
> (No network control plane harmed!).

Then you have the same problem with head-end replication at the source, as we do today with CDNs. You are just moving the problem and not solving the problem where a source just sends packets to a group address.

Today, multicast sending from a source CANNOT get more efficient. You just send one UDP packet to a multicast group. That is pretty simple in my mind and can't get any simpler. So anything you change will add overhead and of course a non-starter.

> Those host nodes typically can do a shi.load of compute in high-speed
> CPU cache when they are DC servers.  In the mentioned parallel compute
> worker management, this would for example be a dynamic subset
> calculation of 10,000 parallel workers based on ongoing performance telemetry.
> I bet those existing large-scale distributed compute apps already have to
> spend orders of magnitude more compute than converting any such subset into
> a bitstring.

They can do better and faster by not doing this.

Dino