OT and trimmed (was Re: Q on the congestion awareness of routing protocols)

Curtis Villamizar <curtis@ietf.occnc.com> Tue, 06 December 2022 12:25 UTC

Return-Path: <curtis@ietf.occnc.com>
X-Original-To: tsv-area@ietfa.amsl.com
Delivered-To: tsv-area@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3E982C14CE55; Tue, 6 Dec 2022 04:25:05 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.454
X-Spam-Level:
X-Spam-Status: No, score=-0.454 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DATE_IN_PAST_06_12=1.543, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ietf.occnc.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SKs-XtL3LCqo; Tue, 6 Dec 2022 04:25:00 -0800 (PST)
Received: from mta5-tap0.andover.occnc.com (mta5-tap0.andover.occnc.com [IPv6:2600:2c00:b000:2500::151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4D447C14CE4B; Tue, 6 Dec 2022 04:24:58 -0800 (PST)
Received: from harbor5.andover.occnc.com (harbor5.andover.occnc.com [IPv6:2600:2c00:b000:2500::1411]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (Client did not present a certificate) (Authenticated sender: curtis@occnc.com) by mta5-tap0.andover.occnc.com (Postfix) with ESMTPSA id 0981111478; Tue, 6 Dec 2022 07:24:54 -0500 (EST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=ietf.occnc.com; s=curtis-ietf-20210422-132019; t=1670329495; bh=TGngN82IwGi6tDW2ZjVIFiSPl8epM/aZXs4YoMgK7W4=; h=To:cc:Reply-To:From:Subject:In-reply-to:Date; b=Ro7XKAsRrWzldmh0daLdjo3NiUatjxzt/bWi1vf4fGw8VG0J4FI2/Zyno3kc7HDz6 EgaOcEUzQWtYuKXUUfjNwua1W+t4AnL5nyYu702q+jdXYbJP+RUZXpePTSvK87grx6 H82XGxKnXmW0OnlnU2E9k3So3rM3Q9sNroV0Rw9HoCF7jJG5FMsVZh1nkX1fy+OXaV 4t3HHIu9Bw6ocjrukCDkmdLLt6b+isSBPuSb3tk1IzVyeSzEBjhDBspfGw8yu/YM78 BDURzchahae6UbmJNDRomHwxxFyBYgg6YNfL1g10iIwGWVQgU7SEUucJ38zhxKpptV n8s9uStSL+nHg==
To: "Bless, Roland (TM)" <roland.bless@kit.edu>
cc: Curtis Villamizar <curtis@ietf.occnc.com>, bier@ietf.org, tsv-area@ietf.org, pim@ietf.org, routing-discussion@ietf.org
Reply-To: Curtis Villamizar <curtis@ietf.occnc.com>
From: Curtis Villamizar <curtis@ietf.occnc.com>
Subject: OT and trimmed (was Re: Q on the congestion awareness of routing protocols)
In-reply-to: Your message of "Mon, 05 Dec 2022 11:10:16 +0100." <1fb6e5d2-a0c0-6abe-1a5c-9d1d24575177@kit.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-ID: <5124.1670306376.1@harbor5.andover.occnc.com>
Content-Transfer-Encoding: 8bit
Date: Tue, 06 Dec 2022 00:59:37 -0500
Message-Id: <20221206122459.4D447C14CE4B@ietfa.amsl.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsv-area/hLRYAVnqk5cDlQTi3Rhy8fP-XKs>
X-BeenThere: tsv-area@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: IETF Transport and Services Area Mailing List <tsv-area.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsv-area>, <mailto:tsv-area-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsv-area/>
List-Post: <mailto:tsv-area@ietf.org>
List-Help: <mailto:tsv-area-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsv-area>, <mailto:tsv-area-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Dec 2022 12:25:05 -0000

In message <1fb6e5d2-a0c0-6abe-1a5c-9d1d24575177@kit.edu>
"Bless, Roland (TM)" writes:
 
> Hi Curtis,
>  
> On 04.12.22 at 12:06 Curtis Villamizar wrote:
> > Is it my imagination or did this conversation start out about PIM-SM
> > and wander off into everything but PIM-SM?
>  
> I think that Toerless question was about congestion control for
> routing protocols in general and multicast routing in particular.
> Thanks for pointing out that specific bottleneck on updating
> the forwarding plane (Toerless also pointed at that) and the
> work-conserving property of BGP.

Not assigning blame for thread wander.  Toerless was commenting on
PIM-SM with the open question of whether other protocols (mostly focus
on multcast I think) had similar congestion issues.

> > Anyway ... regarding BGP and TCP (or SRM).
>  
> [... trimmed ...]
>  
> > This conversation did not start out being about delay based routing.
> > We all know that any use of a varrying route metric is bad without
> > some mechanism to insure stability (hopefully we all know ..).
>  
> I guess it was Stuart Bryant who mentioned shortest time routing,
> and therefore I briefly mentioned it here...

Yeah Stewart.  Shame on you.  [Now we are asigning blame.]  :-)

> >> I actually have not the operational experience as others
> >> may have, but my guess is that practically CPU congestion
> >> occurs more often than link congestion (solely caused by
> >> control plane packets). While I believe that TCP congestion control
> >> may potentially help to fix short-term congestion situations,
> >> it is not a solution for persistent link congestion – I think that
> >> such a system may not be able to function correctly.
> > 
> > The bottlenect in core routers is actually installing the routes onto
> > the forwarding cards.  Multiple cards to talk to with less processing
> > power in some cases.  Congestion control to the forwarding control
> > plane goes back to the route socket BSD days in early to mid 1990s but
> > installing on separate smart forwarding cards is now more often in
> > application space.
> > 
> >> So there are typically dampening mechanisms in place to aggregate
> >> routing information or to wait before announcing certain
> >> route updates.
> >> When using TCP, the CPU congestion problem would cause
> >> flow control to kick in and automatically throttle
> >> the sender to the receiver's processing speed.
> >> However, if the generation rate of routing messages is
> >> permanently too high, the system will not be stable.
> > 
> > Last sentence is not true.  The work conserving aspect of BGP
> > implementation (any that have survived and are used) means you have
> > one most recent route for any destination and are throwing away old
> > routes before ever installing them.  Routes get installed in the order
> > of the least recently received route and if changed before getting
> > installed goes to the back of the queue.  Similar work conserving can
> > be done at Adj-In and is useful if BGP Adj-In is delegated to
> > forwarding cards with reliability built into TCP state.  Multiple
> > routes arrive with each TCP packet, particularly with large MTU, so
> > acks get delayed a bit to allow for failure of the forwarding card a
> > bit like database 2-phase commit.  The routing table install queue is
> > finite and each BGP Adj-In queue is finite.  Both are bound by the
> > number of prefixes in global routing so to protect from getting
> > flooded with bad prefixes it helps to have some sanity check to avoid
> > disaggregation or bogus prefixes.  Any bottleneck in route table
> > install, IGP SPF, or forwarding install does not slow down BGP Adj-In
> > processing or TCP transfer of information.
>  
> The precondition for my statement was CPU congestion
> (in your case: BGP Adj-In processing is tool slow).
> If I understand correctly you say that there is a decoupling between
> routing message processing and FIB entry installation and
> your assumption is that neither BGP Adj-In processing
> nor TCP processing are a bottleneck. However, I didn't
> fully understand the part where "BGP Adj-In is delegated to
> forwarding cards with reliability built into TCP state"
> and "acks get delayed a bit to allow for failure of the forwarding
> card". This sounds like an approach with tight coupling
> between FIB entry installation and TCP processing.
> So in case of failure (what kind of failure? malfunction of the
> whole card or just failure of the installation of entries?)
> the ACK is not sent?
> Then the sender retransmits and the installation is tried again?

The Adj-In on forwarding cards was done by Avici after they did their
"Non-Stop Routing" which was trademarked and patented (problems with
the patent imnsho).  The trademark was withdrawn.  NSR was failover if
a route processor card failed and required a kernel hack to backup the
TCP state on any secondary route processor among other things.  When
Adj-In was moved to forwarding cards the same was done there so that
if either the link failed and packets arived at another card or the
card itself failed another card could take over and the other end
would never know the difference.  [Quite off topic.]

I'm not sure how much of the NSR capability spilled over to other
router vendors in the following two decades.  NSR is definitely off
topic though an interesting topic.

> Regards,
>   Roland
>  
> > Regarding "it helps to remember" from the slides from Ross and John it
> > helps to remember BGP work conserving nature that was discussed
> > extensively in the mid 1990s but still gets overlooked now and then.
> > 
> > Now back to PIM-SM ... maybe.

Or maybe not.

Curtis