Re: [pim] Q on the congestion awareness of routing protocols

"Bless, Roland (TM)" <roland.bless@kit.edu> Mon, 05 December 2022 10:10 UTC

Return-Path: <roland.bless@kit.edu>
X-Original-To: pim@ietfa.amsl.com
Delivered-To: pim@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BB088C1522BB; Mon, 5 Dec 2022 02:10:28 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.897
X-Spam-Level:
X-Spam-Status: No, score=-6.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id S6-ym4kGZROV; Mon, 5 Dec 2022 02:10:25 -0800 (PST)
Received: from iramx1.ira.uni-karlsruhe.de (iramx1.ira.uni-karlsruhe.de [IPv6:2a00:1398:2::10:80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2A3ACC1522B8; Mon, 5 Dec 2022 02:10:22 -0800 (PST)
Received: from i72vorta.tm.uni-karlsruhe.de ([141.3.71.26] helo=i72vorta.tm.kit.edu) by iramx1.ira.uni-karlsruhe.de with esmtpsa port 25 iface 141.3.10.8 id 1p28QS-0001J7-PY; Mon, 05 Dec 2022 11:10:16 +0100
Received: from [IPV6:::1] (ip6-localhost [IPv6:::1]) by i72vorta.tm.kit.edu (Postfix) with ESMTPS id A3F1DD0016F; Mon, 5 Dec 2022 11:10:16 +0100 (CET)
Message-ID: <1fb6e5d2-a0c0-6abe-1a5c-9d1d24575177@kit.edu>
Date: Mon, 05 Dec 2022 11:10:16 +0100
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0 Thunderbird/102.4.2
Content-Language: de-DE
To: Curtis Villamizar <curtis@orleans.occnc.com>
Cc: Toerless Eckert <tte@cs.fau.de>, routing-discussion@ietf.org, tsv-area@ietf.org, pim@ietf.org, bier@ietf.org
References: <1p1sbb-005Vuo-MZ-20221204171644Z@scc-mailin-cs-01.scc.kit.edu>
From: "Bless, Roland (TM)" <roland.bless@kit.edu>
Organization: Institute of Telematics, Karlsruhe Institute of Technology (KIT)
In-Reply-To: <1p1sbb-005Vuo-MZ-20221204171644Z@scc-mailin-cs-01.scc.kit.edu>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
X-ATIS-AV: ClamAV (iramx1.ira.uni-karlsruhe.de)
X-ATIS-Checksum: v3zoCAcc32ckk
X-ATIS-Timestamp: iramx1.ira.uni-karlsruhe.de esmtpsa 1670235016.830818152
Archived-At: <https://mailarchive.ietf.org/arch/msg/pim/hmRQ6W8rf5E8raZTKgMu6RcYHCI>
Subject: Re: [pim] Q on the congestion awareness of routing protocols
X-BeenThere: pim@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Protocol Independent Multicast <pim.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/pim>, <mailto:pim-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/pim/>
List-Post: <mailto:pim@ietf.org>
List-Help: <mailto:pim-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/pim>, <mailto:pim-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 05 Dec 2022 10:10:28 -0000

Hi Curtis,

On 04.12.22 at 12:06 Curtis Villamizar wrote:
> Is it my imagination or did this conversation start out about PIM-SM
> and wander off into everything but PIM-SM?

I think that Toerless question was about congestion control for
routing protocols in general and multicast routing in particular.
Thanks for pointing out that specific bottleneck on updating
the forwarding plane (Toerless also pointed at that) and the
work-conserving property of BGP.

> Anyway ... regarding BGP and TCP (or SRM).
> 
> In message <3cbaf92c-3dc5-01d5-570d-a5ee90f138e0@kit.edu>
> "Bless, Roland (TM)" writes:
>>
>> Hi Toerless,
>>   
>> with respect to routing in general, there is a trade-off between the
>> timeliness of the routing information and rate limiting. Let's assume
>> you use a TCP-like congestion control for routing protocols: in case
>> your sender is throttled heavily, the delay of routing information
>> propagation is probably too high and the routing information lacks
>> seriously behind, thus being outdated when received.
> 
> "Getting behind" happens all the time and its fine.  The bottleneck is
> not TCP or routing protocol processing but installation of routes into
> the forwarding hardware.  All major implementations of BGP will simply
> overwrite any routes in route processor memory that have not yet been
> installed in all forwarding cards and mark them as being needed to
> update in all forwarding cards.  This eliminates the only problem that
> can back up BGP route install and that is persistant route flap.  The
> stable routes get installed in a timely manner though not as fast as
> when other routes are flapping.  In the worst case any given
> persistently flapping route may have inconsistent routing within an
> AS.  Using either form of MPLS (LDP or TE) will avoid route loops or
> blackholes within an AS and LDP is often used just for that (except
> that with flapping when leaving the AS packets can be routed into a
> blackhole or loop in another AS).

Yes, "getting behind" is a usual case and when it is
happening temporarily it is no problem. However, I think it will
become critical if the backlog is either too large or when it happens
persistently so that there is constant backlog

>> I think one has to distinguish where the bottleneck is:
>> 1) link bandwidth (link congestion)
>> 2) routing message processing (CPU congestion)
> 
> I'm not sure how 1 would ever happen since route flap does not
> contribute much TCP traffic on the links ISPs use (mostly in the
> noise).  BGP typically used DHCP marking so congested links don't
> impact control traffic.  On very slow links (HF radio, very low BW
> satelite, low bandwidth consumer such as DSL) a default route or major
> route aggregation is used in rare cases where redundancy is available
> and some balancing is desired.  Dumb deployments not withstanding.

It was also my impression that 1) is usually not the issue.

> The problem is also rarely #2 in higher end routers but rather
> installing in forwarding cards.  Either way due to the way BGP
> implementations are work conserving (competent implementations that
> get used, there might be others) this is not a problem.
> 
>> So in case of 1), routing messages will be dropped, which may
>> lead to retransmissions in case the routing protocol needs
>> reliable message delivery and also cause sending rate reduction
>> in case there is congestion control in place.
>> In case 2) queues build up inside the router causing also
>> serious delay and also the potential problem of obsolete
>> routing information. So the whole routing system could become
>> instable in extreme cases.
>>   
>> A number of typical pitfalls is summarized in this presentation here:
>> https://datatracker.ietf.org/meeting/102/materials/slides-102-rtgarea-those-who-do-not-learn-history-are-doomed-to-repeat-it-00
> 
> Nice nostanlgic slides from Ross and John.  That is more about "simple
> protocols have complex behaviors when assembled into large systems" is
> a big part of this (slide 25).  I've had a hand in a few large systems
> and in BGP implementations (and MPLS).
> 
> Happy retirement Ross.  I'm retired too.
> 
>> This presentation also mentions the widely known practice
>> to prioritize routing control messages over data plane traffic
>> so that the routing control traffic is not adversely affected
>> by congestion in the data plane. Moreover, it also mentions
>> the oscillation effects that happened with delay-based routing
>> as well as the OSPF Flooding issue in denser topologies.
> 
> This conversation did not start out being about delay based routing.
> We all know that any use of a varrying route metric is bad without
> some mechanism to insure stability (hopefully we all know ..).

I guess it was Stuart Bryant who mentioned shortest time routing,
and therefore I briefly mentioned it here...

>> I actually have not the operational experience as others
>> may have, but my guess is that practically CPU congestion
>> occurs more often than link congestion (solely caused by
>> control plane packets). While I believe that TCP congestion control
>> may potentially help to fix short-term congestion situations,
>> it is not a solution for persistent link congestion – I think that
>> such a system may not be able to function correctly.
> 
> The bottlenect in core routers is actually installing the routes onto
> the forwarding cards.  Multiple cards to talk to with less processing
> power in some cases.  Congestion control to the forwarding control
> plane goes back to the route socket BSD days in early to mid 1990s but
> installing on separate smart forwarding cards is now more often in
> application space.
> 
>> So there are typically dampening mechanisms in place to aggregate
>> routing information or to wait before announcing certain
>> route updates.
>> When using TCP, the CPU congestion problem would cause
>> flow control to kick in and automatically throttle
>> the sender to the receiver's processing speed.
>> However, if the generation rate of routing messages is
>> permanently too high, the system will not be stable.
> 
> Last sentence is not true.  The work conserving aspect of BGP
> implementation (any that have survived and are used) means you have
> one most recent route for any destination and are throwing away old
> routes before ever installing them.  Routes get installed in the order
> of the least recently received route and if changed before getting
> installed goes to the back of the queue.  Similar work conserving can
> be done at Adj-In and is useful if BGP Adj-In is delegated to
> forwarding cards with reliability built into TCP state.  Multiple
> routes arrive with each TCP packet, particularly with large MTU, so
> acks get delayed a bit to allow for failure of the forwarding card a
> bit like database 2-phase commit.  The routing table install queue is
> finite and each BGP Adj-In queue is finite.  Both are bound by the
> number of prefixes in global routing so to protect from getting
> flooded with bad prefixes it helps to have some sanity check to avoid
> disaggregation or bogus prefixes.  Any bottleneck in route table
> install, IGP SPF, or forwarding install does not slow down BGP Adj-In
> processing or TCP transfer of information.

The precondition for my statement was CPU congestion
(in your case: BGP Adj-In processing is tool slow).
If I understand correctly you say that there is a decoupling between
routing message processing and FIB entry installation and
your assumption is that neither BGP Adj-In processing
nor TCP processing are a bottleneck. However, I didn't
fully understand the part where "BGP Adj-In is delegated to
forwarding cards with reliability built into TCP state"
and "acks get delayed a bit to allow for failure of the forwarding
card". This sounds like an approach with tight coupling
between FIB entry installation and TCP processing.
So in case of failure (what kind of failure? malfunction of the
whole card or just failure of the installation of entries?)
the ACK is not sent?
Then the sender retransmits and the installation is tried again?

Regards,
  Roland

> Regarding "it helps to remember" from the slides from Ross and John it
> helps to remember BGP work conserving nature that was discussed
> extensively in the mid 1990s but still gets overlooked now and then.
> 
> Now back to PIM-SM ... maybe.
> 
> Curtis
> 
> 
>> On 02.12.22 at 18:03 Toerless Eckert wrote:
>>> Dear routing-discussion / TSV folks
>>> (sorry for escalating this, but it really bugs me - Cc'ing PIM/BIER)
>>>
>>> What are these days the expectations against let's say a full Internet Standard
>>> for a routing protocol to support in terms of congestion safe behavior ? And
>>> what are congestion control expectation for new routing protocl RFCs even if
>>> just proposed standard ?
>>>
>>> I am asking, because i think that our core IP multicast routing protocol
>>> fails miserably on this end, and quite frankly i do not understand how
>>> PIM-SM (RFC7761) could have become a full Internet standard given how it
>>> has zilch discussion about congestion or loss handling.
>>>
>>> [ Especially, when in comparison a protocol like RFC7450 where TSV did raise concerns
>>>     about multicast data plane congestion awareness, and it  was held up for years, and
>>>     GregS as the WG-chair for the WG responsible for RFC7450 had to even help
>>>     co-author RFC8085 to cut through the congestion control concern-cord. But likely
>>>     all for the better!].
>>>
>>> To quickly summarize the issue with PIM-SM to those who do not know it:
>>>
>>>                    /- R2 -------- R6 -\
>>>        Rcvrs ... R1                    R7 ... Senders
>>>                    \- R3 -- R4 -- R5 -/
>>>
>>>           CE ... PE .. P    P     P    PE  CE ...
>>>
>>> R1 has let's say 100,000 ulticast/PIM (S,G) states with sources behind R7, so
>>> it has to maintain 1000,000 so-called PIM (S,G) joins across the path R2, R6, R7.
>>> Lets say roughly an (S,G) join for IPv6 is about 38 byte (IPv6), maybe 35 (S,G)
>>> per 1500 byte packet, so 2857 packets of 1500 byte to carry all 100,000 (S,G).
>>>
>>> Assume link R6/R7 fails, IGP reconverges, R1 recognizes that it needs to
>>> change path, so it sends 2857 PIM-SM packets with prunes to R2 and 2857 PIM -SM
>>> packets with joins to R3.
>>>
>>> Assume R1 is a PE, R2 and R3 are P routers in an SP, and actually R2/R3 connect
>>> to lets say 100 routers like R1. Now R2 and R3 get 100 x 2857 1500 byte packets.
>>>
>>> And there is nothing in the PIM-SM spec that talks about how to throttle this
>>> heap of PIM-SM packets. Typically, routers would just send them back-to-back.
>>> And those packets repeat every 60 seconds given how PIM-SM is datagram / periodic
>>> soft-state.  In fact, if you try to scale this in production networks, you will
>>> most likely fail a lot more than IP multicast in those routers, because PIM not
>>> only will badly compete on control-plane CPU time, but even more so on control-plane
>>> to hardware-forwarding time when updating the 100,000 (S,G) hardware forwarding entries.
>>>
>>> Correct me if i am wrong, but did the same type of issues in ISIS/OSPF in
>>> DC because of so many parallel paths and hence duplication of LSA recently
>>> lead to the creation of multiple IETF working groups in RTG to solve these
>>> issues ?
>>>
>>> In IP multicast, we where well aware of these issues and they where a core
>>> reason to not build a PIM-based MPLS multicast protocol, but use the TCP based LDP
>>> to specify mLDP (RFC6388). Same thing, when various BGP multicast work was
>>> done as an alternative to PIM for SPs (BCP also being TCP based).
>>>
>>> We did even fix this problem in PIM by specifying RFC6559 (PIM over TCP),
>>> but instead of making that mechanisms mandatory and become the only option
>>> for PIM when moving PIM up the IETF standards ladder to RFC7761, that
>>> RFC had seemingly fallen into ignorance in the IP Multicast community,
>>> because most IP multicast deployments are small enough that these issues
>>> do not occur.
>>>
>>> So, why do i escalate this issue now ?
>>>
>>> We have a great new multicast architecture called BIER that eliminates
>>> all this PIM multicast state issues from the P routers of such large
>>> service provider networks by being stateless. But it still leaves the
>>> need for overlay signaling, such as with PIM to operate between the
>>> PE, such as in above picture the hundreds if not thousands
>>> of receiver PE R1' and sender PE R7'. In which case you would have
>>> PIM directly between those R1'/R7' across multihop paths, leading
>>> to even more congestion considerations. And in support of such BIER networks,
>>> there is a draft draft-hb-pim-light proposed to PIM-WG to optimize PIM explicitly
>>> for this type of deployment. And when i said in PIM@IETF115, that such a draft IMHO
>>> should only allowed to proceed when it is written to say it MUST
>>> be based on PIM over TCP (RFC6388), all other people responding
>>> on the thread said at best it could be be a MAY. Aka: Congestion control optional.
>>>
>>> Am i a congestion control extremist ? I really only want to have
>>> scaleable, reliably multicast RFCs, especially when they aspire and
>>> go to full IETF standard and are meant to support our next-gen IP Multicast
>>> architectures (BIER). I do fully understand how there is a lot
>>> of cost pressure on vendor development, and having procrastinated
>>> to implement, proliferate and deploy PIM over TCP so far (almost a decade!)
>>> does make this a less attractive choice short term. And the whole purpose
>>> of the PIM light draft of course is to reduce the amount of development needed
>>> by making PIM more "light" (which is a good think). But when it
>>> carries forward the problems of PIM to another generation of networks
>>> (using BIER) that was especially built to scale better, then one
>>> should IMHO really become worried. At least i do. But i also struggled to
>>> implement datagram PIM processing for 100,000 states in a prior life
>>> and then pushed for PIM over TCP...
>>>
>>> Thanks!
>>>       Toerless
>>>