Re: [Bier] Q on the congestion awareness of routing protocols

Thanks, Mathis

This is actually a good point, and i also had to live through terrible abuse
of BGP/TCP to actually determine link aliveness. Beside the details you mention,
BFD is of course a very popular add-on to solve the link-aliveness determination.

Is TCP with all these add-ons the best possible signaling protocol ?
Certainly not. But it is in so much common use in networks because of
BGP that it would be silly to not go at least up to this level for
another protocol (PIM) that has pretty much the same congestion/burst
signaling issues.

Cheers
    Toerless

On Mon, Dec 05, 2022 at 05:59:27PM -0800, Matt Mathis wrote:
> One of the early BGP insights was that TCP liveness is not a good metric of
> link liveness.   I had already moved on toe TCP performance (so I don't
> recall many of the details in the routing space) but at some point BGP
> became very fragile, with a large number of seemingly random route flaps.
>  It was eventually traced to the fact that state-of-the-art TCP in routers
> did not include SACK, and nobody noticed that bidirectional data breaks
> Reno fast retransmit, because data on the other side of the connection
> invalidates the dup ACK status needed to trigger fast retransmits.
> 
> By this time I had completely moved away from routing, but I believe that
> several fixes were deployed:
> -  QoS to support for "High priority signalling" - I suspect that this is
> well supported on single hops, even if it is not supported in the wide area
> at all.
> - Deploy a different protocol with calibrated outage behavior to test for
> link liveness
> - TCP closes do not flush routes if the link still tests live - BGP will
> run "open loop" for long enough to restart TCP.
> 
> I was present at IETF 14, where people hotly debated the protocol layer
> inversion of using TCP to implement a routing protocol. I came into the
> room as a sceptic, and left to become a co-author of RFC 1164, the first
> BGP Applicability statement.
> 
> Thanks,
> --MM--
> The best way to predict the future is to create it.  - Alan Kay
> 
> We must not tolerate intolerance;
>        however our response must be carefully measured:
>             too strong would be hypocritical and risks spiraling out of
> control;
>             too weak risks being mistaken for tacit approval.
> 
> 
> On Mon, Dec 5, 2022 at 1:27 PM Toerless Eckert <tte@cs.fau.de> wrote:
> 
> > Matt
> >
> > I guess i should have made the subject even longer to avoid confusion with
> > the topic
> > you are describing. My concern is solely about the messaging of the
> > protocol itsel
> > not being resilient against congestion because for bad, historic reasons,
> > it just
> > uses non-congestion proof datagram messages. And IMHO is completely not
> > compliant with
> > what i think should be IETF TSV congestion awareness requirements. Except
> > that i am
> > not sure if we actually formalized these requirements for non-UDP baed
> > protocols, sich as
> > in the case i am worried about PIM.
> >
> > Cheers
> >     Toerless
> >
> > P.S.  Now, having said this, i do suffer from ADD as the next guy, so i
> > fully appreciate the
> > problem you mention, especially because i think it could come up in the
> > context of
> > dynamic re-routing based on power/energy factors, because it is easy to
> > imagine how
> > there would be imilar feedback loops as the old ones from 30 years ago
> > thart you mention.
> > Only that this time it is not instability due to the routing impacting the
> > load of links wrt.
> > congestion of the links, but because the traffic paths created by routing
> > would impct the
> > power factors of the path - more load, more energy, exhaustion
> > ("congestion equivalent")
> > of of the "green" energy for the path... More convoluted, smae fundamental
> > issue. I think
> > i mentioned this at TVR BoF @ IETF116.
> >
> > On Fri, Dec 02, 2022 at 10:20:53AM -0800, Matt Mathis wrote:
> > > There were some very early failure examples with David Mill's HELLO
> > > protocol (the precursor to NTP).  Basically load sensitive routing is
> > prone
> > > to route sloshing, where too many routes change once congestion is
> > > detected, causing pervasive route flapping.
> > >
> > > AFAICR People concluded that it was not feasible for a distributed
> > > algorithm to solve this problem, so HELLO was abandoned (except
> > extracting
> > > NTP as a stand alone protocol).
> > >
> > > That work was more than 30 years ago, so it might pay to revisit it, to
> > > re-examine it to see if something has changed.   However, the conclusion
> > > matches my intuition.
> > >
> > > Thanks,
> > > --MM--
> > > The best way to predict the future is to create it.  - Alan Kay
> > >
> > > We must not tolerate intolerance;
> > >        however our response must be carefully measured:
> > >             too strong would be hypocritical and risks spiraling out of
> > > control;
> > >             too weak risks being mistaken for tacit approval.
> > >
> > >
> > > On Fri, Dec 2, 2022 at 9:57 AM Jon Crowcroft <jon.crowcroft@cl.cam.ac.uk
> > >
> > > wrote:
> > >
> > > > Gonna say, ironically, one early use of multicast was a proposal to use
> > > > SRM instead of a mesh of tcp connections for iBGP...so some people do
> > think
> > > > about scaling control plane traffic in the presence of congestion, some
> > > > times:-)
> > > >
> > > > On Fri, 2 Dec 2022, 17:03 Toerless Eckert, <tte@cs.fau.de> wrote:
> > > >
> > > >> Dear routing-discussion / TSV folks
> > > >> (sorry for escalating this, but it really bugs me - Cc'ing PIM/BIER)
> > > >>
> > > >> What are these days the expectations against let's say a full Internet
> > > >> Standard
> > > >> for a routing protocol to support in terms of congestion safe
> > behavior ?
> > > >> And
> > > >> what are congestion control expectation for new routing protocl RFCs
> > even
> > > >> if
> > > >> just proposed standard ?
> > > >>
> > > >> I am asking, because i think that our core IP multicast routing
> > protocol
> > > >> fails miserably on this end, and quite frankly i do not understand how
> > > >> PIM-SM (RFC7761) could have become a full Internet standard given how
> > it
> > > >> has zilch discussion about congestion or loss handling.
> > > >>
> > > >> [ Especially, when in comparison a protocol like RFC7450 where TSV did
> > > >> raise concerns
> > > >>   about multicast data plane congestion awareness, and it  was held up
> > > >> for years, and
> > > >>   GregS as the WG-chair for the WG responsible for RFC7450 had to even
> > > >> help
> > > >>   co-author RFC8085 to cut through the congestion control
> > concern-cord.
> > > >> But likely
> > > >>   all for the better!].
> > > >>
> > > >> To quickly summarize the issue with PIM-SM to those who do not know
> > it:
> > > >>
> > > >>                  /- R2 -------- R6 -\
> > > >>      Rcvrs ... R1                    R7 ... Senders
> > > >>                  \- R3 -- R4 -- R5 -/
> > > >>
> > > >>         CE ... PE .. P    P     P    PE  CE ...
> > > >>
> > > >> R1 has let's say 100,000 ulticast/PIM (S,G) states with sources behind
> > > >> R7, so
> > > >> it has to maintain 1000,000 so-called PIM (S,G) joins across the path
> > R2,
> > > >> R6, R7.
> > > >> Lets say roughly an (S,G) join for IPv6 is about 38 byte (IPv6),
> > maybe 35
> > > >> (S,G)
> > > >> per 1500 byte packet, so 2857 packets of 1500 byte to carry all
> > 100,000
> > > >> (S,G).
> > > >>
> > > >> Assume link R6/R7 fails, IGP reconverges, R1 recognizes that it needs
> > to
> > > >> change path, so it sends 2857 PIM-SM packets with prunes to R2 and
> > 2857
> > > >> PIM -SM
> > > >> packets with joins to R3.
> > > >>
> > > >> Assume R1 is a PE, R2 and R3 are P routers in an SP, and actually
> > R2/R3
> > > >> connect
> > > >> to lets say 100 routers like R1. Now R2 and R3 get 100 x 2857 1500
> > byte
> > > >> packets.
> > > >>
> > > >> And there is nothing in the PIM-SM spec that talks about how to
> > throttle
> > > >> this
> > > >> heap of PIM-SM packets. Typically, routers would just send them
> > > >> back-to-back.
> > > >> And those packets repeat every 60 seconds given how PIM-SM is
> > datagram /
> > > >> periodic
> > > >> soft-state.  In fact, if you try to scale this in production networks,
> > > >> you will
> > > >> most likely fail a lot more than IP multicast in those routers,
> > because
> > > >> PIM not
> > > >> only will badly compete on control-plane CPU time, but even more so on
> > > >> control-plane
> > > >> to hardware-forwarding time when updating the 100,000 (S,G) hardware
> > > >> forwarding entries.
> > > >>
> > > >> Correct me if i am wrong, but did the same type of issues in
> > ISIS/OSPF in
> > > >> DC because of so many parallel paths and hence duplication of LSA
> > recently
> > > >> lead to the creation of multiple IETF working groups in RTG to solve
> > these
> > > >> issues ?
> > > >>
> > > >> In IP multicast, we where well aware of these issues and they where a
> > > >> core
> > > >> reason to not build a PIM-based MPLS multicast protocol, but use the
> > TCP
> > > >> based LDP
> > > >> to specify mLDP (RFC6388). Same thing, when various BGP multicast
> > work was
> > > >> done as an alternative to PIM for SPs (BCP also being TCP based).
> > > >>
> > > >> We did even fix this problem in PIM by specifying RFC6559 (PIM over
> > TCP),
> > > >> but instead of making that mechanisms mandatory and become the only
> > option
> > > >> for PIM when moving PIM up the IETF standards ladder to RFC7761, that
> > > >> RFC had seemingly fallen into ignorance in the IP Multicast community,
> > > >> because most IP multicast deployments are small enough that these
> > issues
> > > >> do not occur.
> > > >>
> > > >> So, why do i escalate this issue now ?
> > > >>
> > > >> We have a great new multicast architecture called BIER that eliminates
> > > >> all this PIM multicast state issues from the P routers of such large
> > > >> service provider networks by being stateless. But it still leaves the
> > > >> need for overlay signaling, such as with PIM to operate between the
> > > >> PE, such as in above picture the hundreds if not thousands
> > > >> of receiver PE R1' and sender PE R7'. In which case you would have
> > > >> PIM directly between those R1'/R7' across multihop paths, leading
> > > >> to even more congestion considerations. And in support of such BIER
> > > >> networks,
> > > >> there is a draft draft-hb-pim-light proposed to PIM-WG to optimize PIM
> > > >> explicitly
> > > >> for this type of deployment. And when i said in PIM@IETF115, that
> > such a
> > > >> draft IMHO
> > > >> should only allowed to proceed when it is written to say it MUST
> > > >> be based on PIM over TCP (RFC6388), all other people responding
> > > >> on the thread said at best it could be be a MAY. Aka: Congestion
> > control
> > > >> optional.
> > > >>
> > > >> Am i a congestion control extremist ? I really only want to have
> > > >> scaleable, reliably multicast RFCs, especially when they aspire and
> > > >> go to full IETF standard and are meant to support our next-gen IP
> > > >> Multicast
> > > >> architectures (BIER). I do fully understand how there is a lot
> > > >> of cost pressure on vendor development, and having procrastinated
> > > >> to implement, proliferate and deploy PIM over TCP so far (almost a
> > > >> decade!)
> > > >> does make this a less attractive choice short term. And the whole
> > purpose
> > > >> of the PIM light draft of course is to reduce the amount of
> > development
> > > >> needed
> > > >> by making PIM more "light" (which is a good think). But when it
> > > >> carries forward the problems of PIM to another generation of networks
> > > >> (using BIER) that was especially built to scale better, then one
> > > >> should IMHO really become worried. At least i do. But i also
> > struggled to
> > > >> implement datagram PIM processing for 100,000 states in a prior life
> > > >> and then pushed for PIM over TCP...
> > > >>
> > > >> Thanks!
> > > >>     Toerless
> > > >>
> > > >> _______________________________________________
> > > >> routing-discussion mailing list
> > > >> routing-discussion@ietf.org
> > > >> https://www.ietf.org/mailman/listinfo/routing-discussion
> > > >>
> > > >
> >
> > --
> > ---
> > tte@cs.fau.de
> >

-- 
---
tte@cs.fau.de