Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0

On Wed, Dec 16, 2020 at 04:21:53PM +0000, Jakob Heitz (jheitz) wrote:
> No. It's not closed with a NOTIFICATION. The send queue is frozen. No
> data, not even a NOTIFICATION is going to get to rtr-A. The only thing
> that will get there is a TCP RST and/or a new TCP SYN.

While true the sending system doing the reset will consider it as if it
sent a NOTIFICATION and will therefor not do a GR. Once the new session is
established rtr-A will flush its table because rtr-B will not include the
F flags. Considering that rtr-A is probably not handling any route updates
it is more important that rtr-B routes traffic away from rtr-A and this
happens.

> Regards,
> Jakob.
> 
> 
> > On Dec 16, 2020, at 1:36 AM, Claudio Jeker <cjeker@diehard.n-r-g.com> wrote:
> > 
> > On Tue, Dec 15, 2020 at 11:39:52PM +0000, Jakob Heitz (jheitz) wrote:
> >> If you tell the socket to shutdown and then close, it will attempt to
> >> send everything in the queue with the FIN at the end.
> >> Then wait for the FIN ACK and all manner of nonsense to bore rtr-B to tears.
> >> So, to get on with it, send the RST.
> >> 
> >> Next question is what to do if GR is in effect.
> >> rtr-A will dutifully retain all the routes from rtr-B and Job's beloved WITHDRAW
> >> will still not happen.
> >> The new session will come up (maybe), rtr-B will send all its routes again and
> >> (if it doesn't get stuck again) will send its EOR. Only now can Job breathe easy.
> >> 
> >> Might we need a new bit in the GR capability in the OPEN message?
> >> "WITHDRAW ALL MY ROUTES NOW"
> > 
> > GR should not be an issue since the connection is closed with a
> > NOTIFICATION. At least the system detecting the stuck session will flush
> > and WITHDRAW all routes. In the next OPEN message this system will neither
> > set the R flag nor the F flag and so the stuck system will WITHDRAW
> > all routes as well.
> > 
> > The per AF "Forwarding State" bit already acts as a withdraw all my routes
> > now indicator.
> > 
> > Cheers,
> > -- 
> > :wq Claudio
> > 
> >> Regards,
> >> Jakob.
> >> 
> >> From: Idr <idr-bounces@ietf.org> On Behalf Of Robert Raszuk
> >> Sent: Tuesday, December 15, 2020 3:19 PM
> >> To: Job Snijders <job@sobornost.net>
> >> Cc: idr@ietf.org
> >> Subject: Re: [Idr] TCP & BGP: Some don't send terminate BGP when holdtimer expired, because TCP recv window is 0
> >> 
> >> Hi Job,
> >> 
> >> Putting all other concerns aside I have few questions ...
> >> 
> >> 1. Is this BGP which should trigger the session RST or FIN or TCP ?
> >> 
> >> 2. If this is BGP (TCP would not be aware of HOLD_SEND) how exactly do we know that peer's window is 0 for HOLD_SEND TIME ?
> >> 
> >> 3. Which TCP socket option will return BGP an error that for the duration of X sec window for a given peer was 0 ? I presumed even if it jumped for 100 ms above 0 the timer would be reset indicating peer is still alive ?
> >> 
> >> From your bgpd example you are not checking anything other then BGP's ability to write to out queue. So is this the suggestion now forgetting all about TCP layer ? Simply if I can not write anything to a peer for over X sec RST the session ?
> >> 
> >> Hi John,
> >> 
> >> I think the suggestion is to add a second HOLD_SEND TIME different from normal HOLD TIME.
> >> 
> >> Also there could be lost of different type of peers so unless HOLD_SEND would be say 5 x HOLD putting all peers under same time value may be suboptimal.
> >> 
> >> Thx,
> >> R.
> >> 
> >> 
> >> On Tue, Dec 15, 2020 at 10:54 PM Job Snijders <job@sobornost.net<mailto:job@sobornost.net>> wrote:
> >>> On Tue, Dec 15, 2020 at 09:57:47PM +0100, Christoph Loibl wrote:
> >>> Thanks for answering my question in more detail. Maybe I was unclear
> >>> (but reading your email I think we are talking about the same).
> >>>> On 15.12.2020, at 21:00, John Scudder <jgs@juniper.net<mailto:jgs@juniper.net>> wrote:
> >>>> 
> >>>> I think you are talking about this scenario. I’ll copy the example
> >>>> from Rob’s message cited above:
> >>>> 
> >>>>  rtr-A                   rtr-B
> >>>>  (congested c-p)         (uncongested c-p)
> >>>>  send window: >0         send window: 0
> >>>>  recv window: 0          recv window: >0
> >>>> 
> >>>> In this case we expect:
> >>>> a) rtr-B does not send any BGP packet (KEEPALIVE/UPDATE/NOTIFICATION)
> >>>> to rtr-A in normal operating circumstances.
> >>>> b) rtr-A does not expect any KEEPALIVE/UPDATE packets from rtr-B. The
> >>>> session remains established even if no packet is received in the
> >>>> holdtime.
> >>>> c) rtr-A continues to send KEEPALIVE packets to rtr-B.
> >>> 
> >>> The part I have a problem to understand is b). It is clear that rtr-A
> >>> will not receive any packets from rtr-B because rtr-B cannot send them
> >>> (send window: 0). But does "rtr-A does not expect any KEEPALIVE/UPDATE
> >>> packets from rtr-B” mean that rtr-A has essentially suspended its
> >>> hold-timer until it is ready to receive new messages and opens up its
> >>> recv window? If yes, why? I would expect timers to run independently
> >>> of the transport protocol.
> >> 
> >> Yeah, I'd expect that too. We've seen congested BGP implementations
> >> continue to send KEEPALIVEs but not accept (or send!) other BGP
> >> messages. And rtr-B's attempts at KEEPALIVE just be TCP ACked with zero
> >> window.
> >> 
> >> I'd argue in the above scenario rtr-A is simply broken and rtr-B MUST
> >> proceed to close down the session towards rtr-A, rtr-B must cleanup and
> >> generate WITHDRAWs for any routes pointing to rtr-A. By doing the
> >> clean-up rtr-B does both itself and rtr-A a favor. If the issue was
> >> transcient rtr-A and rtr-B will re-establish a few minutes later
> >> (IdleHoldTimer, right?) and things will normalize.
> >> 
> >> Arguably and measurably, rtr-A is operating its Loc-RIB (forwarding)
> >> based on stale routing information (assuming rtr-A is working at all!):
> >> rtr-A has not received any WITHDRAWs, UPDATEs (or somewhat less
> >> importantly KEEPALIVEs) from rtr-B.
> >> 
> >> Rtr-B is fully aware of this stale situation, because rtr-B was not able
> >> to write these BGP messages to the network: the messages are still in
> >> OutQ. Rtr-A didn't accept any KEEPALIVE (or UPDATE/WITHDRAW) from
> >> rtr-B.
> >> 
> >> How to solve this? Claudio Jeker took a look at what it would take in
> >> OpenBGPD and came up with the (tiny!) following patch, should be
> >> readable to most: https://marc.info/?l=openbsd-tech&m=160796802508185&w=2
> >> 
> >> Ben Cox helped me create a 'EBGP peer from hell': a publicly accessible
> >> EBGP multihop instance which can reliably produce the undesirable
> >> TCP/BGP behavior we're discussing here. This 'peer from hell' will do
> >> the OPEN exchange but then manipulates the TCP recvwindow towards zero.
> >> 
> >> All BGP implementations tested so far (5 famous ones) appear vulnerable
> >> because they continue to consider the BGP session healthy & stable
> >> (meanwhile OutQ keeps growing endlessly and zero BGP messages go across
> >> the wire).
> >> 
> >> One network operator (with thousands of EBGP sessions in the DFZ)
> >> reported to me the above stalled-TCP scenario is *not* a common case on
> >> the Internet. On a normal day, a network operator will see no (zero)
> >> sessions stuck this way, which leads me to believe 'recvwind=0' ...
> >> *for the duration of the hold timer* is a very strong indicator for a
> >> really broken situation which should be attempted to automatically
> >> resolve.
> >> 
> >> I believe BGP implementations are not helping any known deployment
> >> scenarios by *not* disconnecting a stuck peer, however on the other we
> >> now know about various operational examples where honoring recvwind=0
> >> for (hours, days) longer than $holdtimer led to global scale problems.
> >> 
> >> As the 'not-at-all progressing OutQ' situation seems somewhat rare in
> >> the wild (yet continues to happen from time to time) I think it is worth
> >> discussing & documenting how implementers can attempt to avoid this
> >> state from happening. It might help make the Internet 1% more robust.
> >> 
> >> BGP implementers (or operators wanting to test their equipment) feel
> >> free to contact me off-list if you'd like to set up an EBGP multihop
> >> session towards the 'peer from hell' testbed. Testing potential
> >> solutions this way is quite easy, the behavior can be triggered within a
> >> few seconds.
> >> 
> >> Kind regards,
> >> 
> >> Job
> >> 
> >> ps. At this moment we have (1) an attempt at problem description, (2) a
> >> demonstration BGP-4 implementation of a 'problem causer', and (3) a
> >> different BGP-4 implementation with a 'solution'. This enables IDR to
> >> test interopability & (potentially revised) protocol compliance,
> >> hopefully moving the problem a bit from theoretical to practical
> >> reality? :)
> > 
> >> _______________________________________________
> >> Idr mailing list
> >> Idr@ietf.org
> >> https://www.ietf.org/mailman/listinfo/idr
> > 

-- 
:wq Claudio