Re: [Idr] draft-spaghetti-idr-bgp-sendholdtimer

One thing which also I am worried about with this proposal is that data
plane may be working just fine (imagine stub ASN where it advertised a
prefix and received default) yet zero window was signalled by the peer for
any of those reasons Jeff nicely enumerated.

So what we are discussing is breaking data plane just because control plane
has experienced 15 min (or worse recommended 4 min) inability to send
keepalives.

Even for dual connected stub sites usually it means different exit
interface switchover and different NAT breaking all end user's connections.

Of course none of this applies when you are a transit router with many
peers.

And I am not even touching the control plane separation from data plane
architectures here.

So two questions ..

* Should we perhaps test data plane before declaring peer's failure and
before we reset the session ? (I understand that the paramount motivation
is BGP consistency here though - but this is one of those cases where one
size may not fit all).

* Should we first withdraw received routes from our peers before resetting
the session ? At least data plane will have a chance to converge to a
different set of links with no sudden packet drops.

Thx,
R.

On Fri, Apr 23, 2021 at 11:01 PM Jeffrey Haas <jhaas@pfrc.org> wrote:

> On Tue, Apr 20, 2021 at 02:47:35PM +0100, Ben Cox wrote:
> > We submitted our first draft today (
> > https://datatracker.ietf.org/doc/draft-spaghetti-idr-bgp-sendholdtimer/
> > ) and we are looking for feedback knowing that it is not complete but
> > is likely in a state for some discussion.
> >
> > Since this draft tweaks the BGP FSM, we would like to make sure that
> > it's done correctly and so we are soliciting help from this working
> > group.
>
> As someone who spent a lot of time working with the editors of what became
> RFC 4271, you're going to find there's a lot of corner cases in the FSM to
> deal with.  :-)  We can work on the text once it's clearer what we should
> do.
>
> A number of stream of thought comments about the topic covered by the
> draft:
>
> I don't object quite as much as Jakob does about the issues of claiming
> implementations that are exhibiting a stuck session as contributing to
> stale
> routes and blackholes in the Internet.  A simple truth of BGP as a
> hop-by-hop protocol is that if you can't distribute your updates, the
> downstreams are operating on an incorrect version of reality.
>
> That said, the problem statement isn't really crisp enough I'd want to make
> such a large claim.
>
> TCP applications will exhibit zero-windowing naturally over the course of a
> session's lifetime.  This typically will occur when there's some level of
> backpressure on the application due to I/O issues or even CPU.  While these
> situations aren't desirable since they signal congestion in the
> application,
> congestion SHALL happen.[1]
>
> Most of the headaches I'd discusss for the problem this draft attempts to
> solve revolve around the consequences of this statement in section 2 of the
> draft:
>
> :   Generally BGP implementation have no visibility into lower-layer
> :   subsystems such as TCP or the peer's current Receive Window.
> :   Therefor this document banks on BGP implementations being able to
> :   detect an inability to push more data to the remote peer, at which
> :   point the SendHoldTimer starts.
>
> It is quite true that it is difficult to get a sense of what's going on
> when
> you have simple socket (for example) backpressure.  What I believe an
> application really wants to know for some sort of send-hold-timer is the
> following:
>
> - A session's TCP window has reached zero availability at a given sequence
>   number.
> - It stays at that sequence number with zero availability for "too long".
>
> Typical POSIX socket APIs won't give you this kind of certitude.  The
> condition, as written in the draft, effectively manifests as a write() to
> the socket that fails.  But how does it fail:
> - Is it because there is no space in the socket to accept data?
> - Is it because there are no local resources (e.g. mbufs) to push the next
>   bit of data?
> - Is it because a kernel timer hasn't triggered a bit of TCP behavior to
> try
>   to move some data?
> - Is it because we had blocked and we're waiting for select/kevent to fire
>   and let us know that there's space - and it simply hasn't bothered even
>   though there's space in the window?
> - Is it because a low-watermark feature is preventing the socket from being
>   ready even though the window has advanced?
> - Is it because the session is draining VERY.... SLOWLY... and the kernel
>   won't wake us up even though the window is no longer zero - just not big
>   enough? (A version of the prior issue.)
>
> Some things that can cause zero windowing may be worth keeping in mind to
> decide how bad they can be and what an implementation may want to do about
> it:
>
> - Perhaps it's doing a live-upgrade and needs to wedge the incoming session
>   to do so? (ISSU, e.g.)
> - It might be configured with incoming session prioritization, and your
>   socket isn't important enough by policy.  (Corrollary: Not all routes are
>   created equal.)
> - It may be behind in servicing the socket. For example, too much work,
>   perhaps due to a reconfiguration event that will resolve at some point.
> - A firewall may be interfering with TCP and the receiver is actually fine.
>
> Finally, I want to highlight something rather critical:
> Just because you reset your session, in some of these pathologies the
> router
> you're resetting your session with may not withdraw your routes downstream
> fast enough.
>
> This is the stale situation highlighted in the draft.  The distinction is
> that sometimes thrashing makes it worse.
>
> If you're going to disconnect from such a peer, you had better have enough
> routing diversity to survive disconnecting from it.  You also better be
> prepared to keep the session down for some time.
>
> Keeping the session down is not addressed in the draft.
>
> -- Jeff
>
> [1] When I give presentations discussing BGP, I tend to note that BGP is
> for
> the most part a simple protocol with a lot of optional features... and the
> hard part is making it scale.
>
> _______________________________________________
> Idr mailing list
> Idr@ietf.org
> https://www.ietf.org/mailman/listinfo/idr
>

Re: [Idr] draft-spaghetti-idr-bgp-sendholdtimer - Feedback requested