Re: bgp4-17 Cease subcode

Susan Hares <skh@nexthop.com> Thu, 17 January 2002 13:52 UTC

Received: from trapdoor.merit.edu (postfix@trapdoor.merit.edu [198.108.1.26]) by nic.merit.edu (8.9.3/8.9.1) with ESMTP id IAA09851 for <idr-archive@nic.merit.edu>; Thu, 17 Jan 2002 08:52:16 -0500 (EST)
Received: by trapdoor.merit.edu (Postfix) id EAF34912D3; Thu, 17 Jan 2002 08:51:47 -0500 (EST)
Delivered-To: idr-outgoing@trapdoor.merit.edu
Received: by trapdoor.merit.edu (Postfix, from userid 56) id B23A5912D4; Thu, 17 Jan 2002 08:51:47 -0500 (EST)
Delivered-To: idr@trapdoor.merit.edu
Received: from segue.merit.edu (segue.merit.edu [198.108.1.41]) by trapdoor.merit.edu (Postfix) with ESMTP id 8C10C912D3 for <idr@trapdoor.merit.edu>; Thu, 17 Jan 2002 08:51:46 -0500 (EST)
Received: by segue.merit.edu (Postfix) id 6043D5DDDA; Thu, 17 Jan 2002 08:51:46 -0500 (EST)
Delivered-To: idr@merit.edu
Received: from presque.djinesys.com (presque.djinesys.com [198.108.88.2]) by segue.merit.edu (Postfix) with ESMTP id 45BFF5DD9E for <idr@merit.edu>; Thu, 17 Jan 2002 08:51:46 -0500 (EST)
Received: from SKH.nexthop.com ([64.211.218.122]) by presque.djinesys.com (8.11.3/8.11.1) with ESMTP id g0HDpb365074; Thu, 17 Jan 2002 08:51:37 -0500 (EST) (envelope-from skh@nexthop.com)
Message-Id: <5.0.0.25.0.20020117083423.0252ef28@mail.nexthop.com>
X-Sender: skh@mail.nexthop.com
X-Mailer: QUALCOMM Windows Eudora Version 5.0
Date: Thu, 17 Jan 2002 08:51:35 -0500
To: Alex Zinin <azinin@nexsi.com>, randy Bush <randy@psg.com>
From: Susan Hares <skh@nexthop.com>
Subject: Re: bgp4-17 Cease subcode
Cc: idr@merit.edu
In-Reply-To: <114185201786.20020116111047@nexsi.com>
References: <5.0.0.25.0.20020116090028.039d2fa8@mail.nexthop.com> <20020115140711.GA23937@opentransit.net> <20020114123700.C7761@nexthop.com> <200201141750.g0EHo3634958@merlot.juniper.net> <20020115140711.GA23937@opentransit.net> <5.0.0.25.0.20020116090028.039d2fa8@mail.nexthop.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format="flowed"
X-NextHop-MailScanner: Found to be clean
Sender: owner-idr@merit.edu
Precedence: bulk

Alex and Randy:

Let's go back to first principles here on the FSM?

1st)  The original text is below from draft-12.

    This explanation gives "flag" that has implications
    for the full state machine.   The definition of
    state machines that Alex proposes, is all inputs
    and all actions are determined by state machine in
    clear text.

2nd) Is the bug real?

    The text was fixing a " persistent bgp flapping"
    bug.  According to the text there is a
    state within a state with fairly vague
    descriptions.

The second question belongs to Randy, since he wears 2 hats (Operations
and Routing temp AD), was this a real problem?  Has it gone away?

1) At least one routing vendor doesn't implement it [cisco]
    and I know this vendor is utilized in BGP peering sessions
    in the network.

2) What was the operational concern this text implies?

If there is no operational issue and no operational usage,
return to the original FSM text and out the comments on
"hold down."

So, Randy and all other operators - Is the problem it describes real?
Does anyone need it?   Let's answer that question first.




Sue Hares

----------------------


           If a BGP speaker detects an error, it shuts down the connection
          and changes its state to Idle. Getting out of the Idle state
          requires generation of the Start event.  If such an event is
          generated automatically, then persistent BGP errors may result
          in persistent flapping of the speaker.  To avoid such a
          condition it is recommended that Start events should not be
          generated immediately for a peer that was previously
          transitioned to Idle due to an error. For a peer that was
          previously transitioned to Idle due to an error, the time



Expiration Date July 2001                                      [Page 31]





RFC DRAFT                                                   January 2001


          between consecutive generation of Start events, if such events
          are generated automatically, shall exponentially increase. The
          value of the initial timer shall be 60 seconds. The time shall
          be doubled for each consecutive retry.

          Any other event received in the Idle state is ignored.



At 11:10 AM 1/16/2002 -0800, you wrote:

>Sue,
>
>  Introduction of the IdleHold does change the FSM,
>  and I thought we wanted the spec to reflect the current
>  running code as much as possible.
>
>  I agree with Russ and Ishi---the new state does not
>  seem to be necessary, instead it could be as easy
>  as holding the session in Idle and giving clue on
>  how to make the delay exponential. I don't think
>  there's an interoperability issue if people decide
>  to keep the session in Idle using different internal
>  mechanisms.
>
>  Regards,
>
>--
>Alex Zinin
>
>Wednesday, January 16, 2002, 6:04:36 AM, Susan Hares wrote:
>
> > Kunihiro:
>
> > We are not changing the FSM so I would be surprised if the change
> > was anything but modest. Usually specifications with "no big" deal get
> > interpreted differently.  Inter-operable code means you tie down the 
> details.
>
> > The comment was on the clarity of the specification.
>
> > If you have a specific comment on the text of the state machine,
> > can you propose the concerns you have as a revision to the text.
>
> > Sue
>
> > PS -- I just love last call on a draft...  It's when
> >        everyone finally reads a new section  ;-)...
>
>
> > At 05:39 PM 1/15/2002 -0800, Kunihiro Ishiguro wrote:
> >> >> > Is the expoential backoff in the FSM in current implementations?
> >> >>
> >> >> I guess we are going to find this out as part of the implementation
> >> >> report. And if it is not in (at least two) current implementations,
> >> >> we'll take it out of the text.
> >> >
> >> >I implemented Cease subcode in zebra-0.92a but not exponential backoff.
> >> >I checked that new subcode does not put other BGP stack in trouble
> >> >(checked Cisco and Juniper).
> >>
> >>First of all, sorry for talking about specific implementation.  In
> >>Zebra implementation we've implemented Cease subcode.  The code is in
> >>CVS repository.  I'll prepare a release version for that.
> >>
> >>And also we've implemented exponetial backoff.  It is done without
> >>introducing new FSM status.  It is not a big deal.  Just check a flag
> >>in a few functions.
>
>
> >>Exponential backoff is a good feature.  I can't understand why it
> >>require change to FSM.