Graceful Restart

Krishna Rao <ospf_query@REDIFFMAIL.COM> Fri, 04 April 2003 09:22 UTC

Received: from cherry.ease.lsoft.com (cherry.ease.lsoft.com [209.119.0.109]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id EAA11915 for <ospf-archive@LISTS.IETF.ORG>; Fri, 4 Apr 2003 04:22:31 -0500 (EST)
Received: from walnut (209.119.0.61) by cherry.ease.lsoft.com (LSMTP for Digital Unix v1.1b) with SMTP id <9.00962710@cherry.ease.lsoft.com>; Fri, 4 Apr 2003 4:23:58 -0500
Received: from DISCUSS.MICROSOFT.COM by DISCUSS.MICROSOFT.COM (LISTSERV-TCP/IP release 1.8e) with spool id 717853 for OSPF@DISCUSS.MICROSOFT.COM; Fri, 4 Apr 2003 04:23:58 -0500
Received: from 203.199.83.39 by WALNUT.EASE.LSOFT.COM (SMTPL release 1.0i) with TCP; Fri, 4 Apr 2003 04:23:56 -0500
Received: (qmail 6001 invoked by uid 510); 4 Apr 2003 09:22:16 -0000
Received: from unknown (203.197.138.201) by rediffmail.com via HTTP; 04 apr 2003 09:22:16 -0000
MIME-Version: 1.0
Content-type: text/plain; format="flowed"
Content-Disposition: inline
Message-ID: <20030404092216.6000.qmail@webmail29.rediffmail.com>
Date: Fri, 04 Apr 2003 09:22:16 -0000
Reply-To: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
Sender: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
From: Krishna Rao <ospf_query@REDIFFMAIL.COM>
Subject: Graceful Restart
To: OSPF@DISCUSS.MICROSOFT.COM
Precedence: list

Hi,

draft-ietf-ospf-hitless-restart-07.txt:
2.  Operation of restarting router:

Point 2) If the restarting router determines that it was
Designated
          Router on a given segment immediately prior to the
restart,
          it elects itself as Designated Router again

Q1)Why does this need to be done while restarting?
    AND
    How does it help ??

Q2)What if normal DR/BDR election takes place "on exiting
    graceful restart"

thanks,
krishna









On Fri, 04 Apr 2003 Erblichs wrote :
>Padma, :)
>
>         Comments inline..
>
>         Mitchell Erblich
>         -------------
>
>Padma Pillay-Esnault wrote:
> >
> > Erblichs wrote:
> >
> > > Padma Pillay_Esnault,
> > >         I agree with your intention!
> > >         However, in MY OPINION.
> > >         1) Decrease the flooding interval to maybe
> > >            50 minutes for LSAs without the DNA being
> > >            set. Thus request that this archectural
> > >            constant become a configurable parameter.
> > >
> >
> > There was a passage in the draft that mentionned this
>explicitly as
> > well as
> > its limitations. Our Chair asked me to remove it. You can
>bring it up
> > to
> > Acee. ;-)
>
>                 I could guess what his issue was. That is
>another
>                 proposal to decreasing refresh and flooding
>and
>                 felt that the scope of your document didn't
>need
>                 to specify every possible way to decrease..
>
> >
> > >         2) Create a new capability that allows a minimal
> > >            OOB request/ack for a single LSA over an adj
> > >            with awareness of the LSA instance.
> > >            - This would take care of a loss of
>synchronization
> > >              due to checksum removal or some other loss.
> > >            - Be able to have the request directed to the
>LSA
> > >              originator when the below #3 header is
>identified.
> > >
> >
> > Not scalable in my opinion for a problem which is too rare
>and
> > autosolved
> > by a periodic force refresh.
>
>                 I agree that a periodic forced refresh is
>simpler, but
>                 using updates pkts is a very high overhead
>method. The main
>                 problem is with periodic refresh without a
>request mechanism
>                 is that on avg you have to wait 1/2 of the
>refresh interval.
>                 This "periodic force refresh" could be set to
>days, weeks,
>                 months, years" and that would effectively remove
>asynchronous
>                 flooding requirement from OSPF!!!!!
>
>                 If you have a request mechanism, that avg wait
>time is
>                 decreased to 2x delay of the link plus time to
>process the
>                 request. LSA headers for flooding is really
>lightweight vs
>                 update pkts with a miminally changing or Stable
>environment.
>                 However, if the environment changes, a rush of
>LSA reqs
>                 could significantly increase the amount of
>traffic. Yes,
>                 I thought about this... But I was working with a
>Stable env
>                 assumption.
>
>                 My #2 and #3 compliment each other. The #2 can
>get instance
>                 information supplied by #3. Implimenting #2
>without #3 or
>                 #3 without #2 wouldn't make sense.
> >
> > >         3) Suggest a new flooding low-overhead flood
>capability
> > >            using LSA headers (ala IS-IS) that is defaulted
>to
> > >            re-xmit every 30 minutes but can be set higher.
> > >
> >
> > Problem that the LSAs will still die within 1hr. This draft
>gets more
> > flexibility.
> >
>
>                 Yes, I am not removing the 1 hour timeframe.
>However, this SUGGESTION
>                 should significantly decrease the number of
>bytes be flooded,
>which
>                 significantly decreases the amount of overhead
>in the flooding process
>                 for a STABLE TOPOLOGY. Yes, my suggestion still
>would require each
>router
>                 to process the header packets and make LSA
>comparisons.
>
>                 With the new header information, my #2
>suggestion would then be able
>to
>                 take the instance and do a "Initial DBD
>Synchronization" like
>comparison
>                 for a single LSA and follow up with a req / ack
>communication.
>
>
>
>
> > >         4) Define a method for backward compatibility.
> > >
> >
> > How is this draft not backward compatible ?
>
>                 Noooo, yours draft is okay wrt this item.. What
>I was suggesting would
>                  need a section for backward compatiblity.
>
> >
> > >         Mitchell Erblich
> > >         Sr Software Engineer
> > >         ========================
> > > Padma Pillay-Esnault wrote:
> > >
> > >> See below
> > >> Erblichs wrote:
> > >>
> > >> > Lets try  inlining this time..
> > >> > Mitchell Erblich
> > >> > Sr Software Engineer
> > >> > ----------
> > >> > Padma Pillay-Esnault wrote:
> > >> >
> > >> >>  Mitchell Erblich,
> > >> >>
> > >> >> > Padma Pillay-Esnault,
> > >> >> >         Let me re-phrase the first question.
> > >> >> >         1) One of the most common methods that a
> > >> >> >         stable OSPF router does is the periodic
> > >> >> >         verification of the checksums of
> > >> >> >         LSAs within its database.
> > >> >> >         Upon checksum verifcation failure and
> > >> >> >         removal, (router just removes the bad
> > >> >> >         LSA and marks the memory location as
> > >> >> >         suspect) then non-aynchronous flooding
> > >> >> >         will not resubmit the removed LSA....
> > >> >> >         Note: With the draft that deals with
> > >> >> >         the later re-Initial Database
>Synchronization
> > >> >> >         work, we can get back the LSA if the
> > >> >> >         adj hasn't also removed it. But not
> > >> >> >         yet..
> > >> >> >
> > >> >>  What is this work ??
> > >> >>
> > >> >         OSPF Out-of-band LSDB resynchronization
> > >> >         by Alex Zinn and Abhay Roy
> > >> >         Cisco, Feb 2001
> > >> >         And truly, I don't know what is happening with
>it???
> > >> >
> > >> >> >         So, you then break the assumption that
> > >> >> >         all routers within an area must have the
> > >> >> >         same LSAs in their database?????
> > >> >> >
> > >> >>  No. The LSA contents are the same.
> > >> >>
> > >> >> >         You just forced us to
> > >> >> >         take doown the interface
> > >> >> >         if we have a DNA LSA on a non-DC configured
> > >> >> >         interface and we fail a checksum? Right???
> > >> >> >         (Actually, identify the adj that submitted
> > >> >> >          us the LSA, and remove him from the next
> > >> >> >          hello, then...redo the Init DB Sync)
> > >> >> >
> > >> >>  I usually don't force anyone ;-)
> > >> >>  ???- Are you referring to the work above ?
> > >> >>
> > >> >         Yes...
> > >> >
> > >> I think this draft is for a issue orthogonal to mine and
>let's not
> > >> mix
> > >> them.
> > >>
> > >> >> >         *** Yes, This problem was not reported to
> > >> >> >         me early on dealing with demand-circuits!
> > >> >> >         I don't know why other people dont't see
> > >> >> >         this. Maybe people have perfect memory or
> > >> >> >         they just don't run the checksuming
>algorithm.
> > >> >> >         The current solution is to restart the
> > >> >> >         DC interface where the LSDB originated to
> > >> >> >         restart the Initial Database
>Synchronization.
> > >> >> >         Else, we wait for the asynchronous re-submit
>of
> > >> >> >         the LSA. It depends on the time that we last
>saw
> > >> >> >         the LSA instance.
> > >> >> >         But now you are making it apply to all
> > >> >> >         interfaces / adjs ...
> > >> >> >
> > >> >>  Let assume that what you say happens in a regular
>topology -
> > >> >>  an LSA gets corrupted and discarded then the SPF should
>run and
> > >> >>  this route and others including it in their SPF
>calculation
> > >> >>  should be discarded. Hence we will lose routes for at
>the
> > >> >>  maximum
> > >> >>  30 minutes. I have not heard anyone complain about such
>a
> > >> >>  problem
> > >> >>  in normal. I believe that it is very rare if it
>indeeds
> > >> >>  happens.
> > >> >>
> > >> >         Maybe implimentations are skipping the periodic
>checksum.
> > >> >         The 30 minutes is assuming that asynchronous
>flooding
> > >> > will
> > >> >         occur started at the originator. Your draft
>changes from
> > >> >         30 minutes to whenever the originator refloods
>the LSA if
> > >> >         ever. So, a route can now be lost for an
>undeterminate
> > >> >         amount of time, if you don't do anything. I don't
>think
> > >> > that
> > >> >         is good.
> > >> >         BTW, If our current dyanmic SPF wait interval
>(see
> > >> > Cisco's
> > >> >         SPF Throttling Paper on their web site) which
> > >> > approximates
> > >> >         this paper, (max of 600,000 ms) is less than the
>expected
> > >> >         time of reflush of the removed LSA, we then wait
>for the
> > >> >         reflood.
> > >> >
> > >> This draft does not introduce any new issue from the rfc
>1793.
> > >> More
> > >> over, Section 5
> > >> mentions the forced periodic forced flooding. The parameter
>to be
> > >> configured to
> > >> whatever they want and hence still have a flooding of
>LSAs.
> > >>
> > >> >> >         2) Oops, section 5 does specify interfaces
>or
> > >> >> >            globaly.
> > >> >> >         3) Let me state this one.. Reachability in
>the
> > >> >> >            Demand Circuit is 1 hr, etc. Shouldn't
>your
> > >> >> >            document state that reachability is based
> > >> >> >            on Dead Router Interval? You reference a
> > >> >> >            doc and take some things from it and
>leave
> > >> >> >            this type of item "explicitly unstated".
> > >> >> >
> > >> >>  Remember that this is *not* DC. We are not setting DC
>bit
> > >> >>  in the hellos or in the DBD packets. So it is Dead
>Interval.
> > >> >>
> > >> >                 Yes, but shouldn't you possibly want to
>state
> > >> >                 this in the RFC?
> > >> >
> > >> It can be stated.
> > >>
> > >> >> >         4) Are you stating reflooding by the LSA
>orignator?
> > >> >> >            What happens if the originator doesn't
>see
> > >> >> >            the change? The orignator will not
>reflood
> > >> >> >            the LSA...
> > >> >> >
> > >> >>  If the originator doesn't see a change that should
>cause a
> > >> >>  change in contents in its LSA then there is a bug ;-)
> > >> >>
> > >> >> > `       5) The non-reachability rule for
>Demand-Circuit
> > >> >> >            specifies 1 hr, but for normal routers
>specify
> > >> >> >            dead-router interval? Which one should we
> > >> >> >            follow? Shouldn't it explicitly be stated
>in
> > >> >> >            the document?
> > >> >> >
> > >> >>  See response 3.
> > >> >>
> > >> >> >            Remember, I stated an orderly shutdown.
>Why
> > >> >> >            don't we reflood with MAX-AGE, to remove
> > >> >> >            un-necessary LSAs? This is analgous to
>RIP's
> > >> >> >            poisson reverse something...
> > >> >> >
> > >> >>  This is an implementation decision. When you lose
>your
> > >> >>  adjacency,
> > >> >>  the LSA from the originating router will not be used in
>the spf
> > >> >>  anyway. Flooding MAXAGE LSA can be very expensive while
>the
> > >> >>  lingering LSAs will not be used in the SPF anyway. So,
>some
> > >> >>  believe it is useless.
> > >> >>
> > >> >         Well Moy does the former in his book and I agree
>with
> > >> >         his decision. See the shutdown function.
> > >> >
> > >> Let me put it like this .. my draft is about eliminating
> > >> unneccessary
> > >> protocol
> > >> traffic .. and the maxaging on going out is unnecessary
>traffic.
> > >> Padma
> > >>
> > >> >>  Padma
> > >> >>
> > >> >> >         Mitchell Erblich
> > >> >> >         Sr Software Engineer
> > >> >> >         ---------------------------
> > >> >> > Padma Pillay-Esnault wrote:
> > >> >> >
> > >> >> >>   Mitchell,
> > >> >> >>   First let me state that this feature can only
>operate if
> > >> >> >>   all the routers have an understanding on how to
>handle DNA
> > >> >> >>   LSA per rfc 1793.
> > >> >> >>
> > >> >> >> > Group,
> > >> >> >> >         Sorry, just a few items to ponder :)
> > >> >> >> >
> > >> >> >         1)
> > >> >> >
> > >> >> >> >         This document does not discuss the
>implications of
> > >> >> >> > a
> > >> >> >> > LSA
> > >> >> >> >         being corrupted (LSA checksum failed) and
>then
> > >> >> >> > being
> > >> >> >> >         discarded.
> > >> >> >> >         Should we not do checksums on DNA LSAs?
> > >> >> >> >         Can the router that is removing the
>failed
> > >> >> >> > checksum
> > >> >> >> > LSA
> > >> >> >> >         flood a earlier instance of the LSA? This
>hoping
> > >> >> >> > that
> > >> >> >> > the
> > >> >> >> >         LSA will be reflooded by the originator?
>Should it
> > >> >> >> > be
> > >> >> >> >         stated here?
> > >> >> >> >
> > >> >> >>   I am not sure I understand your question.
> > >> >> >>   How does handling this case differ between a
>router
> > >> >> >>   implementing this
> > >> >> >>   draft and having a router implementing DC somewhere
>in your
> > >> >> >>   topology
> > >> >> >>   and emitting DNA LSAs?
> > >> >> >>
> > >> >> >         2)
> > >> >> >
> > >> >> >> >         Should it be stated that this knob is per
> > >> >> >> > interface?
> > >> >> >> >
> > >> >> >>   This is stated in section 5.
> > >> >> >>
> > >> >> >> >         Should it be stated what happens if the
>knob is
> > >> >> >> > first
> > >> >> >> > to
> > >> >> >> >         have normal flooding, then reduced flooding
>or
> > >> >> >> > vice
> > >> >> >> > versa,
> > >> >> >> >         OR more simply stated the requirement to
>reflush
> > >> >> >> > the
> > >> >> >> > LSA
> > >> >> >> >         that are having their DNA bit changed?
> > >> >> >> >
> > >> >> >>   Do you mean to state that turning off the feature
>should
> > >> >> >>  cause
> > >> >> >>   all
> > >> >> >>   the LSAs to be reflooded with the DNA bit clear ?
>This
> > >> >> >>  seemed
> > >> >> >>   obvious
> > >> >> >>   to me. But any ways, on the first refresh of the
>LSA on the
> > >> >> >>   originating
> > >> >> >>   router this will cause the LSA to be flooded with
>the new
> > >> >> >>   sequence
> > >> >> >>   number and DNA bit clear.
> > >> >> >>
> > >> >> >> >         What is the implication of changing the
>knob twice
> > >> >> >> > within
> > >> >> >> >         the standard 5 secs of re-originating the
>same
> > >> >> >> > LSA?
> > >> >> >> >
> > >> >> >>   That would be implementation choice how to keep
>track of
> > >> >> >>  this.
> > >> >> >>
> > >> >> >         3)
> > >> >> >
> > >> >> >> >         Should a DNA LSA be removed upon loss of an
>adj
> > >> >> >> > due
> > >> >> >> > to
> > >> >> >> >         the dead router interval expiring?
> > >> >> >> >
> > >> >> >>   This is part of rfc 1793 that when reachability to
>the
> > >> >> >>  router
> > >> >> >>   originating an LSA with DNA bit set is lost that
>LSA should
> > >> >> >>  be
> > >> >> >>   maxaged.
> > >> >> >>
> > >> >> >         4)
> > >> >> >
> > >> >> >> >         **What is the implication that during the
>initial
> > >> >> >> > flooding of
> > >> >> >> >         the DNA LSA, that a temporary routing loop
>or a
> > >> >> >> > black
> > >> >> >> > hole
> > >> >> >> >         existed? This could even be due to
>mis-router
> > >> >> >> > configuration?
> > >> >> >> >         Now, after this temporary condition has
>cleared
> > >> >> >> > will
> > >> >> >> > some
> > >> >> >> >         routers never get the LSA?  In the past we
>could
> > >> >> >> > be
> > >> >> >> > assured
> > >> >> >> >         that we would eventually see the LSA....
> > >> >> >> >
> > >> >> >>   Any change in topology/configuration will result in
>LSA(s)
> > >> >> >>   being
> > >> >> >>   flooded so this will clear up.
> > >> >> >>
> > >> >> >         5)
> > >> >> >
> > >> >> >> >         And lastly, should this document identify
>the
> > >> >> >> > procedure
> > >> >> >> >         for removing its originated LSA before a
>orderly
> > >> >> >> > shutdown?
> > >> >> >> >
> > >> >> >>   This is covered by the non-reachability rule.
> > >> >> >>
> > >> >> >> >         Mitchell Erblich
> > >> >> >> >         Sr Software Engineer
> > >> >> >> >         -----------------------------------
> > >> >> >> >
> > >> >> >>   Padma
> > >> >> >>

_______________________________________________________________________
Odomos - the only  mosquito protection outside 4 walls -
Click here to know more!
http://r.rediff.com/r?http://clients.rediff.com/odomos/Odomos.htm&&odomos&&wn