Re: [OSPF] Comments / questions on RFC3623 and Update GR part 1 : BMA env

Hi Mitchell,
You have some valid questions.
First;
In your scenario where the restarting router X is a DR
and the env is a must NBMA, no BDR present.
A DR-other when sense that there is Topology change has no means of
communicating this with the other DR-others.
The point here is do we need to communicate this with the other DR-other's?
Is it required, if the topology change is communicated with X , should it be
not sufficient, because
as X will exit GR prematurely, it should flush its previous send grace lsa's
, thus communicating with the other DR-other's that there is something
wrong, cannot say though
particularly a toplogy change has occured.
Before we discuss any further, another interesting point you should note,
the only way the DR-other (i.e the one which has detected topo-change)
communicates is by sending a LSA(network/router) which when examined will
tell to X that
 the DR-other has stopped its support prematurely, in essence it done by
doing a 2-way check on the lsa contents. This check works well if the NSM
between X and the DR-other is below adjacency full. What if the DR-other
finds post adjacency full event that it has to exit GR , there is no way it
can communicate this to the restarting router, if you read the update draft
however, this catch has been taken care of as the DR-other should send some
form of notification to X, using 1-way hello , lls block etc. that it does
not support GR now.In brief explicit signalling.

Second; your scenario in which X is the DR and an extremely important router
and is the only path to a set
of routes and end systems within the area.
and suppose a topology change causes a route element to be removed , so now
the question is if we want a liberal approach in GR, (because we want X to
be up and running as it is a critical router) , the topology change
detection should not cause any action on the helpers, result will be
ofcourse there will be routes black-holed in the system.
X will still be forwarding packets to the removed lsa routes.
On the other hand if you wish to take action, make sure that the helper
exits gr prematurely if it has ANY info about topo-change. In the update
this is made more tight , in the sense of using spf to validate if X has any
dependency on the router lsa which has changed, out of which you may get
broadly two results, the routes are gone or the routes have changed to
something better.If the routes are gone you must think of exiting gr, if the
routes have changed to better you must still think of exiting gr (as
remember the routes on the fwd plane of X is unchanged ) IMO this *may*
causing routing loops. True there is additional processing for all the spf
checks and dependencies.

Third, in your scenario where the ls refresh timer has exceeded 30 mins to
say 45 mins the env is NBMA and X our GR router is the DR, there is no way
things can work correctly.If we keep things as it is at the LSA max-age the
system will exit GR.You could solve this two ways;
1. Do nothing for max-aged lsa's as long as you are in the helping mode,
dont treat it as a bad news.
2. As per your suggestion, start with a fresh copy of the lsa everytime when
entering gr helper mode.-- but note that grace lsa's are link local so at
best you will get new copies from immediate routers but not a step ahead of
them, so there lsa's could still Max age out before gr expiry period
--- did i miss something here?

About the idea of extending the grace period to 1hr+ and then sending DN
lsa's for them, Isn't this a too long period for GR? just a thought..
you could however extend your grace period by sending subsequent grace
lsa's, if X feels it will be unable to complete gr in the given time
period..

Eitherways this issue definitely needs some thinking.

And about the minor item, the helper on removing the GR router should run
spf and update the routes of its forwarding table (this is specified in
3623)

Needless to say the base GR needs a relook, as probably you too must have
surmised by now.

While at my presentation at Montreal on the same, I saw no clear objections,
with what I gather from this minutes of the 67th meeting , there was only
one objection.With vendors relying on features like NSF perhaps the
awareness for a workable-GR takes a back seat.

Best Regards,
Sujay

On 11/17/06, Erblichs <erblichs@earthlink.net> wrote:
>
> Sujay Gupta,
>
>         Let me give you some insight on part of this thinking.
>         IMO, GR has added a few holes in our converged routing
>         assumptions and that your update might be able to close
>         a couple..
>
>         There is alot here!!!!
>
>         Summary says to think multiple
>         times before you exit prematurely based on a topology
>         change (LSDB) and assume that you need router "X" no matter
>         what.
>
>         So...
>
>                 Why don't you tell me how a LSDB change is
>                 going to be communicated iff the env is
>                 BMA and GR router "X" is the DR and no
>                 BDR exists?
>
>                 I consider this a huge hole..
>                 I don't really see a DR-other communicating
>                   LSAs to other DR-others to allow them to
>                   understand about a topology change.
>
>
>         Second, lets assume that router X's is an extremely
>         important router within our area with the
>         connection to the internet, etc..  If it weren't we
>         would / COULD just take it down.
>
>         thus, we have router "X" is a DR, and is the ONLY path
>         to a set of routes and end-systems within this area..
>
>         Then lets conservatively look that a topology change is
>         a removal of a single router LSA that has no effect
>         after a SPF calculation.
>
>         If router "X" has sucessfully entered its grace
>         period / GR interval, aren't we ...
>
>            .... black holing the routers, routes, end-systems
>            that reside on the other end of router "X"?
>
>            .... since router "X" CAN not probe to verify that
>           a topology change has occured. It is in non-stop
>           forwarding mode.. It will still forward pkts to
>           the removed LSA routes until...
>
>         so what...
>           The pkts that are forwarded by router "X" to
>           say non-reach Z will cause minimal harm.. They will be
>           dropped and most likely a ICMP message will
>           be sent on the reverse path... The ICMP
>           message might be interpreted by X, but if it
>           is just forwarded,
>
>           So,,, what we have here is the unability is to allow
>           router "X" to state that its routes are of a level
>           of imporantance that most topology changes should
>           be balanced against a delay in making the change.
>
>                 New LSAs or changed LSAs that don't show up as
>                 primary LSAs after the SPF calculation COULD
>                 in theory IMO, also be ignored. The originator
>                 could in theory check out whether the LSA is
>                 needed to be communicated.
>
>                (This says just because a minor toplogy change
>                 has occured do we need to go thru all the
>                 work and possibly isolate router's "X"'s
>                 dependencies).
>
>
>           If we feel that the new / changed / or removed LSA
>          should be communicated ...
>            router X still isn't going to be in convergence
>            until it resync it's LSDB
>
>            If the change of LSA results in a better route, is
>            it worth isolating router's X's entities???...
>
>
>
>       Third, most large OSPF router vendors support the ability
>       to set the LSA refresh time on non-DNA LSAs. If we are in
>       a normal environment that the admin does this to decrease
>       the number of LSA refreshes versus the drastic steps to
>       using DNA LSAs. Then assuming having a 45 min LSA refresh
>       would not be unreasonable.
>
>         Going with this logic we have a few simple steps to
>         take.
>                 One is that all HELPERS that originate LSAs,
>                 re-originate their LSAs upon first seeing a grace
>                 LSA in addition to router "X" doing the same.
>
>                 This allows us to take almost the full hour of the
>                 grace-time.
>
>                 If we could communicate a wished grace-time in
>                 excess of 1 hour, we could in theory send out
>                 DNA LSAs which would allow a static environment
>                 to support grace times in excess of 1 hour.
>                 :: Part of this logic follows the damand circuit
>                    support.
>
>
>         Part of this is can, should we figure out a way to stop
>         router "X" in his grace-period/interval from forwarding pkts
>         based on obsolete OSPF control information.?????
>
>         The other part says, why don't we just run in a degraded mode
>         if we -realize that router "X" is so important that we would
>         just isolate ourselves or so many routes/routers that we
>         REALLY need to keep router "X" alive as long as feasibly
>         possible.
>
>         Third minor item is I didn't see any suggestion of removing
>         entitites from HELPERS forwarding table based on removing the
>         GR router because of a LSDB change.
>
>         Mitchell Erblich
>         ------------------
>
>
> Hi Mitchell,
> I have some comments inline;
>
> On 11/14/06, Erblichs < erblichs@earthlink.net> wrote:
>
>      Group,
>
>              First, I have a few questions to the base and the update.
>              These comments / questions pertain ONLY to BMA envs
>              but are also relevelent to other envs.
>
>              FYI) IMO, helpers are defined by two major requirements
>                 a) the ability to recieve a grace-LSA before a hello
>                    "new hello" and not restart the adj
>                 b) the ability of a helper "Y" to (re)send a router-LSA,
> etc upon
>                    router "X" exiting graceful restart that are in its
>                    retransmit-list.
>
>
>              If this "helper" split functionality is supported, then
> only
>              a limited number of routers within a area need to be in
>              Full-helper mode.
>
>              Within BMA envs a DR-other above helper "b" functionality
>              is not really necessary for retransmit. We use the DR/BDR
>              for retransmit support.
>
>
>              IFFFFFFF, a GR router is the DR, who is in it grace-period,
>              and a DR-other stops sending hellos, if their are no
>              alternate paths thru another router for any forwarding
>              data. Then what benefit is it to prematurely end helper
>              mode? Basicly, most algs can determine ahead of time
> whether
>              the exiting router has contributed a primary link/path. If
>              it isn't then it has no consequence. Yes, data pkts that
> have
>              a intermediate dest will be forwarded 1 or two extra hops,
>              before they are dropped.
>
>              (Update: 2.0 Action on route calculation)
>              If a topology change occurs and the GR restarting router
>              is identified as the next hop. By definition it is in
> non-stop
>              forwarding mode and should be able to forward packets. If
> this
>              is identified as a topology change and the ONLY route is
> thru
>              this GR router, shouldn't packets be forwarded thru it
> anyway?
>              Not doing so blackholes those routes, IMO.
>
>
>
> Suj>> Here topology change also includes routes which are no more valid,
> if the lsa seeks a route to be removed and that route involves the GR
> router
> the helper exits, thus black-holing and circumventing the GR router as
> router
> under plain restart.Note here that previously(rfc3623) the helper was
> exiting
> for all reasons, which has been curbed now so as you have pointed out if
> the
> lsa effects the route which is ONLY through the restarting router will
> the system
> exit, which makes sense.
>
>         Yes, and no.. Router X, once he has entered GR, is not in the mode
>         to accept topology changes.. He will forwarded no-matter what
> based
>         on his old information. Thus, any routing inconsistencies done by
>         him until his GR time period ends CAN result in him routing pkts
>         improperly. This is basic IP forwarding.
>
>         The originator of the change will be the one who can make the
>         correct IP forwarding decisions with reciving or forwarding
>         IP pkts based on the LSA change. However, because router X is no
>         longer in sync, should he inform others not to send or allow
> reciept
>         of data pkts from router X?
>
>         At the end of the grace-period or when router X re-enables
>         his OSPF control code, he can be resync'ed. Until then, yes
>         we SHOULD have some level of AI that determines the benefit
>         of keeping router X in GR.
>
>
>         Second, most environments have most of the
>         LSA changes not result in any primary path alterations..
>
>              Thus,
>              1) Helpers should be able to be able to support a, without
> b
>                 If this is the case, router "X", if it is a DR should
> allow
>                 all the DR-others to support a, while only the BDR need
> be
>                 a "Full-Helper".
>
>
>
> Suj>> I like your analysis in breaking up the helper job into partial
> and full.
> IMO  "Helper" is a behaviour of a router w.r.t to the GR operation,
> where as
> the idea whether the helper is "Full-helper" else a "Partial" depends
> upon the result
> of election. In brief a "Helper" is  always expected to perform
> "Full-helper" features as per your describe , if it does less; guess it
> is
> plain lucky ! ,( also note Helper support maybe a feature switched on /
> off by admin, not a partial/full helper).
>
>
>
>              2) If router "X" is a DR-other only one of a "DR or BDR"
>                 need to be a Full-helper. All LSAs can be placed on the
>                 retransmit list of the for router "X" by the DR or BDR.
>
>              3) Topology change. A LSA aging out of the LSDB, will not
>                 necessarily remove a element in the forwarding table?
>                 If this can be detected, should it still force the exit
>                 of a helper wrt graceful restart?
>
>
> Suj>> as per the base ospf operation a max aged lsa on the lsa, should
> trigger
> it to flood the lsa (-->" which in turn should instigate the originator
> of the lsa to resend a fresh copy, if fails ends in removal of the
> corresponding route everywhere! "),
> thus if the case happens as you have described a GR with "strict lsa
> checking
> enabled" i.e catch all topology changes should force the helper to exit
> as it has to
> flood this lsa to the GR router, here the update draft would tell you
> prohibit your
> exit only if the lsa somehows affects a route through the GR router,
> else let peace
> prevail, no need to exit.
>
>
>              4) If router "X" is a DR and has entered graceful restart
>                 with no BDR present, should the DR-other routers allow
> the
>                 LSAs to become not refreshed and aged out because the
>                 grace period has not expired.
>                 How can "one" or more "HELPERS" force a new DR election?
>
>
> Suj>> IMO the DR-other allows normal aging to continue, if the lsa age's
> out
> before grace period expiry, the helpers would exit GR(see above) , again
> applying the same logic.
>
> A new election would disturb all previous adjacencies, do we need it?
>
>
>
>              5) Even with a 30 min max for the grace period, some
> routers
>                 may only retransmit once every 45 mins or so. How can we
>                 guarantee that those retransmited-LSAs are not aged-out
> if
>                 router "X" is the DR and is down for a longer time?
>
>                  Shouldn't recieving the first grace-LSAs from a
> restarting
>                  router initiate a re-send of the LSAs?
>
>
> Suj>>  With the LSRefreshtime is 30 min , the MaxAge double(60min), the
> max grace period is same as LSRefreshtime 30 min, I
> dont see how we reach a stage where
> the problem will come, you might want to consider some border cases i
> guess there the previous arguments(see above) hold good.
>
>
>
>              Mitchell Erblich
>
>
>
> Best Regards,
> Sujay
>
>      _______________________________________________
>      OSPF mailing list
>      OSPF@ietf.org
>      https://www1.ietf.org/mailman/listinfo/ospf
>