Re: Updated: draft-zinin-microloop-analysis-01.txt

At 16:43 31/05/2005 -0400, Alia Atlas wrote:
>Stewart & Alex,
>
>At 02:02 PM 5/27/2005, Stewart Bryant wrote:
>>      Primary neighbor
>>           Neighbor N of router S is considered S's primary neighbor for
>>           destination D, if N provides the shortest path to D according
>>           to the SPF calculation.
>>
>>SB> Need to say something formal like selecting N such that
>>SB> Dopt(N.D) is minimised.
>
>AA> There's always the possibility that a potential primary neighbor isn't 
>selected.  Consider the case
>AA> where a router only selects up to 4 equal-cost paths, and there are 5 
>or more.  The definition should
>AA> handle this case as well.
>
>>  2.2 Next hop safety condition
>>
>>    We start the analysis with the following observation:
>>
>>      When router X learns about a topology change and starts using
>>      neighbor Y as its new primary neighbor for a given destination, a
>>      microloop between X and Y can only form if the topology before
>>      failure or topology after failure are such that Y uses X as its
>>      primary neighbor for the same destination.
>>
>>SB> I don't think that this is quite right. You say that X uses
>>SB> Y as its new next hop, AND Y uses X as it's new next hop. That
>>SB> would be a failure of the IGP, which is out of scope.
>
>AA> Perhaps a better way to phrase it is:
>
>AA> "... a microloop between X and Y can only form if the topology before 
>the failure is such that Y used
>AA> X as its primary neighbor for the same destination."
>
>AA> And then need to clarify the opposite case as well - where the roles 
>of X and Y are reversed.  I think
>AA> you were trying to - but I agree that it isn't clear.
>
>>    Routers SHOULD use the symmetric-link safety condition by default,
>>    MAY attempt to dynamically determine the method that needs to be
>>    applied based on the topological information from the routing
>>
>>SB> I think that we need to discuss which algorithm should
>>SB> be the default. Given that many networks that are thought
>>SB> to be symmetric turn out to be asymmetric, it's not clear
>>SB> which we should choose and why.
>
>AA> How many of the symmetric networks that actually turn out to be 
>asymmetric have multi-hop loops in
>AA> them?  Couldn't this be something that was flagged by a MIB - to 
>indicate that the "symmetric"
>AA> network isn't really.  Surely this is something that the network 
>operators would want to know so that it
>AA> can be corrected??

Yes, I wonder how much of a problem this really is. Given that the 
algorithm doesn't prevent all loops anyway, then a small increment in the 
number of loops caused by incorrectly handled asymmetric cost cases doesn't 
seem to be much a price to pay, especially since using the stronger 
condition to handle them correctly will result in the overall coverage 
being less. i.e. the total number of loops may get WORSE by using the 
asymmetric cost fixing algorithm.

I know.... something for me to simulate :-)

>AA> Another related question is how does PLSN work with max-cost 
>links?  How should it work?  Is it
>AA> acceptable to use a max-cost link to reach a safe neighbor that isn't 
>a potential primary neighbor on
>AA> either the old or new topology?  That seems potentially bad to me, 
>since it could cause additional
>AA> traffic loss, depending on why the link was set to max-cost.

I think a max-cost link should be treated as unreachable, since that is 
probably why it was set to max cost.

>>------------------------
>>
>>3.3 IP Fast Reroute Considerations
>>
>>    If the router implements [IPFRR] and performs local failure repair,
>>    procedures describes in this document still need to be applied in
>>    order to prevent micro-loops while reconverging on the new topology.
>>
>>SB> This is stricter than it should be. Say we implement basic [IPFRR]
>>SB> AND some other enhanced mechanism. We may wish to use some other
>>SB> mechanim in place of this.
>
>AA> I think that the intention should be to say that PLSN is useful to 
>avoid micro-loops during
>AA> re-convergence and this benefit is not provided simply by using basic 
>[IPFRR] or another repair
>AA> mechanism.  Both a repair mechanism and a convergence control 
>mechanism are desirable.
>AA> I do think it would be useful to specify the risks/undesirability of 
>using PLSN without a repair
>AA> mechanism when the topology change includes failures.

Yes.

>AA> I do agree with Stewart that the phrasing should consider the 
>possibility of future techniques being
>AA> introduced.

Agreed.

>>    Another difference is when the router could not repair the failure,
>>    the new primary next-hops do not satisfy the safety condition, and
>>    there's no other neighbor that does, i.e. a type-C situation. Unlike
>>    other routers in the network, the router directly connected to the
>>    network does not have the old next-hop any more, and cannot continue
>>    using it. In this situation, the router MUST revert to the regular
>>    convergence procedures, and update the route with the new next-hops
>>    with no additional delay.
>>
>>SB> We need to think about this some more. When we have an imperfect
>>SB> repair we need to consider the "greater good" and that might
>>SB> be to control the convergence of the rest of the network.
>
>AA> Given that no other router in the network is aware that the router (S) 
>doesn't have an alternate, I'm
>AA> not sure what better option can exist.  I think that the convergence 
>of the rest of the network is being
>AA> controlled.  The micro-loops related to S are not being handled.  The 
>worst-case that I see is that S
>AA> uses a neighbor N where that neighbor is type-B and is using S as its 
>safe neighbor.  I do agree that
>AA> we need to think about this more.
>
>>3.4 Architectural Constants
>>
>>    The following architectural constants have been used in the descrip-
>>    tion of the algorithm above:
>>
>>      DELAY_SPF
>>           The delay between the moment the router receives a topology
>>SB> s/a/the first/
>>           update after a period of stability and the moment it starts
>>           its routing table recalculation.  This delay is necessary to
>>           collect multiple updates originated by different routers that
>>           relate to the same topological event.
>>
>>SB> We might want to more formally state the start/inhibit criteria
>
>AA> I agree.
>
>>      DELAY_TYPEB and DELAY_TYPEC
>>           Periods of time used by the router to delay installation of
>>           new primary next-hops after a topology change when the router
>>           has (type-B) or has not (type-C) a safe neighbor to temporary
>>           divert the traffic to in the meantime.
>>
>>    While correctness and effectiveness of the algorithm described here
>>    does not depend on the actual values assigned to the architectural
>>    constants, it does depend on the relationship between them, and the
>>    assumption that all routers in the same network use the same values.
>>
>>    To satisfy these constrains, and yet allow these delays to be
>>    decreased as implementations continue to improve towards faster con-
>>    vergence, this document defines the architectural constants as con-
>>    figurable, specifies the required relationship between the values,
>>    and the default values that should be used by the implementations.
>>
>>SB> I wonder if we need to signal these, for example in the LSP/LSA
>>SB> I am concerned that there is little chance that all routers
>>SB> in the network will be correctly configured. The trouble is
>>SB> that if there is a mis-config it will be very hard to detect.
>
>AA> What would be done by the routers with this additional 
>information?  Why isn't this a management
>AA> problem?  These values could be in (yet another) MIB & then the values 
>of the routers could be
>AA> compared.  I don't like the idea of adding signaling to check for 
>inconsistency - when all the router
>AA> could do on detecting this would be a log or, I guess, disabling the 
>functionality in the case of mis-AA> matches.

I think I agree with Alia here. While at first sight it seems that there 
might be something you could do with advertising these things in the 
protocol, life can get very complicated when you start considering what 
happens when various routers and or regions of the network come and go. 
There is a very real danger that an "automated"dynamic synchronization 
scheme would result in more errors than a manual static one.

Simply using an advertisement in the protocol to give a warning that some 
static misconfiguration has been made (as I think Stewart was suggesting) 
is more workable, but seems like a poor use of the protocol, especially 
since (as Alia points out) the information should be available for 
management application to check anyway.

Mike

>Alia
>
>
>_______________________________________________
>Rtgwg mailing list
>Rtgwg@ietf.org
>https://www1.ietf.org/mailman/listinfo/rtgwg

_______________________________________________
Rtgwg mailing list
Rtgwg@ietf.org
https://www1.ietf.org/mailman/listinfo/rtgwg