Re: thoughts on draft-bryant-shand-ipfrr-notvia-addresses-00.txt

Alia Atlas <aatlas@avici.com> Wed, 27 April 2005 17:56 UTC

Received: from localhost.localdomain ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1DQqm2-00071G-E8; Wed, 27 Apr 2005 13:56:34 -0400
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1DQqly-000718-VY for rtgwg@megatron.ietf.org; Wed, 27 Apr 2005 13:56:32 -0400
Received: from ietf-mx.ietf.org (ietf-mx.ietf.org [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA06837 for <rtgwg@ietf.org>; Wed, 27 Apr 2005 13:56:29 -0400 (EDT)
Received: from gateway.avici.com ([208.246.215.5] helo=mailhost.avici.com) by ietf-mx.ietf.org with esmtp (Exim 4.33) id 1DQqyc-00022w-Jx for rtgwg@ietf.org; Wed, 27 Apr 2005 14:09:38 -0400
Received: from aatlas-lt.avici.com (aatlas-lt.avici.com [10.2.20.92]) by mailhost.avici.com (8.12.8/8.12.8) with ESMTP id j3RHtW7l016493; Wed, 27 Apr 2005 13:55:32 -0400
Message-Id: <5.1.0.14.2.20050427122250.020483f8@10.2.0.68>
X-Sender: aatlas@10.2.0.68
X-Mailer: QUALCOMM Windows Eudora Version 5.1
Date: Wed, 27 Apr 2005 13:55:25 -0400
To: Stewart Bryant <stbryant@cisco.com>
From: Alia Atlas <aatlas@avici.com>
In-Reply-To: <426FA6E7.5050903@cisco.com>
References: <5.1.0.14.2.20050426110603.020131f0@10.2.0.68> <5.1.0.14.2.20050325145408.01fa1378@mailhost.avici.com> <5.1.0.14.2.20050426110603.020131f0@10.2.0.68>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format="flowed"
X-Avici-MailScanner-Information: Please contact the ISP for more information
X-Spam-Score: 0.0 (/)
X-Scan-Signature: a7c7a0f28a102b9cb6317697abf1cf76
Cc: rtgwg@ietf.org, mike shand <mshand@cisco.com>
Subject: Re: thoughts on draft-bryant-shand-ipfrr-notvia-addresses-00.txt
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: rtgwg.ietf.org
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
Sender: rtgwg-bounces@ietf.org
Errors-To: rtgwg-bounces@ietf.org

At 10:51 AM 4/27/2005, Stewart Bryant wrote:
>First of all I largely agree with Mike's email,
>but then that's not going to surprise anyone :)

Simply shocking :-)

>Alia Atlas wrote:
>
>>Mike,
>>At 10:35 AM 4/26/2005, mike shand wrote:
>>
>>>At 15:07 25/03/2005 -0500, Alia Atlas wrote:
>>>
>>>>Second is the list of downsides with the approach.  The main concern is 
>>>>that the mechanism becomes too complex such that the trade-off between 
>>>>its complexity and the full coverage is not desirable.
>>>>1.      This requires a large number of additional IP addresses in the 
>>>>IGP.  The same number of additional FECs is required to support LDP.
>>>
>>>
>>>Yes, it does. In the simplest case of link and node protection, and 
>>>ignoring LANs it requires 2 addresses per protected link. It is expected 
>>>that these would come out of a "private" address space, and hence 
>>>wouldn't consume real addresses. Indeed for security reasons it is 
>>>preferable that they are private addresses.
>>>
>>>I don't think this number is "too many". The question is how does this 
>>>number increase when we add LANs and SRLGs.
>>
>>It would be useful to hear some additional opinions on the impact of 
>>adding a large number of addresses.  The other question is what is the 
>>boundary when it becomes a serious concern.
>
>Also to understand whether the issue is the number of addresses per se
>or the inflation of the routing protocol message size.

Sure.  We need to understand clearly the potential issues causing the 
constraints.

>>>>2.      Explicit tunnels are needed, which means that targeted LDP 
>>>>sessions are necessary to have this support LDP traffic.
>>>
>>>Yes. In the case of node protection we could also using Naiming's scheme 
>>>of next-next hop LDP advertisement.
>>
>>True - but I'd want to think about the implications in terms of 
>>additional communication & periods of instability/inaccuracy of 
>>knowledge.    It also doesn't handle the multi-homed prefix case for the 
>>case when the path isn't via the next-next-hop.
>
>OK. I think that we need to work on some state transition description
>to make sure that all the bases are covered, and that we have
>a common view of the states.
>
>The complexity of MHP is really the complexity of MHP per se, rather
>than the complexity of NV.
>
>We have four options:
>
>1) Restrict the reach of the repair to max two hops and
>maybe use Naiming's LDP extension.

yes - maybe it's good enough?  It'd be good to get the multi-homed prefixes 
considered in the network topology & simulations so we can see.

>2) Tunnel the packet (using NV, PQ or whatever) and learn the label
>at the far end.

In the general worst-case, doesn't scale.  How realistic that worst-case 
isn't as clear.

>3) Tunnel the packet and then strip all labels and do an IP lookup.

Not really an option - assuming we want to have things like pseudo-wires 
running under the LDP :-)

>4) Figure out some other method of delivering the packet n hops away
>in the base topology (such as n-hop u-turn).

Possible, but it does get uglier to compute.

>Each of these approaches seems to have it's issues, and it's
>a question of picking the least unpalatable.

Agree

>>>>  This is a particular concern for multi-homed prefixes; I'll describe 
>>>> my concerns on this later.
>>>
>>>
>>>Yes. This is a concern for LDP. I don't like the idea of targeted LDP 
>>>sessions. Two possibilities come to mind
>>>
>>>a) each node with an attached MHP distributes an additional label for 
>>>that prefix which has the semantics that when you pop that address you 
>>>MUST forward the underlying IP packet "directly".
>>>
>>>b) an alternative which doesn't require additional labels, but DOES 
>>>require a new "well known" label with the above semantics.
>>>
>>>Neither are very attractive, but perhaps more attractive than the 
>>>directed LDP sessions.
>>
>>Both of these presume the ability to route based on the nested addresses 
>>of the packet.  In general, I don't think that this is a valid 
>>assumption.  Consider, for instance, the case of a BGP-free core.
>>Traffic is directed towards an ASBR in a different area (that is 
>>multi-homed to the one being considered).  In that case, the ABR may not 
>>have the BGP routes to be able to correctly forward the packet based on 
>>its IP address.  There are also a number of scenarios where what is 
>>underneath the top LDP label is another MPLs label & not routable at all.
>
>In which case we either have to:
>
>a) Run the directed LDP session

See above for my thoughts on the options.

>b) Give up

nope - too stubborn :-)

>c) Think of something else.
>
>A for else might be domain wide labels, but I remember the last
>time that was proposed in MPLS WG:)

sure, but we're seeing upstream label distribution being proposed again too :-)
It is true that anything else seems likely to require some change/addition 
to the MPLS semantics - and that would be more challenging to consider.

>Are there any other for-else's that are better?

>>>>3.      Substantial IGP changes are required to handle the additional 
>>>>Notvia addresses.
>>>
>>>
>>>Substantial is perhaps a bit strong. We need to advertise the not-via 
>>>address and its association. For IS-IS its pretty straightforward. OSPF, 
>>>by its very nature, may be a little more tricky.
>>
>>More substantial than a few bits :-)  The main issue here is just the 
>>interop and migration concerns.
>
>I don't understand. The IGP will flood the TLV's we have in mind.
>Non-NV routers will be excluded from base. Could you expand?

Oh, this is just the more general worry of getting interoperability without 
very clear specification.  I've nothing specific here yet - just a 
background concern that it be thought about.

I do agree that excluding those routers that don't advertise the capability 
takes care of a large section of the issues.

>>>>b.      It is desirable to have some dampening on the withdrawal of 
>>>>Notvia addresses to minimize thrashing.
>>>
>>>The allocation of notvia addresses to links certainly shouldn't be 
>>>changed as a result of not "needing" the notvia address when the object 
>>>with which it is associated goes away. It should also get back the same 
>>>notvia address when it comes back. But I don't think there are any 
>>>particular issues associated with them disappearing and reappearing in 
>>>the LSPs.
>>>
>>>Do you have any specific issues in mind?
>>
>>Only keeping the notvia addresses around until after the network has 
>>converged...  If the notvia address is withdrawn with the link that's 
>>failed, then traffic may still be using that alternate.
>
>Since we are going to use controlled rather than uncontrolled
>convergence we can include managing the NV entries in the FIB.
>You are right to point out that we have not described how to do this.

What are you defining as controlled convergence?  Is this the wait until 
the network's done otherwise & then do the notvia addresses?  That'll work 
- but it's a bit different from the general controlled convergence for the 
primary topology.

>>>>2.      Insufficiently diverse topology:  It is possible that a network 
>>>>topology cannot provide an alternate that suffices for link, node and 
>>>>SRLG protection.  It isn't clear to me how to compute a 
>>>>"best-available" alternate using this approach.  For instance, if one 
>>>>can get link protection, but not node protection, how would that be 
>>>>determined, computed and assigned?  This becomes much more of a concern 
>>>>for SRLG protection & for topologies where failures have already 
>>>>occurred and the network has converged for those & needs protection in 
>>>>the event of an additional failure.
>>>
>>>
>>>Clearly it is always possible to create a topology which contains single 
>>>points of failure and is inherently irreparable. This is part of the 
>>>tradeoff we need to address when thinking about SRLGs, since taking a 
>>>simple but pessimistic approach to SRLG can result in this sort of 
>>>failure. This seems to be a property of the problem rather than any 
>>>particular solution.
>>
>>Let me try to explain this a bit better.  Say there's a topology that, 
>>for a particular next-hop & next-next-hop, can only provide an alternate 
>>that gives link and node protection but not SRLG protection.  Now, how 
>>does the notvia addresses method compute an alternate?  If the method is 
>>pruning the topology of the relevant link, node & SRLGs, no alternate 
>>will be found.  However, it was possible to compute & use an alternate 
>>that gives the link & node protection.
>
>I need to think about this.
>
>>The similar case can easily occur with link & node protection.  Say S has 
>>two parallel links to E; if the first fails, S could use the other to get 
>>link protection  - but there is no node-protecting alternate.  How does S 
>>determine this?  What is the fall-back strategy in the case that no 
>>"full-protection" alternate is available?
>
>In this case
>
>S fails E, and computes the NV paths to its neighbors.
>If any or all of these are unreachable it uses a link
>repair to E_!S to reach them as described in Section 4.2
>of the draft. If E_!S does not exist as in the case above,
>S then looks to see if the parallel link exists.
>
>Of course in the absence of SRLG, this topology contains
>a SPF for node protection and will always be expected to have
>limited repair coverage.

Thanks.  This helps clarify the behavior when the link is pt-to-pt.

For the broadcast link case, I can see  greater difficulties - because it 
looks like a local SRLG  to a large extent.  Any thoughts there?

>You are correct that this all needs describing in detail.

That's what revisions are for :-)

>>>>b.      An example of a concern with the BFD diagnosis is that all 
>>>>interfaces on a node that has failed are not certain to fail exactly 
>>>>simultaneously or even within a sub-50ms bounded window.  It is 
>>>>entirely possible that BFD sessions are terminated on different 
>>>>line-cards, that detect the router failure at slightly different times 
>>>>and stop forwarding traffic, therefore, at slightly different times.
>>>
>>>
>>>Yes. There is the possibility of misdiagnosis in this case if the second 
>>>failure occurs too long after the first. I suppose this then looks like 
>>>two separate failures. Clearly an unreliable diagnosis is probably worse 
>>>than no diagnosis at all. We need to get some handle on how realistic or 
>>>not this scenario is.
>>
>>Well, I think it is exceedingly realistic :-)
>>For a non-power related failure, routers with separate forwarding & 
>>control planes may take varying amounts of time for the line-cards to all 
>>realize that the route controller is down.
>
>Well maybe for power-failures as well :)
>
>The pathology of this sort of failure is highly implementation dependent.
>Say BFD was running on the LC, but the switch fabric was down.
>You could end up with the neighbors thinking that the router was still
>up, but it was non-functional. Eventually routing would notice the
>absence of routing hellos, unless of course, these had also been
>delegated :) Perhaps we need to run BFD to the neighbor's neighbors
>on the direct path?

This is a large part of what I see to be the problem with thinking of BFD 
as being a mechanism for detecting router failure.  Perhaps there are those 
with more BFD experience who can point out how it could work?

To my mind, either the BFD session is running on the line-card, in which 
case one can possibly imagine it being implemented in a scalable way such 
that packets are sent out at least every 5-10ms, such that a failure could 
be detected within 20ms & then repaired in the remaining 30ms, or the BFD 
session isn't running on the line-card, in which case generating packets 
every 5-10ms per interface reliably seems a bit of a stretch.

It seems to me to get the desired speed for failure detection that BFD 
would need to be done on the line-card & preferably in hardware.  This 
makes BFD good for detecting link failures, but not so much for detecting 
router failures.

As for the line-card being up, but the switch fabric down, I'd hope that 
the router internals would have a way of detecting that & bringing the 
line-card down when appropriate.  That's internal implementation anyhow.

BFD to neighbor's neighbor would at least validate the forwarding path all 
the way to that neighbor's neighbor - but it gives a larger number of BFD 
sessions & there is also the difficulty of interpreting the failure back to 
the affected routes.  For instance, what about the case where the 
downstream neighbor may (or may not?) ECMP to two of its neighbors.  If the 
BFD session to one of those goes down, does S do repair?

>The problem is that we rapidly get on a complexity spiral that
>becomes intractable.

This is why I really do not like the idea of trying to do failure diagnosis 
via BFD.  I don't think it is realistic to believe that router failure can 
be identified as such by BFD within 20-30ms.

>We clearly to write down a set of project scoping rules for the
>types of failure that we will and will not deal with.

I don't think the issue here is the type of failure that is being handled - 
that is determined by the avoidance of the alternate.  The question here is 
the ability to diagnosis failure types on the fly.

>>>>   It also needs to be thought through what issues might exist if the 
>>>> topologies used for the SPF vary slightly for each router that is on 
>>>> the broadcast link, since each will, as described, not prune itself 
>>>> out when doing the computation; of course, there could be an approach 
>>>> where the same topology can be used everywhere.
>>>
>>>
>>>I'm not really sure what you mean here.
>>
>>Let me try and explain it a bit.  Perhaps I'm missing something.   In the 
>>case where a notvia topology results in pruning the router doing the 
>>computation, what forms the root of the SPT?   Say routers A, B and C are 
>>all connected to a broadcast link X and want to compute a notvia X 
>>address as described in (c) by pruning the pseudo-node related to X as 
>>well as A, B, and C.   Now, router A prunes the pseudo-node, A, B and C 
>>from the topology; what does A use as the root?  IF A only prunes the 
>>pseudo-node, B and C to compute notvia X, B only prunes the pseudo-node, 
>>A, and C, and C only prunes the pseudo-node, A and B,  and all other 
>>routers prune the pseudo-node, A, B, and C, can there be any issues with 
>>a consistently computed & non-looping path for notvia X?
>>I think it may not be an issue - b/c once the traffic leaves A, B or C, 
>>it will never return - but it at least needs some thought, since this is 
>>a bit different from what's traditionally been done.
>
>Agreed, we need to write down the algorithm and subject it to review.

The more I think about it, the more comfortable I am - but as we've agreed, 
it needs clear description & thought about the differences & their 
implications.

>>>>  Of course, multi-homed prefixes may be much more infrequent for LDP 
>>>> than for IP; for example, there is no reason to advertise a separate 
>>>> FEC for the subnet of a link.  However, multi-homed prefixes are a 
>>>> concern for LDP for at least the inter-area, AS External, and BGP routes.
>>>>iii.    If traffic is encapsulated to a node's regular address, because 
>>>>that traffic is destined to a prefix advertised by the node, how does 
>>>>the receiving node know to remove the encapsulation and forward the 
>>>>packet inside  all in the fast path?  Is this a just a question of 
>>>>different handling based on the header type inside the outer 
>>>>encapsulation (for GRE)?
>>>
>>>
>>>Yes.
>>
>>OK.  The traffic wouldn't be directed up to the control plane because it 
>>was GRE encapsulated??
>
>GRE always pops the header at the tunnel endpoint. That is how it
>works!

OK.  Thanks.  I'm (obviously) not that familiar with GRE.

>>And had a special header type for this purpose?
>>Certainly I can see something like this working with an LDP LSP, b/c the 
>>label would just get it to that router & then be popped & the packet 
>>forwarded based on what's underneath.
>
>Perhaps an MPLS label of some sort the way we thought of doing directed
>forwarding and the way that Mark Townsley proposed doing IP VPN?

Could you explain more?  I'm not sure what you're picturing the label being 
used for.

>>>>iv.     Perhaps these issues could be handled by determining a 
>>>>next-next-hop that avoids the failure to reach an appropriate 
>>>>advertiser.   Of course, this is a different set/type of computation.
>>>
>>>
>>>Could you explain that suggestion please?
>>
>>Well, if there is a neighbor's neighbor whose path to the multi-homed 
>>prefix doesn't go through the failure & this can be determined, then the 
>>traffic could be tunneled to that neighbor's neighbor & then normally 
>>forwarded from there.
>
>Yes, but you only get two hop reachability. Perhaps you do this,
>and then do directed LDP for the remaining (perhaps 2%) of cases.
>The problem I have with this is the added complexity.

Yes, it would require additional computation & anytime you have two ways to 
do something, it gets more complicated.   On the other hand, a 
(potentially) full mesh of LDP sessions isn't simple either!

If the frequency of multi-homed prefixes that can't be protected via this 
is small - and the network design to gain protection isn't too complicated 
to understand, then it'd help.  For, say, the inter-area case, a typical 
cross-hatch connection probably works fine.

>>>>7.      There is a definite need to describe the convergence case 
>>>>better.   This is how the transition from using the alternate to the 
>>>>network being converged happens, such that the alternate remains functional.
>>>>a.      For instance, if the node E fails, then the Notvia address E_!S 
>>>>will no longer be advertised.  If S was getting link protection 
>>>>(because that was all that was possible, for instance) by tunneling 
>>>>traffic to E_!S, it is important that this traffic be properly 
>>>>discarded when E's addresses go away.   This implies that there needs 
>>>>to be a default blackhole for Notvia addresses.
>>>
>>>
>>>I don't quite understand your concern here. If E goes away and S is 
>>>sending to E_!S, then the neighbors of E will drop the packets because 
>>>we don't repair a notvia address.
>>
>>I'm thinking of this as the more specific prefix goes away.   Without a 
>>specific blackhole for the group of prefixes, why wouldn't the packets 
>>take that instead?  I.e., if the notvia address is 10.1.1.1 and there's a 
>>default route for 10.1/16  (or for 0.0.0.0/0), then the packet would pick 
>>up the latter when the notvia address is removed.
>
>Yes, you are quite right. We need a NV black hole.

I think this is just a detail to be written down then.

>>>>c.      It is possible to get a micro-forwarding loop affecting a 
>>>>Notvia address as a result of a less severe failure than 
>>>>anticipated.  For instance, consider the following topology.
>>>>              [D]
>>>>               |
>>>>           1   |
>>>>      [E]-----[F]-\
>>>>       |       |   \ 10
>>>>     1 |R    1 |R   \
>>>>       |   5   |     \
>>>>      [S]-----[H]----[I]
>>>>                  2
>>>>
>>>>      Link S->E and Link H->F are in SRLG R
>>>>
>>>>When node E fails, if I converges before H, there will be a loop 
>>>>affecting the Notvia address being used to reach F without going 
>>>>through any of Link S->E, E or SRLG R.
>>>
>>>
>>>We discussed this privately, and I still don't see how loops could 
>>>arrise even if the notvia FIB were recomputed before normal convergence 
>>>is complete. But I think it is better to delay the notvia FIB changes anyway.
>>
>>Just for clarity (hopefully),  before the failure, H computes the path 
>>for F_!E, the address of F's that is notvia E, to go via I and then to 
>>F.  After the failure of E, if H installs the changed notvia address F_!E 
>>the path is directly to F, b/c node E no longer has SRLG R associated 
>>with any of E's up links.
>
>I think that the core issue here is the case of a failure during the
>reconvergence of the repair topology. Is that in scope?

??  No, I think this is the issue of whether the repair topology 
reconverges at the same time as the primary topology or it is delayed.  If 
it is at the same time, the problem above happens.  I think we're all 
agreed that delaying the repair topology convergence until the primary 
topology convergence is complete is the way to go.

>>>>d.      How do exceptions work?  Particularly in regards to an IP-in-IP 
>>>>encapsulation such as GRE, it doesn't seem like MTU exceeded cases can 
>>>>be handled cleanly  either by use of DF or by doing IP fragmentation 
>>>>and then the reassembly at the end of the tunnel.  This seems like a 
>>>>problem for all ICMP packets; how could a source understand the header 
>>>>inside for a TTL expired, for instance.
>>>
>>>
>>>I'll leave this for Stewart (tunnel) Bryant!
>>
>>For LDP, there are mechanisms (layer violations though they are) to 
>>handle exceptions generating ICMP packets.
>
>The interesting question is who needs to know of the MTU problem?
>
>If you tell the host, then by the time it adjusts it's MTU the
>network will likely have reconverged anyway.

It depends on how long the network is taking to get off the 
alternates.   That could be 10s or so - so there's a chance the host will 
be able to adjust its MTU usefully.

>However if you tell repairing router (which is what will happen
>with a tunnels packet), it can alarm and let the network
>administrators know that there is problem with IPFRR config.

But why does the ICMP packet need to travel back to the repairing 
router?  Why couldn't the router which sees an MTU exceeded addressed to a 
notvia address let the network administrators know?

>For this to work, the MTU at the edges needs to be lower than the
>MTU in the core.

Sure.  My  concern isn't so much for the MTU exceeded case as for the TTL 
expired case.  That's a common network debugging tool - and we'd want it to 
continue to work during repair.

>>>>e.      For IP-in-IP tunnels, another concern is flow diversity.  The 
>>>>IP source and destination addresses are used to determine a flow; this 
>>>>flow identification may then be used for a variety of purposes, 
>>>>including ECMP.  By putting all the traffic to a variety of 
>>>>destinations inside the same header, the ability to take advantage of 
>>>>flow diversity appears to have disappeared.   This could possibly be 
>>>>solved by putting the original source address into the encapsulating 
>>>>header?  Are there other approaches?
>>>
>>>
>>>and this.
>>
>>Again, for an LDP tunnel, many routers can look under the label and 
>>consider the IP packet inside for flow identification.
>
>I was going to say:
>
>Given that basic cuts in before NV, I think that the only case where
>this is a problem is when you have a router with max ECMP = say 2 which
>selects two from more than two, and the next hop on one of them fails.
>This is surely a corner case?
>
>Then Mike pointed out that we had said that we would use ECMP in the
>draft, and yes there is a problem. Again we need to think about the
>implications, because it's not clear what we should do.

If the paths for notvia addresses can be ECMP, then it becomes a 
concern.   Now, maybe it isn't necessary for those to be ECMP - or that 
likely.  But maybe it could be solved by pulling the real source IP address 
in the encapsulating packet header?

Alia



_______________________________________________
Rtgwg mailing list
Rtgwg@ietf.org
https://www1.ietf.org/mailman/listinfo/rtgwg