RE: discussion on fast notification work

Gábor Sándor Enyedi <gabor.sandor.enyedi@ericsson.com> Fri, 08 July 2011 10:35 UTC

Return-Path: <gabor.sandor.enyedi@ericsson.com>
X-Original-To: rtgwg@ietfa.amsl.com
Delivered-To: rtgwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1C69621F8A0D for <rtgwg@ietfa.amsl.com>; Fri, 8 Jul 2011 03:35:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.233
X-Spam-Level:
X-Spam-Status: No, score=-6.233 tagged_above=-999 required=5 tests=[AWL=0.066, BAYES_00=-2.599, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WLm3hNsyyB12 for <rtgwg@ietfa.amsl.com>; Fri, 8 Jul 2011 03:35:23 -0700 (PDT)
Received: from mailgw10.se.ericsson.net (mailgw10.se.ericsson.net [193.180.251.61]) by ietfa.amsl.com (Postfix) with ESMTP id 8B56221F8A08 for <rtgwg@ietf.org>; Fri, 8 Jul 2011 03:35:22 -0700 (PDT)
X-AuditID: c1b4fb3d-b7c17ae00000262e-ec-4e16dd697070
Received: from esessmw0191.eemea.ericsson.se (Unknown_Domain [153.88.253.124]) by mailgw10.se.ericsson.net (Symantec Mail Security) with SMTP id F7.A9.09774.96DD61E4; Fri, 8 Jul 2011 12:35:21 +0200 (CEST)
Received: from ESESSCMS0359.eemea.ericsson.se ([169.254.1.227]) by esessmw0191.eemea.ericsson.se ([153.88.115.84]) with mapi; Fri, 8 Jul 2011 12:35:21 +0200
From: Gábor Sándor Enyedi <gabor.sandor.enyedi@ericsson.com>
To: "curtis@occnc.com" <curtis@occnc.com>, Anton Smirnov <asmirnov@cisco.com>
Date: Fri, 08 Jul 2011 12:35:03 +0200
Subject: RE: discussion on fast notification work
Thread-Topic: discussion on fast notification work
Thread-Index: Acw8/7cg6uN78EUAS12okKR2Y/IGHAAUz7mA
Message-ID: <EFAB865EBEFB734CA1FABD543B2E0E2E09FAF13E1B@ESESSCMS0359.eemea.ericsson.se>
References: Your message of "Thu, 07 Jul 2011 13:54:26 +0200." <4E159E72.1000400@cisco.com> <201107072343.p67NhcIJ035360@harbor.orleans.occnc.com>
In-Reply-To: <201107072343.p67NhcIJ035360@harbor.orleans.occnc.com>
Accept-Language: hu-HU, en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: hu-HU, en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Brightmail-Tracker: AAAAAA==
Cc: "rtgwg@ietf.org" <rtgwg@ietf.org>
X-BeenThere: rtgwg@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Routing Area Working Group <rtgwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtgwg>
List-Post: <mailto:rtgwg@ietf.org>
List-Help: <mailto:rtgwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtgwg>, <mailto:rtgwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Jul 2011 10:35:24 -0000

Hi Curtis,

At this point I feel, that there was a terrible mistake, and drafts were mixed. I try to make clear what the situation is.

Draft 1. - draft-kini-ospf-fast-notification-01:
This draft is trying to increase the speed of LSA/LSP propagation. This problem can be easily challenged, since currently the bottleneck of convergence time is not the propagation. Later this may change, but currently you are right, this draft may have no attention.

Draft 2. - draft-csaszar-ipfrr-fn-00
This is a full-fledged protection technique for IP (and LDP) networks, exactly like e.g. LFA, and trying to address the same problems. I know, that several people think that protection MUST be local, but it is not true. The difference between protection and restoration is that while restoration does all the work AFTER the failure, protection is proactive, and only some minor reaction is needed when the failure occurs (e.g. switching to the precomputed detours). Therefore we were thinking that: "OK, everybody wants to do local protection for IP, but why? They want to avoid the latency of propagating an information packet, but that is not an issue for 99% of the areas! We have 50ms, in the very worst case 10-15ms is enough to get through most of the areas... So some not local but still protection technique may fit to the requirements." So we thought that we can try to do some non-local PROTECTION for IP, and see what we get for the price of being say 10ms slower than the LFA, NotVia, or any other regular IPFRR. We got that we can protect 100% of single link, node and single SRLG failures, we get to the shortest paths after the failure immediately (so no second convergence is needed after the new topology was explored), we do not need any extra protection address or even tunneling.
Recall, that we are speaking about a protection mechanism, not some OSPF/IS-IS hack. So this draft describes a mechanism, where alternatives are predownloaded, and failure reaction is done without involving the control plane. This means that we need time for:

1. detecting the failure - T1
2. propagating the notification - T2
3. doing the failover - T3

The third one is now needed at all the routers, but it is done parallel, so the network is fully recovered when the last node is reconfigured, T1+T2+T3 time after the failure. And with a "local" protection? We would need T1+T3 time. We are saying that we can afford spending that extra T2 time in most of the networks to get the previous advantages.

Please do not mix the two drafts!
BR,

Gabor

P.S.: We are currently modifying draft-csaszar-ipfrr-fn-00, and address its most important problems (e.g. authentication). If you can wait till Monday, it may worth to read the new version.



-----Original Message-----
From: rtgwg-bounces@ietf.org [mailto:rtgwg-bounces@ietf.org] On Behalf Of Curtis Villamizar
Sent: Friday, July 08, 2011 1:44 AM
To: Anton Smirnov
Cc: rtgwg@ietf.org
Subject: Re: discussion on fast notification work 


Anton,

A key point made below is.

  Protection at the PLR is always faster than inter-node protection.

That said, if someone wants fast protection with full coverage, run MPLS with RSVP-TE and enable FRR and join the rest of the world that has already figured that out.  Then there is complete (single failure) protection at every PLR.  If multiple failures occur, then precomputed protection is impractical, notification schemes don't work, and flooding needs to work well and SPF and/or CSPF and installation of new FIB needs to be fast.

Of course, those that think the idea of running MPLS is way too scary or think that reinventing the wheel is great fun can continue this conversation as long as they like.  After all if people didn't think MPLS was scary we wouldn't even have IPFRR in the first place.

Curtis


In message <4E159E72.1000400@cisco.com>
Anton Smirnov writes:

    Hi András,

 > 2. near instantaneous update of the FIB  >

    I am no specialist in FIB implementations but it would appear to me that implementations and their requirements vary so much that intention itself of improving them all is incorrect and bound to fail.


 > 1. near instantaneous notification of failures to neighbour and remote nodes

    Here is my vision of the problem:
    My logic says that good inter-router notification cannot be made as fast as good intra-router API notification. So all good local repair techniques are intrinsically superior to [even good] inter-router notification approach. Superior first of all in speed of restoration but obviously things like deployment ease add attractiveness.
    That is, remote notification technique's niche is squeezed; it can be applied as an aid to local repair techniques in those cases where network topology provides redundancy but local repair techniques can't use it. Since more elaborate local repair techniques are being developed which expand their coverage, niche for remote notification technique is contracting to the point when people don't want to bother with it (not even care to criticize it :-) )

    I am guessing that authors of the proposal don't agree with this
part: "My logic says that good inter-router notification cannot be made as fast as good intra-router API."
    May I suggest to authors to work on this perception? Otherwise I am afraid there again will be total misunderstanding and disinterest.

Anton


On 07/07/2011 12:52 PM, András Császár wrote:
> Dear All,
>
> As a recap, the basic idea was to explore how one could approximate 1. 
> near instantaneous notification of failures to neighbour and remote 
> nodes 2. near instantaneous update of the FIB
>
> 1 is approximated by a completely dataplane-based fast notification (FN) framework.
> 2 is approximated by pre-calculating and pre-downloading backup routes for RELEVANT failures and doing the FIB update from within the linecard.
>
> Since last IETF, based on the comments we received, we have been working on (and prototyping) a method where FNs are propagated on the shortest path and each hop performs SHA256 authentication in the dataplane before forwarding the packet.
>
> Important highlights proving feasibility:
>
> - In a 1000-node area with a diameter of 20 hops and 500k external routes, the backup FIB even in a very bad case is not bigger than 30MB with very diverse ECMP (10 ECMP alternatives for each destination). The download of this backup FIB size should be no problem.
>
> - A naive serial FIB update procedure after a failure in the above network takes less than 15ms within a dataplane card (assuming 5MT/sec memory performance and 1 memory controller). But there may be more intelligent approaches, such as a lazy (on-demand) FIB update.
>
> - In reality, our calculations show that typically only nodes between 
> 1 and 3 hops away need to prepare for a failure, i.e. failures only 
> 1-2-3 hops away are RELEVANT (the above calculation assumes that for 
> each destination needs to prepare for all failures of the 20-hop 
> diameter)
>
> - Very important: the FN packet always proceeds AHEAD OF normal data 
> packets, so re-routed data packets typically find nodes on their way 
> which have finished or almost finished reconfiguring. (In this way 
> long links do not cause problems as both FN and normal data packets 
> are delayed the same.)
>
> - Pre-calculation complexity is in the same order of magnitude as with Not-Via, and it's done "offline"
>
>
> Conclusions of our naïve implementation are the following:
>
> - The solution can be implemented on a current platform, and we don't 
> seem to use any operation that would make it less useful on other 
> platforms including e.g. EZChip NP-4
>
> - A FN packet can be originated in less than 200us (micro-sec) after 
> failure detection
>
> - An FN packet can be forwarded at each hop in ca. 180us (this already 
> includes SHA256 verification and duplicate check!)
>
>
> András
>
>
>> -----Original Message-----
>> From: rtgwg-bounces@ietf.org [mailto:rtgwg-bounces@ietf.org] On 
>> Behalf Of Alia Atlas
>> Sent: 2011. július 6. 22:57
>> To: rtgwg@ietf.org
>> Subject: discussion on fast notification work
>>
>> The last 2 IETFs, we have had discussions about the idea of fast 
>> notification, as described in draft-lu-fast-notification-framework, 
>> draft-lu-fn-transport-00, and draft-csaszar-ipfrr-fn-00.
>>
>> Since then, I have not seen substantial discussion or interest on the 
>> mailing list.  If you are interested in this work, have questions 
>> about it, or would like to see RTGWG continue to discuss it, please 
>> send email to this mailing list.  I'd like to see this conversation 
>> happening here before IETF.
>>
>> Thanks,
>> Alia
>> _______________________________________________
>> rtgwg mailing list
>> rtgwg@ietf.org
>> https://www.ietf.org/mailman/listinfo/rtgwg
>>
> _______________________________________________
> rtgwg mailing list
> rtgwg@ietf.org
> https://www.ietf.org/mailman/listinfo/rtgwg
_______________________________________________
rtgwg mailing list
rtgwg@ietf.org
https://www.ietf.org/mailman/listinfo/rtgwg