RE: Questions on RSVP-TE Graceful Restart and the new Extensions

"Bardalai, Snigdho" <Snigdho.Bardalai@us.fujitsu.com> Fri, 02 November 2007 04:18 UTC

Return-path: <owner-ccamp@ops.ietf.org>
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1Innzd-00072X-Mw for ccamp-archive@ietf.org; Fri, 02 Nov 2007 00:18:49 -0400
Received: from psg.com ([147.28.0.62]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1InnzR-00072K-0p for ccamp-archive@ietf.org; Fri, 02 Nov 2007 00:18:44 -0400
Received: from majordom by psg.com with local (Exim 4.68 (FreeBSD)) (envelope-from <owner-ccamp@ops.ietf.org>) id 1InnfK-000KSj-DB for ccamp-data@psg.com; Fri, 02 Nov 2007 03:57:50 +0000
X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on psg.com
X-Spam-Level:
X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.3
Received: from [168.127.0.57] (helo=fncnmp04.fnc.fujitsu.com) by psg.com with esmtp (Exim 4.68 (FreeBSD)) (envelope-from <Snigdho.Bardalai@us.fujitsu.com>) id 1Innf8-000KRx-Ru for ccamp@ops.ietf.org; Fri, 02 Nov 2007 03:57:44 +0000
X-IronPort-AV: E=Sophos;i="4.21,361,1188795600"; d="scan'208";a="144643187"
Received: from rchemx01.fnc.net.local ([168.127.134.104]) by fncnmp02.fnc.fujitsu.com with ESMTP; 01 Nov 2007 22:57:35 -0500
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Questions on RSVP-TE Graceful Restart and the new Extensions
Date: Thu, 01 Nov 2007 22:57:35 -0500
Message-ID: <A278CCD6FF152E478C3CF84E4C3BC79D0222F609@rchemx01.fnc.net.local>
In-Reply-To: <00fe01c80972$5f51d7a0$4b4d460a@china.huawei.com>
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Thread-Topic: Questions on RSVP-TE Graceful Restart and the new Extensions
Thread-Index: AcgJcqxAjKiJ6VsgRlep9ow8mhkrpATkLjSA
From: "Bardalai, Snigdho" <Snigdho.Bardalai@us.fujitsu.com>
To: Dan Li <danli@huawei.com>, Adrian Farrel <adrian@olddog.co.uk>, "Ccamp (E-mail)" <ccamp@ops.ietf.org>
Sender: owner-ccamp@ops.ietf.org
Precedence: bulk
X-Spam-Score: -4.0 (----)
X-Scan-Signature: 3b8eea209b62bd15620865bc4fbef8cd

Hi Dan,

I have come up with the following text for the first problem:

+--------------------------+
Additional Restarting Node procedures

According to the current graceful restart procedure [RFC 3473], after a node
restarts its control plane, it needs its upstream node to send PATH message with recovery 
label to synchronize its RSVP state. If the restarted control plane becomes
operational quickly relative to the hello interval, the upstream node may not detect
the restarting of  downstream node and therefore, may send a PATH message without
recovery label causing errors and unwanted connection deletion.

To resolve the aforementioned problem, the following procedures are proposed 
and are meant to work together with the recovery procedures documented in
[RFC3473]. Here, it is assumed that the restarting node and the neighboring 
node(s) support Hello extension as documented in [RFC3209] and recovery 
procedures documented in [RFC3473].

After a node restarts its control plane, it should ignore and silently drop 
all RSVP-TE messages, except hello messages, it receives from any neighbor to which,
no Hello session has been established.

The restarting node should follow [RFC3209] to establish Hello sessions with 
its neighbors,  after its control plane becomes operational. 

The restarting node resumes processing of RSVP-TE messages sent from each
neighbor to which the Hello session has been established.
+---------------------------+

Could you please incorporate this into the GR description ID.

Regarding the second issue, I think the real problem starts when after the communication failure
all nodes C, D and E restart.

             A---B-x...x-C---D---E-x...x-F---G

In the example, above nodes C and E are isolated from the neighboring nodes B and F. In this 
scenario the LSPs may get deleted from D after the recovery timer expires. The reason for this is
since both nodes C and E restarted there is no RSVP-TE state in either nodes to recover the state
in node D.

Thanks,
Snigdho

-----Original Message-----
From: Dan Li [mailto:danli@huawei.com]
Sent: Monday, October 08, 2007 1:14 AM
To: Adrian Farrel; Bardalai, Snigdho; Ccamp (E-mail)
Subject: Re: Questions on RSVP-TE Graceful Restart and the new
Extensions


Hi,

Please see my comments below.

Regards,

Dan

----- Original Message ----- 
From: "Adrian Farrel" <adrian@olddog.co.uk>
To: "Bardalai, Snigdho" <Snigdho.Bardalai@us.fujitsu.com>; "Ccamp (E-mail)" <ccamp@ops.ietf.org>
Sent: Saturday, October 06, 2007 7:12 PM
Subject: Re: Questions on RSVP-TE Graceful Restart and the new Extensions


> Hi Snigdho,
> 
> Always good to have reports of use of the more advanced functions.
> 
> Some thoughts...
> 
> We can do some work to fix up the SRefresh behavior, but I don't think it 
> actually helps, because if Summary Refresh is not being used, exactly the 
> same scenario can arise with a Path Refresh. In other words, fixing the 
> SRefresh would not fix the problem.

[dan] Agree. Changing the behavior of SRefresh message would not help.

> 
> In some circumstances, the window in the scenario you have drawn is very 
> small.. The SRefresh *and* the Path Refresh must be sent by N1 between N2 
> completing restart and N2's first Hello arriving at N1. One could say that 
> N2 should ignore all received messages until it has Hellos up and running. 
> That would guarantee that N1 knew about the restart.

[dan] The time window size depends on how quickly the restarted node sends 
out the Hello message after the communication comes back. I would like to see 
N2 ignore all received messages until it receives the ACK of the Hello message 
from N1.
 
> 
> That simply requires that when startup completes N2 must:
> - either
>     - immediately send a new Hello to each neighbor
>   or
>      - respond to any first received message from a
>        neighbor with which it does not have an active
>        Hello exchange by sending a Hello
> - ignore all subsequent RSVP messages except Hellos
>   from neighbors with which it does not have an active
>   Hello exchange
>

[dan] Agree with Adrian.
 
> We should understand why N2 sends PathErr. "Resources in use" could either 
> mean that *all* resources are already in use, or the limit resources 
> required on the Path message (i.e. Upstream Label or Label Set) are already 
> in use.
> 
> In all cases, the error code should indicate that the requested new resource 
> allocation cannot be satisfied (N2 thinks this is a new LSP).
> 
> In the case of no resource being available, the PathErr MUST [3473] contain 
> "Routing problem/MPLS label allocation failure".
> In the case of failure because the Label Set cannot be satisfied, the 
> PathErr MUST [3473] carry "Routing problem/Label Set". In the case of 
> failure because of Upstream Label N2 MUST [3473] send a PathErr with 
> "Routing problem/Unacceptable label value" and MAY include an Acceptable 
> Label Set object.
> 
> So, I would suggest that N1 may be over-reacting to the PathErr from N2. 
> Presumably N1 expects that the LSP is up and running - ii already has Resv 
> state and data is probably flowing. So any of these three error codes (all 
> of which represent LSP setup errors) should cause N1 to suspect something 
> slightly strange is happening. Before getting too agitated and taking the 
> dramatic step of tearing the LSP, it should just check that everything else 
> is functioning correctly, and part of that process would be to send a Hello 
> if one has not been sent/received for a considerable period of time.
> 
> In fact, your situation arises either because the Hello period is set far to 
> large (i.e. larger than the neighbor's whole restart cycle) or because N1 is 
> not considering any difference between normal and hello-degraded states. The 
> former is a configuration error. The latter should allow the PathErr to be 
> treated with more care than a simple PathTear.

[dan] Usually the traffic carried by the data plane is required not be touched
when the control plane fails. So N1 should be very careful to take any actions
to tear down the LSP.

> 
> With regard to your second question. Yes, we believe that all cases of 
> multiple failures are handled by the restart procedures without refresh 
> failure causing state to be deleted. However, the text for this was removed 
> from the graceful restart draft before publication (actually, long ago) as 
> second order failures really clogged up the draft. Instead, we have 
> draft-ietf-ccamp-gr-description-00.txt.
> 
> I suggest:
> 
> 1. The situation you describe in your first scenario (with and without 
> SRefresh) should be included in the GR-Description I-D.
> 
> 2. You should check the GR-Description I-D to see whether it answers your 
> questions about multiple failures.
> 
> In both cases, I am sure the authors would welcome suggested text and 
> pointer to what they could change.

[dan] Yes, you're welcome to send us the proposed text for GR-Description ID.

> 
> Cheers,
> Adrian
> ----- Original Message ----- 
> From: "Bardalai, Snigdho" <Snigdho.Bardalai@us.fujitsu.com>
> To: "Ccamp (E-mail)" <ccamp@ops.ietf.org>
> Sent: Friday, October 05, 2007 5:18 PM
> Subject: Questions on RSVP-TE Graceful Restart and the new Extensions
> 
> 
> Hi,
> 
> I have a couple of questions on RSVP-TE Graceful Restart and the new 
> extensions being propose in draft-ietf-ccamp-rsvp-restart-ext-09.
> 
> Did anybody come across any issues when the hello interval duration times 
> the failure multiple (typically 3) is too large compared to the neighboring 
> node restart duration? For example, if the RSVP-TE interval is 10 seconds, 
> the multiple is 3 and the neighboring node restarts within 10 seconds then 
> it is possible that the RSVP-TE hello will never detect a hello failure.
> 
> RFC3473 does describe detection of a node restart in this case based on a 
> new source instance in the hello message, but we have come across an issue 
> with NACKs being generated for an Srefresh message in this scenario.
> 
> Please look-at the sequence diagram below:
> 
>   N1                                N2
>   |                                 |
>   |                                 X (Restart start)
>   |  HELLO                          |
>   |-------------------------------->|
>   |                                 |
>   |  SRefresh                       |
>   |-------------------------------->|
>   |                                 |
>   |  HELLO                          |
>   |-------------------------------->|
>   |                                 |
>   |                                 X (Restart complete)
>   |  SRefresh                       |
>   |-------------------------------->|
>   |  NACK                           |
>   |<--------------------------------|
>   |  Path (without recovery label)  |
>   |-------------------------------->|
>   |                                 X (resoure allocation failed because the 
> resouces are in use)
>   |  PathErr                        |
>   |<--------------------------------|
>   |  PathTear                       |
>   |-------------------------------->|
>   X (CON deletion)                  X (XCON deletion)
>   |                                 |
> 
> The issue is because N1 did not detect a hello failure it continues sending 
> SRefreshes which may get NACKed by N2 once restart completes because there 
> is no Path state corresponding to the SRefresh message. This NACK causes a 
> Path refresh message to be generated but there is no RECOVERY_LABEL because 
> N1 did not yet detect that N2 has restarted because hello exchanges have not 
> yet started. PLEASE NOTE: This is based on an actual implementation and a 
> real test.
> 
> What is the solution to this issue because I don't see either N1 or N2 doing 
> anything that is not compliant as per the current RFCs? Or is there 
> something I have missed?
> 
> The other issue I wanted to understand is with respect to the graceful 
> restart extension. Will the RecoveryPath message handle issues when 
> communication fails and a node restarts? There may be issues when somes 
> nodes in the LSP path gets isolated from both upstream and downstream ends.
> 
> Example,
> 
>              A---B-x...x-C---D---E-x...x-F---G
> 
> Nodes C, D and E are isolated. If this condition persists and node's C,D and 
> E restarts. Will the LSP get deleted after the recovery timer expires in 
> node D? Can this be prevented ?
> 
> Would appreciate your response.
> 
> Regards,
> Snigdho
> 
> 
> 
>