Re: [IPsec] criteria for failure detection

Yoav Nir <ynir@checkpoint.com> Mon, 22 March 2010 20:04 UTC

Return-Path: <ynir@checkpoint.com>
X-Original-To: ipsec@core3.amsl.com
Delivered-To: ipsec@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 3E6853A69A6 for <ipsec@core3.amsl.com>; Mon, 22 Mar 2010 13:04:36 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.91
X-Spam-Level:
X-Spam-Status: No, score=-0.91 tagged_above=-999 required=5 tests=[AWL=-1.641, BAYES_50=0.001, DNS_FROM_OPENWHOIS=1.13, J_CHICKENPOX_62=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Rb5s5TRO6FU9 for <ipsec@core3.amsl.com>; Mon, 22 Mar 2010 13:04:35 -0700 (PDT)
Received: from michael.checkpoint.com (michael.checkpoint.com [194.29.32.68]) by core3.amsl.com (Postfix) with ESMTP id A73373A6AFA for <ipsec@ietf.org>; Mon, 22 Mar 2010 13:04:28 -0700 (PDT)
Received: from il-ex01.ad.checkpoint.com (il-ex01.checkpoint.com [194.29.34.26]) by michael.checkpoint.com (8.12.10+Sun/8.12.10) with ESMTP id o2MK4hsd025202; Mon, 22 Mar 2010 22:04:43 +0200 (IST)
X-CheckPoint: {4BA7CCB2-0-1211DC2-2FFFF}
Received: from il-ex01.ad.checkpoint.com ([126.0.0.2]) by il-ex01.ad.checkpoint.com ([126.0.0.2]) with mapi; Mon, 22 Mar 2010 22:05:03 +0200
From: Yoav Nir <ynir@checkpoint.com>
To: "Hu, Jun (Jun)" <jun.hu@alcatel-lucent.com>
Date: Mon, 22 Mar 2010 22:04:41 +0200
Thread-Topic: [IPsec] criteria for failure detection
Thread-Index: AcrJ+vXXq2q3QgUMR0iQ6yhNmhlKbw==
Message-ID: <7C523A18-610F-4B3D-8BAB-0511C157F9A3@checkpoint.com>
References: <098C866D3766674B95CCF197E1C2489B0B82631361@USNAVSXCHMBSC1.ndc.alcatel-lucent.com>
In-Reply-To: <098C866D3766674B95CCF197E1C2489B0B82631361@USNAVSXCHMBSC1.ndc.alcatel-lucent.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "ipsec@ietf.org" <ipsec@ietf.org>
Subject: Re: [IPsec] criteria for failure detection
X-BeenThere: ipsec@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Discussion of IPsec protocols <ipsec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ipsec>, <mailto:ipsec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ipsec>
List-Post: <mailto:ipsec@ietf.org>
List-Help: <mailto:ipsec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipsec>, <mailto:ipsec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 22 Mar 2010 20:04:36 -0000

Failure detection by either QCD or SIR takes 1-2 roundtrips, whether this is subsecond or not depends on round-trip time.

In case of a non-synchronized hot-standby gateway, those 2 roundtrips begin the moment that the standby gateway becomes active, so the call probably doesn't get dropped.

In case of a single gateway, these two roundtrips start when the gateway is back online.
- with a manual tunnel deletion, this can be immediate.
- with a restart of IPsec services, this can be 10-15 seconds (for a particular implementation)
- with a reboot of the device, this is 0.5-2 minutes.

What the work item is about, is shortening the time that the actual detection takes, from the several minutes specified in RFC 4306 to something more manageable, like two roundtrips.

Without an HA implementation, this is not about keeping the current calls, but about being able to establish new calls as soon as possible.

Yoav

On Mar 22, 2010, at 12:13 PM, Hu, Jun (Jun) wrote:

> Hi,
> First of all, I am not sure if this fit into existing "supported scenarios" criteria or it is a new one, the failure detection time is cirtical to some services runs over ipsec tunnel, such services like VoIP can only tolerate sub-second(or 1~2 seconds max) of transport failure, otherwise the call will be dropped. However , it seems to me that the current proposed solutions all depends on reception of "INVALID_SPI" from failed node AFTER reboot which usually take much longer time than 1~2 seconds.  This will result to the interuption of those services.
> 
> Of course, a good HA implementation may solve this issue, however a fast failure detection mechanism can also help the host to switch to a backup tunnel(or other route) asap before the service got interrupted.
> 
> ---------------
> Hu Jun