Re: [v6ops] SLAAC renum: Problem Statement & Operational workarounds

Ole Troan <otroan@employees.org> Wed, 30 October 2019 11:01 UTC

Return-Path: <otroan@employees.org>
X-Original-To: v6ops@ietfa.amsl.com
Delivered-To: v6ops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 976DA120104 for <v6ops@ietfa.amsl.com>; Wed, 30 Oct 2019 04:01:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id CZysUrHflMMd for <v6ops@ietfa.amsl.com>; Wed, 30 Oct 2019 04:01:49 -0700 (PDT)
Received: from clarinet.employees.org (clarinet.employees.org [198.137.202.74]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CEEF412002E for <v6ops@ietf.org>; Wed, 30 Oct 2019 04:01:49 -0700 (PDT)
Received: from astfgl.hanazo.no (unknown [IPv6:2001:420:44c1:2614:19db:53b8:4daa:7092]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by clarinet.employees.org (Postfix) with ESMTPSA id ADB024E11B08; Wed, 30 Oct 2019 11:01:48 +0000 (UTC)
Received: from [IPv6:::1] (localhost [IPv6:::1]) by astfgl.hanazo.no (Postfix) with ESMTP id 92AEE208F335; Wed, 30 Oct 2019 12:01:45 +0100 (CET)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3594.4.19\))
From: Ole Troan <otroan@employees.org>
In-Reply-To: <FACE45EC-27FC-437A-A5BF-D800DF089B50@fugue.com>
Date: Wed, 30 Oct 2019 12:01:45 +0100
Cc: Philip Homburg <pch-v6ops-9@u-1.phicoh.com>, v6ops@ietf.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <837E9523-14FC-4F6C-88FC-DCC316265299@employees.org>
References: <m1iPlMZ-0000J5C@stereo.hq.phicoh.net> <FACE45EC-27FC-437A-A5BF-D800DF089B50@fugue.com>
To: Ted Lemon <mellon@fugue.com>
X-Mailer: Apple Mail (2.3594.4.19)
Archived-At: <https://mailarchive.ietf.org/arch/msg/v6ops/61TVjhoJ_D9mWJZNuiPb-1X7P9k>
Subject: Re: [v6ops] SLAAC renum: Problem Statement & Operational workarounds
X-BeenThere: v6ops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: v6ops discussion list <v6ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/v6ops>, <mailto:v6ops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/v6ops/>
List-Post: <mailto:v6ops@ietf.org>
List-Help: <mailto:v6ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/v6ops>, <mailto:v6ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Oct 2019 11:01:52 -0000

>>>> For IPv4 + NAT, if you flash renumber the upstream address then existing
>>>> connections will be stuck.
>>> Not exactly They do fairly quickly get sent TCP RSTs in most cases.
>> 
>> That's not my experience. I hardly ever see a CPE generate a RST when a flow
>> doesn't exist
> 
> To be clear, the issue is that in order for a TCP RST to be sent, the TCP message has to be delivered to the endpoint in the first place.   The CPE isn’t going to randomly send TCP RSTs to flows that don’t terminate on it.   That is, it is not going to watch TCP traffic through it, notice that the source address on a particular message is not on the prefix it is currently advertising, and then construct a TCP RST to break that connection.
> 
> So in practice, if a renumbering has happened in this way, what is going to happen is that that TCP connection is going to time out for ninety seconds.

What a CPE does do, is to send an ICMP destination unreachable (5 - Source address failed ingress/egress policy) as the source address would violate the unicast RPC check. RFC7084, L-14.

>> This is exactly what happens today in many cases. Of course, where developers
>> get annoyed by this behaviour, they implement shorter timeout at the 
>> application level. I.e., a typical webbrowser doesn't wait for the host to
>> report an error on a TCP connection.
> 
> This should not be happening often enough that application developers are adding shorter timeouts to their code.  A CPE reboot is an exceptional event, and a deliberate renumbering by the ISP ought to be as well.   If they are doing that, I suspect it’s independent of the renumbering problem.
> 
> It would not surprise me if there are CPEs that behave badly when the ISP renumbering, and ISPs that behave badly when a prefix that is still valid is, from their perspective, no longer on offer. We ought to characterize that problem and propose solutions to this.   By “characterize,” I mean look at what actual CPEs do, not theorize, although I think theorizing isn’t bad either. 


You can generalise the renumbering problem to a multi-prefix multi-homing problem. Whenever a host has multiple addresses the application must continuously evaluate all possible paths and react to path failures. An invalidated prefix is just a reachability problem close to the host, and theoretically no different than a path failure multiple hops away.

In the case when a requesting router loses a prefix, it can of course signal that to the hosts as proposed in Fernando's draft. But that doesn't solve the general problem, and I'm skeptical to using addressing as a reachability signal. E.g. if the CPE's upstream link flaps, is that enough to trigger deprecation of the prefix? Of course it shouldn't.

Ole