Re: [v6ops] Opsdir last call review of draft-ietf-v6ops-slaac-renum-03

Fernando Gont <fgont@si6networks.com> Thu, 10 September 2020 09:15 UTC

Return-Path: <fgont@si6networks.com>
X-Original-To: v6ops@ietfa.amsl.com
Delivered-To: v6ops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 3282D3A1140; Thu, 10 Sep 2020 02:15:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.845
X-Spam-Level:
X-Spam-Status: No, score=-2.845 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, NICE_REPLY_A=-0.948, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HQCj6gl4w1sl; Thu, 10 Sep 2020 02:15:41 -0700 (PDT)
Received: from fgont.go6lab.si (fgont.go6lab.si [IPv6:2001:67c:27e4::14]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1A7D33A124E; Thu, 10 Sep 2020 02:15:39 -0700 (PDT)
Received: from [10.0.0.134] (unknown [186.19.8.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by fgont.go6lab.si (Postfix) with ESMTPSA id 40C9A283A74; Thu, 10 Sep 2020 09:15:34 +0000 (UTC)
To: Jürgen Schönwälder <j.schoenwaelder@jacobs-university.de>, ops-dir@ietf.org
Cc: draft-ietf-v6ops-slaac-renum.all@ietf.org, last-call@ietf.org, v6ops@ietf.org
References: <159968910157.15345.3077847299653382902@ietfa.amsl.com>
From: Fernando Gont <fgont@si6networks.com>
Message-ID: <03acb49d-9c05-521a-9bf8-40da16c5f7a7@si6networks.com>
Date: Thu, 10 Sep 2020 06:00:02 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.1
MIME-Version: 1.0
In-Reply-To: <159968910157.15345.3077847299653382902@ietfa.amsl.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/v6ops/8b6CMUEJHYQVHGyjOxnEwnZwnAY>
Subject: Re: [v6ops] Opsdir last call review of draft-ietf-v6ops-slaac-renum-03
X-BeenThere: v6ops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: v6ops discussion list <v6ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/v6ops>, <mailto:v6ops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/v6ops/>
List-Post: <mailto:v6ops@ietf.org>
List-Help: <mailto:v6ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/v6ops>, <mailto:v6ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 10 Sep 2020 09:15:46 -0000

Hi, Jürgen,

Thanks a lot for your comments! In-line....

On 9/9/20 19:05, Jürgen Schönwälder via Datatracker wrote:
[....]
> 
> Perhaps indicate a bit earlier what unacceptably long means, i.e. we
> are talking about days and weeks.

This is a bit subjective. If I'm sitting on my computer doing e.g. 
video-conferencing (i.e., anything interactive), probably anything over 
a few minutes would be unacceptable. In a more general case, what's 
acceptable is a function of how often the problem happens and whether 
there's any ongoing interactive usage -- and that's still subjective.


> The scenarios described read a bit
> like somewhat rare events and hence it is useful for the reader to
> have an idea what unacceptably long means in such events.

I wondering if adding something like:
" Any definition of what is considered 'acceptable' here would be 
subjective, and would probably also depend on how often these 
flash-renumbering events occur, whether the affected hosts are employing 
any interactive applications, and other parameters. However, one rough 
estimate would be that hosts should be able to deal with 
flash-renumbering events with a similar timeliness with which they can 
deal with failing default routers."

would help?


> (BTW, I find
> the scenario not described at the beginning where a router announces
> SLAAC lifetimes that are not synchronized with obtained prefix
> lifetimes operationally the more tricky problem since this can lead to
> regular failures.)

Fair enough. How about adding this to the bulleted-list:

" o A router (e.g. Customer Edge router) may advertise autoconfiguration 
prefixes corresponding to prefixes learned via DHCPv6-PD with constant 
PIO lifetimes that are not synchronized with the DHCPv6-PD lease time 
(as required in Section 6.3 of [RFC8415]). While this behavior violates 
the aforementioned requirement from [RFC8415], it is not an unusual 
behavior, particularly when e.g. DHCPv6-PD is implemented in a different 
software module than the SLAAC router component.".

?



> Section 2.2 seems to confuse soft-state (this is what a learned IPv6
> prefix is for me) with certain protocol timers. There are many places
> where protocols use soft-state and implementations use timers to purge
> or refresh soft-state. That timers generally do not go off in normal
> conditions is not really correct in this context, DHCP leases are
> renewed when their lifetime expires, a normal operation. 

Normally, you renew the lease before the lease expires.


> IP address
> mappings to Ethernet addresses expire when their lifetime timer goes
> off. 

This one is not the necessarily the best example ;-) (while RFC1122 
requires that, IIRC in many implementations the entry is refreshed when 
referenced, and it only expires when not referenced/refreshed frequently 
enough).

But I do see where you are going and I realize that the text is a bit 
sloppy in this respect. How about tweaking the text as follows:

---- cut here ----
    Many protocols, from different layers, normally employ timers for 
fault isolation/recovery.  The
    general logic is as follows:

    o  A timer is set with a value such that, under normal conditions,
       the timer does *not* go off.

    o  Whenever a fault condition arises, the timer goes off, and the
       protocol can perform fault recovery

    For example, when implementing reliability mechanisms, a timer is 
normally set when a packet is transmitted and, unless a response is 
received before the timer goes off, a fault recovery action (such as 
packet re-transmission) is triggered.
---- cut here ----

?

One might also look at this same issue as the timer implying a sensible 
period of time where information should be refreshed, as you correctly 
point out, though.

(I guess the only difference is that when looking at this form the 
soft-state angle, you're mostly considering the case where information 
changes, whereas when looking at this from the fault-recovery pov, 
you're mostly thinking about failures, rather than updates).


> Switches purge forwarding state regularly when forwarding entries
> expire. Cached DNS name to IP resolutions expire. The only problem
> here seems to be that a lifetime of 7 days / 30 days is a bit
> ridiculous.

Agreed.


> Is anyone shipping the RFC 4861 defaults? 

Yes, unfortunately. Some implementations override the RFC4861 defaults. 
Still, RFC4861 defaults are extremely common and widespread.



> The few
> implementations I have seen do use a bit more reasonable defaults.  I
> think this section should be rewritten to replace the "timer going off
> is associated with a failure" text with a discussion of	soft-state in
> other protocols. (Section 2.2 is why I ticked 'has issues'.)

As a second alternative to what I've suggested above:

---- cut here ----
    Many protocols, from different layers, normally employ timers for a
    variety of purposes, such as in fault isolation/recovery mechanisms,
    and in the maintenance of data structures that contain bindings of
    some sort (e.g., the IPv6 Neighbor Cache [RFC4861]).

    In the case of fault recovery/isolation, the general logic is as
    follows:

    o  A timer is set with a value such that, under normal conditions,
       the timer does *not* go off.

    o  Whenever a fault condition arises, the timer goes off, and the
       protocol can perform fault recovery

     For example, when implementing reliability mechanisms, a timer is
     normally set when a packet is transmitted and, unless a response is
     received before the timer goes off, a fault recovery action (such as
     packet re-transmission) is triggered.

     On the other hand, when maintaining bindings in data structures, 
timers are usually selected in a way that any bindings that become stale 
are updated in a timely manner.
---- cut here ----


?



> Isn't a part of the solution (other than moving to less ridiculous
> default) that SLAAC hosts experiencing connectivity problems should
> try to validate the prefix that they have learned (and if the
> validation fails move to a newly learned prefix)?

Yes, indeed. That's what we are pursuing in draft-ietf-6man-slaac-renum. 
(see Section 4 of this (draft-ietf-v6ops-slaac-renum-03) document).

draft-ietf-v6ops-slaac-renum-03 contains the problem statement and 
*operational* mitigations only.


> Involving the hosts
> in a resolution of the problem may be	more robust than expecting that
> something in the network takes care of invalidating stale soft-state.

I agree 100%. That is and has been, indeed, the motivation for pursuing 
draft-ietf-6man-slaac-renum.

Thanks!

Regards,
-- 
Fernando Gont
SI6 Networks
e-mail: fgont@si6networks.com
PGP Fingerprint: 6666 31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492