Re: Adoption Call for "Improving the Robustness of Stateless Address Autoconfiguration (SLAAC) to Flash Renumbering Events"

Fernando Gont <fgont@si6networks.com> Tue, 30 June 2020 17:37 UTC

Return-Path: <fgont@si6networks.com>
X-Original-To: ipv6@ietfa.amsl.com
Delivered-To: ipv6@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E52423A0D12 for <ipv6@ietfa.amsl.com>; Tue, 30 Jun 2020 10:37:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=unavailable autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cR-9XUled1JK for <ipv6@ietfa.amsl.com>; Tue, 30 Jun 2020 10:37:43 -0700 (PDT)
Received: from fgont.go6lab.si (fgont.go6lab.si [91.239.96.14]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B61CD3A07FF for <ipv6@ietf.org>; Tue, 30 Jun 2020 10:37:41 -0700 (PDT)
Received: from [192.168.4.130] (unknown [186.19.8.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by fgont.go6lab.si (Postfix) with ESMTPSA id CD2AE2803D0; Tue, 30 Jun 2020 17:37:35 +0000 (UTC)
Subject: Re: Adoption Call for "Improving the Robustness of Stateless Address Autoconfiguration (SLAAC) to Flash Renumbering Events"
To: Lorenzo Colitti <lorenzo=40google.com@dmarc.ietf.org>, Fernando Gont <fernando@gont.com.ar>
Cc: IPv6 List <ipv6@ietf.org>, Bob Hinden <bob.hinden@gmail.com>
References: <CC295D49-5981-41C3-B4DB-E064D66616CE@gmail.com> <adddbd07-2262-b585-68a1-00fc28207a84@gmail.com> <CABNhwV0MFe-d6-DL2SuhuyPSq7Mn0-TS=poDn9ynAqn1ZWXOKA@mail.gmail.com> <CAKD1Yr3zEcZ5=1ttDbZGDtN86qy+wRbFXmOHXqngqu6NuYYJ5g@mail.gmail.com> <2759b55c-871f-dc41-c180-47c1ebd1135d@gont.com.ar> <CAKD1Yr2Uv=2PaoJschS_a6KSE_V8CgL=WkUxnUnBFqQ9Rkoe4Q@mail.gmail.com>
From: Fernando Gont <fgont@si6networks.com>
Message-ID: <87f9c636-9f35-f291-79f5-190d8a22dbdc@si6networks.com>
Date: Tue, 30 Jun 2020 14:34:43 -0300
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.1
MIME-Version: 1.0
In-Reply-To: <CAKD1Yr2Uv=2PaoJschS_a6KSE_V8CgL=WkUxnUnBFqQ9Rkoe4Q@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/ipv6/X6DcnSYYWGveZHKwzCZmsRQJP-g>
X-BeenThere: ipv6@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "IPv6 Maintenance Working Group \(6man\)" <ipv6.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipv6>, <mailto:ipv6-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ipv6/>
List-Post: <mailto:ipv6@ietf.org>
List-Help: <mailto:ipv6-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipv6>, <mailto:ipv6-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 30 Jun 2020 17:37:47 -0000

On 29/6/20 04:49, Lorenzo Colitti wrote:
> On Mon, Jun 29, 2020 at 2:43 PM Fernando Gont <fernando@gont.com.ar 
> <mailto:fernando@gont.com.ar>> wrote:
> 
>     reports that 37% of of responding ISPs do dynamic prefixes. That seems
>     pretty widespread to me. (not to mention the other possible scenarios).
> 
> 
> Please read the substance of the email I linked earlier. For example: 
> for the problem to occur it is not sufficient that there be flash 
> renumbering. The problem only occurs if there is a flash renumbering AND 
> a crash with loss of state AND layer 2 remains up. I stand by my 
> assertion is that that is a rare case

Use of CPE+switch, or CPE+range extender is quite common. Not sure why 
you'd deem such scenarios as "rare". Also, there is no requirement, 
IIRC/AFAICT, that a "link down" event causes previous information to be 
discarded. -- i.e., there was even an example about TCP retransmissions 
in Stevens' TCPv1 book whether the author would unplug one of the 
endpoints, show TCP'd exponential backoff, the replug the endpoint, and 
everything would go back to normal.

FWIW, it might be worth adding a note about this to this doc. e.g. "Upon 
link-down events, SLAAC implementations are expected to verify the 
fershness of the network configuration information employed before the 
"link down event".



> 
>      > 1. It complicates SLAAC in several ways. It requires hosts to keep
>      > track of a lot more state. It associates PIOs with a particular
>      > router not just for the purpose of routing but also for the purpose
>      > of lifetime processing. It seems to special-case ULA prefixes,
>      > treating them differently from non-ULA prefixes, and even tying them
>      > together ("Only RAs that advertise Global Unicast prefixes may
>      > deprecate Global Unicast Addresses (GUAs), while only RAs that
>      > advertise Unique Local prefixes may deprecate Unique Local Addresses
>      > (ULAs)").
> 
>     The mitigation in Section 4.5 requires only one additional variable per
>     advertised prefix: LTA_LA (a timestamp of when the prefix was last
>     advertised). Is that the "a lot more state" you are referring to?
> 
> It *is* a lot more state compared to what implementations keep now.
> 
> Right now, there's only the two lifetimes.. In theory there's also the 
> router that advertised the prefix, but I believe most popular 
> implementations don't actually store that (it's only required for rule 
> 5.5, and AFAIK only Windows implements rule 5.5).

Most current IPv6 implementations are broken when it comes to 
multi-prefix scenarios. In order to fix that, they need to implement 
RFC8028.
And in the context of RFC8028, requiring this extra time-stamp is a no 
brainer..

In any case, this document is a recommendation. "SHOULD" if you wish. 
You may not implement it if, for some reason, you don't want to.



>      > 2. it attempts to detect network changes using heuristics which I
>      > think will be brittle in the field, in particular, in the presence of
>      > packet loss. We must bear in mind that many handheld devices
>      > intentionally drop significant percentages of multicast traffic
>      > (upwards of 50%), when on Wi-Fi networks because not listening to
>      > multicast traffic at every beacon interval provides very substantial
>      > battery savings.
> 
>     Could you please elaborate on why you think this would make
>     implementations brittle?
> 
>     If such devices can successfully employ SLAAC, there's no reason
>     why the proposes mitigation would make them more brittle. Simply pick
>     LTA_DEPRECATE and LTA_INVALID that suits you.
> 
> And how do I determine "what suits me"? Can an implementation pick the 
> same value and have it work well on all networks? 

Sure. Set LTA_INVALID to the advertised Router Lifetime (or 2*Router 
Lifetime, if you want to be overly conservative). This will be enough to 
do the associated garbage collection earlier than the current "1 month".

For un-preferring addresses, you do not need to pick any timer value.



> It seems to me that it 
> can't, because there are lots of variables that cannot be determined 
> without accumulating state on previous network behaviour such as packet 
> loss and RA frequency. It seems pretty clear that 5 seconds is not great 
> in most scenarios because if RAs are sent infrequently, then a single 
> lost RA will cause the device to conclude that some address is no longer 
> preferred (which by the way isn't really very useful because as long as 
> there is an active TCP connection on a deprecated address, that 
> connection will remain stuck for potentially tens of seconds or even 
> minutes; but new connections will instead use some other prefix, or even 
> use IPv4).

The 5-second value is meant to address the *theoretical* case where PIOs 
are split among multiple RAs. In the real world, they are not. So if you 
receive one RA that does not include the previous PIO, but includes a 
new one, you might as well unprefer the stale address without even 
waiting for the five seconds.


[.....]
>      > 3. It only considers PIOs. But SLAAC can convey many parameters that
>      > are specific to the given network or given router. The most obvious
>      > example would be if a router advertises, say, a PIO of 2001:db8::/64
>      > and RDNSS servers of 2001:db8::cafe and 2001:db8::beef. (This is, for
>      > example, what Android does when acting as a router for hotspot
>      > purposes.) Even if the host correctly deprecates the PIOs, the host
>      > will still have a broken DNS configuration. Fixing this would require
>      > complicating the already brittle and complex heuristics in this
>      > document, and will require tying together options like RDNSS and PIO
>      > that are currently not tied together in any way. But there are many
>      > other options that would need to be treated in this way in order to
>      > solve the problem with this approach. For example, the PREF64 option
>      > is potentially dependent on the network attachment. How would the
>      > heuristics need to change for that option?
> 
>     1) The point of the WG adopting a document is for the WG to work on it.
>     It is not necessarily an indication that the document in question is
>     already complete.
> 
> 
> Yup. But I don't think the approach taken by this document is a 
> promising one. I think it adds too much complexity compared to the 
> advantages that it brings, and most importantly, it places a burden on 
> future design work in this area as well.

The advantage that it brings is that hosts may now do more appropriate 
use of available addresses, as well as unprefer and discard stale 
addresses that would otherwise lay around forever. I do think that's a 
significant advantage: robustness.

Nobody requires that any future design (e.g. other option 
specifications) do something similar. In fact, implementations are even 
free of not implementing this specification if they have reasons not to 
do so.



>     2) When it comes to the specific example you've cited, I'd say:
>          * Quite normally, you have multiple configured RDNSS servers, for
>     redundancy purposes. So you presumably already have code to use a
>     different RDNSS if the current one doesn't work. So, in that light, the
>     existing code will take care of it.
>          * That said, it would be sensible to set and cap the RDNSS
>     lifetimes
>     a la Section 4.1.2, and, similarly, set the lifetime as a function of
>     the Router Lifetime. This will help with the associated garbage
>     collection. -- i.e., one might want to incorporate this into the
>     document.
>          * If one wanted to further improve/fine tune this with the same
>     logic
>     as in Section 4.5, the idea would be simple: if the same router
>     advertises a new RDNSS, but not the existing ones, simply reduce the old
>     RDNSS lifetimes. However, as per the previous bullets, hosts are already
>     expected to deal with a list of RDNSS, and use the ones that work.
> 
> Sure, we can fix that problem with more complexity. Like I said, if we 
> apply the approach taken in this document for PIOs to DNS, then we need 
> more rules, and more logic, and more dependencies between options. My 
> main concern with this approach is that we have to deal with this 
> complexity for all current options, and likely future options as well.

Address information is the most critical piece: that's why this document 
focuses on that. If the wg wanted to do something similar for other 
options, that's the wg's call -- but that's not what we're currently 
proposing.



>      > 4. A consequence of #3 above is that any *new* option we define also
>      >  needs to update the heuristics, and needs rules on when and how to
>      > invalidate it, potentially by being tied to other options that are
>      > already considered by the heuristics.
> 
>     They need not. If nodes can gracefully deal with stale information
>     provided by such options, there's no need to invalidate them, and hence
>     no need for heuristics. OTOH, if hosts are not able to deal gracefully
>     with stale information provided by such options, and you don't devise a
>     mechanism to take care of such old information, then you have a broken
>     protocol.
> 
> But that's exactly my point. Who decides the answer to the "if" in your 
> sentence above? And who writes the documents that inform implementations 
> of how to deal gracefully with stale information? The WG when working on 
> those future options, right? So we're adding more work to the WG 
> whenever we define new options.

SLAAC does a bad job at dealing with stale information -- well, 
actually, it doesn't do the job at all.

So, if you are writing a spec, you should consider that there are going 
to be scenarios where information becomes stale, and will be laying 
around for ages. Then, you might:

1) Ignore the issue,

2) Claim that this should not or will not happen -- knowing that it does 
and it will (so this is basically #1 above), or,

3) Specify something that is robust


Being lazy is nice, but usually doesn't solve problems.



>     1) Currently, some specs have "default" values, and at times there are
> 
>     BCPs that have "recommended" values -- such as the default "Router
>     Lifetime" specified in RFC4861, and the "recommended" values in
>     RFC7772.
>     As someone at the last RIPE IPv6 meeting, default values essentially
>     turn out to be "these values any sane person would override to
>     something
>     else". So, for the values in Section 4.1.1, I'd rather have a Std Track
>     document that specifies sensible default values, rather than having a
>     Std Track document that specifies inappropriate default values, and an
>     operational document that somehow overrides the default values with
>     something sensible.
> 
> Setting defaults seems much more appropriate for an operational document 
> than a standards track - particularly because a standards track document 
> refers to all implementations, whereas operational documents can change 
> the defaults based on the scenario that is being deployed. 

I believe it's fine to have an operational document if a specific 
deployment scenario warrants overriding the default values.

But the default values have to be sane in the first place. I don't think 
that's the case with the current PIO PL/VL default values.


> But if there 
> is consensus in this WG to change the defaults of the existing 
> standards, then that's fine.
> 
>     2) In that light, this document contains what we think are required
>     tweaks to the standards to improve the reaction of slaac to renumbering
>     events.
> 
> 
> Right, but apart from the tweaks, the document also contains pretty 
> fundamental changes to how SLAAC works.. This is the work that I don't 
> think we should take on.

So, in the hopes of making progress, how about this: please do let us 
know what's the part(s) that you object to (I assume Section 4.5?). -- 
You don't need to shoot down the entire document just because there's 
one part of it that you disagree with.

In fact, "adoption" means, indeed, that the wg will work on the 
document, and that the document will reflect the WG's view. It doesn't 
even mean that what's in Section 4.5 (or even a variation of it) will 
end up being published.

If I understand correctly, the only part of the document that you object 
to is Section 4.5. Is that correct?  Then, if so, we are agreeing on a 
lot, already. So why not progress the parts we all agree with, and 
discuss/debate the parts we don't?



>     3) I would expect that the decision to adopt this document does not
>     necessarily imply that the document is published "as is", but rather
>     than we use this document as a starting point. As part of such work, we
>     (wg) might decide to change some things, drop some of the proposed
>     mitigations, or split the document into smaller pieces.
> 
> Yup. Like I said, most of the document consists of simple tweaks that in 
> many cases are already allowed by existing standards. That definitely 
> seems publishable.

Then, let's progress those. And try to resolve or find alternatives for 
the others (Section 4.5, if I understand correctly).



> Another much simpler approach that could be taken to solve this problem 
> is to recommend that if a host receives an RA where previous prefix(es) 
> - or more in general, previous options - have disappeared, then it 
> should attempt to re-check that information's validity in some way 
> (e.g., by attempting off-link connectivity).

That is indeed an option. FWIW, we have tried to focus, specifically, on 
detecting stale information passively (refrain from generating extra 
traffic), and limiting ourselves to inferring when information becomes 
stale.

The problem with e.g. checking information validity as in "attempt 
off-link connectivity" is that, conceptually speaking, you are checking 
much more than whether the information is stale. (e.g. if off-link 
connectivity fails, that's not necessarily an indication that the 
information is stale).

So, in that light, one might want to check information validity by 
sending an unicasted RA to the Router as a final check to see if the 
information has indeed vanished/become stale.

In any case, discussing this sort of thing, and agreeing or something, 
is indeed the point of wg adoption: having the wg work on the document, 
and output something that reflects wg consensus.

Thanks,
-- 
Fernando Gont
SI6 Networks
e-mail: fgont@si6networks.com
PGP Fingerprint: 6666 31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492