Stability and Resilience (was Re: [v6ops] A common...)

Lee Howard <lee@asgard.org> Fri, 22 February 2019 16:36 UTC

Return-Path: <lee@asgard.org>
X-Original-To: ipv6@ietfa.amsl.com
Delivered-To: ipv6@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 61199130EE3 for <ipv6@ietfa.amsl.com>; Fri, 22 Feb 2019 08:36:24 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.234
X-Spam-Level:
X-Spam-Status: No, score=-1.234 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_SOFTFAIL=0.665] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Kh71_7T76bX9 for <ipv6@ietfa.amsl.com>; Fri, 22 Feb 2019 08:36:22 -0800 (PST)
Received: from atl4mhob18.registeredsite.com (atl4mhob18.registeredsite.com [209.17.115.111]) by ietfa.amsl.com (Postfix) with ESMTP id 58EDA130ED6 for <ipv6@ietf.org>; Fri, 22 Feb 2019 08:36:22 -0800 (PST)
Received: from mailpod.hostingplatform.com (atl4qobmail03pod6.registeredsite.com [10.30.71.211]) by atl4mhob18.registeredsite.com (8.14.4/8.14.4) with ESMTP id x1MGaKiE030476 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL) for <ipv6@ietf.org>; Fri, 22 Feb 2019 11:36:20 -0500
Received: (qmail 4980 invoked by uid 0); 22 Feb 2019 16:36:20 -0000
X-TCPREMOTEIP: 174.64.33.182
X-Authenticated-UID: lee@asgard.org
Received: from unknown (HELO ?192.168.2.103?) (lee@asgard.org@174.64.33.182) by 0 with ESMTPA; 22 Feb 2019 16:36:20 -0000
Subject: Stability and Resilience (was Re: [v6ops] A common...)
To: ipv6@ietf.org, IPv6 Ops WG <v6ops@ietf.org>
References: <6D78F4B2-A30D-4562-AC21-E4D3DE019D90@consulintel.es> <B6E2EC33-EEAF-40D0-AFCC-BDAFA9134ACD@consulintel.es> <20190220113603.GK71606@Space.Net> <28fbc2c305c640c9afb3704050f6e8d7@boeing.com> <20190220213107.GS71606@Space.Net> <019c552eb1624d348641d6930829fd1f@boeing.com> <CAKD1Yr0HBG+rhyFWg9zh0t3mW486Mjx9umjn+CRqAZg4z9r0dg@mail.gmail.com> <20190221073530.GT71606@Space.Net> <CAO42Z2wmB2W52b4MZ2h9sW5E9cQKm-HRjyf--q8C26jezS7LXQ@mail.gmail.com> <a73818d31db7422b99a524bc431b00ed@boeing.com> <CAO42Z2z9-48Gbb_Exf+oWUqDO=axSLpZBtqeDcxkAoFq5OziGw@mail.gmail.com>
From: Lee Howard <lee@asgard.org>
Message-ID: <0629af5e-5e1b-7e01-5bf4-b288a2d36809@asgard.org>
Date: Fri, 22 Feb 2019 11:36:19 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0
MIME-Version: 1.0
In-Reply-To: <CAO42Z2z9-48Gbb_Exf+oWUqDO=axSLpZBtqeDcxkAoFq5OziGw@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------303D1DAE42CA8ED37CD5213C"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/ipv6/-GRD3vutEKlpao61odl7nq5KE5k>
X-BeenThere: ipv6@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "IPv6 Maintenance Working Group \(6man\)" <ipv6.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipv6>, <mailto:ipv6-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ipv6/>
List-Post: <mailto:ipv6@ietf.org>
List-Help: <mailto:ipv6-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipv6>, <mailto:ipv6-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 22 Feb 2019 16:36:24 -0000

I think I have heard the following suggestions in this conversation. I 
hope that taken all together, rather than as individual spot solutions, 
they can be a consensus recommendation.


ISPs should, as much as possible, reissue the same prefix to customers. 
Some things ISPs can do to increase the chances of this:

 1.

    Share lease information between redundant DHCPv6 servers. Most ISPs
    probably have redundant servers, since this is critical provisioning
    infrastructure. It may be difficult to synch information between
    servers for millions of leases over tens of milliseconds of latency;
    see RFC6853, "DHCPv6 Redundancy Deployment Considerations." Maybe
    DHCP vendors can report.

 2.

    Aggregate above the provider edge device, so that grooming customers
    between Provider Edge boxes (PEs) doesn't force a renumbering. It's
    been a few years since I worked on CMTSs, but when I did they did
    not support MP-BGP well (if at all), so routes had to be aggregated
    on the PE, or leaked in the IGP which is bad for convergence time.
    Maybe PE vendors can report.

 3.

    Set DHCPv6 lease timers very low prior to grooming events. A short
    interval during the maintenance window will increase load on the
    DHCPv6 server until timers have been returned to normal values.

 4.

    In the case of a PE reboot, use DHCPv6 Bulk Leasequery to rebuild
    the routing table. I think all of the necessary information is in
    those responses. Again, last time I was working on CMTSs, this
    feature was not supported. Maybe PE vendors can report.


Networks should, as much as possible, be resilient to prefix changes. 
Some things networks can do to improve resilience:

 1.

    Write a learned prefix to non-volatile memory and issue a DHCPv6
    Renew for that prefix on reboot.

 2.

    Use dynamic DNS and shorter TTLs.

 3.

    Implement something like NETCONF to distribute prefix information to
    policy devices like firewalls or SD-WAN controllers. I think a
    separate document describing this application of NETCONF would make
    sense.


In the case of failures, it cannot be assumed that sessions will stay 
active. We try to build in redundancy and resilience where we can, but 
where there's a single point of failure (such as CE or PE), and it fails 
(such as an unplanned reboot), our expectations should be appropriate.

Is this a reasonable summary?

Lee