RE: Comments on draft-yourtchenko-colitti-nd-reduce-multicast

Andrew Yourtchenko <ayourtch@cisco.com> Fri, 21 February 2014 20:38 UTC

Return-Path: <ayourtch@cisco.com>
X-Original-To: ipv6@ietfa.amsl.com
Delivered-To: ipv6@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2BECD1A023F for <ipv6@ietfa.amsl.com>; Fri, 21 Feb 2014 12:38:23 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.049
X-Spam-Level:
X-Spam-Status: No, score=-10.049 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RP_MATCHES_RCVD=-0.548, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Of_4AKcT7Xnl for <ipv6@ietfa.amsl.com>; Fri, 21 Feb 2014 12:38:18 -0800 (PST)
Received: from alln-iport-8.cisco.com (alln-iport-8.cisco.com [173.37.142.95]) by ietfa.amsl.com (Postfix) with ESMTP id AEAB91A0242 for <ipv6@ietf.org>; Fri, 21 Feb 2014 12:38:18 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=8141; q=dns/txt; s=iport; t=1393015095; x=1394224695; h=date:from:to:cc:subject:in-reply-to:message-id: references:mime-version; bh=QYFRlqg0H+NRW5FuX2xGI3s2CT/2MJ6/LZmNP5ZAYJE=; b=g3no5ceAKhRs8D0kAu5Ez30hhiKjqwe4r6N3z2NfKz1D9DbqTk41tqKd SKoXCJtgkMPid2qkn0E2xk98LlZcwwcUstEp72c79lEIWBspopzOMdDoR V+lopD8GSAhBDoqlKuXMdSR9duxk69gQVyKka8trFQ/jAHf18swvVV4NC I=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AgEFAL24B1OtJXHB/2dsb2JhbABSCIMGgRLAQYEQFnSCJQEBAQMBJxECNgkQC0ZXBg6IAgjLKBeODAJWB4Q4BJRJijOLX4MugWdC
X-IronPort-AV: E=Sophos;i="4.97,520,1389744000"; d="scan'208";a="22282642"
Received: from rcdn-core2-6.cisco.com ([173.37.113.193]) by alln-iport-8.cisco.com with ESMTP; 21 Feb 2014 20:38:14 +0000
Received: from xhc-rcd-x03.cisco.com (xhc-rcd-x03.cisco.com [173.37.183.77]) by rcdn-core2-6.cisco.com (8.14.5/8.14.5) with ESMTP id s1LKcEpl005480 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Fri, 21 Feb 2014 20:38:14 GMT
Received: from [10.61.173.41] (10.61.173.41) by xhc-rcd-x03.cisco.com (173.37.183.77) with Microsoft SMTP Server (TLS) id 14.3.123.3; Fri, 21 Feb 2014 14:38:08 -0600
Date: Fri, 21 Feb 2014 21:36:46 +0100
From: Andrew Yourtchenko <ayourtch@cisco.com>
X-X-Sender: ayourtch@ayourtch-mac
To: "Hemant Singh (shemant)" <shemant@cisco.com>
Subject: RE: Comments on draft-yourtchenko-colitti-nd-reduce-multicast
In-Reply-To: <75B6FA9F576969419E42BECB86CB1B89115F9BAE@xmb-rcd-x06.cisco.com>
Message-ID: <alpine.OSX.2.00.1402212129450.52880@ayourtch-mac>
References: <5305AF13.5060201@acm.org> <75B6FA9F576969419E42BECB86CB1B89115F99A9@xmb-rcd-x06.cisco.com> <alpine.OSX.2.00.1402211620560.49053@ayourtch-mac> <75B6FA9F576969419E42BECB86CB1B89115F9BAE@xmb-rcd-x06.cisco.com>
User-Agent: Alpine 2.00 (OSX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="US-ASCII"
X-Originating-IP: [10.61.173.41]
Archived-At: http://mailarchive.ietf.org/arch/msg/ipv6/N4dozyIzaOBzzm95qaWDZ30p1os
Cc: Erik Nordmark <nordmark@acm.org>, IETF IPv6 <ipv6@ietf.org>
X-BeenThere: ipv6@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "IPv6 Maintenance Working Group \(6man\)" <ipv6.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipv6>, <mailto:ipv6-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ipv6/>
List-Post: <mailto:ipv6@ietf.org>
List-Help: <mailto:ipv6-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipv6>, <mailto:ipv6-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Feb 2014 20:38:23 -0000

Hemant,

On Fri, 21 Feb 2014, Hemant Singh (shemant) wrote:

>> If you're not convinced, then you can skip this item for your 
>>deployment. I've seen it once, and there were multiple contributing 
>>factors, so I don't have a good data to convince you. Equally I don't 
>>have a spare lab with >9000 hosts of various operating systems, moving 
>>around. I left it because it does seem like a conservatively prudent 
>>thing to do.
>
> It's not me.  Each time anyone has contacted the IETF mailers such as 
>the 6man or the v6ops WG for an avalanche problem with IPv6 ND or DHCPv6, 
>folks in the WG's have replied that both ND and the DHCPv6 specifications 
>have randomization and at least ND has an additional jitter that avoids 
>avalanches to cause any major problem in the ipv6 network.  I have 
>personally tested routers with over 40,000 IPv6 clients with a router

Were they 802.11 clients ? Wireless medium does have different properties.
Did you make any "flash failure" events ? Anyway, see below.

>reload and did not see avalanche issues with ND nor DHCPv6.   Sorry, 
>there was a typo in my email that said "not" when I should have said 
>"lot".    I am interested to know why one client in a 100K client 
>deployment has lot of IPv6 entries in the client Neighbor Cache?  How 
>much is a lot?  10, 100, 1000 and how are these entries populated in the 
>client?

Myself I did not like this portion of my reply with no quantifiable data 
(because operating on a basis of a belief is not a good engineering).

I happened to have a couple of hours of offline time, during which I tried 
to sketch the scenario and try to get as far as I can without building a 
lab.

It's a *very rough sketch*. If you think there can be parts that can be 
made better in it, tell me.

The initial assumptions for this thought experiment are as follows:

1) We have 10000 clients in a single /64.

2) There are multiple APs that bridge the traffic from wired onto
wireless medium, with the client count limited to 100 per AP.

3) there is 20x speed difference between unicast transmission and
multicast transmission.: the effective multicast speed is assumed to be
1 mbps, the effective unicast speed is assumed to be 20mbps data rate.

4) The APs are assumed to be "Naive" i.e. they do not perform any snooping
nor multicast->unicast conversions, but at the same time they are able to 
bridge the unicast traffic without flooding it between multiple access 
points. I.e. we assume a model where we have a single router (or an FHRP 
pair) and a set of 100 APs bridging the traffic.

(Corollary from the above: The effective unicast capacity is 100*20mbps, 
whereas the effective multicast capacity is 1*1Mbps, therefore the difference in 
throughput is 1000-fold).

Let's first consider the steady state. Suppose each host downloads
a file at 0.1 mbps.  Within each AP, therefore we have a 50% capacity 
utilization (0.1*100 = 10Mbps, we have 20 Mbps capacity).

It's easy to see this comfortably accomodates all the hosts. Obviuously
the unicast NUD in this traffic is fairly minimal, so I don't think it's 
even worth to count how much it is.

Now, lets look at a potential failure.

At the time of the NUD probe, it's enough to lose the 3 retries,
spaced 1 second apart.  The default reachable time is 30 seconds,
with the random jitter of 0.5 of that.

So, all that is needed to achieve a mass NUD failure is a ~30 second
outage during the period when all the hosts are sending the traffic.

A reboot of the majority of the networking gear takes several times longer 
than this.

Therefore, a crash of the default gateway during the peak hour is
a one guaranteed trigger for this to happen.

So, now we have a situation of 10000 hosts which have deleted their 
default gateway from the neighbor table, and send the multicast neighbor 
solicitations for it.

Assume the NS is 64 bytes, 10000 hosts sending such a packet means 64 
bytes * 8 bit * 10000 hosts = 5120000 bits/sec - or, 5 mbps.

Note, that this is only the shared bandwidth downstream back to the hosts 
- the airtime spent by the upstream traffic is 20x faster, so I am 
gratuitously discarding it.

Since the APs can not send the traffic at this rate, obviously, we will
need to drop some of it. Note, if the clients succeed with the ND to the 
default gateway, they will start streaming data again, so the effective 
multicast throughput will drop to the 0.5 mbps as soon as the noticeable 
portion of clients recovers.

Let's assume the best case of 20% of the clients managing to recover 
within the first second.

As we are approaching full recovery, the lesser number of clients will be 
able to get their multicast NS sent because the airtime is being taken by 
the payload traffic. Anyway, let's discard that and assume that every 
second 20% of the clients will recover.

This means that the recovery of the full set of the clients in this 
conditions will take an *absolute* minimum of 5 seconds.

Did I prove myself wrong ? Seems so. We can see that with some of 
the relaxed assumptions I took, the hosts seem to recover.

But, let's add to this a little bit of mDNS, and other multicast-loving 
protocols, which tend to generate a fair chunk of traffic when they detect 
the network was "restored".

This can shrink the available capacity several times. Add to this 
that the host does not necessarily stream at 100kbps, but might have 
higher data rates, and I think we can consider the available
multicast capacity at startup to be 1/10th of what it is in theory.

This is where the things may become interesting - 1/10th of the 
capacity means that it will take not 5, but 50 seconds for all the 
hosts to recover.

This means that the hosts which recovered in the very first second, will 
already be sending NUD traffic while the network is still under stress. If 
these packets are lost, the hosts might back into the pool of the 
"orphans" who are sending the multicast, because they delete their 
neighbor entry.

There are some other things to consider:

- I deliberately kept the scenario here with *one* ND entry.
Assuming your hosts are talking with 3-4 other hosts besides the default 
gateway. This increases the load and proportionally makes the dangerous 
state easier to achieve.

- another factor that I am omitting - that such a storm of ND 
onto the default gateway might cause rate-limiting of the control plane 
packets on the gateway. With some of the limits being as low as 1000pps, 
this might give a recovery time on the order of minutes even without the 
wireless multicast being the bottleneck, yet still resulting in a lot 
of multicast NS in the air still during the slow recovery.

This is about as precise of the construct I can build.

Is it perfect ? No, by no means. It also does assume that in the case of 
the default gateway the wireless performance will be limiting factor - I 
think it won't - so it's more of an appropriate scenario for a case of a 
p2p communications on the network. The default-gateway-only will be 
inherently much more stable, I think - because the multicast on the wired 
side is fast, so the drops on the wireless side will not matter.

It's probably worth it to change the text into "There is potential for the 
failing NUD to *contribute* to a longer recovery and possible creation of 
the locked situation in the case of flash failure - but the exact 
quantification of the impact in such an environment is a topic of further 
study".

And then maybe I could dump the above thought experiment into a separate 
draft, to see if the folks could contribute to the experiment / maybe 
someone could run it - and reference it in this item ?

It seems like an interesting area to dig a bit more in - creating a 
suitable model and playing with the parameters to see where it breaks 
seems like a useful exercise to understand how many hosts can there be in 
a single /64 on WiFi with a "naive" set of access-points.

Thoughts ?

--a