Re: Comments on draft-yourtchenko-colitti-nd-reduce-multicast

Erik Nordmark <nordmark@acm.org> Tue, 25 February 2014 13:39 UTC

Return-Path: <nordmark@acm.org>
X-Original-To: ipv6@ietfa.amsl.com
Delivered-To: ipv6@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B648D1A0657 for <ipv6@ietfa.amsl.com>; Tue, 25 Feb 2014 05:39:22 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.465
X-Spam-Level: *
X-Spam-Status: No, score=1.465 tagged_above=-999 required=5 tests=[BAYES_50=0.8, RCVD_IN_DNSWL_NONE=-0.0001, SPF_SOFTFAIL=0.665] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KflSV8Srte2z for <ipv6@ietfa.amsl.com>; Tue, 25 Feb 2014 05:39:20 -0800 (PST)
Received: from d.mail.sonic.net (d.mail.sonic.net [64.142.111.50]) by ietfa.amsl.com (Postfix) with ESMTP id BFFC51A0457 for <ipv6@ietf.org>; Tue, 25 Feb 2014 05:39:20 -0800 (PST)
Received: from [192.168.10.18] ([78.204.24.4]) (authenticated bits=0) by d.mail.sonic.net (8.14.4/8.14.4) with ESMTP id s1PDd9Hr001241 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Tue, 25 Feb 2014 05:39:11 -0800
Message-ID: <530C9CFD.2000409@acm.org>
Date: Tue, 25 Feb 2014 05:39:09 -0800
From: Erik Nordmark <nordmark@acm.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: Andrew Yourtchenko <ayourtch@cisco.com>, "Hemant Singh (shemant)" <shemant@cisco.com>
Subject: Re: Comments on draft-yourtchenko-colitti-nd-reduce-multicast
References: <5305AF13.5060201@acm.org> <75B6FA9F576969419E42BECB86CB1B89115F99A9@xmb-rcd-x06.cisco.com> <alpine.OSX.2.00.1402211620560.49053@ayourtch-mac> <75B6FA9F576969419E42BECB86CB1B89115F9BAE@xmb-rcd-x06.cisco.com> <alpine.OSX.2.00.1402212129450.52880@ayourtch-mac>
In-Reply-To: <alpine.OSX.2.00.1402212129450.52880@ayourtch-mac>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Sonic-ID: C;MMObNCKe4xGU6X+iOBdzQA== M;xEVINSKe4xGU6X+iOBdzQA==
Archived-At: http://mailarchive.ietf.org/arch/msg/ipv6/x-s5nMq7lmtVarSWC1VZVy95lDo
Cc: IETF IPv6 <ipv6@ietf.org>
X-BeenThere: ipv6@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "IPv6 Maintenance Working Group \(6man\)" <ipv6.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipv6>, <mailto:ipv6-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ipv6/>
List-Post: <mailto:ipv6@ietf.org>
List-Help: <mailto:ipv6-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipv6>, <mailto:ipv6-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Feb 2014 13:39:22 -0000

On 2/21/14 12:36 PM, Andrew Yourtchenko wrote:
>
> Myself I did not like this portion of my reply with no quantifiable 
> data (because operating on a basis of a belief is not a good 
> engineering).
>
> I happened to have a couple of hours of offline time, during which I 
> tried to sketch the scenario and try to get as far as I can without 
> building a lab.
>
> It's a *very rough sketch*. If you think there can be parts that can 
> be made better in it, tell me.
Andrew,
I like this sketch. A few comments below.
>
> The initial assumptions for this thought experiment are as follows:
>
> 1) We have 10000 clients in a single /64.
>
> 2) There are multiple APs that bridge the traffic from wired onto
> wireless medium, with the client count limited to 100 per AP.
>
> 3) there is 20x speed difference between unicast transmission and
> multicast transmission.: the effective multicast speed is assumed to be
> 1 mbps, the effective unicast speed is assumed to be 20mbps data rate.
>
> 4) The APs are assumed to be "Naive" i.e. they do not perform any 
> snooping
> nor multicast->unicast conversions, but at the same time they are able 
> to bridge the unicast traffic without flooding it between multiple 
> access points. I.e. we assume a model where we have a single router 
> (or an FHRP pair) and a set of 100 APs bridging the traffic.
>
> (Corollary from the above: The effective unicast capacity is 
> 100*20mbps, whereas the effective multicast capacity is 1*1Mbps, 
> therefore the difference in throughput is 1000-fold).
>
> Let's first consider the steady state. Suppose each host downloads
> a file at 0.1 mbps.  Within each AP, therefore we have a 50% capacity 
> utilization (0.1*100 = 10Mbps, we have 20 Mbps capacity).
>
> It's easy to see this comfortably accomodates all the hosts. Obviuously
> the unicast NUD in this traffic is fairly minimal, so I don't think 
> it's even worth to count how much it is.
>
> Now, lets look at a potential failure.
>
> At the time of the NUD probe, it's enough to lose the 3 retries,
> spaced 1 second apart.  The default reachable time is 30 seconds,
> with the random jitter of 0.5 of that.
>
> So, all that is needed to achieve a mass NUD failure is a ~30 second
> outage during the period when all the hosts are sending the traffic.
>
> A reboot of the majority of the networking gear takes several times 
> longer than this.
>
> Therefore, a crash of the default gateway during the peak hour is
> a one guaranteed trigger for this to happen.
>
> So, now we have a situation of 10000 hosts which have deleted their 
> default gateway from the neighbor table, and send the multicast 
> neighbor solicitations for it.
>
> Assume the NS is 64 bytes, 10000 hosts sending such a packet means 64 
> bytes * 8 bit * 10000 hosts = 5120000 bits/sec - or, 5 mbps.
Just to make it clear, each NS retransmissions is 5 Mbits and the 
retransmission time is 1 second. Hence with 4861 we get 5 mbps for 3 
seconds. However, they hosts aren't that likely to all discover the NUD 
failure in the same 3 second window. With ReachableTime=30 seconds it 
depends on when there last successful NUD probe or higher-level 
reachability advise can in. Hmm - with TCP the reachability advise might 
come with every left window edge advancement i.e. they would synchronize 
quite closely.

>
> Note, that this is only the shared bandwidth downstream back to the 
> hosts - the airtime spent by the upstream traffic is 20x faster, so I 
> am gratuitously discarding it.
>
> Since the APs can not send the traffic at this rate, obviously, we will
> need to drop some of it. Note, if the clients succeed with the ND to 
> the default gateway, they will start streaming data again, so the 
> effective multicast throughput will drop to the 0.5 mbps as soon as 
> the noticeable portion of clients recovers.
While the multicast NS will see downstream drops, what matters for the 
resolution is that the router receives it and responds. Thus why 
wouldn't that part work?
The NA from the router will be unicast so if that unicast isn't dropped 
due to the multicasts then it should resolve quickly. I don't know if 
unicast is prioritized higher than multicast or not.

(We'll see downstream overload due to the multicasts, but that might 
just last for 1 second if the NS/NA to/from the router gets through.)

>
> Let's assume the best case of 20% of the clients managing to recover 
> within the first second.
>
> As we are approaching full recovery, the lesser number of clients will 
> be able to get their multicast NS sent because the airtime is being 
> taken by the payload traffic. Anyway, let's discard that and assume 
> that every second 20% of the clients will recover.
>
> This means that the recovery of the full set of the clients in this 
> conditions will take an *absolute* minimum of 5 seconds.
>
> Did I prove myself wrong ? Seems so. We can see that with some of the 
> relaxed assumptions I took, the hosts seem to recover.
>
> But, let's add to this a little bit of mDNS, and other 
> multicast-loving protocols, which tend to generate a fair chunk of 
> traffic when they detect the network was "restored".
>
> This can shrink the available capacity several times. Add to this that 
> the host does not necessarily stream at 100kbps, but might have higher 
> data rates, and I think we can consider the available
> multicast capacity at startup to be 1/10th of what it is in theory.
Yes, in general TCP will fill the pipe thus at each AP all of the 
downstream bandwidth will be used.

But if the streaming comes via the default router (which crashed or was 
unreachable for a few seconds), then the stream would also slow down due 
to TCP, in which case there is less of a multicast overload issue. There 
might be cases where this TCP slowdown doesn't happen though.

   Erik

>
> This is where the things may become interesting - 1/10th of the 
> capacity means that it will take not 5, but 50 seconds for all the 
> hosts to recover.
>
> This means that the hosts which recovered in the very first second, 
> will already be sending NUD traffic while the network is still under 
> stress. If these packets are lost, the hosts might back into the pool 
> of the "orphans" who are sending the multicast, because they delete 
> their neighbor entry.
>
> There are some other things to consider:
>
> - I deliberately kept the scenario here with *one* ND entry.
> Assuming your hosts are talking with 3-4 other hosts besides the 
> default gateway. This increases the load and proportionally makes the 
> dangerous state easier to achieve.
>
> - another factor that I am omitting - that such a storm of ND onto the 
> default gateway might cause rate-limiting of the control plane packets 
> on the gateway. With some of the limits being as low as 1000pps, this 
> might give a recovery time on the order of minutes even without the 
> wireless multicast being the bottleneck, yet still resulting in a lot 
> of multicast NS in the air still during the slow recovery.
>
> This is about as precise of the construct I can build.
>
> Is it perfect ? No, by no means. It also does assume that in the case 
> of the default gateway the wireless performance will be limiting 
> factor - I think it won't - so it's more of an appropriate scenario 
> for a case of a p2p communications on the network. The 
> default-gateway-only will be inherently much more stable, I think - 
> because the multicast on the wired side is fast, so the drops on the 
> wireless side will not matter.
>
> It's probably worth it to change the text into "There is potential for 
> the failing NUD to *contribute* to a longer recovery and possible 
> creation of the locked situation in the case of flash failure - but 
> the exact quantification of the impact in such an environment is a 
> topic of further study".
>
> And then maybe I could dump the above thought experiment into a 
> separate draft, to see if the folks could contribute to the experiment 
> / maybe someone could run it - and reference it in this item ?
>
> It seems like an interesting area to dig a bit more in - creating a 
> suitable model and playing with the parameters to see where it breaks 
> seems like a useful exercise to understand how many hosts can there be 
> in a single /64 on WiFi with a "naive" set of access-points.
>
> Thoughts ?
>
> --a
>