Re: Comments on draft-yourtchenko-colitti-nd-reduce-multicast

Andrew Yourtchenko <ayourtch@cisco.com> Fri, 28 February 2014 15:27 UTC

Return-Path: <ayourtch@cisco.com>
X-Original-To: ipv6@ietfa.amsl.com
Delivered-To: ipv6@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CB15A1A084A for <ipv6@ietfa.amsl.com>; Fri, 28 Feb 2014 07:27:08 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.048
X-Spam-Level:
X-Spam-Status: No, score=-10.048 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RP_MATCHES_RCVD=-0.547, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id B0kE846i35gi for <ipv6@ietfa.amsl.com>; Fri, 28 Feb 2014 07:27:06 -0800 (PST)
Received: from alln-iport-8.cisco.com (alln-iport-8.cisco.com [173.37.142.95]) by ietfa.amsl.com (Postfix) with ESMTP id 555BF1A0835 for <ipv6@ietf.org>; Fri, 28 Feb 2014 07:27:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=9652; q=dns/txt; s=iport; t=1393601219; x=1394810819; h=date:from:to:cc:subject:in-reply-to:message-id: references:mime-version; bh=rz6lDltyOzyu17lDto7aklrXNNfwwf03EO5kSzzZElU=; b=dnClFOxDOzCtx31YjbkIUykBwcC0SRA72wQTM5eyAarwX/7zFDZxI3ZG HBdWbCCAt4OFHMBfXUmtPUAhU0vtVPJxrtYZvDFLw2JhIx93n6FsXBquf gfIrRgUyJa2sNfHWNgPZT4IxmfyTcuIcf3rAyQCgVVKq03J6j9JbVEsiF o=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AgEFAOSpEFOtJXHB/2dsb2JhbABRCIMGgRLBH4EUFnSCJQEBAQMBJxECNgkFCwsYLlcGDod2CMtKF419AlYHhDcElE+KNYthgy6BZ0I
X-IronPort-AV: E=Sophos;i="4.97,562,1389744000"; d="scan'208";a="23976075"
Received: from rcdn-core2-6.cisco.com ([173.37.113.193]) by alln-iport-8.cisco.com with ESMTP; 28 Feb 2014 15:26:59 +0000
Received: from xhc-rcd-x10.cisco.com (xhc-rcd-x10.cisco.com [173.37.183.84]) by rcdn-core2-6.cisco.com (8.14.5/8.14.5) with ESMTP id s1SFQxql004364 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Fri, 28 Feb 2014 15:26:59 GMT
Received: from [10.61.221.73] (10.61.221.73) by xhc-rcd-x10.cisco.com (173.37.183.84) with Microsoft SMTP Server (TLS) id 14.3.123.3; Fri, 28 Feb 2014 09:26:57 -0600
Date: Fri, 28 Feb 2014 16:26:35 +0100
From: Andrew Yourtchenko <ayourtch@cisco.com>
X-X-Sender: ayourtch@ayourtch-mac
To: Erik Nordmark <nordmark@acm.org>
Subject: Re: Comments on draft-yourtchenko-colitti-nd-reduce-multicast
In-Reply-To: <530C9CFD.2000409@acm.org>
Message-ID: <alpine.OSX.2.00.1402272036140.42594@ayourtch-mac>
References: <5305AF13.5060201@acm.org> <75B6FA9F576969419E42BECB86CB1B89115F99A9@xmb-rcd-x06.cisco.com> <alpine.OSX.2.00.1402211620560.49053@ayourtch-mac> <75B6FA9F576969419E42BECB86CB1B89115F9BAE@xmb-rcd-x06.cisco.com> <alpine.OSX.2.00.1402212129450.52880@ayourtch-mac> <530C9CFD.2000409@acm.org>
User-Agent: Alpine 2.00 (OSX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: text/plain; format="flowed"; charset="US-ASCII"
X-Originating-IP: [10.61.221.73]
Archived-At: http://mailarchive.ietf.org/arch/msg/ipv6/vDMfc5XQFgY0D3yYf2o38nH9xHA
Cc: IETF IPv6 <ipv6@ietf.org>
X-BeenThere: ipv6@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "IPv6 Maintenance Working Group \(6man\)" <ipv6.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipv6>, <mailto:ipv6-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ipv6/>
List-Post: <mailto:ipv6@ietf.org>
List-Help: <mailto:ipv6-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipv6>, <mailto:ipv6-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Feb 2014 15:27:09 -0000

Erik,

On Tue, 25 Feb 2014, Erik Nordmark wrote:

> On 2/21/14 12:36 PM, Andrew Yourtchenko wrote:
>> 
>> Myself I did not like this portion of my reply with no quantifiable data 
>> (because operating on a basis of a belief is not a good engineering).
>> 
>> I happened to have a couple of hours of offline time, during which I tried 
>> to sketch the scenario and try to get as far as I can without building a 
>> lab.
>> 
>> It's a *very rough sketch*. If you think there can be parts that can be 
>> made better in it, tell me.
> Andrew,
> I like this sketch. A few comments below.
>> 
>> The initial assumptions for this thought experiment are as follows:
>> 
>> 1) We have 10000 clients in a single /64.
>> 
>> 2) There are multiple APs that bridge the traffic from wired onto
>> wireless medium, with the client count limited to 100 per AP.
>> 
>> 3) there is 20x speed difference between unicast transmission and
>> multicast transmission.: the effective multicast speed is assumed to be
>> 1 mbps, the effective unicast speed is assumed to be 20mbps data rate.
>> 
>> 4) The APs are assumed to be "Naive" i.e. they do not perform any snooping
>> nor multicast->unicast conversions, but at the same time they are able to 
>> bridge the unicast traffic without flooding it between multiple access 
>> points. I.e. we assume a model where we have a single router (or an FHRP 
>> pair) and a set of 100 APs bridging the traffic.
>> 
>> (Corollary from the above: The effective unicast capacity is 100*20mbps, 
>> whereas the effective multicast capacity is 1*1Mbps, therefore the 
>> difference in throughput is 1000-fold).
>> 
>> Let's first consider the steady state. Suppose each host downloads
>> a file at 0.1 mbps.  Within each AP, therefore we have a 50% capacity 
>> utilization (0.1*100 = 10Mbps, we have 20 Mbps capacity).
>> 
>> It's easy to see this comfortably accomodates all the hosts. Obviuously
>> the unicast NUD in this traffic is fairly minimal, so I don't think it's 
>> even worth to count how much it is.
>> 
>> Now, lets look at a potential failure.
>> 
>> At the time of the NUD probe, it's enough to lose the 3 retries,
>> spaced 1 second apart.  The default reachable time is 30 seconds,
>> with the random jitter of 0.5 of that.
>> 
>> So, all that is needed to achieve a mass NUD failure is a ~30 second
>> outage during the period when all the hosts are sending the traffic.
>> 
>> A reboot of the majority of the networking gear takes several times longer 
>> than this.
>> 
>> Therefore, a crash of the default gateway during the peak hour is
>> a one guaranteed trigger for this to happen.
>> 
>> So, now we have a situation of 10000 hosts which have deleted their default 
>> gateway from the neighbor table, and send the multicast neighbor 
>> solicitations for it.
>> 
>> Assume the NS is 64 bytes, 10000 hosts sending such a packet means 64 bytes 
>> * 8 bit * 10000 hosts = 5120000 bits/sec - or, 5 mbps.

> Just to make it clear, each NS retransmissions is 5 Mbits and the 
> retransmission time is 1 second. Hence with 4861 we get 5 mbps for 3 seconds. 
> However, they hosts aren't that likely to all discover the NUD failure in the 
> same 3 second window. With ReachableTime=30 seconds it depends on when there 
> last successful NUD probe or higher-level reachability advise can in. Hmm - 
> with TCP the reachability advise might come with every left window edge 
> advancement i.e. they would synchronize quite closely.

Yes, it's a bit of a fringe assumption here of all the hosts being active 
and then losing the connectivity, and somewhat uniform implementations of 
TCP (or, stuff like name resolutions). In reality I think they will be 
spaced more.

>
>> 
>> Note, that this is only the shared bandwidth downstream back to the hosts - 
>> the airtime spent by the upstream traffic is 20x faster, so I am 
>> gratuitously discarding it.
>> 
>> Since the APs can not send the traffic at this rate, obviously, we will
>> need to drop some of it. Note, if the clients succeed with the ND to the 
>> default gateway, they will start streaming data again, so the effective 
>> multicast throughput will drop to the 0.5 mbps as soon as the noticeable 
>> portion of clients recovers.
> While the multicast NS will see downstream drops, what matters for the 
> resolution is that the router receives it and responds. Thus why wouldn't 
> that part work?
> The NA from the router will be unicast so if that unicast isn't dropped due 
> to the multicasts then it should resolve quickly. I don't know if unicast is 
> prioritized higher than multicast or not.
>
> (We'll see downstream overload due to the multicasts, but that might just 
> last for 1 second if the NS/NA to/from the router gets through.)

Yes, since the router is on the wired, the 
bandwidth degradation due to a "dummy" multicast retransmission is the 
biggest problem. The NS will go unicast over the air, then go to the wired 
+ multicast over the air, and the unicast reply from the default 
gateway on the wired will go over unicast.

>
>> 
>> Let's assume the best case of 20% of the clients managing to recover within 
>> the first second.
>> 
>> As we are approaching full recovery, the lesser number of clients will be 
>> able to get their multicast NS sent because the airtime is being taken by 
>> the payload traffic. Anyway, let's discard that and assume that every 
>> second 20% of the clients will recover.
>> 
>> This means that the recovery of the full set of the clients in this 
>> conditions will take an *absolute* minimum of 5 seconds.
>> 
>> Did I prove myself wrong ? Seems so. We can see that with some of the 
>> relaxed assumptions I took, the hosts seem to recover.
>> 
>> But, let's add to this a little bit of mDNS, and other multicast-loving 
>> protocols, which tend to generate a fair chunk of traffic when they detect 
>> the network was "restored".
>> 
>> This can shrink the available capacity several times. Add to this that the 
>> host does not necessarily stream at 100kbps, but might have higher data 
>> rates, and I think we can consider the available
>> multicast capacity at startup to be 1/10th of what it is in theory.
> Yes, in general TCP will fill the pipe thus at each AP all of the downstream 
> bandwidth will be used.
>
> But if the streaming comes via the default router (which crashed or was 
> unreachable for a few seconds), then the stream would also slow down due to 
> TCP, in which case there is less of a multicast overload issue. There might 
> be cases where this TCP slowdown doesn't happen though.

Yes, indeed. Also, given the high utilisation of the media at that point, 
I wonder if the cross-channel interference will play a bigger role, 
therefore affecting the results.

This seems like an extremely tricky beast to model with any reasonable 
accuracy!

--a



>
>  Erik
>
>> 
>> This is where the things may become interesting - 1/10th of the capacity 
>> means that it will take not 5, but 50 seconds for all the hosts to recover.
>> 
>> This means that the hosts which recovered in the very first second, will 
>> already be sending NUD traffic while the network is still under stress. If 
>> these packets are lost, the hosts might back into the pool of the "orphans" 
>> who are sending the multicast, because they delete their neighbor entry.
>> 
>> There are some other things to consider:
>> 
>> - I deliberately kept the scenario here with *one* ND entry.
>> Assuming your hosts are talking with 3-4 other hosts besides the default 
>> gateway. This increases the load and proportionally makes the dangerous 
>> state easier to achieve.
>> 
>> - another factor that I am omitting - that such a storm of ND onto the 
>> default gateway might cause rate-limiting of the control plane packets on 
>> the gateway. With some of the limits being as low as 1000pps, this might 
>> give a recovery time on the order of minutes even without the wireless 
>> multicast being the bottleneck, yet still resulting in a lot of multicast 
>> NS in the air still during the slow recovery.
>> 
>> This is about as precise of the construct I can build.
>> 
>> Is it perfect ? No, by no means. It also does assume that in the case of 
>> the default gateway the wireless performance will be limiting factor - I 
>> think it won't - so it's more of an appropriate scenario for a case of a 
>> p2p communications on the network. The default-gateway-only will be 
>> inherently much more stable, I think - because the multicast on the wired 
>> side is fast, so the drops on the wireless side will not matter.
>> 
>> It's probably worth it to change the text into "There is potential for the 
>> failing NUD to *contribute* to a longer recovery and possible creation of 
>> the locked situation in the case of flash failure - but the exact 
>> quantification of the impact in such an environment is a topic of further 
>> study".
>> 
>> And then maybe I could dump the above thought experiment into a separate 
>> draft, to see if the folks could contribute to the experiment / maybe 
>> someone could run it - and reference it in this item ?
>> 
>> It seems like an interesting area to dig a bit more in - creating a 
>> suitable model and playing with the parameters to see where it breaks seems 
>> like a useful exercise to understand how many hosts can there be in a 
>> single /64 on WiFi with a "naive" set of access-points.
>> 
>> Thoughts ?
>> 
>> --a
>> 
>