Re: [armd] Ralph Droms' No Objection on draft-ietf-armd-problem-statement-03: (with COMMENT)

Thomas Narten <narten@us.ibm.com> Thu, 30 August 2012 00:42 UTC

Return-Path: <narten@us.ibm.com>
X-Original-To: armd@ietfa.amsl.com
Delivered-To: armd@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 75BD311E80FF for <armd@ietfa.amsl.com>; Wed, 29 Aug 2012 17:42:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -108.487
X-Spam-Level:
X-Spam-Status: No, score=-108.487 tagged_above=-999 required=5 tests=[AWL=-1.787, BAYES_00=-2.599, FB_CIALIS_LEO3=3.899, RCVD_IN_DNSWL_HI=-8, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tdjzTdvu4cvA for <armd@ietfa.amsl.com>; Wed, 29 Aug 2012 17:42:30 -0700 (PDT)
Received: from e4.ny.us.ibm.com (e4.ny.us.ibm.com [32.97.182.144]) by ietfa.amsl.com (Postfix) with ESMTP id 497B411E8102 for <armd@ietf.org>; Wed, 29 Aug 2012 17:42:29 -0700 (PDT)
Received: from /spool/local by e4.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for <armd@ietf.org> from <narten@us.ibm.com>; Wed, 29 Aug 2012 20:42:29 -0400
Received: from d01dlp03.pok.ibm.com (9.56.250.168) by e4.ny.us.ibm.com (192.168.1.104) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 29 Aug 2012 20:42:27 -0400
Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id 75318C9003E; Wed, 29 Aug 2012 20:42:26 -0400 (EDT)
Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id q7U0gPIv192410; Wed, 29 Aug 2012 20:42:26 -0400
Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q7U0gPlW008083; Wed, 29 Aug 2012 18:42:25 -0600
Received: from cichlid.raleigh.ibm.com ([9.80.31.201]) by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id q7U0gN7I008052 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 29 Aug 2012 18:42:24 -0600
Received: from cichlid.raleigh.ibm.com (localhost.localdomain [127.0.0.1]) by cichlid.raleigh.ibm.com (8.14.5/8.12.5) with ESMTP id q7U0gNJJ018727; Wed, 29 Aug 2012 20:42:23 -0400
Message-Id: <201208300042.q7U0gNJJ018727@cichlid.raleigh.ibm.com>
To: "Ralph Droms" <rdroms.ietf@gmail.com>
In-reply-to: <20120829182602.22800.41833.idtracker@ietfa.amsl.com>
References: <20120829182602.22800.41833.idtracker@ietfa.amsl.com>
Comments: In-reply-to "Ralph Droms" <rdroms.ietf@gmail.com> message dated "Wed, 29 Aug 2012 11:26:02 -0700."
Date: Wed, 29 Aug 2012 20:42:23 -0400
From: Thomas Narten <narten@us.ibm.com>
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 12083000-3534-0000-0000-00000C04956D
Cc: The IESG <iesg@ietf.org>, armd@ietf.org
Subject: Re: [armd] Ralph Droms' No Objection on draft-ietf-armd-problem-statement-03: (with COMMENT)
X-BeenThere: armd@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "Discussion of issues associated with large amount of virtual machines being introduced in data centers and virtual hosts introduced by Cloud Computing." <armd.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/armd>, <mailto:armd-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/armd>
List-Post: <mailto:armd@ietf.org>
List-Help: <mailto:armd-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/armd>, <mailto:armd-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Aug 2012 00:42:31 -0000

Hi Ralph.

"Ralph Droms" <rdroms.ietf@gmail.com>; writes:

> Ralph Droms has entered the following ballot position for
> draft-ietf-armd-problem-statement-03: No Objection

> When responding, please keep the subject line intact and reply to all
> email addresses included in the To and CC lines. (Feel free to cut this
> introductory paragraph, however.)


> Please refer to http://www.ietf.org/iesg/statement/discuss-criteria.html
> for more information about IESG DISCUSS and COMMENT positions.


> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------

> 1. In section 7.1, does a high volume of ARP traffic have more impact
> on routers than on hosts or VMs?  If so, why?

I think the answer is in some cases yes.

At one level, the amount of ARP traffic a router receives is the same
as a host. But (to cut to the chase) there are a number of reasons why
the problem can be worse for routers:

1) router architectures in practice can result in hosts being able to
handle a higher rate of ARP requests. One can argue that routers
should just fix their implementations, but that doesn't change the
fact that in some deployments/implementations there are issues.

2) Routers sometimes have way more networks hanging off of them than
hosts do. E.g., a router might have 100 interfaces (to 100 different
networks - each generating ARP traffic the router would need to
process), whereas hosts would on on only one network and hence see a
lot less traffic. Hence, a router might see 100x more ARP traffic than
one host.

3) Routers are the targets of a lot of communication. So a lot of ARP
traffic is aimed at them. (Forwarding data traffic is fast/easy and
done by the ASIC, ARP processing is slow, done in the software
processor). I'm guessing a bit here, but I suspect that if you looked
at a typical network, the average rate of ARP queries directed at
nodes is likely higher for routers than hosts.

One other detail (that the docuemnt doesn't get into) is that more
recent implementations of Windows borrowed NUD from IPv6 and
retrofitted it into IPv4. Thus, they generate unicast ARP queries
frequently to revalidate entries associated with neighbors just like
IPv6 does.  This has noticably increased the ARP traffic routers have
to process (on networks with more recent versions of Windows).

> 2. In section 7.1, does the total volume of ARP traffic ever become
> great enough to have a measurable impact on available traffic
> capacity?

What I'm told is that the CPU on routers can saturate or come close to
saturating, meaning that they become unable to process all the ARP
traffic and other essential routing functions as well. At this point,
you start having major problems (e.g., the router isn't responding to
other stuff it is supposed to in a timely manner).

> 3. Does this sentence from section 7.2 imply that IPv6 stacks that
> exhibit the described behavior are compliant with RFC 4861?

>    Consequently, some
>    implementations will send out "probe" ND queries to validate in-use
>    ND entries as frequently as every 35 seconds [RFC4861].

The above is the correct behavior as called for in 4861. While the
time may seem short, its intended to insure that recovery takes place
(should the router you are using go down) before TCP connections time
out.

> 4. I suggest dropping the sentence about the impact of VMs in section
> 7.3.  Any growth in the datacenter that increases the number of
> addresses used in an L2 domain, whether it be the physical span of the
> L2 domain or the use of VMs, will have the impact described in section
> 7.3.  The impact of growth will also have an impact on the scenarios
> in section 7.1 and 7.2.  The specific impact of VMs is also mentioned
> earlier in the document.

But is it is well documented that virtualization (using VMs)
exacerbates the problem. So I think saying so here is useful to
mention (even if redundent).

> 5. Are the three problems described in sections 7.1-3 really the only
> address resolution problems in large datacenters?

Well, they are the ones I know of and that the WG called out... Do you
think there are others?

> How do the three problems interact with each other (as mentioned at
> the end of section 7.3), when the ARP and ND problems seem to be
> related to CPU usage and the MAC table issue seems to be a memory
> problem.

The problem is just the more processing that has to be done, the less
cycles there are to go around. And in some deployments there aren't
quite enough cycles, so anything that adds to the load is potentially
problematical...

> 6. It was a little surprising to me that section 5 describes multicast
> ND for address resolution, but section 7.2 only cites the unicast use
> of ND for NUD as a problem.

The problem with ND and ARP are not so much about the
bandwidth/network usage per se. It's really more about routers needing
to process such packets. That's where things start breaking down (in
some deployments). There aren't enough cycles in the router's service
processor to do the work... So whether the packets received are
multicast vs. unicast isn't the issue (for received packets)

That section wasn't really trying to focus on multicast
vs. unicast. Maybe that didn't come out as clearly as it could. I.e,
the first paragraph really should say that in terms of processing of
ND traffic on a router, many of the same costs/issues are equivalent
to the case of handling an ARP packet.

How about I change the first paragraph as follows:

old:

   Though IPv6's Neighbor Discovery behaves much like ARP there are
   several notable differences which result in a different set of
   potential issues.  From an L2 perspective there is the simple
   difference between sending to a multicast versus broadcast address
   which results in ND queries only being processed by the nodes for
   which they are intended.

new:

   Though IPv6's Neighbor Discovery behaves much like ARP there are
   several notable differences which result in a different set of
   potential issues.  From an L2 perspective, an important difference
   is that ND address resolution requests are sent via multicast,
   which results in ND queries only being processed by the nodes for
   which they are intended. This reduces the total number of ND
   packets that an implementation will receive compared with
   broadcast ARPs.

Thomas