Re: [armd] RtgDir review: draft-ietf-armd-problem-statement-03

Thomas Narten <> Mon, 27 August 2012 21:25 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 7585221F8508 for <>; Mon, 27 Aug 2012 14:25:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -110.599
X-Spam-Status: No, score=-110.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_HI=-8, USER_IN_WHITELIST=-100]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id NwxuZchnZ04C for <>; Mon, 27 Aug 2012 14:25:22 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 1CE2B21F84F2 for <>; Mon, 27 Aug 2012 14:25:22 -0700 (PDT)
Received: from /spool/local by with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for <> from <>; Mon, 27 Aug 2012 15:25:21 -0600
Received: from ( by ( with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 27 Aug 2012 15:25:19 -0600
Received: from ( []) by (Postfix) with ESMTP id 640283E40040; Mon, 27 Aug 2012 15:25:18 -0600 (MDT)
Received: from ( []) by (8.13.8/8.13.8/NCO v10.0) with ESMTP id q7RLOsnd167520; Mon, 27 Aug 2012 15:24:57 -0600
Received: from (loopback []) by (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id q7RLOrfP001214; Mon, 27 Aug 2012 15:24:53 -0600
Received: from ([]) by (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id q7RLOqG3001151 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 27 Aug 2012 15:24:52 -0600
Received: from (localhost.localdomain []) by (8.14.5/8.12.5) with ESMTP id q7RLOnx7015943; Mon, 27 Aug 2012 17:24:49 -0400
Message-Id: <>
To: "Bhatia, Manav (Manav)" <>
In-reply-to: <>
References: <>
Comments: In-reply-to "Bhatia, Manav (Manav)" <> message dated "Wed, 15 Aug 2012 18:09:38 +0530."
Date: Mon, 27 Aug 2012 17:24:48 -0400
From: Thomas Narten <>
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 12082721-1780-0000-0000-000008C7172E
X-IBM-ISS-DetailInfo: BY=3.00000293; HX=3.00000196; KW=3.00000007; PH=3.00000001; SC=3.00000007; SDB=6.00169038; UDB=6.00038311; UTC=2012-08-27 21:25:20
Cc: "" <>,, "" <>, "" <>
Subject: Re: [armd] RtgDir review: draft-ietf-armd-problem-statement-03
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "Discussion of issues associated with large amount of virtual machines being introduced in data centers and virtual hosts introduced by Cloud Computing." <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 27 Aug 2012 21:25:23 -0000

Hi Manov.

Thanks for the review comments.

"Bhatia, Manav (Manav)" <> writes:

> Summary: I have some concerns about this document that I think
>  should be resolved before publication.

> Major Issues:

> 1. In Sec 5 why is there a "may" in the following statement?

> "From an L2 perspective, sending to a multicast vs. broadcast
>  address *may* result in the packet being delivered to all nodes,
>  but most (if not all) nodes will filter out the (unwanted) query
>  via filters installed in the NIC -- hosts will never see such
>  packets. "

This is poorly worded. How about I replace  the paragraph with the

	Broadly speaking, from the perspective of address resolution,
        IPv6's Neighbor Discovery (ND) behaves much like ARP, with a
        few notable differences. First, ARP uses broadcast, whereas ND
        uses multicast. Specifically, when querying for a target IP
        address, ND maps the target address into an IPv6 Solicited
        Node multicast address. Using multicast rather than broadcast
        has the benefit that the multicast frames do not necessarily
        need to be sent to all parts of the network, i.e., only to
        segments where listeners for the Solicited Node multicast
        address reside. In the case where multicast frames are
        delivered to all parts of the network, sending to a multicast
        still has the advantage that most (if not all) nodes will
        filter out the (unwanted) multicast query via filters
        installed in the NIC rather than burdening host software with
        the need to process such packets. Thus, whereas all nodes must
        process every ARP query, ND queries are processed only by the
        nodes to which they are intended. In cases where multicast
        filtering can't effectively be implemented in the NIC (e.g.,
        as on hypervisors supporting virtualization), filtering would
        need to be done in software (e.g., in the hypervisor's

> "may" seems to indicate that there are scenarios when a multicast
>  from an L2 perspective will not be delivered to all nodes.


> I am unable to envisage a scenario when this can happen? All BUM
>  (broadcast, unlearnt unicast and multicast) traffic in vanilla L2
>  and VPLS (Virtual Private Lan Service) is delivered to *all*
>  nodes. There are exceptions in H-VPLS or if MMRP is enabled but I
>  suspect if the authors had this in their mind when they wrote the
>  above text.

Hopefully the proposed text answers the above questions.

> 2. Sec 7.1 begins with the following text:

> "One pain point with large L2 broadcast domains is that the routers
>  connected to the L2 domain need to process "a lot of" ARP traffic."

> I am not sure if this is correct with how an L2 broadcast domain has
>  been defined in Sec 2. I would wager that a bigger pain point for a
>  large L2 broadcast domain would be handling unknown unicast traffic
>  that needs to get flooded, instead of dealing with the "ARP"
>  traffic. I am aware of very very large L2 broadcast domains that
>  have no ARP/ND scaling problems. Would it then make more sense to
>  replace the L2 broadcast domain with an ARP/ND domain? If Yes, then
>  ARP/ND domain too needs to be defined in Sec 2.

The issue (as has been discussed in ARMD) is specifically the ARP
processing load (and not unknown unicast traffic). In typical
implementations, ARP processing is done by a service processor with
limited capacity. The cited problem is that the amount of ARP traffic
places a significant load on that processor.

This is explained in the next pargraph. How about I add the following
sentence to the 2nd paragraph.:

     In some deployments, limitations on the rate of ARP processing
     have been cited as being a problem.

Does that work?     

> 3. Sec 7.1 seems to suggest that Gratuitous ARPs pre-populate ARP
>  caches on the neighboring devices. Without an explicit description
>  of what a neighboring device is, I would presume that this also
>  includes edge/core routers. In that case this statement is not
>  entirely correct as I am aware of routers that will by default not
>  pre-populate their ARP caches on receiving Gratuitous ARPs.

Right. The spec says "don't do this". But I believe it was asserted
that some implementations do this. That said, I'm not aware of any
such implementations. I would be willing to remove this sentence in
the absence of known implementations of this.

> 4. Sec 7.2 must also discuss the scaling impact of how the neighbor
>  cache is maintained in IPv6 - especially the impact of moving the
>  neighbor state from REACHABLE to STALE. Once the "IPv6 ARP" gets
>  resolved the neighbor entry moves from the REACHABLE to STALE after
>  around 30secs. The neighbor entry remains in this state till a
>  packet needs to be forwarded to this neighbor. The first time a
>  node sends a packet to a neighbor whose entry is STALE, the sender
>  changes the state to DELAY and sets a timer to expire in around 5
>  seconds. Most routers initiate moving the state from STALE to DELAY
>  by punting a copy of the data packet to CPU so that the sender can
>  reinitiate the Neighbor discovery process. This patently can be
>  quite CPU and buffer intensive if the neighbor cache size is huge.

This could be. But the WG did not report such specific details in
terms of actual problems reported from deployments.

Care to say more about what these "most implementations" are and how
common they are? And are they the *only* way to implement this
feature, or have other vendors chosen different implementations
without this limitation?

That said, I could add the following to the document:

	Routers implementing NUD (for neighboring destinations) will
	need to process neighbor cache state changes such as
	transitioning entries from REACHABLE to STALE. How this
	capability is implemented may impact the scalabability of ND
	on a router. For example, one possible implementation is to
	have the forwarding operation detect when an ND entry is
	referenced that needs to transition from REACHABLE to STALE,
	by signalling an event that would need to be processed by the
	software processor. Such an implementation could increase the
	load on the service processor much in the same way that a high
	rate of ARP requests have led to problems on some routers.

> Minor Issues:

> 1. Sec 2 - Terminology should define Address Resolution as this
>  seems to be the core issue that the draft is discussing.

> Address Resolution: Address resolution is the process through which
>  a node determines the link-layer address of a neighbor given only
>  its IP address.  In IPv6, address resolution is performed as part
>  of Neighbor Discovery [RFC4861], Section 7.2.

How about:

	  <t hangText="Address Resolution:">
	    the process of determining the link-layer address
	    corresponding to a given IP address. In IPv4, address
	    resolution is performed by ARP <xref target="RFC0826">; in
	    IPv6, it is provided by Neighbor Discovery
	    (ND) <xref target="RFC4861"></xref>.

> 2. In Sec 7.1 you mention that routers need to drop all transit
>  traffic when there is no response received for an ARP/ND
>  request. You should mention that in addition to this, routers also
>  need to send an ICMP host unreachable error packet back to the
>  sender. ICMP error packets are generated in the control card
>  CPU. So, if the CPU has to generate a high number of such ICMP
>  errors then this can load the CPU. The whole process can be quite
>  CPU as well as buffer intensive. The CPU/buffer overload is usually
>  mitigated by rate limiting the number of ICMP errors generated.


   "and may send an ICMP destination unreachable message as well."

> 3. In Sec 7.1 you mention that the entire ARP/ND process can be
>  quite CPU intensive since transit data traffic needs to be queued
>  while the address resolution is underway. You could mention that
>  this is mitigated by offloading the queuing part to the line card
>  CPUs so that the CPU on the control card is not inundated with such
>  packets. This obviously would only work on distributed systems that
>  have separate CPUs on the line cards and the main card.

There are many things one could say about ARP implementations. But
that is not the purpose of this document. It is really about outlining
the problems... So I think the above is getting too detailed.

> 4. Sec 7.1 should mention that this could be used as a DoS attack
>  wherein the attacker sends a high volume of packets for which ARPs
>  need to be resolved. This could result in genuine packets that need
>  to resolve ARPs getting dropped as there is only a finite rate at
>  which packets are sent to CPU for ARP resolution. Again this is
>  both CPU and buffer intensive.

Again, I don't think this document needs to cover all aspects of ND.

> 5. Sec 7.2 discusses issues with address resolution mechanism in
>  IPv6. I think its useful for this draft to discuss the fact that
>  unlike IPv4, IPv6 has subnets that are /64. This number is quite
>  large and will perhaps cover trillions of IP addresses, most of
>  which would be unassigned. Thus simplistic IPv6 ND implementations
>  can be vulnerable to attacks which inundates the CPU with huge
>  requests to perform address resolution for a large number of IPv6
>  addresses, most of which are unassigned. As a result of this
>  genuine IPv6 devices will not be able to join the network. You
>  might want to refer to RFC 6583 for more details.


> 6. The last paragraph of Sec 7.3 says the following:

> "Finally, IPv6 and IPv4 are often run simultaneously and in parallel
>  on the same network, i.e., in dual-stack mode.  In such
>  environments, the IPv4 and IPv6 issues enumerated above compound
>  each other."

> While I understand the sentiment behind the above statement, I fail
>  to see how this is related to the MAC problem being described in
>  Sec 7.3. The MAC scaling is a function of the total number of
>  unique MACs that the system has to learn and is orthogonal to the
>  presence of IPv4 or IPv6. I read this statement to mean that
>  something extra happens in the dual stack mode which exacerbates
>  the MAC problem even further. This I believe is patently not the
>  case.

That paragraph was intended to cover all of 7.1 and 7.2, and not be in
7.3. I'll move it.

> 7. Sec 11 - Security Considerations should at the very least give
>  pointers to references on issues related to ARP security
>  vulnerabilities. I don't see IPv6 ND mentioned at all. Since ND
>  relies on ICMPv6 and does not run directly over layer 2, there
>  could possibly be security concerns specific to ND in the data
>  center environments that don't apply to ARP. This document ought to
>  discuss those so that ARMD (or some other WG) can look at solutions
>  addressing those concerns.

Actually, I disagree somewhat. This document doesn't need to get into
all the security issues of ARP and/or ND. For one thing, they did not
come up as "problems" in ARMD. :-) I will put in pointers to the ND
security considerations section. How about I add the following

    Security considerations for Neighbor Discovery are discussed in
    <xref target="RFC4861"></xref> and <xref target="RFC6583"></xref>.

> 8. Should it be mentioned in the document somewhere (sec 11?) that
>  data center administrators can configure ACLs to filter packets
>  addressed to unallocated IPv6 addresses? Folks can consider the
>  valid IPv6 address ranges and filter out packets that use the
>  unallocated addresses. Doing this will avoid unnecessary ARP
>  resolution for invalid IPv6 addresses. The list of the IPv6
>  addresses that are legitimate and should be permitted is small and
>  maintainable because of IPv6's address
>  hierarchy.
>  gives a list of large address blocks that have been allocated by
>  IANA.

IMO no. This goes beyond the scope of this document.

> Cheers, Manav

Thanks again for your detailed review!!