Re: [armd] review of draft-ietf-armd-problem-statement-02

Anoop Ghanwani <anoop@alumni.duke.edu> Sun, 03 June 2012 03:32 UTC

Return-Path: <ghanwani@gmail.com>
X-Original-To: armd@ietfa.amsl.com
Delivered-To: armd@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9A14A11E8080 for <armd@ietfa.amsl.com>; Sat, 2 Jun 2012 20:32:20 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.977
X-Spam-Level:
X-Spam-Status: No, score=-2.977 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KNmxchEa0InH for <armd@ietfa.amsl.com>; Sat, 2 Jun 2012 20:32:17 -0700 (PDT)
Received: from mail-pz0-f44.google.com (mail-pz0-f44.google.com [209.85.210.44]) by ietfa.amsl.com (Postfix) with ESMTP id 0920421F84A7 for <armd@ietf.org>; Sat, 2 Jun 2012 20:32:17 -0700 (PDT)
Received: by dacx6 with SMTP id x6so4314369dac.31 for <armd@ietf.org>; Sat, 02 Jun 2012 20:32:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=WeInw3SRO+Cq20A/JvfrnHZkeqEQWhqSUqlgBXT0BzQ=; b=c6EBPrxl9hKJG0XZ+EgwMImlCmmAd9rpOIhkYSy58V7saeSnFI3rNkKGcN72v5N01z rKlpuyPUDWk4rP28g3unw+IQxuOkvqcO29jeNDwoybmsdyIR/0XmUXC2S18OMLlB08L4 +rhFX5s+YuMXPL39rw48pNi2rlauLlBd8/BSSU8TEomhfZFOOJy+SzSUaLO5HC91fNy3 sqE1UTTu0VafZN4c9CmTH3ynMWiYH8HZbjQBMwJvBlzyxWrO4tR/XGkZRQpdYLHE5PoY WPZEkeABG1u8bZa7BSCOoQsdoQhDVkqZbbDtziOSIiJW2g0w5f079cuOzPM/NR/gnN9N aCVw==
MIME-Version: 1.0
Received: by 10.68.191.201 with SMTP id ha9mr22914322pbc.75.1338694336714; Sat, 02 Jun 2012 20:32:16 -0700 (PDT)
Sender: ghanwani@gmail.com
Received: by 10.142.97.6 with HTTP; Sat, 2 Jun 2012 20:32:16 -0700 (PDT)
In-Reply-To: <201205251943.q4PJhfTb019425@cichlid.raleigh.ibm.com>
References: <CA+-tSzxY2AdMqcOSDDY3A-o+wJj=Ww5FE4btEe1uPgDMbehANA@mail.gmail.com> <201205251943.q4PJhfTb019425@cichlid.raleigh.ibm.com>
Date: Sat, 02 Jun 2012 20:32:16 -0700
X-Google-Sender-Auth: IdXvM3rDXYcpGw6fuxqtB8FvoQI
Message-ID: <CA+-tSzzQ=MpypZ5LJAXTUzFZ=O6-bkiaBasg9aFxKDeGOJxfUg@mail.gmail.com>
From: Anoop Ghanwani <anoop@alumni.duke.edu>
To: Thomas Narten <narten@us.ibm.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable
Cc: armd@ietf.org
Subject: Re: [armd] review of draft-ietf-armd-problem-statement-02
X-BeenThere: armd@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "Discussion of issues associated with large amount of virtual machines being introduced in data centers and virtual hosts introduced by Cloud Computing." <armd.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/armd>, <mailto:armd-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/armd>
List-Post: <mailto:armd@ietf.org>
List-Help: <mailto:armd-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/armd>, <mailto:armd-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 03 Jun 2012 03:32:20 -0000

Hi Thomas,

Please see inline.

On Fri, May 25, 2012 at 12:43 PM, Thomas Narten <narten@us.ibm.com> wrote:

> Anoop Ghanwani <ghanwani@gmail.com> writes:
>
>> Section 4.4.1
>> ============
>
>> For consistency with the following 2 sections
>> change title from Layer 3 to L3.
>
>> "This topology is ideal for scenarios where servers
>>   attached to a particular access switch generally run applications
>>   that are are confined to using a single subnet."
>> I'm not sure I agree with this.  There are many issues
>> surround this including the capabilities of the devices
>> in the network, the use of the multicast, and the preferences
>> of the network administrator.
>
> I agree that "is ideal" is too strong. How about if I say instead:
>
>    This topology has benefits in scenarios ...
>
> Would that address your concerns?

The main problem that I have with this is that it
seems to suggest that, if an application can be
confined to some small number of machines,
then using L2 helps.  I'm not sure that is the case,
except perhaps when multicast is in use.

I would just refer to it as a commonly used
network design.

>
>> "Even though
>>    layer 2 traffic are still partitioned by VLANs, the fact that all
>>    VLANs are enabled on all ports can lead to broadcast traffic on all
>>    VLANs to traverse all links and ports, which is same effect as one
>>    big Layer 2 domain. "
>> I disagree with this because all VLANs would only
>> need to be provisioned on the aggregation-facing ports.
>> The disadvantage here is that a lot more broadcast traffic
>> hits the aggregation layer, and when we need to cross
>> VLAN boundaries, the traffic must go all way to the
>> aggregation switch even though the source and destination
>> may be on the same access switch, and the requirement
>> for larger ARP tables at the aggregation switches.
>
> I struggled with this for a long while. Is this text any better?:
>
>     <t> When the L3 domain only extends to aggregation switches,
>         hosts in any of the IP subnets configured on the aggregation
>         switches can be reachable via L2 through any access switches
>         if access switches enable all the VLANs.  This topology
>         allows a greater level of flexibility as servers attached to
>         any access switch can be reloaded with applications that have
>         been provisioned with IP addresses from multiple prefixes as
>         needed.  Further, in such an environment, VMs can migrate
>         between racks without IP address changes.  The drawback of
>         this design however is that multiple VLANs have to be enabled
>         on all access switches and all access-facing ports on
>         aggregation switches. Even though L2 traffic is still
>         partitioned by VLANs, the fact that all VLANs are enabled on
>         all ports can lead to broadcast traffic on all VLANs to
>         traverse all links and ports, which is same effect as one big
>         L2 domain on the access-facing side of the aggregation
>         switch.  In addition, internal traffic itself might have to
>         cross different L2 boundaries resulting in significant ARP/ND
>         load at the aggregation switches.  This design provides a
>         good tradeoff between flexibility and L2 domain size.  A
>         moderate sized data center might utilize this approach to
>         provide high availability services at a single location.
>         </t>

Yes, this is better.  I've made some edits which I think
help improve clarity, but you can make the call on which
text to use.

<t> When the L3 domain only extends to aggregation switches,
      hosts in any of the IP subnets configured on the aggregation
       switches can be reachable via L2 through any of the access switches
       that have the corresponding VLAN enabled.  This topology
       allows a greater level of flexibility as servers attached to
       any access switch can be reloaded with applications that have
      been provisioned with IP addresses from multiple prefixes as
     needed.  Further, in such an environment, VMs can migrate
     between racks without IP address changes.  The drawback of
      this design however is that VLANs may have to be enabled
    on all access switches that have members of the subnet
    and on the corresponding access-facing ports on
   aggregation switches. Even though L2 traffic is still
    partitioned by VLANs, the fact that VLANs are enabled on
    all ports can lead to broadcast traffic on VLANs that
    traverse all links and ports, which is same effect as one big
    L2 domain on the access-facing side of the aggregation
   switch.  In addition, traffic within an access switch itself might have to
    cross different L2 boundaries resulting in significant ARP/ND
    load at the aggregation switches.  This design provides a
      good tradeoff between flexibility and L2 domain size.  A
    moderate sized data center might utilize this approach to
    provide high availability services at a single location.
    </t>


>
>> "However, the
>>    Overlay Edge switches/routers which perform the network address
>>    encapsulation/decapsulation must ultimately perform a L2 address
>>    resolution and could still potentially face scaling issues at that
>>    point."
>> It's not the overlay edge switches that have the scaling
>> problem, its the volume of broadcasts that need to be
>> sent across the core and that is not helped simply by
>> using an L3 overlay.
>
> I also struggled quite a bit with this comment. Is the following an
> improvement?:
>
>    <t> A potential problem that arises in a large data center is when
>        a large number of hosts communicate with their peers in
>        different subnets, all these hosts send (and receive) data
>        packets to their respective L2/L3 boundary nodes as the
>        traffic flows are generally bi-directional.  This has the
>        potential to further highlight any scaling problems.  These
>        L2/L3 boundary nodes have to process ARP/ND requests sent from
>        originating subnets and resolve physical (MAC) addresses in
>        the target subnets for what are generally bi-directional
>        flows.  Therefore, for maximum flexibility in managing the
>        data center workload, it is often desirable to use overlays to
>        place related groups of hosts in the same topological subnet
>        to avoid the L2/L3 boundary translation.  The use of overlays
>        in the data center network can be a useful design mechanism to
>        help manage a potential bottleneck at the L2 / L3 boundary by
>        redefining where that boundary exists.  </t>

L3 overlays are being used to get around the
problem of needing to physically localize a given L2 domain.
However, this by itself, does not address the ARP/ND
scaling issue because, if an L2 domain extends across
the aggregation/core layers, the ARP/ND messages
would need to sent across the aggregation/core devices.
Thus the volume of broadcast traffic at the aggregation/core
layers is not addressed by use of L3 overlays.  That's
the gist of what needs to captured here.

>
>> Section 6
>> ==========
>
>> "Thus, whereas all
>>    nodes must process every ARP query, ND queries are processed only by
>>    the nodes to which they are intended."
>> When virtualization is in use, the NIC is often operated
>> in promiscuous mode, which means that the packet would
>> be delivered to the hypervisor/vswitch and the filtering
>> would have to be done there (usually implemented in software),
>> making the problem almost as bad as with ARP.
>
> Revised text:
>
>    Thus, whereas all nodes must process every ARP query, ND queries
>    are processed only by the nodes to which they are intended. In
>    cases where multicast filtering can't effectively be implemented
>    in the NIC (e.g., as on hypervisors supporting virualization),
>    filtering would need to be done in software (e.g., in the
>    hypervisor's vSwitch).

This is good.

Thanks,
Anoop