Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)

Benoit Claise <bclaise@cisco.com> Fri, 28 March 2014 11:24 UTC

Return-Path: <bclaise@cisco.com>
X-Original-To: opsawg@ietfa.amsl.com
Delivered-To: opsawg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 69C891A02FF for <opsawg@ietfa.amsl.com>; Fri, 28 Mar 2014 04:24:19 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -9.51
X-Spam-Level:
X-Spam-Status: No, score=-9.51 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YTXxbkBAYJhV for <opsawg@ietfa.amsl.com>; Fri, 28 Mar 2014 04:24:15 -0700 (PDT)
Received: from aer-iport-2.cisco.com (aer-iport-2.cisco.com [173.38.203.52]) by ietfa.amsl.com (Postfix) with ESMTP id 1D8171A04F6 for <opsawg@ietf.org>; Fri, 28 Mar 2014 04:24:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=58894; q=dns/txt; s=iport; t=1396005851; x=1397215451; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to; bh=pWSAMMF6badFz0no+4N53lg5ZnGAKXXmURd+vQ3SroU=; b=TczCks2kT4jlqBi7hH/1tc590J90Wfl+vdZwYF1bsLPZnjqpniLKlvot D2d5r4R/UsaPzOlDv/nHQHuOXGtTXqDr4gQ4uqaIzwHX6S3Hj1nI8S1cI NWakPEIkzwsT5GGWaCOVtlyYFOVdBcawnQJ2mCiidnm+qlgHUqzVf5YTT w=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AkcFAGhbNVOQ/khN/2dsb2JhbABPCg6CNESJeroFgRwWdIIlAQEBBBoNUQEQCxgJDAoBAQYHCQMCAQIBNBEGDQEFAgEBh3XRcReJTIRECAcTSQcKhC4BA5Rhg2yGTYtognBBPIEt
X-IronPort-AV: E=Sophos; i="4.97,750,1389744000"; d="scan'208,217"; a="11321694"
Received: from ams-core-4.cisco.com ([144.254.72.77]) by aer-iport-2.cisco.com with ESMTP; 28 Mar 2014 11:24:09 +0000
Received: from [10.60.67.85] (ams-bclaise-8914.cisco.com [10.60.67.85]) by ams-core-4.cisco.com (8.14.5/8.14.5) with ESMTP id s2SBO8ju016677; Fri, 28 Mar 2014 11:24:09 GMT
Message-ID: <53355BD8.2030106@cisco.com>
Date: Fri, 28 Mar 2014 12:24:08 +0100
From: Benoit Claise <bclaise@cisco.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0
MIME-Version: 1.0
To: Anoop Ghanwani <anoop@alumni.duke.edu>
References: <CA+-tSzxDpD2V7Q15Jjgzz2A+d5Gn_92YQ-1_Zvx2AP=s5AWpxA@mail.gmail.com> <531AF602.5070400@cisco.com>
In-Reply-To: <531AF602.5070400@cisco.com>
Content-Type: multipart/alternative; boundary="------------050300090106010004050806"
Archived-At: http://mailarchive.ietf.org/arch/msg/opsawg/uVvv0POCamKZ4RGbWXFrw-N-AnQ
Cc: "opsawg@ietf.org" <opsawg@ietf.org>
Subject: Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)
X-BeenThere: opsawg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OPSA Working Group Mail List <opsawg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/opsawg>, <mailto:opsawg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/opsawg/>
List-Post: <mailto:opsawg@ietf.org>
List-Help: <mailto:opsawg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/opsawg>, <mailto:opsawg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Mar 2014 11:24:19 -0000

Hi Anoop, Ramki,

A gentle reminder.

Regards, Benoit
> Hi Anoop,
>
> Please post a new draft version, and I'll review the diffs.
> Some more answers in-line.
>
> Regards, Benoit
>>
>> Hi Benoit,
>>
>> Thanks for the detailed and careful review.  Comments inline.
>>
>> Anoop
>>
>> ====
>>
>>
>> On Tue, Feb 18, 2014 at 7:55 AM, Benoit Claise <bclaise@cisco.com 
>> <mailto:bclaise@cisco.com>> wrote:
>>
>>     Dear authors,
>>
>>     Here is my AD review of draft-ietf-opsawg-large-flow-load-balancing
>>
>>     - Section 1:
>>         Networks extensively use link aggregation groups (LAG) [802.1AX] and
>>         equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity
>>         scaling. For the problems addressed by this document, network traffic
>>         can be predominantly categorized into two traffic types: long-lived
>>         large flows and other flows.
>>
>>         ...
>>
>>         This draft describes mechanisms for optimal LAG/ECMP component link
>>         utilization while using hash-based techniques. The mechanisms
>>         comprise the following steps -- recognizing_large flows_  in a router;
>>         and assigning the large flows to specific LAG/ECMP component links or
>>         redistributing the small flows when a component link on the router is
>>         congested.
>>
>>         It is useful to keep in mind that in typical use cases for this
>>         mechanism the_large flows_  are those that consume a significant amount
>>         of bandwidth on a link, e.g. greater than 5% of link bandwidth.  The
>>         number of such flows would necessarily be fairly small, e.g. on the
>>         order of 10's or 100's per LAG/ECMP.  In other words, the number of
>>        _  large flows_  is NOT expected to be on the order of millions of flows.
>>         Examples of such large flows would be IPsec tunnels in service
>>         provider backbone networks or storage backup traffic in data center
>>         networks.
>>
>>     3 instances of "large flows": do you mean "long-lived large flows"?
>>     If not, why do you make a distinction between long-lived large
>>     flows and other flows in the first paragraph?
>>     I eventually understood the source of confusion when I read the
>>     terminology section:
>>         Large flow(s): long-lived large flow(s)
>>
>>     Either use capitalized term in the Intro section (actually
>>     throughout the doc.) so that we understand that the term is
>>     defined somewhere, or make it clear in the intro that large
>>     flow(s) = long-lived large flow(s)
>>
>>
>> Yes, they are all referring to long-lived large flows.  We will 
>> change the early part of Section 1 to clarify that long-lived large 
>> flows are, thereafter in the document, referred to as large flows.
> Or replace large flow by long-lived flow were it makes sense in the draft.
>>
>>     -
>>
>>         This document presents improved load distribution techniques based on
>>         the large flow awareness.
>>
>>     Improved compared to?
>>
>>
>> Improved compared to static hash-based distribution techniques that 
>> do not account for the bandwidth of the flows.  Will reword as follows:
>>
>> "This document presents mechanisms for improving the load 
>> distribution problem resulting from stateless hashing as seen in the 
>> above example."
> ok
>>
>>     -
>>     In several places, starting with the title and abstract, you
>>     speak about mechanisms (plural).
>>     However, looking at section 4.2, it seems that you propose a
>>     single mechanism? Or maybe you consider 4.1, 4.2, 4.3 as
>>     different mechanisms?
>>
>>
>> The title of 4.2 is perhaps misleading and should just be 
>> "Operational Overview."
> ok
>>  Otherwise the rest of the draft discusses several mechanisms 
>> (multiple choices for large flow identification, and multiple choices 
>> for rebalancing).
>>
>>     -
>>
>>     Step 3) On receiving the alert about the congested component link,
>>         the operator, through a central management entity, finds the large
>>         flows mapped to that component link and the LAG/ECMP group to which
>>         the component link belongs.
>>
>>         Step 4) The operator can choose to rebalance the large flows on
>>         lightly loaded component links of the LAG/ECMP group or redistribute
>>         the small flows on the congested link to other component links of the
>>         group. The operator, through a central management entity, can choose
>>         one of the following actions:
>>
>>            1) Indicate specific large flows to rebalance;
>>
>>            2) Have the router decide the best large flows to rebalance;
>>
>>            3) Have the router redistribute all the small flows on the
>>         congested link to other component links in the group.
>>
>>     "Indicate specific large flows to rebalance", "through a central
>>     management entity", what you describe is basically traffic
>>     engineering.
>>     Other the other hand, for 2) and 3), why do you need a central
>>     management entity?
>>
>>
>> The assumption was that the router is controlled by a central 
>> management entity for the purpose of this function, but that is 
>> clearly not a requirement.  The text will be modified to mention that 
>> a central management entity may be used (i.e. not required).
> Ok.
>>
>>     -
>>
>>         A number of routers support sampling techniques such as sFlow [sFlow-
>>         v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954].
>>         For the purpose of large flow identification, sampling must be
>>         enabled on all of the egress ports in the router where such
>>         measurements are desired.
>>
>>     I don't understand the second sentence.
>>     One way to read this is:  sampling must be _enabled _on all of
>>     the egress ports where such measurements are desired.
>>         Ok, this is an obvious statement. If the measurements are
>>     desired, enable them
>>
>>
>> Yes,
> ok please clarify the text.
>>
>>     Or maybe you want to say: _sampling _must be enabled on all of
>>     the egress ports where such measurements are desired.
>>         This is a false statement: if you have the choice between
>>     sampling and non sampling, use non sampling measurements.
>>     Or maybe you want to say: sampling must be enabled on _all _of
>>     the egress ports where such measurements are desired.
>>         This is a false statement: if I have ECMP on 2 links, and
>>     only one of them can't do non sampling, then we should not force
>>         sampling on both links.
>>     You see, I'm confused.
>>
>>     You miss a couple of key messages:
>>     - if unsampled measurements are available, use those.
>>     - egress means where LAG/ECMP are enabled (this is important for
>>     the paragraph starting with "If egress sampling is not available,
>>     ingress sampling can suffice since the central management entity
>>     use")
>>
>>
>> We were not intending to discuss a mix sampling and non-sampling 
>> interfaces in the same router, but this is a reasonable point and it 
>> will be clarified (i.e. we will state that it's possible to mix 
>> sampled and non sampled interfaces as long as the function of large 
>> flow detection/identification can be performed).
>>
>>
>>     -
>>
>>         If egress sampling is not available, ingress sampling can suffice
>>         since the central management entity used by the sampling technique
>>         typically has multi-node visibility and can use the samples from an
>>         immediately downstream node to make measurements for egress traffic
>>         at the local node.
>>
>>     It's not clear if "ingress" means the ingress interface of the
>>     router itself, or the ingress interface of the downstream router.
>>     A drawing is required.
>>     Both options are possible:
>>         1. ingress interfaces on the router where LAG/ECMP is initiated
>>             flow monitoring must be enabled on all ingress interfaces
>>             flow monitoring must have a way to know the egress interfaces
>>     2. ingress interfaces of the downstream router
>>             only work for LAG or ECMP single hop
>>             ingress interfaces = all components from LAG/ECMP
>>     (multiple ifIndex, typically)
>>
>>
>> What we meant here was that ingress sampling would have to be enable 
>> on the downstream device (hence the central management entity must 
>> come into play to identify large flows).
> I still believe that a drawing would clarify things.
>>
>>
>>     this entire section 4.3.3 needs some improvements
>>
>>
>>     -
>>     On one side, you wrote "Specific algorithms for placement of
>>     large flows are out of scope of this document.". On the other
>>     side, "The following parameters are required the configuration of
>>     _this_ feature". It seems contradictory.
>>     It's unclear why you need the following parameter:
>>
>>           .  Imbalance threshold: the difference between the utilization of
>>              the least utilized and most utilized component links.  Expressed
>>              as a percentage of link speed.
>>
>>     Also, does ECMP/LAG always require equivalent link speed for their components?
>>
>> The imbalance threshold is a measure of how much imbalance one is 
>> willing to tolerate before taking the hit of potential packet 
>> reordering in some flows.  Will clarify.
>>
>> Thanks for catching the issue with link speed.  While in most cases 
>> speeds are consistent, there may be the case of composite links which 
>> combine links of different speeds (actually permitted by 802.1AX), so 
>> we will provide a generalized formula for the imbalance threshold 
>> which takes into account the individual speeds of each of the 
>> component links.
>>
>>       
>>     -
>>     5.2. System Configuration and Identification Parameters
>>
>>           .  IP address: The IP address of a specific router that the
>>              feature is being configured on, or that the large flow placement
>>              is being applied to.
>>
>>           .  LAG ID: Identifies the LAG. The LAG ID may be required when
>>              configuring this feature (to apply a specific set of large flow
>>              identification parameters to the LAG) and will be required when
>>              specifying flow placement to achieve the desired rebalancing.
>>
>>           .  Component Link ID: Identifies the component link within a LAG.
>>              This is required when specifying flow placement to achieve the
>>              desired rebalancing.
>>
>>     Nothing regarding ECMP?
>>
>>
>> Initially we were more focused on getting this done for LAG, but then 
>> we completely overlooked ECMP.  5.2, 5.3, and 5.4 would probably 
>> benefit from a bit of clean-up as follows:
>>
>> Add the following to 5.2:
>>
>> ECMP group: Identifies a particular ECMP group.
>>
>> ECMP nexthop: Identifies a particular nexthop within an ECMP group.
>>
>> Add the following line to the end of section 5.3.
>>
>> When using ECMP, the nexthop within an ECMP group is used to identify 
>> the component link for placing the large flow.
>>
>> Add the following to the end of Section 5.4.
>>
>> When using ECMP, the ECMP group and the corresponding Nexthops along 
>> with the percentage of traffic to be assigned to each Nexthop is 
>> required.  Finally it is also possible that an ECMP Nexthop itself 
>> comprises a LAG in which case both the Nexthop and the LAG Component 
>> ID would need to be specified, and the weights of both the Nexthop's 
>> within the ECMP Group and the Component Links within the LAG would 
>> need to be adjusted.
> Ok.
>
> Regards, Benoit
>>
>>
>>     -
>>
>>         For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273]
>>         can be used.
>>
>>     Well, only for ethernet.
>>
>> Will clarify that.
>>
>>
>>
>>
>>     EDITORIAL:
>>     -
>>     figure 2
>>
>>     OLD:
>>
>>                        +-----------+ ->     +-----------+
>>                        |           | ->     |           |
>>                        |           | ===>   |           |
>>                        |        (1)|--------|(1)        |
>>                        |           | ->     |           |
>>                        |           | ->     |           |
>>                        |  (R1)    | ->     |   (R2)  |
>>
>>     NEW:
>>
>>                        +-----------+ ->     +-----------+
>>                        |           | ->     |           |
>>                        |           | ===>   |           |
>>                        |        (1)|--------|(1)        |
>>                        |           | ->     |           |
>>                        |           | ->     |           |
>>                        |  (R1)     | ->     |   (R2)    |
>>
>>
>>
>> Will fix.
>>
>>
>>     -  The indentation in section 2 is not correct
>>
>>
>> Will fix.
>>
>>
>>     - "For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc.,"
>>     You need to expand and provide references.
>>
>> Will provide references.  What do mean by expand -- just expand the 
>> acronyms (already in the acronym section) or something else?
>>
>>     - a PBR rule
>>     Expand.
>>
>> OK
>>
>>     -
>>     OLD:
>>
>>                        +-----------+ ->     +-----------+
>>                        |           | ->     |           |
>>                        |           | ===>   |           |
>>                        |        (1)|--------|(1)        |
>>                        |           |        |           |
>>                        |           | ===>   |           |
>>                        |           | ->     |           |
>>                        |           | ->     |           |
>>                        |  (R1)    | ->     |   (R2)  |
>>                        |        (2)|--------|(2)        |
>>
>>     NEW:
>>                        +-----------+ ->     +-----------+
>>                        |           | ->     |           |
>>                        |           | ===>   |           |
>>                        |        (1)|--------|(1)        |
>>                        |           |        |           |
>>                        |           | ===>   |           |
>>                        |           | ->     |           |
>>                        |           | ->     |           |
>>                        |  (R1)     | ->     |   (R2)    |
>>                        |        (2)|--------|(2)        |
>>
>> Will fix.
>>
>>     -
>>     OLD:
>>     The IPFIX information model [RFC 7011]
>>     NEW:
>>     The IPFIX information model [RFC 7012]
>>
>> Will fix.
>>
>>     Regards, Benoit
>>
>>
>>
>>
>>
>