Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)

Benoit Claise <bclaise@cisco.com> Sat, 08 March 2014 11:29 UTC

Return-Path: <bclaise@cisco.com>
X-Original-To: opsawg@ietfa.amsl.com
Delivered-To: opsawg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C7CF91A0265 for <opsawg@ietfa.amsl.com>; Sat, 8 Mar 2014 03:29:43 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -10.047
X-Spam-Level:
X-Spam-Status: No, score=-10.047 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RP_MATCHES_RCVD=-0.547, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id O3ohJYtr6lkQ for <opsawg@ietfa.amsl.com>; Sat, 8 Mar 2014 03:29:40 -0800 (PST)
Received: from aer-iport-2.cisco.com (aer-iport-2.cisco.com [173.38.203.52]) by ietfa.amsl.com (Postfix) with ESMTP id 2F9EE1A0114 for <opsawg@ietf.org>; Sat, 8 Mar 2014 03:29:38 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=56189; q=dns/txt; s=iport; t=1394278174; x=1395487774; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to; bh=EWxX4XNYIooNLt2T/ok5A0yrbX6I/W2O+bvJfmYQAww=; b=KuBWVXPozpaQfiLtgZGNJKAZrTUNqsV3Or0VdV+Q/hkjmzwEhSEL0YNO tYwJa7AkVVbZdHFn6Ne8yfvYoXHs2J+eCfK2lUY2X0BmWIzRgXnm08Uk2 AJcaUuhGfwrrcvHAdYbPCzq2HodwZe+QUPgo3yPxQvUL2i1dE1PQti9gH 0=;
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AkgFAMz+GlOQ/khN/2dsb2JhbABQCg6CNESJdbhNgREWdIIlAQEBBBoNUQEQCxgJDAoBAQYHCQMCAQIBNBEGDQEFAgEBh3XPcReNcAgHE0kHCoQuAQOUWYNshkqLYYJuPz2BLQ
X-IronPort-AV: E=Sophos;i="4.97,613,1389744000"; d="scan'208,217";a="7120380"
Received: from ams-core-4.cisco.com ([144.254.72.77]) by aer-iport-2.cisco.com with ESMTP; 08 Mar 2014 11:29:31 +0000
Received: from [10.61.202.238] ([10.61.202.238]) by ams-core-4.cisco.com (8.14.5/8.14.5) with ESMTP id s28BTVNj029128; Sat, 8 Mar 2014 11:29:31 GMT
Message-ID: <531AF602.5070400@cisco.com>
Date: Sat, 08 Mar 2014 10:50:42 +0000
From: Benoit Claise <bclaise@cisco.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: Anoop Ghanwani <anoop@alumni.duke.edu>
References: <CA+-tSzxDpD2V7Q15Jjgzz2A+d5Gn_92YQ-1_Zvx2AP=s5AWpxA@mail.gmail.com>
In-Reply-To: <CA+-tSzxDpD2V7Q15Jjgzz2A+d5Gn_92YQ-1_Zvx2AP=s5AWpxA@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------070802060801080208020703"
Archived-At: http://mailarchive.ietf.org/arch/msg/opsawg/g8dDL5Wt8lkYU0DgeLl5GHw8qK8
Cc: "opsawg@ietf.org" <opsawg@ietf.org>
Subject: Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)
X-BeenThere: opsawg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OPSA Working Group Mail List <opsawg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/opsawg>, <mailto:opsawg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/opsawg/>
List-Post: <mailto:opsawg@ietf.org>
List-Help: <mailto:opsawg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/opsawg>, <mailto:opsawg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 08 Mar 2014 11:29:44 -0000

Hi Anoop,

Please post a new draft version, and I'll review the diffs.
Some more answers in-line.

Regards, Benoit
>
> Hi Benoit,
>
> Thanks for the detailed and careful review.  Comments inline.
>
> Anoop
>
> ====
>
>
> On Tue, Feb 18, 2014 at 7:55 AM, Benoit Claise <bclaise@cisco.com 
> <mailto:bclaise@cisco.com>> wrote:
>
>     Dear authors,
>
>     Here is my AD review of draft-ietf-opsawg-large-flow-load-balancing
>
>     - Section 1:
>         Networks extensively use link aggregation groups (LAG) [802.1AX] and
>         equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity
>         scaling. For the problems addressed by this document, network traffic
>         can be predominantly categorized into two traffic types: long-lived
>         large flows and other flows.
>
>         ...
>
>         This draft describes mechanisms for optimal LAG/ECMP component link
>         utilization while using hash-based techniques. The mechanisms
>         comprise the following steps -- recognizing_large flows_  in a router;
>         and assigning the large flows to specific LAG/ECMP component links or
>         redistributing the small flows when a component link on the router is
>         congested.
>
>         It is useful to keep in mind that in typical use cases for this
>         mechanism the_large flows_  are those that consume a significant amount
>         of bandwidth on a link, e.g. greater than 5% of link bandwidth.  The
>         number of such flows would necessarily be fairly small, e.g. on the
>         order of 10's or 100's per LAG/ECMP.  In other words, the number of
>        _  large flows_  is NOT expected to be on the order of millions of flows.
>         Examples of such large flows would be IPsec tunnels in service
>         provider backbone networks or storage backup traffic in data center
>         networks.
>
>     3 instances of "large flows": do you mean "long-lived large flows"?
>     If not, why do you make a distinction between long-lived large
>     flows and other flows in the first paragraph?
>     I eventually understood the source of confusion when I read the
>     terminology section:
>         Large flow(s): long-lived large flow(s)
>
>     Either use capitalized term in the Intro section (actually
>     throughout the doc.) so that we understand that the term is
>     defined somewhere, or make it clear in the intro that large
>     flow(s) = long-lived large flow(s)
>
>
> Yes, they are all referring to long-lived large flows.  We will change 
> the early part of Section 1 to clarify that long-lived large flows 
> are, thereafter in the document, referred to as large flows.
Or replace large flow by long-lived flow were it makes sense in the draft.
>
>     -
>
>         This document presents improved load distribution techniques based on
>         the large flow awareness.
>
>     Improved compared to?
>
>
> Improved compared to static hash-based distribution techniques that do 
> not account for the bandwidth of the flows.  Will reword as follows:
>
> "This document presents mechanisms for improving the load distribution 
> problem resulting from stateless hashing as seen in the above example."
ok
>
>     -
>     In several places, starting with the title and abstract, you speak
>     about mechanisms (plural).
>     However, looking at section 4.2, it seems that you propose a
>     single mechanism? Or maybe you consider 4.1, 4.2, 4.3 as different
>     mechanisms?
>
>
> The title of 4.2 is perhaps misleading and should just be "Operational 
> Overview."
ok
>  Otherwise the rest of the draft discusses several mechanisms 
> (multiple choices for large flow identification, and multiple choices 
> for rebalancing).
>
>     -
>
>     Step 3) On receiving the alert about the congested component link,
>         the operator, through a central management entity, finds the large
>         flows mapped to that component link and the LAG/ECMP group to which
>         the component link belongs.
>
>         Step 4) The operator can choose to rebalance the large flows on
>         lightly loaded component links of the LAG/ECMP group or redistribute
>         the small flows on the congested link to other component links of the
>         group. The operator, through a central management entity, can choose
>         one of the following actions:
>
>            1) Indicate specific large flows to rebalance;
>
>            2) Have the router decide the best large flows to rebalance;
>
>            3) Have the router redistribute all the small flows on the
>         congested link to other component links in the group.
>
>     "Indicate specific large flows to rebalance", "through a central
>     management entity", what you describe is basically traffic
>     engineering.
>     Other the other hand, for 2) and 3), why do you need a central
>     management entity?
>
>
> The assumption was that the router is controlled by a central 
> management entity for the purpose of this function, but that is 
> clearly not a requirement.  The text will be modified to mention that 
> a central management entity may be used (i.e. not required).
Ok.
>
>     -
>
>         A number of routers support sampling techniques such as sFlow [sFlow-
>         v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954].
>         For the purpose of large flow identification, sampling must be
>         enabled on all of the egress ports in the router where such
>         measurements are desired.
>
>     I don't understand the second sentence.
>     One way to read this is:  sampling must be _enabled _on all of the
>     egress ports where such measurements are desired.
>         Ok, this is an obvious statement. If the measurements are
>     desired, enable them
>
>
> Yes,
ok please clarify the text.
>
>     Or maybe you want to say: _sampling _must be enabled on all of the
>     egress ports where such measurements are desired.
>         This is a false statement: if you have the choice between
>     sampling and non sampling, use non sampling measurements.
>     Or maybe you want to say: sampling must be enabled on _all _of the
>     egress ports where such measurements are desired.
>         This is a false statement: if I have ECMP on 2 links, and only
>     one of them can't do non sampling, then we should not force
>         sampling on both links.
>     You see, I'm confused.
>
>     You miss a couple of key messages:
>     - if unsampled measurements are available, use those.
>     - egress means where LAG/ECMP are enabled (this is important for
>     the paragraph starting with "If egress sampling is not available,
>     ingress sampling can suffice since the central management entity use")
>
>
> We were not intending to discuss a mix sampling and non-sampling 
> interfaces in the same router, but this is a reasonable point and it 
> will be clarified (i.e. we will state that it's possible to mix 
> sampled and non sampled interfaces as long as the function of large 
> flow detection/identification can be performed).
>
>
>     -
>
>         If egress sampling is not available, ingress sampling can suffice
>         since the central management entity used by the sampling technique
>         typically has multi-node visibility and can use the samples from an
>         immediately downstream node to make measurements for egress traffic
>         at the local node.
>
>     It's not clear if "ingress" means the ingress interface of the
>     router itself, or the ingress interface of the downstream router.
>     A drawing is required.
>     Both options are possible:
>         1. ingress interfaces on the router where LAG/ECMP is initiated
>             flow monitoring must be enabled on all ingress interfaces
>             flow monitoring must have a way to know the egress interfaces
>     2. ingress interfaces of the downstream router
>             only work for LAG or ECMP single hop
>             ingress interfaces = all components from LAG/ECMP
>     (multiple ifIndex, typically)
>
>
> What we meant here was that ingress sampling would have to be enable 
> on the downstream device (hence the central management entity must 
> come into play to identify large flows).
I still believe that a drawing would clarify things.
>
>
>     this entire section 4.3.3 needs some improvements
>
>
>     -
>     On one side, you wrote "Specific algorithms for placement of large
>     flows are out of scope of this document.". On the other side, "The
>     following parameters are required the configuration of _this_
>     feature". It seems contradictory.
>     It's unclear why you need the following parameter:
>
>           .  Imbalance threshold: the difference between the utilization of
>              the least utilized and most utilized component links.  Expressed
>              as a percentage of link speed.
>
>     Also, does ECMP/LAG always require equivalent link speed for their components?
>
> The imbalance threshold is a measure of how much imbalance one is 
> willing to tolerate before taking the hit of potential packet 
> reordering in some flows.  Will clarify.
>
> Thanks for catching the issue with link speed.  While in most cases 
> speeds are consistent, there may be the case of composite links which 
> combine links of different speeds (actually permitted by 802.1AX), so 
> we will provide a generalized formula for the imbalance threshold 
> which takes into account the individual speeds of each of the 
> component links.
>
>       
>     -
>     5.2. System Configuration and Identification Parameters
>
>           .  IP address: The IP address of a specific router that the
>              feature is being configured on, or that the large flow placement
>              is being applied to.
>
>           .  LAG ID: Identifies the LAG. The LAG ID may be required when
>              configuring this feature (to apply a specific set of large flow
>              identification parameters to the LAG) and will be required when
>              specifying flow placement to achieve the desired rebalancing.
>
>           .  Component Link ID: Identifies the component link within a LAG.
>              This is required when specifying flow placement to achieve the
>              desired rebalancing.
>
>     Nothing regarding ECMP?
>
>
> Initially we were more focused on getting this done for LAG, but then 
> we completely overlooked ECMP.  5.2, 5.3, and 5.4 would probably 
> benefit from a bit of clean-up as follows:
>
> Add the following to 5.2:
>
> ECMP group: Identifies a particular ECMP group.
>
> ECMP nexthop: Identifies a particular nexthop within an ECMP group.
>
> Add the following line to the end of section 5.3.
>
> When using ECMP, the nexthop within an ECMP group is used to identify 
> the component link for placing the large flow.
>
> Add the following to the end of Section 5.4.
>
> When using ECMP, the ECMP group and the corresponding Nexthops along 
> with the percentage of traffic to be assigned to each Nexthop is 
> required.  Finally it is also possible that an ECMP Nexthop itself 
> comprises a LAG in which case both the Nexthop and the LAG Component 
> ID would need to be specified, and the weights of both the Nexthop's 
> within the ECMP Group and the Component Links within the LAG would 
> need to be adjusted.
Ok.

Regards, Benoit
>
>
>     -
>
>         For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273]
>         can be used.
>
>     Well, only for ethernet.
>
> Will clarify that.
>
>
>
>
>     EDITORIAL:
>     -
>     figure 2
>
>     OLD:
>
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)    | ->     |   (R2)  |
>
>     NEW:
>
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)     | ->     |   (R2)    |
>
>
>
> Will fix.
>
>
>     -  The indentation in section 2 is not correct
>
>
> Will fix.
>
>
>     - "For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc.,"
>     You need to expand and provide references.
>
> Will provide references.  What do mean by expand -- just expand the 
> acronyms (already in the acronym section) or something else?
>
>     - a PBR rule
>     Expand.
>
> OK
>
>     -
>     OLD:
>
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           |        |           |
>                        |           | ===>   |           |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)    | ->     |   (R2)  |
>                        |        (2)|--------|(2)        |
>
>     NEW:
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           |        |           |
>                        |           | ===>   |           |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)     | ->     |   (R2)    |
>                        |        (2)|--------|(2)        |
>
> Will fix.
>
>     -
>     OLD:
>     The IPFIX information model [RFC 7011]
>     NEW:
>     The IPFIX information model [RFC 7012]
>
> Will fix.
>
>     Regards, Benoit
>
>
>
>
>