Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)

Benoit Claise <bclaise@cisco.com> Sat, 08 March 2014 11:29 UTC

Message-ID: <531AF602.5070400@cisco.com>
Date: Sat, 08 Mar 2014 10:50:42 +0000
From: Benoit Claise <bclaise@cisco.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: Anoop Ghanwani <anoop@alumni.duke.edu>
References: <CA+-tSzxDpD2V7Q15Jjgzz2A+d5Gn_92YQ-1_Zvx2AP=s5AWpxA@mail.gmail.com>
In-Reply-To: <CA+-tSzxDpD2V7Q15Jjgzz2A+d5Gn_92YQ-1_Zvx2AP=s5AWpxA@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------070802060801080208020703"
Archived-At: http://mailarchive.ietf.org/arch/msg/opsawg/g8dDL5Wt8lkYU0DgeLl5GHw8qK8
Cc: "opsawg@ietf.org" <opsawg@ietf.org>
Subject: Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)
Precedence: list

Hi Anoop,

Please post a new draft version, and I'll review the diffs.
Some more answers in-line.

Regards, Benoit
>
> Hi Benoit,
>
> Thanks for the detailed and careful review.  Comments inline.
>
> Anoop
>
> ====
>
>
> On Tue, Feb 18, 2014 at 7:55 AM, Benoit Claise <bclaise@cisco.com 
> <mailto:bclaise@cisco.com>> wrote:
>
>     Dear authors,
>
>     Here is my AD review of draft-ietf-opsawg-large-flow-load-balancing
>
>     - Section 1:
>         Networks extensively use link aggregation groups (LAG) [802.1AX] and
>         equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity
>         scaling. For the problems addressed by this document, network traffic
>         can be predominantly categorized into two traffic types: long-lived
>         large flows and other flows.
>
>         ...
>
>         This draft describes mechanisms for optimal LAG/ECMP component link
>         utilization while using hash-based techniques. The mechanisms
>         comprise the following steps -- recognizing_large flows_  in a router;
>         and assigning the large flows to specific LAG/ECMP component links or
>         redistributing the small flows when a component link on the router is
>         congested.
>
>         It is useful to keep in mind that in typical use cases for this
>         mechanism the_large flows_  are those that consume a significant amount
>         of bandwidth on a link, e.g. greater than 5% of link bandwidth.  The
>         number of such flows would necessarily be fairly small, e.g. on the
>         order of 10's or 100's per LAG/ECMP.  In other words, the number of
>        _  large flows_  is NOT expected to be on the order of millions of flows.
>         Examples of such large flows would be IPsec tunnels in service
>         provider backbone networks or storage backup traffic in data center
>         networks.
>
>     3 instances of "large flows": do you mean "long-lived large flows"?
>     If not, why do you make a distinction between long-lived large
>     flows and other flows in the first paragraph?
>     I eventually understood the source of confusion when I read the
>     terminology section:
>         Large flow(s): long-lived large flow(s)
>
>     Either use capitalized term in the Intro section (actually
>     throughout the doc.) so that we understand that the term is
>     defined somewhere, or make it clear in the intro that large
>     flow(s) = long-lived large flow(s)
>
>
> Yes, they are all referring to long-lived large flows.  We will change 
> the early part of Section 1 to clarify that long-lived large flows 
> are, thereafter in the document, referred to as large flows.
Or replace large flow by long-lived flow were it makes sense in the draft.
>
>     -
>
>         This document presents improved load distribution techniques based on
>         the large flow awareness.
>
>     Improved compared to?
>
>
> Improved compared to static hash-based distribution techniques that do 
> not account for the bandwidth of the flows.  Will reword as follows:
>
> "This document presents mechanisms for improving the load distribution 
> problem resulting from stateless hashing as seen in the above example."
ok
>
>     -
>     In several places, starting with the title and abstract, you speak
>     about mechanisms (plural).
>     However, looking at section 4.2, it seems that you propose a
>     single mechanism? Or maybe you consider 4.1, 4.2, 4.3 as different
>     mechanisms?
>
>
> The title of 4.2 is perhaps misleading and should just be "Operational 
> Overview."
ok
>  Otherwise the rest of the draft discusses several mechanisms 
> (multiple choices for large flow identification, and multiple choices 
> for rebalancing).
>
>     -
>
>     Step 3) On receiving the alert about the congested component link,
>         the operator, through a central management entity, finds the large
>         flows mapped to that component link and the LAG/ECMP group to which
>         the component link belongs.
>
>         Step 4) The operator can choose to rebalance the large flows on
>         lightly loaded component links of the LAG/ECMP group or redistribute
>         the small flows on the congested link to other component links of the
>         group. The operator, through a central management entity, can choose
>         one of the following actions:
>
>            1) Indicate specific large flows to rebalance;
>
>            2) Have the router decide the best large flows to rebalance;
>
>            3) Have the router redistribute all the small flows on the
>         congested link to other component links in the group.
>
>     "Indicate specific large flows to rebalance", "through a central
>     management entity", what you describe is basically traffic
>     engineering.
>     Other the other hand, for 2) and 3), why do you need a central
>     management entity?
>
>
> The assumption was that the router is controlled by a central 
> management entity for the purpose of this function, but that is 
> clearly not a requirement.  The text will be modified to mention that 
> a central management entity may be used (i.e. not required).
Ok.
>
>     -
>
>         A number of routers support sampling techniques such as sFlow [sFlow-
>         v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954].
>         For the purpose of large flow identification, sampling must be
>         enabled on all of the egress ports in the router where such
>         measurements are desired.
>
>     I don't understand the second sentence.
>     One way to read this is:  sampling must be _enabled _on all of the
>     egress ports where such measurements are desired.
>         Ok, this is an obvious statement. If the measurements are
>     desired, enable them
>
>
> Yes,
ok please clarify the text.
>
>     Or maybe you want to say: _sampling _must be enabled on all of the
>     egress ports where such measurements are desired.
>         This is a false statement: if you have the choice between
>     sampling and non sampling, use non sampling measurements.
>     Or maybe you want to say: sampling must be enabled on _all _of the
>     egress ports where such measurements are desired.
>         This is a false statement: if I have ECMP on 2 links, and only
>     one of them can't do non sampling, then we should not force
>         sampling on both links.
>     You see, I'm confused.
>
>     You miss a couple of key messages:
>     - if unsampled measurements are available, use those.
>     - egress means where LAG/ECMP are enabled (this is important for
>     the paragraph starting with "If egress sampling is not available,
>     ingress sampling can suffice since the central management entity use")
>
>
> We were not intending to discuss a mix sampling and non-sampling 
> interfaces in the same router, but this is a reasonable point and it 
> will be clarified (i.e. we will state that it's possible to mix 
> sampled and non sampled interfaces as long as the function of large 
> flow detection/identification can be performed).
>
>
>     -
>
>         If egress sampling is not available, ingress sampling can suffice
>         since the central management entity used by the sampling technique
>         typically has multi-node visibility and can use the samples from an
>         immediately downstream node to make measurements for egress traffic
>         at the local node.
>
>     It's not clear if "ingress" means the ingress interface of the
>     router itself, or the ingress interface of the downstream router.
>     A drawing is required.
>     Both options are possible:
>         1. ingress interfaces on the router where LAG/ECMP is initiated
>             flow monitoring must be enabled on all ingress interfaces
>             flow monitoring must have a way to know the egress interfaces
>     2. ingress interfaces of the downstream router
>             only work for LAG or ECMP single hop
>             ingress interfaces = all components from LAG/ECMP
>     (multiple ifIndex, typically)
>
>
> What we meant here was that ingress sampling would have to be enable 
> on the downstream device (hence the central management entity must 
> come into play to identify large flows).
I still believe that a drawing would clarify things.
>
>
>     this entire section 4.3.3 needs some improvements
>
>
>     -
>     On one side, you wrote "Specific algorithms for placement of large
>     flows are out of scope of this document.". On the other side, "The
>     following parameters are required the configuration of _this_
>     feature". It seems contradictory.
>     It's unclear why you need the following parameter:
>
>           .  Imbalance threshold: the difference between the utilization of
>              the least utilized and most utilized component links.  Expressed
>              as a percentage of link speed.
>
>     Also, does ECMP/LAG always require equivalent link speed for their components?
>
> The imbalance threshold is a measure of how much imbalance one is 
> willing to tolerate before taking the hit of potential packet 
> reordering in some flows.  Will clarify.
>
> Thanks for catching the issue with link speed.  While in most cases 
> speeds are consistent, there may be the case of composite links which 
> combine links of different speeds (actually permitted by 802.1AX), so 
> we will provide a generalized formula for the imbalance threshold 
> which takes into account the individual speeds of each of the 
> component links.
>
>       
>     -
>     5.2. System Configuration and Identification Parameters
>
>           .  IP address: The IP address of a specific router that the
>              feature is being configured on, or that the large flow placement
>              is being applied to.
>
>           .  LAG ID: Identifies the LAG. The LAG ID may be required when
>              configuring this feature (to apply a specific set of large flow
>              identification parameters to the LAG) and will be required when
>              specifying flow placement to achieve the desired rebalancing.
>
>           .  Component Link ID: Identifies the component link within a LAG.
>              This is required when specifying flow placement to achieve the
>              desired rebalancing.
>
>     Nothing regarding ECMP?
>
>
> Initially we were more focused on getting this done for LAG, but then 
> we completely overlooked ECMP.  5.2, 5.3, and 5.4 would probably 
> benefit from a bit of clean-up as follows:
>
> Add the following to 5.2:
>
> ECMP group: Identifies a particular ECMP group.
>
> ECMP nexthop: Identifies a particular nexthop within an ECMP group.
>
> Add the following line to the end of section 5.3.
>
> When using ECMP, the nexthop within an ECMP group is used to identify 
> the component link for placing the large flow.
>
> Add the following to the end of Section 5.4.
>
> When using ECMP, the ECMP group and the corresponding Nexthops along 
> with the percentage of traffic to be assigned to each Nexthop is 
> required.  Finally it is also possible that an ECMP Nexthop itself 
> comprises a LAG in which case both the Nexthop and the LAG Component 
> ID would need to be specified, and the weights of both the Nexthop's 
> within the ECMP Group and the Component Links within the LAG would 
> need to be adjusted.
Ok.

Regards, Benoit
>
>
>     -
>
>         For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273]
>         can be used.
>
>     Well, only for ethernet.
>
> Will clarify that.
>
>
>
>
>     EDITORIAL:
>     -
>     figure 2
>
>     OLD:
>
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)    | ->     |   (R2)  |
>
>     NEW:
>
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)     | ->     |   (R2)    |
>
>
>
> Will fix.
>
>
>     -  The indentation in section 2 is not correct
>
>
> Will fix.
>
>
>     - "For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc.,"
>     You need to expand and provide references.
>
> Will provide references.  What do mean by expand -- just expand the 
> acronyms (already in the acronym section) or something else?
>
>     - a PBR rule
>     Expand.
>
> OK
>
>     -
>     OLD:
>
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           |        |           |
>                        |           | ===>   |           |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)    | ->     |   (R2)  |
>                        |        (2)|--------|(2)        |
>
>     NEW:
>                        +-----------+ ->     +-----------+
>                        |           | ->     |           |
>                        |           | ===>   |           |
>                        |        (1)|--------|(1)        |
>                        |           |        |           |
>                        |           | ===>   |           |
>                        |           | ->     |           |
>                        |           | ->     |           |
>                        |  (R1)     | ->     |   (R2)    |
>                        |        (2)|--------|(2)        |
>
> Will fix.
>
>     -
>     OLD:
>     The IPFIX information model [RFC 7011]
>     NEW:
>     The IPFIX information model [RFC 7012]
>
> Will fix.
>
>     Regards, Benoit
>
>
>
>
>

Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Benoit Claise
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Benoit Claise
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Anoop Ghanwani
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… ramki Krishnan
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Benoit Claise
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Anoop Ghanwani
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Benoit Claise
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Anoop Ghanwani
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Benoit Claise
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Anoop Ghanwani
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… ramki Krishnan
Re: [OPSAWG] AD review of draft-ietf-opsawg-large… Benoit Claise