Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)

ramki Krishnan <ramk@Brocade.com> Sun, 06 April 2014 23:42 UTC

Return-Path: <ramk@Brocade.com>
X-Original-To: opsawg@ietfa.amsl.com
Delivered-To: opsawg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7E7241A00EC for <opsawg@ietfa.amsl.com>; Sun, 6 Apr 2014 16:42:57 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.801
X-Spam-Level:
X-Spam-Status: No, score=0.801 tagged_above=-999 required=5 tests=[AC_DIV_BONANZA=0.001, BAYES_50=0.8, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TTYpXNJV7jsk for <opsawg@ietfa.amsl.com>; Sun, 6 Apr 2014 16:42:52 -0700 (PDT)
Received: from mx0b-000f0801.pphosted.com (mx0b-000f0801.pphosted.com [IPv6:2620:100:9005:71::1]) by ietfa.amsl.com (Postfix) with ESMTP id F19641A05D3 for <opsawg@ietf.org>; Sun, 6 Apr 2014 16:42:48 -0700 (PDT)
Received: from pps.filterd (m0000700 [127.0.0.1]) by mx0b-000f0801.pphosted.com (8.14.5/8.14.5) with SMTP id s36MaWwX014950; Sun, 6 Apr 2014 16:42:40 -0700
Received: from hq1wp-exchub02.corp.brocade.com ([144.49.131.13]) by mx0b-000f0801.pphosted.com with ESMTP id 1k2krrh6v7-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Sun, 06 Apr 2014 16:42:39 -0700
Received: from HQ1WP-EXHUB02.corp.brocade.com (10.70.38.14) by hq1wp-exchub02.corp.brocade.com (10.70.38.99) with Microsoft SMTP Server (TLS) id 14.3.123.3; Sun, 6 Apr 2014 16:42:38 -0700
Received: from HQ1-EXCH01.corp.brocade.com ([fe80::ed42:173e:fe7d:d0a6]) by HQ1WP-EXHUB02.corp.brocade.com ([fe80::e1f4:a4c8:696b:3780%10]) with mapi; Sun, 6 Apr 2014 16:42:38 -0700
From: ramki Krishnan <ramk@Brocade.com>
To: Benoit Claise <bclaise@cisco.com>, Anoop Ghanwani <anoop@alumni.duke.edu>
Date: Sun, 06 Apr 2014 16:42:33 -0700
Thread-Topic: AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)
Thread-Index: Ac9KeEM9GtBoZWaIT1aJLON66fQzBAHeVk2g
Message-ID: <C7634EB63EFD984A978DFB46EA5174F2C00372A95B@HQ1-EXCH01.corp.brocade.com>
References: <CA+-tSzxDpD2V7Q15Jjgzz2A+d5Gn_92YQ-1_Zvx2AP=s5AWpxA@mail.gmail.com> <531AF602.5070400@cisco.com> <53355BD8.2030106@cisco.com>
In-Reply-To: <53355BD8.2030106@cisco.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: multipart/alternative; boundary="_000_C7634EB63EFD984A978DFB46EA5174F2C00372A95BHQ1EXCH01corp_"
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.11.87, 1.0.14, 0.0.0000 definitions=2014-04-06_02:2014-04-04, 2014-04-06, 1970-01-01 signatures=0
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=7.0.1-1402240000 definitions=main-1404060261
Archived-At: http://mailarchive.ietf.org/arch/msg/opsawg/F5JVQ9UN-J5VxX9IDhWP8E4WLEE
Cc: "opsawg@ietf.org" <opsawg@ietf.org>
Subject: Re: [OPSAWG] AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)
X-BeenThere: opsawg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: OPSA Working Group Mail List <opsawg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/opsawg>, <mailto:opsawg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/opsawg/>
List-Post: <mailto:opsawg@ietf.org>
List-Help: <mailto:opsawg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/opsawg>, <mailto:opsawg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 06 Apr 2014 23:42:57 -0000

Hi Benoit,

Thanks for your patience. We have addressed all your comments in the latest version. Additionally, we have done some editorial.

http://datatracker.ietf.org/doc/draft-ietf-opsawg-large-flow-load-balancing/

Thanks,
Ramki

From: Benoit Claise [mailto:bclaise@cisco.com]
Sent: Friday, March 28, 2014 4:24 AM
To: Anoop Ghanwani
Cc: ramki Krishnan; opsawg@ietf.org
Subject: Re: AD review of draft-ietf-opsawg-large-flow-load-balancing (draft response)

Hi Anoop, Ramki,

A gentle reminder.

Regards, Benoit
Hi Anoop,

Please post a new draft version, and I'll review the diffs.
Some more answers in-line.

Regards, Benoit

Hi Benoit,

Thanks for the detailed and careful review.  Comments inline.

Anoop

====

On Tue, Feb 18, 2014 at 7:55 AM, Benoit Claise <bclaise@cisco.com<mailto:bclaise@cisco.com>> wrote:

Dear authors,



Here is my AD review of draft-ietf-opsawg-large-flow-load-balancing



- Section 1:

   Networks extensively use link aggregation groups (LAG) [802.1AX] and

   equal cost multi-paths (ECMP) [RFC 2991] as techniques for capacity

   scaling. For the problems addressed by this document, network traffic

   can be predominantly categorized into two traffic types: long-lived

   large flows and other flows.



   ...



   This draft describes mechanisms for optimal LAG/ECMP component link

   utilization while using hash-based techniques. The mechanisms

   comprise the following steps -- recognizing large flows in a router;

   and assigning the large flows to specific LAG/ECMP component links or

   redistributing the small flows when a component link on the router is

   congested.



   It is useful to keep in mind that in typical use cases for this

   mechanism the large flows are those that consume a significant amount

   of bandwidth on a link, e.g. greater than 5% of link bandwidth.  The

   number of such flows would necessarily be fairly small, e.g. on the

   order of 10's or 100's per LAG/ECMP.  In other words, the number of

   large flows is NOT expected to be on the order of millions of flows.

   Examples of such large flows would be IPsec tunnels in service

   provider backbone networks or storage backup traffic in data center

   networks.


3 instances of "large flows": do you mean "long-lived large flows"?
If not, why do you make a distinction between long-lived large flows and other flows in the first paragraph?
I eventually understood the source of confusion when I read the terminology section:
    Large flow(s): long-lived large flow(s)

Either use capitalized term in the Intro section (actually throughout the doc.) so that we understand that the term is defined somewhere, or make it clear in the intro that large flow(s) = long-lived large flow(s)

Yes, they are all referring to long-lived large flows.  We will change the early part of Section 1 to clarify that long-lived large flows are, thereafter in the document, referred to as large flows.
Or replace large flow by long-lived flow were it makes sense in the draft.


-

   This document presents improved load distribution techniques based on

   the large flow awareness.
Improved compared to?

Improved compared to static hash-based distribution techniques that do not account for the bandwidth of the flows.  Will reword as follows:

"This document presents mechanisms for improving the load distribution problem resulting from stateless hashing as seen in the above example."
ok


-
In several places, starting with the title and abstract, you speak about mechanisms (plural).
However, looking at section 4.2, it seems that you propose a single mechanism? Or maybe you consider 4.1, 4.2, 4.3 as different mechanisms?

The title of 4.2 is perhaps misleading and should just be "Operational Overview."
ok

 Otherwise the rest of the draft discusses several mechanisms (multiple choices for large flow identification, and multiple choices for rebalancing).

-

Step 3) On receiving the alert about the congested component link,

   the operator, through a central management entity, finds the large

   flows mapped to that component link and the LAG/ECMP group to which

   the component link belongs.



   Step 4) The operator can choose to rebalance the large flows on

   lightly loaded component links of the LAG/ECMP group or redistribute

   the small flows on the congested link to other component links of the

   group. The operator, through a central management entity, can choose

   one of the following actions:



      1) Indicate specific large flows to rebalance;



      2) Have the router decide the best large flows to rebalance;



      3) Have the router redistribute all the small flows on the

   congested link to other component links in the group.


"Indicate specific large flows to rebalance", "through a central management entity", what you describe is basically traffic engineering.
Other the other hand, for 2) and 3), why do you need a central management entity?

The assumption was that the router is controlled by a central management entity for the purpose of this function, but that is clearly not a requirement.  The text will be modified to mention that a central management entity may be used (i.e. not required).
Ok.


-


   A number of routers support sampling techniques such as sFlow [sFlow-

   v5, sFlow-LAG], PSAMP [RFC 5475] and NetFlow Sampling [RFC 3954].

   For the purpose of large flow identification, sampling must be

   enabled on all of the egress ports in the router where such

   measurements are desired.
I don't understand the second sentence.
One way to read this is:  sampling must be enabled on all of the egress ports where such measurements are desired.
    Ok, this is an obvious statement. If the measurements are desired, enable them

Yes,
ok please clarify the text.


Or maybe you want to say: sampling must be enabled on all of the egress ports where such measurements are desired.
    This is a false statement: if you have the choice between sampling and non sampling, use non sampling measurements.
Or maybe you want to say: sampling must be enabled on all of the egress ports where such measurements are desired.
    This is a false statement: if I have ECMP on 2 links, and only one of them can't do non sampling, then we should not force
    sampling on both links.
You see, I'm confused.

You miss a couple of key messages:
- if unsampled measurements are available, use those.
- egress means where LAG/ECMP are enabled (this is important for the paragraph starting with "If egress sampling is not available, ingress sampling can suffice since the central management entity use")

We were not intending to discuss a mix sampling and non-sampling interfaces in the same router, but this is a reasonable point and it will be clarified (i.e. we will state that it's possible to mix sampled and non sampled interfaces as long as the function of large flow detection/identification can be performed).


-

   If egress sampling is not available, ingress sampling can suffice

   since the central management entity used by the sampling technique

   typically has multi-node visibility and can use the samples from an

   immediately downstream node to make measurements for egress traffic

   at the local node.


It's not clear if "ingress" means the ingress interface of the router itself, or the ingress interface of the downstream router.
A drawing is required.
Both options are possible:
    1. ingress interfaces on the router where LAG/ECMP is initiated
        flow monitoring must be enabled on all ingress interfaces
        flow monitoring must have a way to know the egress interfaces
2. ingress interfaces of the downstream router
        only work for LAG or ECMP single hop
        ingress interfaces = all components from LAG/ECMP (multiple ifIndex, typically)

What we meant here was that ingress sampling would have to be enable on the downstream device (hence the central management entity must come into play to identify large flows).
I still believe that a drawing would clarify things.



this entire section 4.3.3 needs some improvements


-
On one side, you wrote "Specific algorithms for placement of large flows are out of scope of this document.". On the other side, "The following parameters are required the configuration of this feature". It seems contradictory.
It's unclear why you need the following parameter:

     .  Imbalance threshold: the difference between the utilization of

        the least utilized and most utilized component links.  Expressed

        as a percentage of link speed.



Also, does ECMP/LAG always require equivalent link speed for their components?
The imbalance threshold is a measure of how much imbalance one is willing to tolerate before taking the hit of potential packet reordering in some flows.  Will clarify.

Thanks for catching the issue with link speed.  While in most cases speeds are consistent, there may be the case of composite links which combine links of different speeds (actually permitted by 802.1AX), so we will provide a generalized formula for the imbalance threshold which takes into account the individual speeds of each of the component links.



-

5.2. System Configuration and Identification Parameters



     .  IP address: The IP address of a specific router that the

        feature is being configured on, or that the large flow placement

        is being applied to.



     .  LAG ID: Identifies the LAG. The LAG ID may be required when

        configuring this feature (to apply a specific set of large flow

        identification parameters to the LAG) and will be required when

        specifying flow placement to achieve the desired rebalancing.



     .  Component Link ID: Identifies the component link within a LAG.

        This is required when specifying flow placement to achieve the

        desired rebalancing.


Nothing regarding ECMP?

Initially we were more focused on getting this done for LAG, but then we completely overlooked ECMP.  5.2, 5.3, and 5.4 would probably benefit from a bit of clean-up as follows:

Add the following to 5.2:

ECMP group: Identifies a particular ECMP group.

ECMP nexthop: Identifies a particular nexthop within an ECMP group.

Add the following line to the end of section 5.3.

When using ECMP, the nexthop within an ECMP group is used to identify the component link for placing the large flow.

Add the following to the end of Section 5.4.

When using ECMP, the ECMP group and the corresponding Nexthops along with the percentage of traffic to be assigned to each Nexthop is required.  Finally it is also possible that an ECMP Nexthop itself comprises a LAG in which case both the Nexthop and the LAG Component ID would need to be specified, and the weights of both the Nexthop's within the ECMP Group and the Component Links within the LAG would need to be adjusted.
Ok.

Regards, Benoit



-

   For high speed links, the etherStatsHighCapacityTable MIB [RFC 3273]

   can be used.



Well, only for ethernet.
Will clarify that.



EDITORIAL:
-
figure 2

OLD:

                  +-----------+ ->     +-----------+

                  |           | ->     |           |

                  |           | ===>   |           |

                  |        (1)|--------|(1)        |

                  |           | ->     |           |

                  |           | ->     |           |

                  |  (R1)    | ->     |   (R2)  |
NEW:

                  +-----------+ ->     +-----------+

                  |           | ->     |           |

                  |           | ===>   |           |

                  |        (1)|--------|(1)        |

                  |           | ->     |           |

                  |           | ->     |           |

                  |  (R1)     | ->     |   (R2)    |

Will fix.


-  The indentation in section 2 is not correct

Will fix.

- "For tunneling protocols like GRE, VXLAN, NVGRE, STT, etc.,"
You need to expand and provide references.
Will provide references.  What do mean by expand -- just expand the acronyms (already in the acronym section) or something else?

- a PBR rule

Expand.
OK
-
OLD:

                  +-----------+ ->     +-----------+

                  |           | ->     |           |

                  |           | ===>   |           |

                  |        (1)|--------|(1)        |

                  |           |        |           |

                  |           | ===>   |           |

                  |           | ->     |           |

                  |           | ->     |           |

                  |  (R1)    | ->     |   (R2)  |

                  |        (2)|--------|(2)        |



NEW:

                  +-----------+ ->     +-----------+

                  |           | ->     |           |

                  |           | ===>   |           |

                  |        (1)|--------|(1)        |

                  |           |        |           |

                  |           | ===>   |           |

                  |           | ->     |           |

                  |           | ->     |           |

                  |  (R1)     | ->     |   (R2)    |

                  |        (2)|--------|(2)        |


Will fix.

-

OLD:

The IPFIX information model [RFC 7011]

NEW:

The IPFIX information model [RFC 7012]
Will fix.
Regards, Benoit