Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-00.txt

Thomas Narten <narten@us.ibm.com> Mon, 17 January 2011 16:24 UTC

Message-Id: <201101171626.p0HGQTvU002982@cichlid.raleigh.ibm.com>
To: Brian Haberman <brian@innovationslab.net>
Subject: Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-00.txt
In-reply-to: <4D272FAC.70104@innovationslab.net>
References: <4D272FAC.70104@innovationslab.net>
Comments: In-reply-to Brian Haberman <brian@innovationslab.net> message dated "Fri, 07 Jan 2011 10:22:20 -0500."
Date: Mon, 17 Jan 2011 11:26:29 -0500
From: Thomas Narten <narten@us.ibm.com>
Cc: Bob Hinden <bob.hinden@gmail.com>, ipv6@ietf.org
Precedence: list

Here is my review of the ECMP document. I don't think it is quite
ready for advancement yet. I do generally support this document,
though there are some clarifications I need to understand.

The review is rather lengthy, and is mostly editorial but there is
some stuff that may turn out to be substantive.

>    When several network paths between the same two nodes are known by
>    the routing system to be equally good (in terms of capacity and
>    latency), it may be desirable to share traffic among them.  Two
>    such

The document seems to focus on "equal cost" paths, but I assume that
the applicability is really when you have multiple paths that you want
to distribute traffic across. They may be equal cost, but they may not
be. For instance, if I have two links, one with double the capacity as
another, I may want to distributed the traffic in a weighted fashion
(i.e., 1/3 vs. 2/3s). Would be good to make that clear.

I.e., rather than say "equal distribution", make clear that other
types (weighted, etc.) are also covered.

Later in the document, the term "load distribution" is used, which is
I think closer to the mark. (This distinction should be clarified
throughout.)

>    o  Work-conserving method (no idle time when queue is non-empty).

Do we really need to use the term "work conserving?" (I had to go look
it up). I don't think it's particularly helpful to use that term in
this document.

>    One lightweight approach to ECMP or LAG is this: if there are N
>    equally good paths to choose from, then form a modulo(N) hash
>    [RFC2991] from a consistent set of fields in each packet header, and
>    use the resulting value to select a particular path.  If the hash

make clear the "consistent set of fields" means that the contents of
those fields are invariant for packets of the same flow.

>    use the resulting value to select a particular path.  If the hash
>    function is chosen so that the hash values have a uniform statistical
>    distribution, this method will share traffic roughly equally between

For "hash values", I assume this means the "output of the hash" (not
the inputs). Perhaps make that more clear? (And it's probably better
to say "output" than "value" generally.)

   The question with such a method, then, is which IP header fields to
   include to identify a flow.  A minimal choice in the routing system
   is simply to use a hash of the source and destination IP addresses,
   i.e., the 2-tuple.  This is necessary and sufficient to avoid out-of-
   order delivery, and with a wide variety of sources and destinations,
   as one finds in the core of the network, sometimes sufficient to
   achieve work-conserving load sharing.
   
I'd suggest starting this paragraph as a new section, entitled
something like "Choice of IP Header Fields for Hash Input".

Also, this new section might benefit from some reorganization. Most of
the info here is good, but the flow could be better.

For example, after the above paragraph would be a good time to go into
more detail about why using just the src/dst is insufficient in
practice. The mailing list discussion talked alot about tunneled
protocols being the real problem (and how tunneling is
increasing). Real examples of how in practice this is inadequate would
strengthen the motivation for this document. You talk about this
further down, but saying it here (all in one place) might be good.

Then move on to using 5 tuples, and again describe the limitations and
problems. I think it is critical in this part to justify why the
existing practice as used in IPv4 today is not sufficient for
IPv6. I.e,. why redefining the Flow Label is *necessary* rather than
possibly just that is possibly *nice to have*.

   However, including transport layer information as input keys to a
   hash may be a problem for IPv4 fragments [RFC2991].  In addition,

Yes, but is that relevant to IPv6?  (If so, say so, otherwise, the
point is presumably irrelevant.)

>    hash may be a problem for IPv4 fragments [RFC2991].  In addition,
>    protocol and destination port numbers in the hash will not only make
>    the hash slightly more expensive to compute, but will not
>    particularly improve the hash distribution, due to the prevalence of
>    well known port numbers and popular protocol numbers.  Ephemeral
>    ports, on the other hand, are quite well distributed [Lee10].  In the

I disagree with the latter part of the the above. It seems to me that
including port numbers (absent tunneling) will certainly improve the
hash distribution. It is not a requirement that the input to the hash
be uniformly distributed. A good hash will produce a good output
distribution, even if the input is skewed. You seem to be making the
assumption that if the ports are not distributed, the outputs of the
hash wont be distributed either. I do not believe that is true
generally.

   well known port numbers and popular protocol numbers.  Ephemeral
   ports, on the other hand, are quite well distributed [Lee10].  In the

And since most flows use one well-known port and one ephemeral port,
using both ports as input would seem to give you good
properties. Right? In which case the suggestion that using ports isn't
good isn't true.

   In the
   case of IPv6, protocol numbers are particularly inconvenient due to
   the variable placement of and variable length of next-headers.  In

How many of these next headers exist in practice today?  I.e., is this
a real problem, or a theoretical problem?

   the variable placement of and variable length of next-headers.  In
   addition, [RFC2460] recommends that all next-headers, except hop-by-
   hop options, should not be inspected by intermediate nodes in the
   network, presumably to make introduction of new next-headers more
   straightforward.

Which text in 2460 is being referred to here? I'm not sure I agree
with the last part of the sentence.

   The situation is different in tunneled scenarios.  Identifying a
   flow inside the tunnel is more complicated, particularly because
   nearly all hardware can only identify flows based on information
   contained in the outermost IP header.  Assume that traffic from
   many sources to many destinations is aggregated in a single
   IP-in-IP tunnel from

try for IPinIP, but what about GRE with keys?

   achieved.  If there is much tunnel traffic, this will result in a
   high probability of congestion on one of the paths between R1 and R2.

Not necessarily true. Better to just say traffic won't be distributed
as intended. That may or may not result in congestion.

   Also, for IPv6, the total number of bits in the 5-tuple is quite
   large (296), as well as inconvenient to extract due to the next-
   header placement.  This may be challenging for some hardware
   implementations, raising the potential that network equipment vendors
   might sacrifice the length of the fields extracted from an IPv6

I disagree with some of the above. First, 256 bits comes just from the
source/dest addresses. another 40 bits is noise relative to this in
terms of "number of bits". I.e., saying a 5 tuple is too many bits,
but just the src/addr (possibly with the Flow Label) is OK doesn't
make sense to me.  The argument about accessing the fields seems more
compelling and should be made separately.

   header.  The question therefore arises whether the 20-bit flow label
   in IPv6 packets would be suitable for use as input to an ECMP or LAG
   hash algorithm.  If it could be used in place of the port numbers and
   protocol number in the 5-tuple, the hash calculation would be
   simplified.

I think the stronger argument here is that it is more likely that a
sender could set the Flow Label to a useful value here in cases like
tunneling, where the intermediate routers in practice would not have
access to the port numbers in in the inner header. The good think is
that TEP's can do this today without requiring any changes in existing
specs.

   would it do any harm to the distribution of the hash values.  If the
   community at some stage agrees to set pseudo-random flow labels in
   the majority of traffic flows, this would add to the value of the
   hash.

I'm confused. Isn't the latter exactly what this document is
recommending???

   source to the destination(s)."  The RFC should perhaps have made
   clear that a router that has participated in flow state establishment
   can rely on properties of the resulting flow label values without
   further signaling.  If a router knows these properties, rule 2 is

Right. This is my understanding of what the words should have meant.

   In the tunneling situation sketched above, routers R1 and R2 can rely
   on the flow labels set by TEP A and TEP B being assigned by a known
   method.  This allows a safe ECMP or LAG method to be based on the
   flow label without breaching [RFC3697].

Technically, this is only true for those flows that the TEP has
set. It is not true for all flows generally, and presumably that is
what R1 and R2 would be assuming...   

Also, I do not believe that including the Flow Label bits as an input
key to ECMP violates 3697 at all. What text does this violate?

   At the time of this writing, the IETF is discussing a possible
   revision of the rules of RFC 3697 [I-D.ietf-6man-flow-update].  If
   adopted, that revision would be fully compatible with the present
   document and would obviate much of the above discussion.

This wording seems odd, since the ID cited does not carry any weight
and does not update 3697.
   
   o  The flow label in the outer packet SHOULD be set by the sending
      TEP to a pseudo-random 20-bit value in accordance with [RFC3697].

pseudo random is not right here.

This is a bit of a divergence, but I do not believe it is a
requirement that the someone who sets the Flow Label set it to a
psuedo-random value. Please clarify *why* this needs to be a
SHOULD. Who benefits? There was email discussion about this. Routers
can't assume they will be pseudo-random, otherwise they set themselves
up for DOS attacks.  So why is this necessary?

      as determined by the IP header fields of the inner
      packet.
      *  Note that this rule is a SHOULD rather than a MUST, to permit
         individual implementers to take an alternative approach if they
         wish to do so.  Such an alternative MUST conform to [RFC3697].

Please clarify what type of an exception you are trying to allow
for. I need to understand the thinking before agreeing to allow for
this. :-)

   o  The sending TEP MUST classify all packets into flows, once it has
      determined that they should enter a given tunnel, and then write
      the relevant flow label into the outer IPv6 header.  A user flow
      could be identified by the ingress TEP most simply by its
      {destination, source} address pair (coarse) or by its 5-tuple
      {dest addr, source addr, protocol, dest port, source port}
      (fine).

Isn't it ironic that you don't include the Flow Label here? I.e., if
the user sets the Flow Label, you don't use its value? That doesn't
seem right! :-)

      This is an implementation detail in the sending TEP.
      *  It might be possible to make this classifier stateless, by
         using a suitable 20 bit hash of the inner IP header's 2-tuple
         or 5-tuple as the pseudo-random flow label value.

Doesn't this go against the 3697 requirement not to reuse a value?

The issue I have with this is that it does mean some base assumptions
about Flow Labels no longer hold in general. This may or may not be a
problem, but we should speak to this point directly. I.e, this
specification is proposing some possibly non-backwards compatable
changes to the Flow Label semantics.

   o  At intermediate router(s) that perform load distribution of
      tunneled packets whose source address is a TEP, the hash algorithm
      used to determine the outgoing component-link in an ECMP and/or
      LAG toward the next-hop MUST minimally include the triple {dest
      addr, source addr, flow label} to meet the [RFC3697] rules.

The wording here is awkward. It seems to suggest that you only expect
routers to do this for tunneled packets? How will they know which
packets those are? Sounds inefficient and requiring of
configuration... In practice, won't they just do this for all packets?
If so, the wording above seems convoluted... or trying to make
something legal if implementations do X, knowing full well they won't
in practice.

Also, which 3697 rules are being referred to above? How is it a
violation of an existing standard to do ECMP on just the src/dst
fields?

      *  Intermediate router(s) MAY also include {protocol, dest port,
         source port} as input keys to the ECMP and/or LAG hash
         algorithms, to provide sufficient entropy in cases where the
         flow-label is currently set to zero.

Seems like this should be a SHOULD. It never hurts to include these
fields, and in many cases helps. To keep things simple, shouldn't the
default be just to use the 6 tuple (when available) as input to the
hash?

4.  Security Considerations

   The flow label is not protected in any way and can be forged by an
   on-path attacker.  Off-path attackers are unlikely to guess a valid
   flow label if a pseudo-random value is used.  In either case, the
   worst an attacker could do against ECMP or LAG is to attempt to
   selectively overload a particular path.  For further discussion, see
   [RFC3697].

This section seem weak, though I need to think a bit about what it
should say.

6MAN WG Last Call: draft-ietf-6man-flow-ecmp-00.t… Brian Haberman
Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-… Thomas Narten
Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-… Brian E Carpenter