Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-00.txt

Brian E Carpenter <brian.e.carpenter@gmail.com> Mon, 17 January 2011 22:13 UTC

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:organization:user-agent:mime-version:to:cc :subject:references:in-reply-to:content-type :content-transfer-encoding; b=G1588EulLqCW6qGbg7/E1VdG6e1EtYRXHDE7jaskdrcRRLYhvg8QiL9jcsBLYM/Mf9 yY2ccqI1eX/OWhck8fug8bZKVTcmldaLAu81CrHn0KvuEHit50VT1PRUlF/S1eSA9wDw yl2AheXGoRY3vamN5nFsqOaTX9K0+otwocOtY=
Message-ID: <4D34BF77.5040503@gmail.com>
Date: Tue, 18 Jan 2011 11:15:19 +1300
From: Brian E Carpenter <brian.e.carpenter@gmail.com>
Organization: University of Auckland
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: Thomas Narten <narten@us.ibm.com>
Subject: Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-00.txt
References: <4D272FAC.70104@innovationslab.net> <201101171626.p0HGQTvU002982@cichlid.raleigh.ibm.com>
In-Reply-To: <201101171626.p0HGQTvU002982@cichlid.raleigh.ibm.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Cc: ipv6@ietf.org, Brian Haberman <brian@innovationslab.net>, Bob Hinden <bob.hinden@gmail.com>
Precedence: list

Thomas,

Thanks for the thorough review. These are my comments; Shane may have
others.

On 2011-01-18 05:26, Thomas Narten wrote:
> Here is my review of the ECMP document. I don't think it is quite
> ready for advancement yet. I do generally support this document,
> though there are some clarifications I need to understand.
> 
> The review is rather lengthy, and is mostly editorial but there is
> some stuff that may turn out to be substantive.
> 
>>    When several network paths between the same two nodes are known by
>>    the routing system to be equally good (in terms of capacity and
>>    latency), it may be desirable to share traffic among them.  Two
>>    such
> 
> The document seems to focus on "equal cost" paths, but I assume that
> the applicability is really when you have multiple paths that you want
> to distribute traffic across. They may be equal cost, but they may not
> be. For instance, if I have two links, one with double the capacity as
> another, I may want to distributed the traffic in a weighted fashion
> (i.e., 1/3 vs. 2/3s). Would be good to make that clear.
> 
> I.e., rather than say "equal distribution", make clear that other
> types (weighted, etc.) are also covered.

Yes, true. I don't think this changes the main argument, but it
should be mentioned.

> 
> Later in the document, the term "load distribution" is used, which is
> I think closer to the mark. (This distinction should be clarified
> throughout.)
> 
>>    o  Work-conserving method (no idle time when queue is non-empty).
> 
> Do we really need to use the term "work conserving?" (I had to go look
> it up). I don't think it's particularly helpful to use that term in
> this document.

Well, maybe people don't care about it these days, but it seemed to be
a big concern when we were designing diffserv. It's a term of art,
which is perhaps distracting.

> 
>>    One lightweight approach to ECMP or LAG is this: if there are N
>>    equally good paths to choose from, then form a modulo(N) hash
>>    [RFC2991] from a consistent set of fields in each packet header, and
>>    use the resulting value to select a particular path.  If the hash
> 
> make clear the "consistent set of fields" means that the contents of
> those fields are invariant for packets of the same flow.

Yes

> 
>>    use the resulting value to select a particular path.  If the hash
>>    function is chosen so that the hash values have a uniform statistical
>>    distribution, this method will share traffic roughly equally between
> 
> For "hash values", I assume this means the "output of the hash" (not
> the inputs). Perhaps make that more clear? (And it's probably better
> to say "output" than "value" generally.)

OK. You may want to update the Wikipedia article, which says
"The values returned by a hash function are called hash values,
hash codes, hash sums, checksums or simply hashes."

> 
>    The question with such a method, then, is which IP header fields to
>    include to identify a flow.  A minimal choice in the routing system
>    is simply to use a hash of the source and destination IP addresses,
>    i.e., the 2-tuple.  This is necessary and sufficient to avoid out-of-
>    order delivery, and with a wide variety of sources and destinations,
>    as one finds in the core of the network, sometimes sufficient to
>    achieve work-conserving load sharing.
>    
> I'd suggest starting this paragraph as a new section, entitled
> something like "Choice of IP Header Fields for Hash Input".

OK

> 
> Also, this new section might benefit from some reorganization. Most of
> the info here is good, but the flow could be better.
> 
> For example, after the above paragraph would be a good time to go into
> more detail about why using just the src/dst is insufficient in
> practice. The mailing list discussion talked alot about tunneled
> protocols being the real problem (and how tunneling is
> increasing). Real examples of how in practice this is inadequate would
> strengthen the motivation for this document. You talk about this
> further down, but saying it here (all in one place) might be good.

OK

> 
> Then move on to using 5 tuples, and again describe the limitations and
> problems. I think it is critical in this part to justify why the
> existing practice as used in IPv4 today is not sufficient for
> IPv6. I.e,. why redefining the Flow Label is *necessary* rather than
> possibly just that is possibly *nice to have*.

My understanding is that existing practice is not sufficient for
IPv4 either, so using (not redefining) the flow label is a definite
plus point for IPv6. We can make this explicit.

> 
>    However, including transport layer information as input keys to a
>    hash may be a problem for IPv4 fragments [RFC2991].  In addition,
> 
> Yes, but is that relevant to IPv6?  (If so, say so, otherwise, the
> point is presumably irrelevant.)

As far as I understand things, the unfragmentable part of an IPv6
packet does not include the transport header, so the same problem
exists for IPv6. I think it should just say "IP fragments".

> 
>>    hash may be a problem for IPv4 fragments [RFC2991].  In addition,
>>    protocol and destination port numbers in the hash will not only make
>>    the hash slightly more expensive to compute, but will not
>>    particularly improve the hash distribution, due to the prevalence of
>>    well known port numbers and popular protocol numbers.  Ephemeral
>>    ports, on the other hand, are quite well distributed [Lee10].  In the
> 
> I disagree with the latter part of the the above. It seems to me that
> including port numbers (absent tunneling) will certainly improve the
> hash distribution. It is not a requirement that the input to the hash
> be uniformly distributed. A good hash will produce a good output
> distribution, even if the input is skewed. You seem to be making the
> assumption that if the ports are not distributed, the outputs of the
> hash wont be distributed either. I do not believe that is true
> generally.

We may be overcooking the argument.

> 
>    well known port numbers and popular protocol numbers.  Ephemeral
>    ports, on the other hand, are quite well distributed [Lee10].  In the
> 
> And since most flows use one well-known port and one ephemeral port,
> using both ports as input would seem to give you good
> properties. Right? In which case the suggestion that using ports isn't
> good isn't true.

Yes, the well-known port may be chosen from a very small set in
practice so provides little entropy, but the ephemeral port is
pretty good.

> 
>    In the
>    case of IPv6, protocol numbers are particularly inconvenient due to
>    the variable placement of and variable length of next-headers.  In
> 
> How many of these next headers exist in practice today?  I.e., is this
> a real problem, or a theoretical problem?

I guess this is the same argument as for draft-ietf-6man-exthdr,
but in any case the logic that extracts the 5-tuple has to allow
for all legal options, even if extension headers are rare. Should
be rephrased to say that.

> 
>    the variable placement of and variable length of next-headers.  In
>    addition, [RFC2460] recommends that all next-headers, except hop-by-
>    hop options, should not be inspected by intermediate nodes in the
>    network, presumably to make introduction of new next-headers more
>    straightforward.
> 
> Which text in 2460 is being referred to here? I'm not sure I agree
> with the last part of the sentence.

  "With one exception, extension headers are not examined or processed
   by any node along a packet's delivery path, until the packet reaches
   the node (or each of the set of nodes, in the case of multicast)
   identified in the Destination Address field of the IPv6 header."

It's not a formal recommendation, so we should rephrase the sentence;
it doesn't add much anyway.

> 
>    The situation is different in tunneled scenarios.  Identifying a
>    flow inside the tunnel is more complicated, particularly because
>    nearly all hardware can only identify flows based on information
>    contained in the outermost IP header.  Assume that traffic from
>    many sources to many destinations is aggregated in a single
>    IP-in-IP tunnel from
> 
> try for IPinIP, but what about GRE with keys?

Maybe that's why GRE keys were invented ;-) We should be clear
about it being IPinIP.
> 
>    achieved.  If there is much tunnel traffic, this will result in a
>    high probability of congestion on one of the paths between R1 and R2.
> 
> Not necessarily true. Better to just say traffic won't be distributed
> as intended. That may or may not result in congestion.

Yes

> 
>    Also, for IPv6, the total number of bits in the 5-tuple is quite
>    large (296), as well as inconvenient to extract due to the next-
>    header placement.  This may be challenging for some hardware
>    implementations, raising the potential that network equipment vendors
>    might sacrifice the length of the fields extracted from an IPv6
> 
> I disagree with some of the above. First, 256 bits comes just from the
> source/dest addresses. another 40 bits is noise relative to this in
> terms of "number of bits". I.e., saying a 5 tuple is too many bits,
> but just the src/addr (possibly with the Flow Label) is OK doesn't
> make sense to me.  The argument about accessing the fields seems more
> compelling and should be made separately.

Yes

> 
>    header.  The question therefore arises whether the 20-bit flow label
>    in IPv6 packets would be suitable for use as input to an ECMP or LAG
>    hash algorithm.  If it could be used in place of the port numbers and
>    protocol number in the 5-tuple, the hash calculation would be
>    simplified.
> 
> I think the stronger argument here is that it is more likely that a
> sender could set the Flow Label to a useful value here in cases like
> tunneling, where the intermediate routers in practice would not have
> access to the port numbers in in the inner header. The good think is
> that TEP's can do this today without requiring any changes in existing
> specs.

Yes

> 
>    would it do any harm to the distribution of the hash values.  If the
>    community at some stage agrees to set pseudo-random flow labels in
>    the majority of traffic flows, this would add to the value of the
>    hash.
> 
> I'm confused. Isn't the latter exactly what this document is
> recommending???

We're recommending that the TEP does so; recommending that everybody
does so is the other draft. As you say, TEPs can do this today if they
want (that's why this draft is compatible with RFC 3697 as-is).

> 
>    source to the destination(s)."  The RFC should perhaps have made
>    clear that a router that has participated in flow state establishment
>    can rely on properties of the resulting flow label values without
>    further signaling.  If a router knows these properties, rule 2 is
> 
> Right. This is my understanding of what the words should have meant.
> 
>    In the tunneling situation sketched above, routers R1 and R2 can rely
>    on the flow labels set by TEP A and TEP B being assigned by a known
>    method.  This allows a safe ECMP or LAG method to be based on the
>    flow label without breaching [RFC3697].
> 
> Technically, this is only true for those flows that the TEP has
> set. It is not true for all flows generally, and presumably that is
> what R1 and R2 would be assuming... 

Correct, the problem *this* draft tackles is only balancing the
TEP-to-TEP traffic.
> 
> Also, I do not believe that including the Flow Label bits as an input
> key to ECMP violates 3697 at all. What text does this violate?

It doesn't, so clearly I need to rephrase slightly.

> 
>    At the time of this writing, the IETF is discussing a possible
>    revision of the rules of RFC 3697 [I-D.ietf-6man-flow-update].  If
>    adopted, that revision would be fully compatible with the present
>    document and would obviate much of the above discussion.
> 
> This wording seems odd, since the ID cited does not carry any weight
> and does not update 3697.

See my reply on the other thread.

>    
>    o  The flow label in the outer packet SHOULD be set by the sending
>       TEP to a pseudo-random 20-bit value in accordance with [RFC3697].
> 
> pseudo random is not right here.
> 
> This is a bit of a divergence, but I do not believe it is a
> requirement that the someone who sets the Flow Label set it to a
> psuedo-random value. Please clarify *why* this needs to be a
> SHOULD. Who benefits? There was email discussion about this. Routers
> can't assume they will be pseudo-random, otherwise they set themselves
> up for DOS attacks.  So why is this necessary?

Our contention is that this will improve the hash distribution
compared to a deterministic method of allocating flow labels.
Given that we aren't specifying a hash algorithm, that's
a difficult assertion to prove. However, my understanding is that
if the inputs to a hash algorithm are distributed uniformly, you
can use a relatively simple algorithm to get a uniform output
distribution. Pseudo-random values help with this.

>       as determined by the IP header fields of the inner
>       packet.
>       *  Note that this rule is a SHOULD rather than a MUST, to permit
>          individual implementers to take an alternative approach if they
>          wish to do so.  Such an alternative MUST conform to [RFC3697].
> 
> Please clarify what type of an exception you are trying to allow
> for. I need to understand the thinking before agreeing to allow for
> this. :-)

Well, for example, some lazy implementer who doesn't bother with
a pseudo-random function but just does next_label++. That's
why there's a normative reference.

> 
>    o  The sending TEP MUST classify all packets into flows, once it has
>       determined that they should enter a given tunnel, and then write
>       the relevant flow label into the outer IPv6 header.  A user flow
>       could be identified by the ingress TEP most simply by its
>       {destination, source} address pair (coarse) or by its 5-tuple
>       {dest addr, source addr, protocol, dest port, source port}
>       (fine).
> 
> Isn't it ironic that you don't include the Flow Label here? I.e., if
> the user sets the Flow Label, you don't use its value? That doesn't
> seem right! :-)

Yes, in today's situation where the label is always zero. Of course,
if the source set it, it should be in the mix.

We should probably note the irony.
> 
>       This is an implementation detail in the sending TEP.
>       *  It might be possible to make this classifier stateless, by
>          using a suitable 20 bit hash of the inner IP header's 2-tuple
>          or 5-tuple as the pseudo-random flow label value.
> 
> Doesn't this go against the 3697 requirement not to reuse a value?

You raise an interesting point. To be brutal, if your use case
is statistical load balancing, you really don't care about occasional
re-use. And you really do want to be stateless for performance reasons.
> 
> The issue I have with this is that it does mean some base assumptions
> about Flow Labels no longer hold in general. This may or may not be a
> problem, but we should speak to this point directly. I.e, this
> specification is proposing some possibly non-backwards compatable
> changes to the Flow Label semantics.

There's a get-out-of-jail card in 3697:

"1.  Introduction

   A flow is a sequence of packets sent from a particular source to a
   particular unicast, anycast, or multicast destination that the source
   desires to label as a flow.  A flow could consist of all packets in a
   specific transport connection or a media stream.  However, a flow is
   not necessarily 1:1 mapped to a transport connection."

In other words, and this was entirely intentional in 3697, a flow is
whatever the source chooses to label as a flow (between a given source
and destination). So it is completely consistent for a TEP to label
two separate user flows with the same label, if that's the way its
stateless algorithm happens to operate.

> 
>    o  At intermediate router(s) that perform load distribution of
>       tunneled packets whose source address is a TEP, the hash algorithm
>       used to determine the outgoing component-link in an ECMP and/or
>       LAG toward the next-hop MUST minimally include the triple {dest
>       addr, source addr, flow label} to meet the [RFC3697] rules.
> 
> The wording here is awkward. It seems to suggest that you only expect
> routers to do this for tunneled packets? How will they know which
> packets those are? Sounds inefficient and requiring of
> configuration... In practice, won't they just do this for all packets?

Yes, definitely.

> If so, the wording above seems convoluted... or trying to make
> something legal if implementations do X, knowing full well they won't
> in practice.

It needs rephrasing.

> 
> Also, which 3697 rules are being referred to above? How is it a
> violation of an existing standard to do ECMP on just the src/dst
> fields?

Right, that need rephrasing too - it isn't a question of rules.

> 
>       *  Intermediate router(s) MAY also include {protocol, dest port,
>          source port} as input keys to the ECMP and/or LAG hash
>          algorithms, to provide sufficient entropy in cases where the
>          flow-label is currently set to zero.
> 
> Seems like this should be a SHOULD. It never hurts to include these
> fields, and in many cases helps. To keep things simple, shouldn't the
> default be just to use the 6 tuple (when available) as input to the
> hash?

Well, that's debatable. I still think it's a MAY in RFC 2119 terms,
but it's a (lower case) should as a matter of common sense.

> 
> 4.  Security Considerations
> 
>    The flow label is not protected in any way and can be forged by an
>    on-path attacker.  Off-path attackers are unlikely to guess a valid
>    flow label if a pseudo-random value is used.  In either case, the
>    worst an attacker could do against ECMP or LAG is to attempt to
>    selectively overload a particular path.  For further discussion, see
>    [RFC3697].
> 
> This section seem weak, though I need to think a bit about what it
> should say.

Ideas welcome. We couldn't think of much to say; we tend to assume
that TEPs that are likely to use this will be ISP-managed and really
not exposed to MITM risks, so off-path injection of packets
is the only obvious risk.

    Brian

6MAN WG Last Call: draft-ietf-6man-flow-ecmp-00.t… Brian Haberman
Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-… Thomas Narten
Re: 6MAN WG Last Call: draft-ietf-6man-flow-ecmp-… Brian E Carpenter