Re: [v6ops] I-D Action: draft-jaeggli-v6ops-pmtud-ecmp-problem-00.txt

joel jaeggli <joelja@bogus.com> Wed, 04 June 2014 23:41 UTC

Return-Path: <joelja@bogus.com>
X-Original-To: v6ops@ietfa.amsl.com
Delivered-To: v6ops@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7CFC81A03D8 for <v6ops@ietfa.amsl.com>; Wed, 4 Jun 2014 16:41:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.551
X-Spam-Level:
X-Spam-Status: No, score=-2.551 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.651] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SA1VkrPS5kLu for <v6ops@ietfa.amsl.com>; Wed, 4 Jun 2014 16:41:05 -0700 (PDT)
Received: from nagasaki.bogus.com (nagasaki.bogus.com [IPv6:2001:418:1::81]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 697951A03D0 for <v6ops@ietf.org>; Wed, 4 Jun 2014 16:41:05 -0700 (PDT)
Received: from mbp.local (31.66.208.web-pass.com [208.66.31.202] (may be forged)) (authenticated bits=0) by nagasaki.bogus.com (8.14.7/8.14.7) with ESMTP id s54NetYC028430 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Wed, 4 Jun 2014 23:40:56 GMT (envelope-from joelja@bogus.com)
Message-ID: <538FAE82.1000902@bogus.com>
Date: Wed, 04 Jun 2014 16:40:50 -0700
From: joel jaeggli <joelja@bogus.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Thunderbird/30.0
MIME-Version: 1.0
To: Brian E Carpenter <brian.e.carpenter@gmail.com>
References: <20140602072659.7433.89475.idtracker@ietfa.amsl.com> <538E73EE.8050409@gmail.com> <538EA522.4060507@bogus.com> <538F8246.9000909@gmail.com>
In-Reply-To: <538F8246.9000909@gmail.com>
X-Enigmail-Version: 1.6
Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="DikI2GbIO8gdA5ltE1lmskowwq28rc4kX"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.4.3 (nagasaki.bogus.com [147.28.0.81]); Wed, 04 Jun 2014 23:40:56 +0000 (UTC)
Archived-At: http://mailarchive.ietf.org/arch/msg/v6ops/hTer6M-xhgs-5waxDUWI1W40KE4
Cc: IPv6 Operations <v6ops@ietf.org>, draft-jaeggli-v6ops-pmtud-ecmp-problem@tools.ietf.org
Subject: Re: [v6ops] I-D Action: draft-jaeggli-v6ops-pmtud-ecmp-problem-00.txt
X-BeenThere: v6ops@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: v6ops discussion list <v6ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/v6ops>, <mailto:v6ops-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/v6ops/>
List-Post: <mailto:v6ops@ietf.org>
List-Help: <mailto:v6ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/v6ops>, <mailto:v6ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Jun 2014 23:41:07 -0000

On 6/4/14, 1:32 PM, Brian E Carpenter wrote:
> On 04/06/2014 16:48, joel jaeggli wrote:
>> On 6/3/14, 6:18 PM, Brian E Carpenter wrote:
>>> Hi,
>>>
>>>>    A problem common to the approach of distribution through hashing is
>>>>    its impact on path MTU discovery.  An ICMPv6 type 2 PTB message
>>>>    generated on the path between a client and an ECMP load balanced
>>>>    server will have the anycast address as the destination and will be
>>>>    statelessly load balanced to one of the anycast servers. 
>>> This may seem picky, but I think the reader's brain will run
>>> more smoothly if it's explicit that it's a PTB triggered by
>>> a packet *from* the server and therefore directed *to* the server.
>>> "on the path" doesn't quite contain that information.
>>
>> I think that's a reasonable observation and fairly straight forward to
>> adjust. To my mind the destination address being the anycast address
>> does communicate the direction of this ptb message.
>>
>>>>             Because of this, the results of
>>>>    the ICMPv6 ECMP hash do not match that of the corresponding TCP or
>>>>    UDP ECMP hash.
>>> Again picky, but if there are (say) 2 paths, there's a 50% chance
>>> that it gets the right path, etc. So "might not match" is probably
>>> better.
>>
>> If there are 64 hash buckets the probability of ending up in the same
>> one is 1.5% if there are 255 it's ~.4%. so may not is potentially a near
>> certainty. It's entirely possible to have more hash buckets than
>> next-hops in order to facilitate load balancing without rehashing when
>> adding and removing devices in which case the distribution may vary, and
>> on those cases a packet hashed to the wrong bucket may well arrive on
>> the correct server/load-balancer.
> 
> My logic is the number of hash buckets doesn't matter - it's the number
> of paths, because if there are N paths, ~1/N of the traffic will go
> along each path regardless of the number of hash buckets.

yeah and I had 64 paths in one deployment so...

>>
>>> General comment: I'm not sure what is specific to ECMP
>>> about this problem. We identified it as a general (but out of
>>> scope) problem in RFC 7098. We did include this comment there:
>>
>> So it may be out of scope for your rfc, I'm pretty sure that doesn 't
>> make the problem go away. I've been dinking around with large scale
>> load-balancing of this flavor for some time, more than a decade in on
>> form or another. We found it to be a commercial necessity to address the
>> problem of how to make PTB work as part of deploying consumer facing
>> internet applications services. it's not unique to ICMP, e.g. my
>> interest in the fragdrop discussion is due to similar problems.
> 
> Fully agreed. It just wasn't a problem we could ameliorate using the
> flow label, so it didn't belong in 7098.

If I were hashing on the flow label (which frankly I wouldn't mind
doing) I'd still have to find it. it can probably can be found  fairly
easily since theoretically since the packet in the payload originated
with me, and I emitted no fragments, and used no extension headers the
payload in the icmp  after offset 32 should be the ip header, which is
conveniently fixed. and the upper layer header. doing this in silicon
depends on how much of the packet I can sluice off with the header (and
vendor implmentation of course).

>>
>>>  o  Note that correct handling of ICMPv6 for Path MTU Discovery
>>>     requires the layer 3/4 balancer to keep state for the client
>>>     source address, independently of either the port numbers or the
>>>     flow label.
>>
>> So there are two problems with this assertion.
>>
>> 1. state sharing is presently impossible at that scale I'm operating at.
>> so any solution that presumes it across an entire population of devices
>> isn't going to work. state sharing between pairs of devices in otherwise
>> stateless clusters (which describes the internal architecture of a
>> number of high-end firewalls and load-balancer products doesn't solve
>> this either).
> 
> Agreed. Our comment was not intended to suggest that it was practical
> to synchronise state in that way.
> 
>>
>> 2. the ptb packet doesn't have the parts of the flow associated within
>> the ip and icmp header so the flow that is was associated with cannot be
>> reconstructed by a pure l3/l4 device stateful or otherwise, it can as
>> noted in the draft be derived by inspecting the icmp payload for the
>> IP/TCP/UDP header of outgoing packet which is pretty deep packet
>> inspection (normally done by the end system). Finding the ICMP header is
>> not the same as being able to parse the payload. In any event we were
>> using rather high-end but but otherwise normal router silicon, so we get
>> to do this the same as anyone else who fronts a load-balancer or server
>> tier containing ecmp nexhops with a router...
> 
> Agreed. Again, we didn't mean to imply that line-speed dissection
> of ICMP was practical; only that it seems to be needed to send
> PTBs where they need to go.
> 
>>
>>> because, indeed, the PMTU is a function of the address pair only.
>>
>> The ptb packet has the source address of the device that emitted it.
> 
> Yes, of course. That's why you have to dissect the packet.
> 
> (One fanciful idea that the authors of 7098 discussed was to
> specify that the outgoing flow label value should be reflected
> back in the ICMPv6 packet, but that didn't seem like a
> deployable idea.)

In practice  even if many of them weren't zero, 20 bits isn't enough for
the flow label by itself to positively identify the flow in my case.
notwitstanding the lack of state sharing.

if it used the destination adress and the flow label, that would
actually do it, but then it's spoofing.

>>
>> In any event, our goal isn't really to boil the ocean, with respect to
>> parsing the packet or deriving correct host, if we could get close
>> enough, it was to not break PMTUD, and therefore customers in a service
>> that supported millions of  users.
> 
> Fully understood.
> 
>    Brian
>