Re: [v6ops] I-D Action: draft-jaeggli-v6ops-pmtud-ecmp-problem-00.txt

Brian E Carpenter <brian.e.carpenter@gmail.com> Wed, 04 June 2014 20:32 UTC

Return-Path: <brian.e.carpenter@gmail.com>
X-Original-To: v6ops@ietfa.amsl.com
Delivered-To: v6ops@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id BF27D1A0085 for <v6ops@ietfa.amsl.com>; Wed, 4 Jun 2014 13:32:10 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2
X-Spam-Level:
X-Spam-Status: No, score=-2 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1nJBUA7xyCDr for <v6ops@ietfa.amsl.com>; Wed, 4 Jun 2014 13:32:09 -0700 (PDT)
Received: from mail-pb0-x232.google.com (mail-pb0-x232.google.com [IPv6:2607:f8b0:400e:c01::232]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F421E1A0339 for <v6ops@ietf.org>; Wed, 4 Jun 2014 13:32:08 -0700 (PDT)
Received: by mail-pb0-f50.google.com with SMTP id ma3so25389pbc.23 for <v6ops@ietf.org>; Wed, 04 Jun 2014 13:32:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:organization:user-agent:mime-version:to:cc :subject:references:in-reply-to:content-type :content-transfer-encoding; bh=Q1eym8FhCcBmVOA0ZlCxwVB3q9W8eJrHrLcDekIBDJk=; b=CGoGwLS97CfwiurnxS6TBIA/QKJrq71YnWlo6qnhejmLXPIKLi0gk3ViVHMEj3BnmP HA+PnYQQk3843hSk4c0Gf3uC8F5VhtmocykNou5x5gV6CKRTR8Ml0NiczdidTSD3mKuP I859mtLse0dgQGvXoa/g/2guCJlPbQUdVQusVHXFZiYODw5Q9Ulh+OySq5rHwmjn9gNs +G9GR5OWrc1WBKXGCc8jhoaX3xARr9D2PzV4rSmfTPFe4s3JVrtr+coWMEsSTQ78RuSS ja1/9/bVTUZvDDonISIxpjzV19fndMtsbQbc5+an9VCfR5RI9Sxe7cilr9f7BaknIGRl +kkA==
X-Received: by 10.68.254.70 with SMTP id ag6mr67457805pbd.33.1401913922885; Wed, 04 Jun 2014 13:32:02 -0700 (PDT)
Received: from [192.168.178.23] (81.194.69.111.dynamic.snap.net.nz. [111.69.194.81]) by mx.google.com with ESMTPSA id eh4sm13626423pbc.79.2014.06.04.13.32.00 for <multiple recipients> (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 04 Jun 2014 13:32:02 -0700 (PDT)
Message-ID: <538F8246.9000909@gmail.com>
Date: Thu, 05 Jun 2014 08:32:06 +1200
From: Brian E Carpenter <brian.e.carpenter@gmail.com>
Organization: University of Auckland
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: joel jaeggli <joelja@bogus.com>
References: <20140602072659.7433.89475.idtracker@ietfa.amsl.com> <538E73EE.8050409@gmail.com> <538EA522.4060507@bogus.com>
In-Reply-To: <538EA522.4060507@bogus.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
Archived-At: http://mailarchive.ietf.org/arch/msg/v6ops/29Kqmcv7PB1cU-z2lYJqzJGx3Dw
Cc: IPv6 Operations <v6ops@ietf.org>, draft-jaeggli-v6ops-pmtud-ecmp-problem@tools.ietf.org
Subject: Re: [v6ops] I-D Action: draft-jaeggli-v6ops-pmtud-ecmp-problem-00.txt
X-BeenThere: v6ops@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: v6ops discussion list <v6ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/v6ops>, <mailto:v6ops-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/v6ops/>
List-Post: <mailto:v6ops@ietf.org>
List-Help: <mailto:v6ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/v6ops>, <mailto:v6ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Jun 2014 20:32:11 -0000

On 04/06/2014 16:48, joel jaeggli wrote:
> On 6/3/14, 6:18 PM, Brian E Carpenter wrote:
>> Hi,
>>
>>>    A problem common to the approach of distribution through hashing is
>>>    its impact on path MTU discovery.  An ICMPv6 type 2 PTB message
>>>    generated on the path between a client and an ECMP load balanced
>>>    server will have the anycast address as the destination and will be
>>>    statelessly load balanced to one of the anycast servers. 
>> This may seem picky, but I think the reader's brain will run
>> more smoothly if it's explicit that it's a PTB triggered by
>> a packet *from* the server and therefore directed *to* the server.
>> "on the path" doesn't quite contain that information.
> 
> I think that's a reasonable observation and fairly straight forward to
> adjust. To my mind the destination address being the anycast address
> does communicate the direction of this ptb message.
> 
>>>             Because of this, the results of
>>>    the ICMPv6 ECMP hash do not match that of the corresponding TCP or
>>>    UDP ECMP hash.
>> Again picky, but if there are (say) 2 paths, there's a 50% chance
>> that it gets the right path, etc. So "might not match" is probably
>> better.
> 
> If there are 64 hash buckets the probability of ending up in the same
> one is 1.5% if there are 255 it's ~.4%. so may not is potentially a near
> certainty. It's entirely possible to have more hash buckets than
> next-hops in order to facilitate load balancing without rehashing when
> adding and removing devices in which case the distribution may vary, and
> on those cases a packet hashed to the wrong bucket may well arrive on
> the correct server/load-balancer.

My logic is the number of hash buckets doesn't matter - it's the number
of paths, because if there are N paths, ~1/N of the traffic will go
along each path regardless of the number of hash buckets.

> 
>> General comment: I'm not sure what is specific to ECMP
>> about this problem. We identified it as a general (but out of
>> scope) problem in RFC 7098. We did include this comment there:
> 
> So it may be out of scope for your rfc, I'm pretty sure that doesn 't
> make the problem go away. I've been dinking around with large scale
> load-balancing of this flavor for some time, more than a decade in on
> form or another. We found it to be a commercial necessity to address the
> problem of how to make PTB work as part of deploying consumer facing
> internet applications services. it's not unique to ICMP, e.g. my
> interest in the fragdrop discussion is due to similar problems.

Fully agreed. It just wasn't a problem we could ameliorate using the
flow label, so it didn't belong in 7098.

> 
>>  o  Note that correct handling of ICMPv6 for Path MTU Discovery
>>     requires the layer 3/4 balancer to keep state for the client
>>     source address, independently of either the port numbers or the
>>     flow label.
> 
> So there are two problems with this assertion.
> 
> 1. state sharing is presently impossible at that scale I'm operating at.
> so any solution that presumes it across an entire population of devices
> isn't going to work. state sharing between pairs of devices in otherwise
> stateless clusters (which describes the internal architecture of a
> number of high-end firewalls and load-balancer products doesn't solve
> this either).

Agreed. Our comment was not intended to suggest that it was practical
to synchronise state in that way.

> 
> 2. the ptb packet doesn't have the parts of the flow associated within
> the ip and icmp header so the flow that is was associated with cannot be
> reconstructed by a pure l3/l4 device stateful or otherwise, it can as
> noted in the draft be derived by inspecting the icmp payload for the
> IP/TCP/UDP header of outgoing packet which is pretty deep packet
> inspection (normally done by the end system). Finding the ICMP header is
> not the same as being able to parse the payload. In any event we were
> using rather high-end but but otherwise normal router silicon, so we get
> to do this the same as anyone else who fronts a load-balancer or server
> tier containing ecmp nexhops with a router...

Agreed. Again, we didn't mean to imply that line-speed dissection
of ICMP was practical; only that it seems to be needed to send
PTBs where they need to go.

> 
>> because, indeed, the PMTU is a function of the address pair only.
> 
> The ptb packet has the source address of the device that emitted it.

Yes, of course. That's why you have to dissect the packet.

(One fanciful idea that the authors of 7098 discussed was to
specify that the outgoing flow label value should be reflected
back in the ICMPv6 packet, but that didn't seem like a
deployable idea.)

> 
> In any event, our goal isn't really to boil the ocean, with respect to
> parsing the packet or deriving correct host, if we could get close
> enough, it was to not break PMTUD, and therefore customers in a service
> that supported millions of  users.

Fully understood.

   Brian