Re: Fun and surprises with IPv6 fragmentation

Patrick McManus <pmcmanus@mozilla.com> Sat, 03 March 2018 15:23 UTC

Return-Path: <pmcmanus@mozilla.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 84094124B18 for <quic@ietfa.amsl.com>; Sat, 3 Mar 2018 07:23:54 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.233
X-Spam-Level:
X-Spam-Status: No, score=-1.233 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 10QWQokna7LP for <quic@ietfa.amsl.com>; Sat, 3 Mar 2018 07:23:52 -0800 (PST)
Received: from linode64.ducksong.com (www.ducksong.com [192.155.95.102]) by ietfa.amsl.com (Postfix) with ESMTP id 0086412420B for <quic@ietf.org>; Sat, 3 Mar 2018 07:23:51 -0800 (PST)
Received: from mail-oi0-f43.google.com (mail-oi0-f43.google.com [209.85.218.43]) by linode64.ducksong.com (Postfix) with ESMTPSA id AAF323A021 for <quic@ietf.org>; Sat, 3 Mar 2018 10:23:50 -0500 (EST)
Received: by mail-oi0-f43.google.com with SMTP id c12so9138975oic.7 for <quic@ietf.org>; Sat, 03 Mar 2018 07:23:50 -0800 (PST)
X-Gm-Message-State: AElRT7Gt1mw1pKkBlQSC0GnHwLt5LeJ9/mfcSgZDpY+CqEXCVvNZG7nZ o5J6h6bS4RzUdz359D+n1y+Hb4yJmaiaWIF+Jiw=
X-Google-Smtp-Source: AG47ELsJS0qCO29S1qrXe7PDCdusW8I6S3SuXGZIl6xjrw4C6facuDxRaHzSNS0K41UWs8Fjyh0Pxp4kQFUSqj82As4=
X-Received: by 10.202.188.70 with SMTP id m67mr3043792oif.132.1520090630299; Sat, 03 Mar 2018 07:23:50 -0800 (PST)
MIME-Version: 1.0
Received: by 10.74.66.212 with HTTP; Sat, 3 Mar 2018 07:23:49 -0800 (PST)
In-Reply-To: <CAJ_4DfS=6h9qEQ+uwntLtDZNSODhqc_0pww7c2gK50XKna0BCw@mail.gmail.com>
References: <681fcc96-4cf9-100d-9ad6-b3c7be9189a5@huitema.net> <CAJ_4DfS=6h9qEQ+uwntLtDZNSODhqc_0pww7c2gK50XKna0BCw@mail.gmail.com>
From: Patrick McManus <pmcmanus@mozilla.com>
Date: Sat, 03 Mar 2018 10:23:49 -0500
X-Gmail-Original-Message-ID: <CAOdDvNqRD=NqbmDaTDi5t-iPy_sB-bjHcpeVPgEXZnN04DnRSQ@mail.gmail.com>
Message-ID: <CAOdDvNqRD=NqbmDaTDi5t-iPy_sB-bjHcpeVPgEXZnN04DnRSQ@mail.gmail.com>
Subject: Re: Fun and surprises with IPv6 fragmentation
To: Ryan Hamilton <rch=40google.com@dmarc.ietf.org>
Cc: Christian Huitema <huitema@huitema.net>, "quic@ietf.org" <quic@ietf.org>
Content-Type: multipart/alternative; boundary="001a113dde0a0baae8056683ae27"
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/t-i0fyq0u6GEtOa-obmqsD8EAE0>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2018 15:23:54 -0000

On Sat, Mar 3, 2018 at 12:09 AM, Ryan Hamilton <
rch=40google.com@dmarc.ietf.org> wrote:

> I'm sorry if this is a dumb question, but I understood that in IPv6
> routers could not fragment IPv6 packets, only endpoints.
>
>
I know! fun. Honestly the Internet is just so interesting - you wonder how
it works at all.

The pcap is in the #picoquic slack channel. (as always anyone reading this
can just ask anyone on the slack, like me, for an invite). search for
octopus.pcap


> Unlike in IPv4, IPv6 routers never fragment IPv6 packets. Packets
> exceeding the size of the maximum transmission unit of the destination link
> are dropped and this condition is signaled by a Packet too Big ICMPv6 type
> 2 message to the originating node, similarly to the IPv4 method when the
> Don't Fragment bit is set.[1]
>
> End nodes in IPv6 are expected to perform path MTU discovery to determine
> the maximum size of packets to send, and the upper-layer protocol is
> expected to limit the payload size. However, if the upper-layer protocol is
> unable to do so, the sending host may use the Fragment extension header in
> order to perform end-to-end fragmentation of IPv6 packets.
>
> https://en.wikipedia.org/wiki/IPv6_packet#Fragmentation
>
>
> How sure are you that it's a router and not the sending host that's doing
> the fragmentation.
>
>
It seems unlikely to be in the core.. the recv host does have a mtu of 1500
but its hard to imagine a recv stack fragmenting (and then reassembling and
reordering!) things.

one of the interesting tidbits here is that it isn't just the small
fragment that moves ahead in the queue - its both fragments of the big
packet.



> Cheers,
>
> Ryan
>
> On Fri, Mar 2, 2018 at 9:02 PM, Christian Huitema <huitema@huitema.net>
> wrote:
>
>> Yesterday, I was mentioning bugs of the interop. This morning, I woke up
>> to find an interesting message from Patrick McManus. Something is weird, he
>> said. The first data message that your server sends, with sequence number
>> N, always arrives before the final handshake message, with sequence number
>> N-1. That inversion appears to happen systematically.
>>
>> It took us the best part of a day to explore blind alleys and finally
>> understand what was happening. The exchange was over IPv6. Upon receiving a
>> connection request from Patrick’s implementation, Picoquic was sending back
>> a handshake packet. Immediately after that, Picoquic was sending its first
>> data packet, which happens to be an MTU probe. And it turns out that the
>> probe was 1518 bytes, a bit longer than what the AWS routers could accept.
>> So some router inserted an IPv6 fragmentation header and split the packet
>> in two: a large initial fragment, 1496 byte long, and a small second
>> fragment 78 bytes long. You could think that this is no big deal, since
>> fragments would just be reassembled at the destination, but you would be
>> wrong.
>>
>> Some routers on the path try to be helpful. They have learned from past
>> experience that short packets often carry important data, and so they try
>> to route them faster than long data packets. And here is what happens in
>> our case:
>>
>> ·         * The server prepares and send a Handshake packet, 590 bytes
>> long.
>>
>> ·         * The server then prepares the MTU probe, 1518 bytes long.
>>
>> ·         * The MTU probe is split into fragment 1, 1496 bytes, and
>> fragment 2, 78 bytes.
>>
>> ·         * The handshake and the long fragment are routed on the normal
>> path, but the small fragment is routed at a higher priority level.
>>
>> ·         * The Linux driver at the destination receives the small
>> fragment first. It queues everything behind that until it receives the long
>> fragment.
>>
>> ·         * The Linux driver passes the reassembled packet to the
>> application, which cannot do anything with it because the encryption keys
>> can only be obtained from the handshake packet.
>>
>> ·         * The Linux driver then passes the handshake packet to the
>> application.
>>
>> Which confirms an old opinion. When routers try to be smart and helpful,
>> they end up being dumb and harmful. Please just send the packets in the
>> order you get them!
>>
>> I tried to work around the issue by setting the "don't fragment" bit on
>> the socket, but somehow that doesn't work. So I simply programmed the
>> server to not use payloads larger than 1440 bytes. Still, I can see that
>> pattern happening in other circumstances, such as a long Connection Initial
>> message followed by a short 0-RTT packet. isn't networking fun?
>>
>> -- Christian Huitema
>>
>>
>