Re: Fun and surprises with IPv6 fragmentation

Patrick McManus <> Sat, 03 March 2018 15:23 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 84094124B18 for <>; Sat, 3 Mar 2018 07:23:54 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.233
X-Spam-Status: No, score=-1.233 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_SOFTFAIL=0.665, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 10QWQokna7LP for <>; Sat, 3 Mar 2018 07:23:52 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTP id 0086412420B for <>; Sat, 3 Mar 2018 07:23:51 -0800 (PST)
Received: from ( []) by (Postfix) with ESMTPSA id AAF323A021 for <>; Sat, 3 Mar 2018 10:23:50 -0500 (EST)
Received: by with SMTP id c12so9138975oic.7 for <>; Sat, 03 Mar 2018 07:23:50 -0800 (PST)
X-Gm-Message-State: AElRT7Gt1mw1pKkBlQSC0GnHwLt5LeJ9/mfcSgZDpY+CqEXCVvNZG7nZ o5J6h6bS4RzUdz359D+n1y+Hb4yJmaiaWIF+Jiw=
X-Google-Smtp-Source: AG47ELsJS0qCO29S1qrXe7PDCdusW8I6S3SuXGZIl6xjrw4C6facuDxRaHzSNS0K41UWs8Fjyh0Pxp4kQFUSqj82As4=
X-Received: by with SMTP id m67mr3043792oif.132.1520090630299; Sat, 03 Mar 2018 07:23:50 -0800 (PST)
MIME-Version: 1.0
Received: by with HTTP; Sat, 3 Mar 2018 07:23:49 -0800 (PST)
In-Reply-To: <>
References: <> <>
From: Patrick McManus <>
Date: Sat, 03 Mar 2018 10:23:49 -0500
X-Gmail-Original-Message-ID: <>
Message-ID: <>
Subject: Re: Fun and surprises with IPv6 fragmentation
To: Ryan Hamilton <>
Cc: Christian Huitema <>, "" <>
Content-Type: multipart/alternative; boundary="001a113dde0a0baae8056683ae27"
Archived-At: <>
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 03 Mar 2018 15:23:54 -0000

On Sat, Mar 3, 2018 at 12:09 AM, Ryan Hamilton <> wrote:

> I'm sorry if this is a dumb question, but I understood that in IPv6
> routers could not fragment IPv6 packets, only endpoints.
I know! fun. Honestly the Internet is just so interesting - you wonder how
it works at all.

The pcap is in the #picoquic slack channel. (as always anyone reading this
can just ask anyone on the slack, like me, for an invite). search for

> Unlike in IPv4, IPv6 routers never fragment IPv6 packets. Packets
> exceeding the size of the maximum transmission unit of the destination link
> are dropped and this condition is signaled by a Packet too Big ICMPv6 type
> 2 message to the originating node, similarly to the IPv4 method when the
> Don't Fragment bit is set.[1]
> End nodes in IPv6 are expected to perform path MTU discovery to determine
> the maximum size of packets to send, and the upper-layer protocol is
> expected to limit the payload size. However, if the upper-layer protocol is
> unable to do so, the sending host may use the Fragment extension header in
> order to perform end-to-end fragmentation of IPv6 packets.
> How sure are you that it's a router and not the sending host that's doing
> the fragmentation.
It seems unlikely to be in the core.. the recv host does have a mtu of 1500
but its hard to imagine a recv stack fragmenting (and then reassembling and
reordering!) things.

one of the interesting tidbits here is that it isn't just the small
fragment that moves ahead in the queue - its both fragments of the big

> Cheers,
> Ryan
> On Fri, Mar 2, 2018 at 9:02 PM, Christian Huitema <>
> wrote:
>> Yesterday, I was mentioning bugs of the interop. This morning, I woke up
>> to find an interesting message from Patrick McManus. Something is weird, he
>> said. The first data message that your server sends, with sequence number
>> N, always arrives before the final handshake message, with sequence number
>> N-1. That inversion appears to happen systematically.
>> It took us the best part of a day to explore blind alleys and finally
>> understand what was happening. The exchange was over IPv6. Upon receiving a
>> connection request from Patrick’s implementation, Picoquic was sending back
>> a handshake packet. Immediately after that, Picoquic was sending its first
>> data packet, which happens to be an MTU probe. And it turns out that the
>> probe was 1518 bytes, a bit longer than what the AWS routers could accept.
>> So some router inserted an IPv6 fragmentation header and split the packet
>> in two: a large initial fragment, 1496 byte long, and a small second
>> fragment 78 bytes long. You could think that this is no big deal, since
>> fragments would just be reassembled at the destination, but you would be
>> wrong.
>> Some routers on the path try to be helpful. They have learned from past
>> experience that short packets often carry important data, and so they try
>> to route them faster than long data packets. And here is what happens in
>> our case:
>> ·         * The server prepares and send a Handshake packet, 590 bytes
>> long.
>> ·         * The server then prepares the MTU probe, 1518 bytes long.
>> ·         * The MTU probe is split into fragment 1, 1496 bytes, and
>> fragment 2, 78 bytes.
>> ·         * The handshake and the long fragment are routed on the normal
>> path, but the small fragment is routed at a higher priority level.
>> ·         * The Linux driver at the destination receives the small
>> fragment first. It queues everything behind that until it receives the long
>> fragment.
>> ·         * The Linux driver passes the reassembled packet to the
>> application, which cannot do anything with it because the encryption keys
>> can only be obtained from the handshake packet.
>> ·         * The Linux driver then passes the handshake packet to the
>> application.
>> Which confirms an old opinion. When routers try to be smart and helpful,
>> they end up being dumb and harmful. Please just send the packets in the
>> order you get them!
>> I tried to work around the issue by setting the "don't fragment" bit on
>> the socket, but somehow that doesn't work. So I simply programmed the
>> server to not use payloads larger than 1440 bytes. Still, I can see that
>> pattern happening in other circumstances, such as a long Connection Initial
>> message followed by a short 0-RTT packet. isn't networking fun?
>> -- Christian Huitema