Fun and surprises with IPv6 fragmentation

Christian Huitema <> Sat, 03 March 2018 05:02 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 8A68C12EABD for <>; Fri, 2 Mar 2018 21:02:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.34
X-Spam-Status: No, score=-2.34 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, HTML_OBFUSCATE_05_10=0.26, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id i35HIKKymLmh for <>; Fri, 2 Mar 2018 21:02:15 -0800 (PST)
Received: from ( []) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id EA7BE12D93F for <>; Fri, 2 Mar 2018 21:02:14 -0800 (PST)
Received: from ([]) by with esmtps (TLSv1:AES256-SHA:256) (Exim 4.89) (envelope-from <>) id 1erzJD-00052x-Td for; Sat, 03 Mar 2018 06:02:13 +0100
Received: from [] ( by with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.63) (envelope-from <>) id 1erzJA-000452-GK for; Sat, 03 Mar 2018 00:02:09 -0500
Received: (qmail 7856 invoked from network); 3 Mar 2018 05:02:06 -0000
Received: from unknown (HELO []) ([]) (envelope-sender <>) by (qmail-ldap-1.03) with ESMTPA for <>; 3 Mar 2018 05:02:05 -0000
To: "" <>
From: Christian Huitema <>
Message-ID: <>
Date: Fri, 02 Mar 2018 21:02:03 -0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="------------E4F776982F632820A23F70EF"
Content-Language: en-US
Subject: Fun and surprises with IPv6 fragmentation
Authentication-Results:; auth=pass smtp.auth=
X-AntiSpamCloud-Outgoing-Class: unsure
X-AntiSpamCloud-Outgoing-Evidence: Combined (0.41)
X-Recommended-Action: accept
X-Filter-ID: EX5BVjFpneJeBchSMxfU5lnWw4vcxaboV5GNsIM/qTh602E9L7XzfQH6nu9C/Fh9KJzpNe6xgvOx q3u0UDjvO37pNwwF1lRXh5rzvPzo9Jts1ujulqUFmMITHM77eiViyS1QdAIDZudf714GezfksYyz NJVaeAWax4WOe4pTBX2DwIE7VKe+bqpcdCns72R1myoI6HG8RgZGnUdJnKT7IqXe0Of4jddu9xC8 8+iQ5nb6BRFVjXUbiREH8mlR1JtP/UPTAAMnNuB6/0WWjH77oh6ijzwzq5HGxV3pRhOdYuobeA2G NaAif0QyGEAJd8kel+zffa+S3paXsykGResyE7dAzbZabvf4+eAvvSn0D5YzxzA4C4+ILjmdkQoL 6F7cCSavQBrPoagEXfZ210Cx8bwqyT5p50x81ZKcmzCu2U1l0pLLr6Q2GfeLeJGF+80DiMuK19Gt kjSClxa7nkfgrjWKtLT9WR57oxUvRixjadcYp1AHUdwl/5y6S9ANRbCOrtTbk5SQKFzz0trkyx2O xKBWWrR8KrmPkEWZ/0XNjz+nOk/0hBU1wgZoxxx3xydCRzamoXmFFzOHqSgkz8qNlb0yK8nh4wUp PrQsuR74m7mi++sp8W+veGF1nw/XroQ7DZcsTd0S7nlAbLmVODWYnTBk19KAXoZr2QC+JQiZhSMO ufOwbl/5xojV0vlh7+TwwCnSGDac5irsZFPHqbnYAUK2imFSbHDjSfD3WSrzL59mYbIItf+/6PrN wMGmMn55t5ELrSovEbs0q5P3DsZz6Iz8waCda1qh4N7T4Zm+JMD1H/aAwarQpYDOYx/6JtUOfcO5 M3zJ5LV9FnW2JgnPwaHg3Vm1guBM52Xnl60Wt9LGBDlaFROy1SEvFz4VZsqCWyz6fZ9+dDG3ponu SVlpv88Q6/TV94H9fkLRM72riOLwtF7QVYQc1P3RKpQ8Ws9eHT2Opg3qxuGikK4seFV48jEvuGsl KTrRIXcXpFg5ivY=
Archived-At: <>
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sat, 03 Mar 2018 05:02:17 -0000

Yesterday, I was mentioning bugs of the interop. This morning, I woke up
to find an interesting message from Patrick McManus. Something is weird,
he said. The first data message that your server sends, with sequence
number N, always arrives before the final handshake message, with
sequence number N-1. That inversion appears to happen systematically.

It took us the best part of a day to explore blind alleys and finally
understand what was happening. The exchange was over IPv6. Upon
receiving a connection request from Patrick’s implementation, Picoquic
was sending back a handshake packet. Immediately after that, Picoquic
was sending its first data packet, which happens to be an MTU probe. And
it turns out that the probe was 1518 bytes, a bit longer than what the
AWS routers could accept. So some router inserted an IPv6 fragmentation
header and split the packet in two: a large initial fragment, 1496 byte
long, and a small second fragment 78 bytes long. You could think that
this is no big deal, since fragments would just be reassembled at the
destination, but you would be wrong.

Some routers on the path try to be helpful. They have learned from past
experience that short packets often carry important data, and so they
try to route them faster than long data packets. And here is what
happens in our case:

·         * The server prepares and send a Handshake packet, 590 bytes long.

·         * The server then prepares the MTU probe, 1518 bytes long.

·         * The MTU probe is split into fragment 1, 1496 bytes, and
fragment 2, 78 bytes.

·         * The handshake and the long fragment are routed on the normal
path, but the small fragment is routed at a higher priority level.

·         * The Linux driver at the destination receives the small
fragment first. It queues everything behind that until it receives the
long fragment.

·         * The Linux driver passes the reassembled packet to the
application, which cannot do anything with it because the encryption
keys can only be obtained from the handshake packet.

·         * The Linux driver then passes the handshake packet to the

Which confirms an old opinion. When routers try to be smart and helpful,
they end up being dumb and harmful. Please just send the packets in the
order you get them!

I tried to work around the issue by setting the "don't fragment" bit on
the socket, but somehow that doesn't work. So I simply programmed the
server to not use payloads larger than 1440 bytes. Still, I can see that
pattern happening in other circumstances, such as a long Connection
Initial message followed by a short 0-RTT packet. isn't networking fun?

-- Christian Huitema