Re: Fun and surprises with IPv6 fragmentation

"Eggert, Lars" <lars@netapp.com> Sat, 03 March 2018 21:08 UTC

Return-Path: <lars@netapp.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 154FC124217 for <quic@ietfa.amsl.com>; Sat, 3 Mar 2018 13:08:50 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.209
X-Spam-Level:
X-Spam-Status: No, score=-4.209 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=netapp.onmicrosoft.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id tJuBxEUhgO89 for <quic@ietfa.amsl.com>; Sat, 3 Mar 2018 13:08:47 -0800 (PST)
Received: from mx142.netapp.com (mx142.netapp.com [IPv6:2620:10a:4005:8000:2306::b]) (using TLSv1.2 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C0D4D1270A7 for <quic@ietf.org>; Sat, 3 Mar 2018 13:08:47 -0800 (PST)
X-IronPort-AV: E=Sophos;i="5.47,419,1515484800"; d="scan'208,217";a="240803923"
Received: from vmwexchts02-prd.hq.netapp.com ([10.122.105.23]) by mx142-out.netapp.com with ESMTP; 03 Mar 2018 13:08:47 -0800
Received: from VMWEXCCAS06-PRD.hq.netapp.com (10.122.105.22) by VMWEXCHTS02-PRD.hq.netapp.com (10.122.105.23) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Sat, 3 Mar 2018 13:08:26 -0800
Received: from NAM03-BY2-obe.outbound.protection.outlook.com (10.120.60.153) by VMWEXCCAS06-PRD.hq.netapp.com (10.122.105.22) with Microsoft SMTP Server (TLS) id 15.0.1320.4 via Frontend Transport; Sat, 3 Mar 2018 13:08:26 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netapp.onmicrosoft.com; s=selector1-netapp-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=dPbj6cWsxiTYycHZYixfvNlAwSFBL2+p7UF/SqfSw1g=; b=cxEjMSpj3OVIBHPG63fYNM1VoxhuKAfRAUBmL7KIZA436QWYN9yN62dtMB/aFDSvXFhJoGMQJMdFco0RTFlazrIALtTiJlDrwTUwVNfkmFlpEhFTaUK9tZ7BHxCe23PNvBNWTjoCJU1+PLz4cp5Z3crYQqDV4EXbSK2Cn00uC8Y=
Received: from BLUPR06MB1764.namprd06.prod.outlook.com (10.162.224.150) by BLUPR06MB835.namprd06.prod.outlook.com (10.141.24.152) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.548.13; Sat, 3 Mar 2018 21:08:22 +0000
Received: from BLUPR06MB1764.namprd06.prod.outlook.com ([fe80::14b2:cbc0:3674:e004]) by BLUPR06MB1764.namprd06.prod.outlook.com ([fe80::14b2:cbc0:3674:e004%14]) with mapi id 15.20.0548.014; Sat, 3 Mar 2018 21:08:21 +0000
From: "Eggert, Lars" <lars@netapp.com>
To: Patrick McManus <pmcmanus@mozilla.com>
CC: Ryan Hamilton <rch=40google.com@dmarc.ietf.org>, "quic@ietf.org" <quic@ietf.org>, Christian Huitema <huitema@huitema.net>
Subject: Re: Fun and surprises with IPv6 fragmentation
Thread-Topic: Fun and surprises with IPv6 fragmentation
Thread-Index: AQHTsqzfPIDCgmseyUypcXPs3NG3q6O99mSAgACrhICAAGBDTA==
Date: Sat, 03 Mar 2018 21:08:21 +0000
Message-ID: <85DAB0A5-4280-47FC-AFB5-6CB27D92A1D6@netapp.com>
References: <681fcc96-4cf9-100d-9ad6-b3c7be9189a5@huitema.net> <CAJ_4DfS=6h9qEQ+uwntLtDZNSODhqc_0pww7c2gK50XKna0BCw@mail.gmail.com>, <CAOdDvNqRD=NqbmDaTDi5t-iPy_sB-bjHcpeVPgEXZnN04DnRSQ@mail.gmail.com>
In-Reply-To: <CAOdDvNqRD=NqbmDaTDi5t-iPy_sB-bjHcpeVPgEXZnN04DnRSQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: spf=none (sender IP is ) smtp.mailfrom=lars@netapp.com;
x-originating-ip: [2001:a61:360d:b01:c9da:1bc:7c05:fc7]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; BLUPR06MB835; 7:A8e0Kt3uDxM0Arl0EAS9NM8raysmBw+ZCSeS7x77wzk/JthH7ff2vpOQvzl0nPfKf10adCrW7jLxJmsRn/STiC0V5e+6ky1ouP52XkCLAQhuKmxH4jVoaJNkjcxbYs2AOh1Zby0MoSVXAX1P8ylfq259H3lYEDgbS7txL3Z1IY7HSmvX/TM3V74sHbxWUdBYc1nRI31TzCpzn0xr9iZyWq8Jqn92wkoa+lG/ROCrbN6EjcjvKjzVCQCNN+oNj65D
x-ms-exchange-antispam-srfa-diagnostics: SSOS;
x-ms-office365-filtering-correlation-id: 784a3844-0117-48ef-1db4-08d5814ae558
x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652020)(4534165)(4627221)(201703031133081)(201702281549075)(5600026)(4604075)(3008032)(2017052603307)(7193020); SRVR:BLUPR06MB835;
x-ms-traffictypediagnostic: BLUPR06MB835:
x-microsoft-antispam-prvs: <BLUPR06MB835EF6D9F8A850377F11611A7C40@BLUPR06MB835.namprd06.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(158342451672863);
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(6040501)(2401047)(8121501046)(5005006)(93006095)(93001095)(3002001)(3231220)(944501244)(52105095)(10201501046)(6055026)(6041288)(20161123560045)(20161123564045)(20161123562045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(6072148)(201708071742011); SRVR:BLUPR06MB835; BCL:0; PCL:0; RULEID:; SRVR:BLUPR06MB835;
x-forefront-prvs: 0600F93FE1
x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(376002)(39380400002)(39860400002)(396003)(346002)(366004)(189003)(199004)(2950100002)(2900100001)(81156014)(8676002)(6916009)(186003)(86362001)(97736004)(229853002)(36756003)(4326008)(59450400001)(102836004)(966005)(606006)(14454004)(83716003)(46003)(2906002)(8936002)(6506007)(53546011)(3280700002)(81166006)(478600001)(6116002)(76176011)(5660300001)(33656002)(7736002)(316002)(68736007)(25786009)(3660700001)(54906003)(99286004)(54896002)(5250100002)(6486002)(6306002)(6512007)(6436002)(53936002)(106356001)(236005)(6246003)(105586002)(82746002); DIR:OUT; SFP:1101; SCL:1; SRVR:BLUPR06MB835; H:BLUPR06MB1764.namprd06.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en;
received-spf: None (protection.outlook.com: netapp.com does not designate permitted sender hosts)
x-microsoft-antispam-message-info: CdYgClP36Gu1YeOXQYuPHGnkU0hYItOTzbNEJDYRT/vDoHw3ZcbRBZfVxRLAYpNxIdw1r0zFvsN6SvFm4pIsGTwOzO8l7bgd7wCc6Fq1dBG8BO9TG0ekt60bpqpgOUw3kJ3S1/JR6WJHFVmXP3vGA2UShRY0Z1ZUGN8j3V+6fXg=
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: multipart/alternative; boundary="_000_85DAB0A5428047FCAFB56CB27D92A1D6netappcom_"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-Network-Message-Id: 784a3844-0117-48ef-1db4-08d5814ae558
X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Mar 2018 21:08:21.5613 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 4b0911a0-929b-4715-944b-c03745165b3a
X-MS-Exchange-Transport-CrossTenantHeadersStamped: BLUPR06MB835
X-OriginatorOrg: netapp.com
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/B7elZiD3blI656jpEmRYnXVNMWM>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2018 21:08:50 -0000

Actually, a few years back Linux had issues with reordering in the receive stack. One would hope those kernels are out of service by now.

--
Sent from a mobile device; please excuse typos.
+49 151 120 55791

On Mar 3, 2018, at 16:24, Patrick McManus <pmcmanus@mozilla.com<mailto:pmcmanus@mozilla.com>> wrote:



On Sat, Mar 3, 2018 at 12:09 AM, Ryan Hamilton <rch=40google.com@dmarc.ietf.org<mailto:rch=40google.com@dmarc.ietf.org>> wrote:
I'm sorry if this is a dumb question, but I understood that in IPv6 routers could not fragment IPv6 packets, only endpoints.


I know! fun. Honestly the Internet is just so interesting - you wonder how it works at all.

The pcap is in the #picoquic slack channel. (as always anyone reading this can just ask anyone on the slack, like me, for an invite). search for octopus.pcap

Unlike in IPv4, IPv6 routers never fragment IPv6 packets. Packets exceeding the size of the maximum transmission unit of the destination link are dropped and this condition is signaled by a Packet too Big ICMPv6 type 2 message to the originating node, similarly to the IPv4 method when the Don't Fragment bit is set.[1]

End nodes in IPv6 are expected to perform path MTU discovery to determine the maximum size of packets to send, and the upper-layer protocol is expected to limit the payload size. However, if the upper-layer protocol is unable to do so, the sending host may use the Fragment extension header in order to perform end-to-end fragmentation of IPv6 packets.

https://en.wikipedia.org/wiki/IPv6_packet#Fragmentation

How sure are you that it's a router and not the sending host that's doing the fragmentation.


It seems unlikely to be in the core.. the recv host does have a mtu of 1500 but its hard to imagine a recv stack fragmenting (and then reassembling and reordering!) things.

one of the interesting tidbits here is that it isn't just the small fragment that moves ahead in the queue - its both fragments of the big packet.


Cheers,

Ryan

On Fri, Mar 2, 2018 at 9:02 PM, Christian Huitema <huitema@huitema.net<mailto:huitema@huitema.net>> wrote:
Yesterday, I was mentioning bugs of the interop. This morning, I woke up to find an interesting message from Patrick McManus. Something is weird, he said. The first data message that your server sends, with sequence number N, always arrives before the final handshake message, with sequence number N-1. That inversion appears to happen systematically.
It took us the best part of a day to explore blind alleys and finally understand what was happening. The exchange was over IPv6. Upon receiving a connection request from Patrick’s implementation, Picoquic was sending back a handshake packet. Immediately after that, Picoquic was sending its first data packet, which happens to be an MTU probe. And it turns out that the probe was 1518 bytes, a bit longer than what the AWS routers could accept. So some router inserted an IPv6 fragmentation header and split the packet in two: a large initial fragment, 1496 byte long, and a small second fragment 78 bytes long. You could think that this is no big deal, since fragments would just be reassembled at the destination, but you would be wrong.
Some routers on the path try to be helpful. They have learned from past experience that short packets often carry important data, and so they try to route them faster than long data packets. And here is what happens in our case:

•         * The server prepares and send a Handshake packet, 590 bytes long.

•         * The server then prepares the MTU probe, 1518 bytes long.

•         * The MTU probe is split into fragment 1, 1496 bytes, and fragment 2, 78 bytes.

•         * The handshake and the long fragment are routed on the normal path, but the small fragment is routed at a higher priority level.

•         * The Linux driver at the destination receives the small fragment first. It queues everything behind that until it receives the long fragment.

•         * The Linux driver passes the reassembled packet to the application, which cannot do anything with it because the encryption keys can only be obtained from the handshake packet.

•         * The Linux driver then passes the handshake packet to the application.
Which confirms an old opinion. When routers try to be smart and helpful, they end up being dumb and harmful. Please just send the packets in the order you get them!
I tried to work around the issue by setting the "don't fragment" bit on the socket, but somehow that doesn't work. So I simply programmed the server to not use payloads larger than 1440 bytes. Still, I can see that pattern happening in other circumstances, such as a long Connection Initial message followed by a short 0-RTT packet. isn't networking fun?
-- Christian Huitema