RE: Fun and surprises with IPv6 fragmentation

Praveen Balasubramanian <pravb@microsoft.com> Sat, 03 March 2018 18:51 UTC

Return-Path: <pravb@microsoft.com>
X-Original-To: quic@ietfa.amsl.com
Delivered-To: quic@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 44F07126CB6 for <quic@ietfa.amsl.com>; Sat, 3 Mar 2018 10:51:56 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.03
X-Spam-Level:
X-Spam-Status: No, score=-0.03 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=1.989, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=microsoft.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kt_k94llnS5V for <quic@ietfa.amsl.com>; Sat, 3 Mar 2018 10:51:53 -0800 (PST)
Received: from NAM02-SN1-obe.outbound.protection.outlook.com (mail-sn1nam02on0108.outbound.protection.outlook.com [104.47.36.108]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A4CD6126DD9 for <quic@ietf.org>; Sat, 3 Mar 2018 10:51:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=9eprjFKtnBQOeA9ZxiG+i0XtZjVSaPpdVW4s6lnIsSM=; b=TRwnkqs9szEo6ZgpLzPiQD3WN3wox3B5akxf+1Zxe+GELZvo/sIASOoSwMIueTvhvb21stjbknLvlbQJv3sC4v8Clh3aAHFKoKW+XuewK9Cx+YhgUFpKjCyMit0uQ+nzvu4qkXUfRfFN+EMk8p7gu+LngT9mKoODanxBPGVlPnA=
Received: from CY4PR21MB0630.namprd21.prod.outlook.com (10.175.115.20) by CY4PR21MB0504.namprd21.prod.outlook.com (10.172.122.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.588.1; Sat, 3 Mar 2018 18:51:46 +0000
Received: from CY4PR21MB0630.namprd21.prod.outlook.com ([fe80::1ccf:93aa:3263:987]) by CY4PR21MB0630.namprd21.prod.outlook.com ([fe80::1ccf:93aa:3263:987%2]) with mapi id 15.20.0567.006; Sat, 3 Mar 2018 18:51:46 +0000
From: Praveen Balasubramanian <pravb@microsoft.com>
To: Patrick McManus <pmcmanus@mozilla.com>, Ryan Hamilton <rch=40google.com@dmarc.ietf.org>
CC: "quic@ietf.org" <quic@ietf.org>, huitema <huitema@huitema.net>
Subject: RE: Fun and surprises with IPv6 fragmentation
Thread-Topic: Fun and surprises with IPv6 fragmentation
Thread-Index: AQHTsqzTk38gK+s0x02btwUhbnJeJqO99mWAgACrg4CAADhrUA==
Date: Sat, 3 Mar 2018 18:51:46 +0000
Message-ID: <CY4PR21MB0630BC79FA15EA2A8E5D6C57B6C40@CY4PR21MB0630.namprd21.prod.outlook.com>
References: <681fcc96-4cf9-100d-9ad6-b3c7be9189a5@huitema.net> <CAJ_4DfS=6h9qEQ+uwntLtDZNSODhqc_0pww7c2gK50XKna0BCw@mail.gmail.com> <CAOdDvNqRD=NqbmDaTDi5t-iPy_sB-bjHcpeVPgEXZnN04DnRSQ@mail.gmail.com>
In-Reply-To: <CAOdDvNqRD=NqbmDaTDi5t-iPy_sB-bjHcpeVPgEXZnN04DnRSQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [2001:4898:80e8:5::97]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; CY4PR21MB0504; 7:sVVB7bZIG2beCFixOEtmHYUHCdu791z51yEodysTzuvneuFuqHK5MuvEL0dX2UDj/2jCDm2f64qdp4yycCjQiZvX1l86AwaAVKS8J5pCU+1veYHEiCMKR1lSYZsxlKG0AmTkKkqSstl7ELkdry3S8KLB/w4g0vPu0eF8dUGcNQHNnuphf/GjILGtsZx2oA1Sr6QOIUyTdeEQCoNJBaiQvZ9iJbyt6sEzvKgUseGYEUkgwLcyjDuXocmvwDf6/tr2
x-ms-exchange-antispam-srfa-diagnostics: SOS;
x-ms-office365-filtering-ht: Tenant
x-ms-office365-filtering-correlation-id: 09c8f09a-29cf-4904-aa32-08d58137d08e
x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(7020095)(4652020)(48565401081)(5600026)(4604075)(3008032)(4534165)(4627221)(201703031133081)(201702281549075)(2017052603307)(7193020); SRVR:CY4PR21MB0504;
x-ms-traffictypediagnostic: CY4PR21MB0504:
authentication-results: spf=none (sender IP is ) smtp.mailfrom=pravb@microsoft.com;
x-ld-processed: 72f988bf-86f1-41af-91ab-2d7cd011db47,ExtAddr
x-microsoft-antispam-prvs: <CY4PR21MB0504356CC348D62D07E0CB8EB6C40@CY4PR21MB0504.namprd21.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(28532068793085)(158342451672863)(189930954265078)(219752817060721)(21748063052155);
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(8211001083)(61425038)(6040501)(2401047)(8121501046)(5005006)(93006095)(93001095)(10201501046)(3231220)(944501244)(52105095)(3002001)(6055026)(61426038)(61427038)(6041288)(20161123564045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123562045)(20161123560045)(20161123558120)(6072148)(201708071742011); SRVR:CY4PR21MB0504; BCL:0; PCL:0; RULEID:(3232008); SRVR:CY4PR21MB0504;
x-forefront-prvs: 0600F93FE1
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(396003)(39380400002)(366004)(346002)(376002)(39860400002)(189003)(199004)(8990500004)(106356001)(59450400001)(105586002)(22452003)(68736007)(186003)(86612001)(14454004)(606006)(316002)(76176011)(966005)(3660700001)(6506007)(53546011)(102836004)(10090500001)(86362001)(5660300001)(33656002)(74316002)(4326008)(81156014)(81166006)(7736002)(229853002)(6246003)(790700001)(6116002)(10290500003)(7696005)(2906002)(2950100002)(3280700002)(236005)(5250100002)(25786009)(99286004)(55016002)(46003)(6436002)(478600001)(8936002)(54896002)(19609705001)(6306002)(9686003)(53936002)(2900100001)(54906003)(97736004)(110136005)(8676002); DIR:OUT; SFP:1102; SCL:1; SRVR:CY4PR21MB0504; H:CY4PR21MB0630.namprd21.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en;
received-spf: None (protection.outlook.com: microsoft.com does not designate permitted sender hosts)
x-microsoft-antispam-message-info: 2c/UIdLWwfp2ObzfflLCIv5QHBNl6m/5XiA4XXy3mndm4FVXY1PulTVoD/qEfWgUtF9cBK7eXBzdKG3PKMWfO7pPXvRhf2QpSuibFawS+2Ba4QsMl2B215hkIiaXU2ExD5QV9+0VV+ls9CDIwt7+88/E9bwl8xwzWHywNNMBsFU=
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: multipart/alternative; boundary="_000_CY4PR21MB0630BC79FA15EA2A8E5D6C57B6C40CY4PR21MB0630namp_"
MIME-Version: 1.0
X-OriginatorOrg: microsoft.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 09c8f09a-29cf-4904-aa32-08d58137d08e
X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Mar 2018 18:51:46.2836 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 72f988bf-86f1-41af-91ab-2d7cd011db47
X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR21MB0504
Archived-At: <https://mailarchive.ietf.org/arch/msg/quic/U1vZljQX8cFglnTSOrJMBzx1olo>
X-BeenThere: quic@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Main mailing list of the IETF QUIC working group <quic.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/quic>, <mailto:quic-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/quic/>
List-Post: <mailto:quic@ietf.org>
List-Help: <mailto:quic-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/quic>, <mailto:quic-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Mar 2018 18:51:56 -0000

This is why TCP goes out of its way to avoid fragmentation. Fragments are known to take a different (and usually a slower) path through the network. For UDP, if ECMP is being done on 4-tuple then only the first fragment will take one path and subsequent fragments may take another path (since the hash will be on 2-tuple). The right way to deal with this is to detect MF bit and start doing 2-tuple hashing instead of 4-tuple but I have my doubts if many routers do that. Surprising  that a router fragmented IPv6.

From: QUIC [mailto:quic-bounces@ietf.org] On Behalf Of Patrick McManus
Sent: Saturday, March 3, 2018 7:24 AM
To: Ryan Hamilton <rch=40google.com@dmarc.ietf.org>
Cc: quic@ietf.org; huitema <huitema@huitema.net>
Subject: Re: Fun and surprises with IPv6 fragmentation



On Sat, Mar 3, 2018 at 12:09 AM, Ryan Hamilton <rch=40google.com@dmarc.ietf.org<mailto:rch=40google.com@dmarc.ietf.org>> wrote:
I'm sorry if this is a dumb question, but I understood that in IPv6 routers could not fragment IPv6 packets, only endpoints.


I know! fun. Honestly the Internet is just so interesting - you wonder how it works at all.

The pcap is in the #picoquic slack channel. (as always anyone reading this can just ask anyone on the slack, like me, for an invite). search for octopus.pcap

Unlike in IPv4, IPv6 routers never fragment IPv6 packets. Packets exceeding the size of the maximum transmission unit of the destination link are dropped and this condition is signaled by a Packet too Big ICMPv6 type 2 message to the originating node, similarly to the IPv4 method when the Don't Fragment bit is set.[1]

End nodes in IPv6 are expected to perform path MTU discovery to determine the maximum size of packets to send, and the upper-layer protocol is expected to limit the payload size. However, if the upper-layer protocol is unable to do so, the sending host may use the Fragment extension header in order to perform end-to-end fragmentation of IPv6 packets.

https://en.wikipedia.org/wiki/IPv6_packet#Fragmentation<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FIPv6_packet%23Fragmentation&data=04%7C01%7Cpravb%40microsoft.com%7Ca78a3514879f4032cde508d5811acaaa%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636556874442590743%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=Eyq1%2Bv8F7Q9CF%2BWtT0F06M3xwcOnRPhET77cSeJE9XY%3D&reserved=0>

How sure are you that it's a router and not the sending host that's doing the fragmentation.


It seems unlikely to be in the core.. the recv host does have a mtu of 1500 but its hard to imagine a recv stack fragmenting (and then reassembling and reordering!) things.

one of the interesting tidbits here is that it isn't just the small fragment that moves ahead in the queue - its both fragments of the big packet.


Cheers,

Ryan

On Fri, Mar 2, 2018 at 9:02 PM, Christian Huitema <huitema@huitema.net<mailto:huitema@huitema.net>> wrote:
Yesterday, I was mentioning bugs of the interop. This morning, I woke up to find an interesting message from Patrick McManus. Something is weird, he said. The first data message that your server sends, with sequence number N, always arrives before the final handshake message, with sequence number N-1. That inversion appears to happen systematically.
It took us the best part of a day to explore blind alleys and finally understand what was happening. The exchange was over IPv6. Upon receiving a connection request from Patrick’s implementation, Picoquic was sending back a handshake packet. Immediately after that, Picoquic was sending its first data packet, which happens to be an MTU probe. And it turns out that the probe was 1518 bytes, a bit longer than what the AWS routers could accept. So some router inserted an IPv6 fragmentation header and split the packet in two: a large initial fragment, 1496 byte long, and a small second fragment 78 bytes long. You could think that this is no big deal, since fragments would just be reassembled at the destination, but you would be wrong.
Some routers on the path try to be helpful. They have learned from past experience that short packets often carry important data, and so they try to route them faster than long data packets. And here is what happens in our case:

•         * The server prepares and send a Handshake packet, 590 bytes long.

•         * The server then prepares the MTU probe, 1518 bytes long.

•         * The MTU probe is split into fragment 1, 1496 bytes, and fragment 2, 78 bytes.

•         * The handshake and the long fragment are routed on the normal path, but the small fragment is routed at a higher priority level.

•         * The Linux driver at the destination receives the small fragment first. It queues everything behind that until it receives the long fragment.

•         * The Linux driver passes the reassembled packet to the application, which cannot do anything with it because the encryption keys can only be obtained from the handshake packet.

•         * The Linux driver then passes the handshake packet to the application.
Which confirms an old opinion. When routers try to be smart and helpful, they end up being dumb and harmful. Please just send the packets in the order you get them!
I tried to work around the issue by setting the "don't fragment" bit on the socket, but somehow that doesn't work. So I simply programmed the server to not use payloads larger than 1440 bytes. Still, I can see that pattern happening in other circumstances, such as a long Connection Initial message followed by a short 0-RTT packet. isn't networking fun?
-- Christian Huitema