Re: [DNSOP] I-D Action: draft-ietf-dnsop-avoid-fragmentation-00.txt

Marek Majkowski <> Thu, 09 July 2020 16:32 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id A402F3A0C2A for <>; Thu, 9 Jul 2020 09:32:03 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.848
X-Spam-Status: No, score=-1.848 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id aNzDfxmVLSc4 for <>; Thu, 9 Jul 2020 09:32:02 -0700 (PDT)
Received: from ( [IPv6:2a00:1450:4864:20::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id ABC803A0C20 for <>; Thu, 9 Jul 2020 09:32:01 -0700 (PDT)
Received: by with SMTP id q7so3140771ljm.1 for <>; Thu, 09 Jul 2020 09:32:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KABVR1CC+jyqsEEQaCjwyWCPulo9VpEPuyP+fuJvJU0=; b=vTaaFmhJZnO8VbcCB39AwT1Cfii2Gd2pU2A9ac1hsHJv76cLffFgC/LFWUPtjRrY2w 8wEPBDneZejA4Mv6ENOXJGX3c73Ph5GPXCTSE6BXzNknXCAyJXT9e2ceS4vHkF+ufAh2 NSqvIrUosTca1SHwHTdMn/Hcn8m/V7udGPF+Ued0SNvPEsMQB2lzD0dh2+qkoDzDg7Rf 0O3GSVovHjYG/GhqElPjsLH8hRd2hsFFXwOhAJ5EqvpA1fLK3wbjvFl6J4EkPpZDm49q rASL/YRX5/nHr3EkQaN7V9ugLK4ynXdpMqDoiEXCpQKAhM1Ike8UhVf/7Z8G9LTPNwB0 OIHQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KABVR1CC+jyqsEEQaCjwyWCPulo9VpEPuyP+fuJvJU0=; b=MgcGXQbvDig7vLEHPCVMhGbSiyUI8TWSvq9lHLlKyg3UgK1SZWKRLwBpt6M6arre/X MxDNyZkyrUUa3VJAC68r4Vmt7sc7c4HUJvJkELF4JVYJKVi/fMSvQAarx+/H5NMHiW4V 6xZagAS0LXinOsSEPIFA0UTCk7+WGozohgAjprNF4aDduxXG6uVXMObdqZMlYlh8yAvU s5kMxJ5CvTDasHB1isXt+/7OlBthiVhn170pF/p1ViF/3IRVbP8CiQk08xIY+h9tT5mN F5ENI1ydeNuUfgIl91ZF5Mdj30eE2qhZVtUoAtRMUKSge+YONkMiacBHglQnFSuvJuLj fG9g==
X-Gm-Message-State: AOAM530tGEUzdhxeb2bxlpF55bZpgPinOkk+wRfAnniSsLXlbq+C+ODB 2NyrrmPgdyD2260Y55sogMIMGqgHdbOoxAYw1tQSwZRH
X-Google-Smtp-Source: ABdhPJwILtaz40yau2OR32sRMkZz9laUYzYIbeE2WovnsQLBkF2XpnlKOyCNorAU8QB7at2KU/U2dFW9BVCuyCuJbnU=
X-Received: by 2002:a2e:3010:: with SMTP id w16mr36618218ljw.449.1594312319619; Thu, 09 Jul 2020 09:31:59 -0700 (PDT)
MIME-Version: 1.0
References: <> <> <> <>
In-Reply-To: <>
From: Marek Majkowski <>
Date: Thu, 9 Jul 2020 18:31:48 +0200
Message-ID: <>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <>
Subject: Re: [DNSOP] I-D Action: draft-ietf-dnsop-avoid-fragmentation-00.txt
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF DNSOP WG mailing list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 09 Jul 2020 16:32:04 -0000

On Thu, Jul 9, 2020 at 10:28 AM <> wrote:
> > From: Marek Majkowski <>
> >> UDP requestors and responders SHOULD send DNS responses with
> >> IP_DONTFRAG / IPV6_DONTFRAG [RFC3542] options, which will yield
> >> either a silent timeout, or a network (ICMP) error, if the path
> >> MTU is exceeded.
> >
> > When MTU is exceeded the sender might also receive plain old EMSGSIZE
> > error on sendto(). I would love to see an example on what
> > IP_MTU_DISCOVER settings authors expect. This option is notoriously
> > hard to get right.
> Is IP_MTU_DISCOVER a Linux-only option ?

Oh, I see. Still, my point remains. As a developer I'm not exactly
sure what should I set on Linux.

> Authors don't consider discouraging Path MTU discovery.
> We refer RFC1191, RFC8201 and ietf-tsvwg-datagram-plpmtud.
> If Path MTU discovery works well, we can use large UDP datagram size.
> >> Fragmented DNS/UDP messages may be dropped without IP reassembly
> >
> > Not sure what it has to do with the draft. Are we worried about
> > request fragmentation and allowing the DNS server to drop fragmented
> > requests? Are we worried about response fragmentation?
> I would like to change the text as:
>   The DNS Requestor MAY drop Fragmented DNS/UDP responses without IP
>   reassembly. (before IP reassembly?)
>   (Texts related ICMP error may be dropped, I think.)
> > I have two problems with this proposal. First, it doesn't mention IPv4
> > vs IPv6 differences at all. In IPv4 landscape fragmentation, while a
> > security issue, is generally fine. In the IPv6 world, fragmentation is
> > disastrous - packets with extension headers are known to be dropped.
> On IPv6, "every link in the Internet have an MTU of 1280 octets or
> greater" (RFC 8200).
> Then, if Path MTU discovery works, we can use real MTU value.
> Otherwise, we can use 1280 as MTU.
> We can easily avoid IPv6 fragmentation. (Fragmented IPv6 packets are bogus.)

How do you define if Path MTU discovery works? In my case, it works
fine. It's just on ICMP exactly one server of a large fleet will learn
the new path MTU. It's also worth noting that Path MTU expires
quickly, I think < 10 minutes. So the whole dance of request, timeout,
second request must be repeated often.

> On IPv4, it's terrible because IPv4 minimal MTU is 68, but most of
> links support 1500 octet MTU.
> > Second, this proposal assumes that path MTU detection works correctly.
> > This is surprisingly optimistic. Let's consider IPv6 - in IPv6 the
> > smaller path MTU < 1500 is very common.
> Without IPv6 over IPv4 tunnels, most of IPv6 links support 1500 octet
> MTU size.
> We can detect path MTU discovery failure, then we can use 1280 MTU on IPv6.

How do we detect path MTU discovery failure in the context of UDP and
DNS protocols?

> > Let's say a DNS auth server sent an IPv6 DNS response packet exceeding
> > path MTU. An intermediate router will drop the offending packet and
> > one of three scenarios will happen:
>   Leaf client ----Tunnel-----[Tunnel Router]-----Routers----Auth Server
>                  MTU 1280                   1500       1500
>                              drop ------------------------>
>                              ICMP PTB
> > - (A) No ICMP PTB message is sent back.
> > - (B) ICMP PTB message is sent back, but fails to be delivered.
> A and B is the same.
> The first response is simply dropped.
> Leaf client (full-service resolvers) may retry queries to other Auth
> servers by UDP, or retry queries to the Auth server by TCP or UDP/EDNS
> with small size.

Right. Does the proposal address only resolver-auth communication? Is
the draft also addressing stub-resolver case?

It's not clear for me what resolver should do on a timeouted query:
 - retry to the same DNS auth server again over UDP
 - retry to the same DNS auth over UDP using different UDP source port
 - retry over TCP
 - retry using another IP protocol (v4 vs v6) to auth server
 - retry over UDP with tweaked EDNS
- retry using another DNS auth server

Or a mixture of above?

> > - (C) ICMP PTB message is sent back and delivered correctly to the server.
> First, the first response is simply dropped.
> Auth server knows that path MTU to the Leaf client is 1280.
> Leaf client (full-service resolvers) may retry queries to other Auth
> servers by UDP, or retry queries to the Auth server by TCP or UDP/EDNS
> with small size. (The same as A and B.)
> After some time, if the leaf node send next queries to the Auth server
> with UDP and same parameter, the Auth server knows path MTU to the
> leaf.

Negative. *one of potentially a hundred* of ECMP Auth servers learns
the ephemeral Path MTU. There is no guarantee that the retry will use
the same source port, and routing path. The retry may end up at a
completely different machine due to ECMP routing or another L4 load
balancing mechanism. Even perfectly working Path MTU discovery is
limited in scope to only one server. There may be more than one DNS
auth server sharing the same IP.

> Then, the auth server need to compose response packets fit in
> the path MTU size, or set TC=1.
> > All three scenarios are disastrous on the practical internet. The
> > proposal assumes (A) and (B) will rarely happen, and puts the
> > responsibility on the DNS client to retry over TCP. This will cause
> > unnecessary timeouts and degrade the overall quality of the service.
> To avoid these cases, we can make new recommendations.  Authoritative
> servers and full-service resolvers SHOULD support 1500-octet path MTU
> to major parts of the Internet.
> (we need to define major parts of the Internet)

Right, so you are saying:
 - in communication between DNS auth and DNS recursive
 - in order to avoid transmitting fragments
 - no answer from auth should exceed 1500
 - or detected path MTU if such is present

This is fair. But then if you are suggesting that any decent internet
host should have MTU 1500 anyway you can remove the last point and the
whole path-MTU-discovery discussion.

Maybe there are two different discussions. What you are saying is:
don't be ridiculous - don't send udp packets > 1500 which for sure
will require fragmentation. While I am saying - when you do get PTB
ICMP back, react immediately and save the requestor a wait.

> > In this proposal all three (A), (B), and (C) scenarios will result in
> > dropped responses. DNS client needs to wait for timeout, retry over
> > UDP, wait more and eventually retry over TCP. This is bad.
> Do you mean that DNS client is stub resolvers in clients ?
> Stub resolvers may not set DO (DNSSEC OK) bit and responses can be small.
> Or full-service resolvers ? Full-service resolvers should be located
> at 1500-octet MTU world.
> > We could fix (C) by making the DNS server to capture the ICMP PTB in
> > DNS server code. The ICMP payload often has enough context for the DNS
> > server to prepare another reply. This reply of course should be sent
> > with lowered MTU.
> I think that we can know ICMP PTB result from applications by IP_MTU
> socket options on Linux.
> > On Linux it is possible to capture the ICMP PTB without privileges, by
> > setting IP_RECVERR and inspecting MSG_ERRQUEUE. In IPv4 the PTB
> > messages often have 520 bytes of payload and in IPv6 1184 bytes. This
> > is enough context to build another response, without having to wait
> > for any timeout.
> In my opinion, the captured PTB data is similar to IP_MTU/IPV6_MTU
> socket options.

With captured ICMP the dns server is able to react and send something
out immediately. This will improve the service quality and save the
requestor the wait.