Re: [BEHAVE] PMTU Discovery and ICMPv6 filtering

Hi Ed,

> -----Original Message-----
> From: behave-bounces@ietf.org [mailto:behave-bounces@ietf.org] On Behalf Of Ed Jankiewicz
> Sent: Wednesday, February 03, 2010 9:21 AM
> To: Behave WG; softwires@ietf.org
> Subject: [BEHAVE] PMTU Discovery and ICMPv6 filtering
> 
> One of my colleagues received a long comment on Path MTU Discovery recommendations his organization
> published and is seeking advice.  I recall this has been discussed several times at IETF meetings,
> not sure which WG, so this may be redundant.  I've tried to summarize the salient points below, and
> have two broad questions on this:  Are these points already covered in RFCs (other than 4459, 4890)
> or current Internet-Drafts? If so, I would appreciate pointers.  If not already covered by current
> publications, is there interest in documenting the problem and comparing the solutions/drawbacks?
> 
> The commenter basically wrote:
> 
> IPv4 and IPv6 treat packets exceeding MTU differently - IPv4 will fragment packets that are "too big"
> but IPv6 will drop the packet and respond with ICMPv6 "too-big" error message. [The subject
> publication] recommends using the Path MTU Discovery Protocol to discover the end-to-end PMTU, which
> relies on ICMPv6 error messages. These may be blocked by various "filters" and IPsec gateways, which
> is the case in many operational networks.
> 
> However, even when ICMPv6 is not blocked, IPsec gateways (in tunnel mode) add extra headers, and
> there can be more than one tunnel header involved (routers also create tunnels). When a "too-big"
> message is sent the router will return put in its ICMPv6 message the value of the MTU on the next
> link at layer 2. The host receiving this MTU value in an ICMP message at part of the Path MTU
> Discovery Protocol has no way of knowing how many extra tunnel headers are added along the path, and
> so if it just takes the reported MTU value without allowing for these extra headers the process will
> keep on failing and will not recover. We have seen this behavior in our experiments.
> 
> This can be prevented by ensuring that the maximum packet size sent by the host is smaller than the
> layer 2 limit: smaller by an amount estimated to be sufficient to allow room for extra headers to be
> added along the path. Several ways of achieving this are possible:
> 
> (1) Set this reduced MTU value on the on the IPSec gateway LAN interface; the host then discovers
> this MTU through the PMTUD.
> 
> (2) Statically configure this reduced MTU value into the host and switch off PMTUD.
> 
> (3) Set a reduced MTU at the IPSec gateway WAN interface; The IPSec gateway acts as a host on this
> interface and so can do packet fragmentation.
> 
> (4) Provide the capability in the IPSec gateway to discover the MTU on its WAN interface, subtract
> the maximum header size that this gateway will add to packets presented on its LAN interface, which
> the host can then discover through the PMTUD.
> 
> Method (4) would be the best solution, but is not currently available in the IPSec gateway products.
> The next best solution is (1), which has been used [in commenter experiments]. This is not as good as
> (4) because it requires manual intervention, and an understanding of how to calculate the appropriate
> (reduced) MTU value.
> The next best solution is (2), the only disadvantage of this approach is that only one value can be
> set for all paths and so the worst case (lowest) value has to be used. In a complex network it may
> not always be obvious what the worst case path is, and so a conservative estimate may be necessary.
> Even so this could be preferable in some deployment scenarios since the path-MTU discovery protocol
> relies on the passage of ICMP messages which are sometimes blocked by firewalls and other security
> devices.
> Approach (3) is the worst solution since it will cause many IP packets to be fragmented which is
> inefficient (both because, unlike IPv4, the IPv6 header has to be extended to include the
> fragmentation offset field, and because it will result the second fragment being very small, i.e. the
> ratio of user-data to IP header size will be poor).
> 
> It is likely that for immediate use option (1) should be used although (4) would be better if it were
> supported in the relevant products.

(1) seems like a safe option at face value, but can lead
to undesirable inefficiencies. Consider for example the
diagram below:

            L1    L2               L3
            |     |                |
        W --|--R--|--GW1<====>GW2--|--Z
            |     |   (Internet)   |
               X--|     L4, L5,
               Y--|     L6, etc.

Here, we have a tunnel beteen 'GW1' and 'GW2' over the
Internet to connect two networks. 'GW1' sets a reduced
MTU 'M' on its LAN interface connected to link 'L2', and
also advertises 'M' in the Router Advertisements it sends
on 'L2'. Hosts 'X' and 'Y' pick up the reduced MTU from
the RA and limit the size of the packets they send to at
most 'M' bytes. But, host 'W' connected to link 'L1' does
not see the MTU reduction, and hence will routinely send
packets larger than 'M' to any hosts beyond router 'R'
such as 'X', 'Y' and 'Z'. These packets will be dropped
with an ICMPv6 PTB returned, then 'W' will be forced to
reduce its packet size and retransmit. The only way to
prevent this is to drive the reduced MTU 'M' deeply into
the entire network stacked up behind 'GW1', which may
contain arbitrarily many additional routers and links.

Now, even if the reduced MTU 'M' were propagated deeply
throughout the 'GW1' network, if most communications
remain localized hosts would only be able to use a packet
size of at most 'M' even if their links natively support
a much larger MTU. Consider for example that link 'L2'
in the diagram has a native MTU of 9kb, but the effective
MTU across the 'GW1'<====>'GW2' tunnel is only 1400. 'GW1'
will advertise 1400 on link 'L1', and communications
betweenhosts 'X' and 'Y' will be restricted to using at
most 1400 byte packets when they could have used 9kb.

There are a couple of factors to consider in terms of
what might be a better solution. First, what are the
expected data rates over the 'GW1'<====>'GW2' tunnel,
and second what are the performance characteristics of
those gateways? If the data rates are such that GW1 and
GW2 are already operating at their peak performance even
without taking on any additional processing overhead,
then the best solution would be to make sure that all
links over which the tunnel might travel (e.g., 'L4',
'L5', 'L6', etc.) are large enough to "hide" the tunnel
encapsulation artifact. For example, if all links 'L4'
'L5', 'L6', etc. configure a native MTU of no smaller
than 1600 and the encapsulation overhead for the tunnel
is 100 bytes, then all routers and hosts on the LAN side
of 'GW1' would be able to happily use a 1500 MTU. If the
MTUs of 'L4', etc. cannot be controlled, however, then
there is no recourse but to use option (1) and cope with
the inefficiencies.

On the other hand, if the data rates across the tunnel
are nominal and/or 'GW1' and 'GW2' have more than
sufficient processing capability to take on a modest
amount of additional overhead, the GWs can use tunnel
fragmentation so that 'GW1' can present a solid MTU on
its LAN side interface that does not reflect the size
of the tunnel encapsulation headers. If the tunnel
fragmentation could accommodate an MTU of at least 1500
in this way, then 'GW1' would be able to observe the
"de facto Internet cell size" of 1500. If we further
assume that the vast majority of hosts in the world
today either limit their packet sizes to no more than
1500 bytes or are willing to assume the risk of silent
loss of packets larger than 1500 due to MTU restrictions,
then there is no need to place artificial restrictions
on the size of packets that can be used within the 'GW1'
network. This latter class of hosts (those that send
packets larger than 1500) would be best served to use
their own host-based MTU probing mechanisms for sending 
packets larger than 1500 in case the network is somehow
silently dropping PMTUD messages. RFC4821 was specifically
designed for this purpose.

End result - whenever it is practically possible, tunnel
routers should use tunnel fragmentation and hosts should
use RFC4821.

Fred
fred.l.templin@boeing.com

> --
> Ed Jankiewicz - SRI International
> Fort Monmouth Branch Office - IPv6 Research
> Supporting DISA Standards Engineering Branch
> 732-389-1003 or  ed.jankiewicz@sri.com
> 
> _______________________________________________
> Behave mailing list
> Behave@ietf.org
> https://www.ietf.org/mailman/listinfo/behave