Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

Robin Whittle <rw@firstpr.com.au> Mon, 08 February 2010 05:01 UTC

Return-Path: <rw@firstpr.com.au>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id A9C333A6858 for <rrg@core3.amsl.com>; Sun, 7 Feb 2010 21:01:19 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.752
X-Spam-Level:
X-Spam-Status: No, score=-0.752 tagged_above=-999 required=5 tests=[AWL=-0.716, BAYES_20=-0.74, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sz2lRhDx+1L2 for <rrg@core3.amsl.com>; Sun, 7 Feb 2010 21:01:17 -0800 (PST)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id CD3C83A6767 for <rrg@irtf.org>; Sun, 7 Feb 2010 21:01:16 -0800 (PST)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id 83D76175BCC; Mon, 8 Feb 2010 16:02:16 +1100 (EST)
Message-ID: <4B6F9AD9.7020508@firstpr.com.au>
Date: Mon, 08 Feb 2010 16:02:17 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
References: <4B5ED682.8000309@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF 0A64950F33198@XCH-NW-01V.nw.nos.boeing.com> <4B5F8E7E.1090301@firstpr.com. a u> <E1829B60731D1740BB7A0626B4FAF0A64950F332A8@XCH-NW-01V.nw.nos.boeing. c om> <4B5FC783.4030401@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64 9 50F3333F@XCH-NW-01V.nw.nos.boeing.com> <4B6103C8.6090307@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FEC1D3@XCH-NW-01V.nw.nos.boeing.com> < 4B6473E5.1000508@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FEC98C@XCH-NW-01V.nw.nos.boeing.com> <4B698565.8030301@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FECB19@XCH-NW-01V.nw.nos.boeing.com>
In-Reply-To: <E1829B60731D1740BB7A0626B4FAF0A64950FECB19@XCH-NW-01V.nw.nos.boeing.com>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Subject: Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Feb 2010 05:01:19 -0000

Hi Fred,

I am replying to your (msg05937).  I will soon write, in a different
thread, my understanding of SEAL - which I hope you will respond to
with any necessary corrections.

You wrote:

Regarding some IPv4 or IPv6 routers sending a PTB with a zero value
in the 16 bit field for MTU (IPv4) or the 32 bit field for MTU in IPv6:

>> Are there really routers in use which don't include the 16 bit value?
>>
>> Even if there theoretically are not, SEAL still has to cope with the
>> possibility of zero or unreasonable values in this MTU field.
> 
> SEAL explicitly turns off PMTUD and uses its own tunnel
> endpoint-to-endpoint MTU determination, so in the normal
> case it does not expect to receive any ICMP PTBs from
> routers within the tunnel. 

My understanding is that this is only true for IPv4, because the SEAL
ITE (Ingress Tunnel Endpoint) sends packets with DF=0 to the ETE
(Egress Tunnel Endpoint).  For IPv6, the ITE can get PTBs from
routers in the tunnel since no packets are fragmentable.  So I think
it would not be true to state that SEAL ITE "turns off" the
traditional IPv6 RFC 1981 PMTUD mechanism when it tunnels packets to
the ITE.

> SEAL *can* enable PMTUD for certain "expendable" packets, 

I don't recall what these would be.

> and can benefit from any
> ICMP PTBs coming from within the tunnel that contain
> sufficient information. But, that would simply be an
> optimization.

Is there a mechanism for SEAL, in IPv4, to send these "expendable"
packets with DF=1?


>>> In some environments, it may be necessary to insert a
>>> mid-layer UDP header in order to give ECMP/LAG routers
>>> a handle to support multipath traffic flow separation.
>>    http://en.wikipedia.org/wiki/Equal-cost_multi-path_routing
>>
>> http://www.force10networks.com/CSPortal20/TechTips/0065_HowDoIConfigureLoadBalancing.aspx
>>
>> As far as I know, these techniques are not something to consider with
>> the RANGER CES, or with LISP or Ivip.  If the routers can handle
>> ordinary traffic packets they can handle encapsulated packets too.  I
>> haven't read about these techniques in detail.  I guess that within
>> RANGER, beyond its use as a CES scalable routing solution, you may
>> want to support ECMP and LAG.
> 
> There has been a great deal of talk about taking care
> of ECMP/LAG routers within the network that only
> recognize common-case protocols (i.e., TCP and UDP),
> which is why LISP has locked into using UDP encaps.

Do you expect this to be the case for IRON?  If so, then I guess that
SEAL in IRON must always use this UDP header before the SEAL header -
since no ITE could know for sure whether ECMP/LAG is in use on the
path to the ETE.

If anyone can point me to good references on ECMP used with LAG, I
would really appreciate it.


Regarding how the ITE manages a sliding window in the (random
starting value, then monotonically incremented) 32 bit SEAL ID it
uses for each ETE - so it can recognise valid values in PTBs or
messages from the ETE, without having to cache each value used so
far, and while keeping the window small enough to make it hard for an
attacker to correctly guess a value which would be accepted:

>>> The above is all correct wrt the window management. The
>>> ITE can ensure that the window size remains bounded by
>>> sending periodic explicit probes (e.g., once explicit
>>> probe per every N data packets).
>>
>> I don't have a really clear idea of how SEAL sets the numeric window,
>> or what you mean by your second sentence above.  If you can give a
>> more complete description with an example, I would really appreciate
>> it - in part because I might want to use or adapt your technique for
>> Ivip, which currently uses nonces.
> 
> What I am asking for in SEAL is that the ITE sets the
> SEAL_ID in each packet in monotonically-incrementing
> fashion (modulo 32). Then, on every Nth packet (e.g.,
> 500th, 1000th, etc.) the ITE sets the "Acknowledgement
> Requested" bit and the ETE (upon seeing the bit set)
> sends back an explicit acknowledgement. The ITE can
> then keep a window of outstanding SEAL_ID's by keeping
> track of the most recently sent and most recently
> acknowledged SEAL_IDs.

OK, except for the possibility that the packets arrive out of order.
For instance, let's say the random start point for the SEAL ID for
this ITE sending packets to this ETE was 3,000,000,000.  On the 500th
packet 3,000,000,499, the "Acknowledgement Requested" flag is set.
However, let's say it is received at the ETE before the packet with
SEAL-ID 3,000,000,498 is received.

By the time the ITE gets the acknowledgement packet from the ETE, it
has sent packets using SEAL IDs up to 3,000,000,510.

The ITE will then refuse to accept as valid any messages from the ITE
- or PTBs from routers in the tunnel path - with values in these
ranges, working backwards around the modulo 2^32 circle:

   Between 3,000,000,499 and 0 inclusive, or between
   4,294,967,295 and 3,000,000,511 inclusive.

But lets say the ETE then responds to the packet with SEAL ID
3,000,000,498, or that the ETE responded to this before sending the
Acknowledgement, but that the response was delivered to the ITE after
the acknowledgement.  Then this response will be considered invalid.

There are at least one reason for lowering the value of "500": To
reduce the proportion of random values in spoofed messages to the ITE
which it would accept.  500 (~2^9) already increases the chance of a
spoofed packet being accepted to 2^-23.

A reason for increasing it is to reduce the workload on the ITE and
ETE - though doing this once every 500 times doesn't seem excessive.

An alternative arrangement to the above would be is to implement in
the ITE a delay function, with a time-constant of a few seconds - say
3 seconds - so that at any instant, the ITE could find for a given
ETE what was the lowest SEAL ID it sent to this ETE within the last 3
seconds.  By using this and the last value used (which would already
be stored) then the ITE could maintain a relatively narrow window
without the ETE having to acknowledge anything, and without the
out-of-order packets problem I suggested above.

This "3 second timer" function doesn't have to be particularly
accurate.  If the effective time is between 3 and 4 seconds, then it
would suffice to take a snapshot of ETE IP addresses and the current
state of their SEAL ID counters, every second, and then stash the
snapshots in one of 4 locations on a circular basis.  Then for the
function which looks up the "SEAL ID last used ~3 seconds ago" simply
find that ETE's SEAL ID value in the snapshot taken 3 to 4 seconds ago.


>>> No, the sentence is correct. It is possible for the ITE
>>> to need to reduce its cached S_MSS value to a size less
>>> than 576 if there is truly a link with a small MTU (e.g.,
>>> 256) on the path. Although 576 is often considered to
>>> be the "nominal" minimum MTU for IPv4 links, the actual
>>> minimum MTU is only 68 bytes per RFC791.
>>
>> OK - I guess I wouldn't go this far for Ivip.  I will probably have
>> some guidance that if anyone has routers, tunnels or whatever with
>> PMTUs less than 1280 or some figure in that range, for both IPv4 or
>> IPv6, then they should make sure that ITRs or ETRs are not located so
>> that such low PMTU parts of the network are between the DFZ and these
>> ITRs or ETRs.
> 
> I want to be able to use SEAL over truly constrained
> links, such as certain wireless communications systems.
> On such links, there may be a reason to set an
> unusually small MTU.

OK.  I will assume for IRON that SEAL wouldn't be required to perform
heroics - to try to work with PMTUs below ~1200 or so.  I certainly
don't want Ivip to have to cope with people putting ITRs and ETRs in
places where there are MTUs less than this.  Such MTUs shouldn't
exist in the DFZ - if anyone ISP is paying for such a link, they
should get another one.  Likewise, there shouldn't be such low MTU
restrictions in any self-respecting ISP or other network.


>> Indeed you did - I had no idea things were this bad.
>>
>>   http://www.ietf.org/mail-archive/web/rrg/current/msg05907.html
>>   http://www.ietf.org/mail-archive/web/rrg/current/msg05910.html
>>
>>>> I just think it wrong in principle to develop messy new protocols
>>>> such as RFC 4821 to cope with these failings.
>>>
>>> In my opinion, packetization layers are operating "at risk"
>>> if they use packet sizes larger than 1500 but are not in
>>> some way checking with the final destination to ensure that
>>> the big packets are actually getting through. RFC4821 is
>>> a method for the source to do just that without requiring
>>> any changes on the destination. But to be sure, SEAL does
>>> not *depend* on RFC4821 but rather *sets the stage* for
>>> RFC4821 and/or any functional equivalents.
>>
>> OK.  I need to revisit all my thinking on PMTUD given the results of
>> your recent research.  But I haven't yet changed my thinking that RFC
>> 4821 is messy and expensive, and that it would be better to
>> straighten out whatever it is within the network which is stopping
>> PMTUD from working correctly.
> 
> The good thing about RFC4821 is that only the source
> host needs to implement it, so there is a natural path
> for incremental deployment. I guess there are other
> possibilities as well (e.g., similar schemes that do
> PMTUD black hole detection) so as long as the method
> chosen causes no harm the source host is welcome to
> implement whatever it sees fit.

It is up to whoever writes TCP packetization code in host stacks how
they want to set their packet length - and likewise application
developers who need to set their own packet lengths.

My view is that for IPv4, RFC 1191 PMTUD is an excellent system -
except that the PTB message should be made to follow the RFC 1981
requirement of sending back as much of the original packet as would
not make the PTB exceed 576 octets:

  http://tools.ietf.org/html/rfc1885#section-3.2

At present, the PTB spec in RFC 1191 (in 1990) is to only send back
the IPv4 header and the next 8 octets.  It needs to be like RFC 1981
so that when there are nested tunnels, a PTB sent by a router in the
innermost tunnel will contain enough of the original packet (IPv4
header = 8 octets) for the outermost tunnel entry router to craft a
valid PTB to the sending host, without having to cache anything from
the original traffic packet.  This would occur by a chain of events:

    A PTB from router in 3rd tunnel is recognised by the
    ITE of that tunnel as valid, since it contains enough bytes
    after the IPv4 header.  For instance, these bytes would include
    the SEAL ID or any other extra stuff the router put into the
    headers to enable it to accept only genuine PTBs and reject
    ones spoofed by an off-path attacker.  So the ITE at the start
    of the 3rd tunnel sends:

    A PTB to the ITE of the 2nd tunnel, which is long enough, as
    described above, to be validated and to contain enough of the
    tunneled packet for the ITE of the 2nd tunnel to send:

    A PTB to the ITE of the 1st (outer) tunnel.  This contains
    enough of the original traffic packet (at least the IPv4
    header plus 8 octets) for this ITE to be able to construct
    a valid PTB to the sending host, without the need to cache
    the initial part of the packet.  Likewise, the PTB, as
    mentioned above, must contain whatever this ITE needs to
    validate it as genuine.

If the RFC 1191 designers had correctly anticipated the need for one
or more levels of tunneling to support their PMTUD system, then I
think they would have altered the PTB requirements to be as long as
those for IPv6's RFC 1981.  Then we probably would have tunnels today
which properly support RFC 1191 PMTUD.


Also, I think that DF=0 packets should be deprecated - unless perhaps
they are shorter than some constant such as 1200 bytes or so.  I
think it would be bad to expect ITRs and ETRs and the whole CES
system to work over paths with MTUs below this.  People shouldn't use
such short PMTU links in the DFZ and shouldn't place their ITRs or
ETRs anywhere where there are such short PMTU links between them and
the DFZ.

My view is that for IPv6, RFC 1981 is an excellent system.

>From your research (msg05910), it seems that the current state of
PMTUD in IPv4 is a shambles - with some networks blocking PTBs, some
tunnels (or combinations of tunnels) not generating PTBs and with
some hosts ignoring PTBs, or not responding properly to them.  Also
some hosts send DF=0 packets of 1470 bytes (Google at least).

As far as I know, everything generally works because many hosts are
configured not to send packets long enough to run into PMTU problems.

>From the current basis, there's no way we can generally adopt
jumboframe paths in the DFZ as they appear.

Nor is there a way of introducing a tunneling-based CES architecture
which relies for its PMTUD on PTBs.  My IPTM approach and I think
your SEAL approach should be able to cope without relying on PTBs
from within the tunnel (but see my forthcoming message).  But what if
the ITRs (ITEs) can correctly sense the PMTU to the ETRs (ETEs) and
are unable to alter the sending host's packet lengths?

This could be due to:

  1  A PTB sent by the ITR is dropped by some filtering system
     before it can get to the SH.  This seems more likely if
     the ITR is outside the ISP or end-user network where the
     SH is located, than within it.

     If people filter PTBs from entering their system, or use an
     ISP which does the same, this is their own fault.

     The trouble is, they get away with it now, because the packets
     their hosts send are generally short enough not to run into MTU
     problems.  Unfortunately, such networks will perceive the
     difficulties resulting from their choices as being caused by
     sending packets to a host with an SPI ("edge") address in the
     CES architecture - and may not think it is their own filtering
     which is causing the trouble.

  2  The SH ignoring or responding incorrectly to the PTB.

     As above - they get away with it now, and would perceive the
     problem as being caused by the destination network which
     is using the CES system's "edge" space.

  3  The SH sends DF=0 packets which are too long, after
     encapsulation for some, many or all paths to ETRs.

     Again, as above, they get away with it now - but would blame
     the CES system, or rather the destination network which they
     may not know has adopted the "edge" space provided by the
     CES system.

     So does a CES system have to fragment every such packet?
     It seems so.


I think that to implement defensive, complex protocols such as RFC
4821 would be to accept and allow all these bad practices, and would
forever doom us to having to do extra work, and suffer extra
flakiness, just because of these bad practices.

RFC 4821 will always be a slower and less accurate method of
determining PMTU to a given host than RFC 1191 or RFC 1981.  It would
be subject to choosing a lower than proper value, if there was an
outage for a while and it interpreted this as a PMTU limitation.

 - Robin