Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware

"Templin, Fred L" <Fred.L.Templin@boeing.com> Mon, 08 February 2010 19:24 UTC

Return-Path: <Fred.L.Templin@boeing.com>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id E75DB28C0E8 for <rrg@core3.amsl.com>; Mon, 8 Feb 2010 11:24:19 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.618
X-Spam-Level:
X-Spam-Status: No, score=-6.618 tagged_above=-999 required=5 tests=[AWL=-0.019, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9An0uPrtI57j for <rrg@core3.amsl.com>; Mon, 8 Feb 2010 11:24:17 -0800 (PST)
Received: from slb-smtpout-01.boeing.com (slb-smtpout-01.boeing.com [130.76.64.48]) by core3.amsl.com (Postfix) with ESMTP id 8FDA63A70CB for <rrg@irtf.org>; Mon, 8 Feb 2010 11:24:17 -0800 (PST)
Received: from stl-av-01.boeing.com (stl-av-01.boeing.com [192.76.190.6]) by slb-smtpout-01.ns.cs.boeing.com (8.14.0/8.14.0/8.14.0/SMTPOUT) with ESMTP id o18JPEXk007448 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL); Mon, 8 Feb 2010 11:25:17 -0800 (PST)
Received: from stl-av-01.boeing.com (localhost [127.0.0.1]) by stl-av-01.boeing.com (8.14.0/8.14.0/DOWNSTREAM_RELAY) with ESMTP id o18JPEYB004428; Mon, 8 Feb 2010 13:25:14 -0600 (CST)
Received: from XCH-NWHT-08.nw.nos.boeing.com (xch-nwht-08.nw.nos.boeing.com [130.247.25.112]) by stl-av-01.boeing.com (8.14.0/8.14.0/UPSTREAM_RELAY) with ESMTP id o18JPDxL004412 (version=TLSv1/SSLv3 cipher=RC4-MD5 bits=128 verify=OK); Mon, 8 Feb 2010 13:25:14 -0600 (CST)
Received: from XCH-NW-01V.nw.nos.boeing.com ([130.247.64.120]) by XCH-NWHT-08.nw.nos.boeing.com ([130.247.25.112]) with mapi; Mon, 8 Feb 2010 11:25:13 -0800
From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
To: Robin Whittle <rw@firstpr.com.au>, RRG <rrg@irtf.org>
Date: Mon, 08 Feb 2010 11:25:12 -0800
Thread-Topic: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
Thread-Index: Acqoe+YotMLqVBvGQIeprOXISOZN/AAdA87A
Message-ID: <E1829B60731D1740BB7A0626B4FAF0A64951037EDB@XCH-NW-01V.nw.nos.boeing.com>
References: <4B5ED682.8000309@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF 0A64950F33198@XCH-NW-01V.nw.nos.boeing.com> <4B5F8E7E.1090301@firstpr.com. a u> <E1829B60731D1740BB7A0626B4FAF0A64950F332A8@XCH-NW-01V.nw.nos.boeing. c om> <4B5FC783.4030401@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64 9 50F3333F@XCH-NW-01V.nw.nos.boeing.com> <4B6103C8.6090307@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FEC1D3@XCH-NW-01V.nw.nos.boeing.com> < 4B6473E5.1000508@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FEC 98C@XCH-NW-01V.nw.nos.boeing.com> <4B698565.8030301@firstpr.com.au> <E1829B60731D1740BB7A0626B4FAF0A64950FECB19@XCH-NW-01V.nw.nos.boeing.com> <4B6F9AD9.7020508@firstpr.com.au>
In-Reply-To: <4B6F9AD9.7020508@firstpr.com.au>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Feb 2010 19:24:20 -0000

Hi Robin,

> -----Original Message-----
> From: Robin Whittle [mailto:rw@firstpr.com.au]
> Sent: Sunday, February 07, 2010 9:02 PM
> To: RRG
> Cc: Templin, Fred L
> Subject: Re: [rrg] SEAL critique, PMTUD, RFC4821 = vapourware
>
> Hi Fred,
>
> I am replying to your (msg05937).  I will soon write, in a different
> thread, my understanding of SEAL - which I hope you will respond to
> with any necessary corrections.
>
> You wrote:
>
> Regarding some IPv4 or IPv6 routers sending a PTB with a zero value
> in the 16 bit field for MTU (IPv4) or the 32 bit field for MTU in IPv6:
>
> >> Are there really routers in use which don't include the 16 bit value?
> >>
> >> Even if there theoretically are not, SEAL still has to cope with the
> >> possibility of zero or unreasonable values in this MTU field.
> >
> > SEAL explicitly turns off PMTUD and uses its own tunnel
> > endpoint-to-endpoint MTU determination, so in the normal
> > case it does not expect to receive any ICMP PTBs from
> > routers within the tunnel.
>
> My understanding is that this is only true for IPv4, because the SEAL
> ITE (Ingress Tunnel Endpoint) sends packets with DF=0 to the ETE
> (Egress Tunnel Endpoint).  For IPv6, the ITE can get PTBs from
> routers in the tunnel since no packets are fragmentable.  So I think
> it would not be true to state that SEAL ITE "turns off" the
> traditional IPv6 RFC 1981 PMTUD mechanism when it tunnels packets to
> the ITE.

Yes, that's right. My mind has been so locked into the
IPv4 case that I forget that IPv6 does not allow
fragmentation in the network. So, you are right that
IPv6 as the outer protocol requires RFC1981 PMTUD
feedback from the network.

> > SEAL *can* enable PMTUD for certain "expendable" packets,
>
> I don't recall what these would be.

Out-of-band probes, e.g.

> > and can benefit from any
> > ICMP PTBs coming from within the tunnel that contain
> > sufficient information. But, that would simply be an
> > optimization.
>
> Is there a mechanism for SEAL, in IPv4, to send these "expendable"
> packets with DF=1?

Yes; just set DF=1 in the outer IPv4 header and send it.

> >>> In some environments, it may be necessary to insert a
> >>> mid-layer UDP header in order to give ECMP/LAG routers
> >>> a handle to support multipath traffic flow separation.
> >>    http://en.wikipedia.org/wiki/Equal-cost_multi-path_routing
> >>
> >> http://www.force10networks.com/CSPortal20/TechTips/0065_HowDoIConfigureLoadBalancing.aspx
> >>
> >> As far as I know, these techniques are not something to consider with
> >> the RANGER CES, or with LISP or Ivip.  If the routers can handle
> >> ordinary traffic packets they can handle encapsulated packets too.  I
> >> haven't read about these techniques in detail.  I guess that within
> >> RANGER, beyond its use as a CES scalable routing solution, you may
> >> want to support ECMP and LAG.
> >
> > There has been a great deal of talk about taking care
> > of ECMP/LAG routers within the network that only
> > recognize common-case protocols (i.e., TCP and UDP),
> > which is why LISP has locked into using UDP encaps.
>
> Do you expect this to be the case for IRON?  If so, then I guess that
> SEAL in IRON must always use this UDP header before the SEAL header -
> since no ITE could know for sure whether ECMP/LAG is in use on the
> path to the ETE.

Yes, I guess so.
>
> If anyone can point me to good references on ECMP used with LAG, I
> would really appreciate it.
>
>
> Regarding how the ITE manages a sliding window in the (random
> starting value, then monotonically incremented) 32 bit SEAL ID it
> uses for each ETE - so it can recognise valid values in PTBs or
> messages from the ETE, without having to cache each value used so
> far, and while keeping the window small enough to make it hard for an
> attacker to correctly guess a value which would be accepted:
>
> >>> The above is all correct wrt the window management. The
> >>> ITE can ensure that the window size remains bounded by
> >>> sending periodic explicit probes (e.g., once explicit
> >>> probe per every N data packets).
> >>
> >> I don't have a really clear idea of how SEAL sets the numeric window,
> >> or what you mean by your second sentence above.  If you can give a
> >> more complete description with an example, I would really appreciate
> >> it - in part because I might want to use or adapt your technique for
> >> Ivip, which currently uses nonces.
> >
> > What I am asking for in SEAL is that the ITE sets the
> > SEAL_ID in each packet in monotonically-incrementing
> > fashion (modulo 32). Then, on every Nth packet (e.g.,
> > 500th, 1000th, etc.) the ITE sets the "Acknowledgement
> > Requested" bit and the ETE (upon seeing the bit set)
> > sends back an explicit acknowledgement. The ITE can
> > then keep a window of outstanding SEAL_ID's by keeping
> > track of the most recently sent and most recently
> > acknowledged SEAL_IDs.
>
> OK, except for the possibility that the packets arrive out of order.
> For instance, let's say the random start point for the SEAL ID for
> this ITE sending packets to this ETE was 3,000,000,000.  On the 500th
> packet 3,000,000,499, the "Acknowledgement Requested" flag is set.
> However, let's say it is received at the ETE before the packet with
> SEAL-ID 3,000,000,498 is received.
>
> By the time the ITE gets the acknowledgement packet from the ETE, it
> has sent packets using SEAL IDs up to 3,000,000,510.
>
> The ITE will then refuse to accept as valid any messages from the ITE
> - or PTBs from routers in the tunnel path - with values in these
> ranges, working backwards around the modulo 2^32 circle:
>
>    Between 3,000,000,499 and 0 inclusive, or between
>    4,294,967,295 and 3,000,000,511 inclusive.
>
> But lets say the ETE then responds to the packet with SEAL ID
> 3,000,000,498, or that the ETE responded to this before sending the
> Acknowledgement, but that the response was delivered to the ITE after
> the acknowledgement.  Then this response will be considered invalid.
>
> There are at least one reason for lowering the value of "500": To
> reduce the proportion of random values in spoofed messages to the ITE
> which it would accept.  500 (~2^9) already increases the chance of a
> spoofed packet being accepted to 2^-23.
>
> A reason for increasing it is to reduce the workload on the ITE and
> ETE - though doing this once every 500 times doesn't seem excessive.
>
> An alternative arrangement to the above would be is to implement in
> the ITE a delay function, with a time-constant of a few seconds - say
> 3 seconds - so that at any instant, the ITE could find for a given
> ETE what was the lowest SEAL ID it sent to this ETE within the last 3
> seconds.  By using this and the last value used (which would already
> be stored) then the ITE could maintain a relatively narrow window
> without the ETE having to acknowledge anything, and without the
> out-of-order packets problem I suggested above.
>
> This "3 second timer" function doesn't have to be particularly
> accurate.  If the effective time is between 3 and 4 seconds, then it
> would suffice to take a snapshot of ETE IP addresses and the current
> state of their SEAL ID counters, every second, and then stash the
> snapshots in one of 4 locations on a circular basis.  Then for the
> function which looks up the "SEAL ID last used ~3 seconds ago" simply
> find that ETE's SEAL ID value in the snapshot taken 3 to 4 seconds ago.

OK, that sounds good on the ITE side but what about the
ETE side? If the ETE is going to be tracking the SEAL_ID
for this ITE, can't it similarly keep a sliding window
based on the packets received within the last ~3sec?

> >>> No, the sentence is correct. It is possible for the ITE
> >>> to need to reduce its cached S_MSS value to a size less
> >>> than 576 if there is truly a link with a small MTU (e.g.,
> >>> 256) on the path. Although 576 is often considered to
> >>> be the "nominal" minimum MTU for IPv4 links, the actual
> >>> minimum MTU is only 68 bytes per RFC791.
> >>
> >> OK - I guess I wouldn't go this far for Ivip.  I will probably have
> >> some guidance that if anyone has routers, tunnels or whatever with
> >> PMTUs less than 1280 or some figure in that range, for both IPv4 or
> >> IPv6, then they should make sure that ITRs or ETRs are not located so
> >> that such low PMTU parts of the network are between the DFZ and these
> >> ITRs or ETRs.
> >
> > I want to be able to use SEAL over truly constrained
> > links, such as certain wireless communications systems.
> > On such links, there may be a reason to set an
> > unusually small MTU.
>
> OK.  I will assume for IRON that SEAL wouldn't be required to perform
> heroics - to try to work with PMTUs below ~1200 or so.  I certainly
> don't want Ivip to have to cope with people putting ITRs and ETRs in
> places where there are MTUs less than this.  Such MTUs shouldn't
> exist in the DFZ - if anyone ISP is paying for such a link, they
> should get another one.  Likewise, there shouldn't be such low MTU
> restrictions in any self-respecting ISP or other network.
>
>
> >> Indeed you did - I had no idea things were this bad.
> >>
> >>   http://www.ietf.org/mail-archive/web/rrg/current/msg05907.html
> >>   http://www.ietf.org/mail-archive/web/rrg/current/msg05910.html
> >>
> >>>> I just think it wrong in principle to develop messy new protocols
> >>>> such as RFC 4821 to cope with these failings.
> >>>
> >>> In my opinion, packetization layers are operating "at risk"
> >>> if they use packet sizes larger than 1500 but are not in
> >>> some way checking with the final destination to ensure that
> >>> the big packets are actually getting through. RFC4821 is
> >>> a method for the source to do just that without requiring
> >>> any changes on the destination. But to be sure, SEAL does
> >>> not *depend* on RFC4821 but rather *sets the stage* for
> >>> RFC4821 and/or any functional equivalents.
> >>
> >> OK.  I need to revisit all my thinking on PMTUD given the results of
> >> your recent research.  But I haven't yet changed my thinking that RFC
> >> 4821 is messy and expensive, and that it would be better to
> >> straighten out whatever it is within the network which is stopping
> >> PMTUD from working correctly.
> >
> > The good thing about RFC4821 is that only the source
> > host needs to implement it, so there is a natural path
> > for incremental deployment. I guess there are other
> > possibilities as well (e.g., similar schemes that do
> > PMTUD black hole detection) so as long as the method
> > chosen causes no harm the source host is welcome to
> > implement whatever it sees fit.
>
> It is up to whoever writes TCP packetization code in host stacks how
> they want to set their packet length - and likewise application
> developers who need to set their own packet lengths.
>
> My view is that for IPv4, RFC 1191 PMTUD is an excellent system -
> except that the PTB message should be made to follow the RFC 1981
> requirement of sending back as much of the original packet as would
> not make the PTB exceed 576 octets:

How can a system with a going-in strategy of *throwing
away good data* be "excellent"?

>   http://tools.ietf.org/html/rfc1885#section-3.2
>
> At present, the PTB spec in RFC 1191 (in 1990) is to only send back
> the IPv4 header and the next 8 octets.  It needs to be like RFC 1981
> so that when there are nested tunnels, a PTB sent by a router in the
> innermost tunnel will contain enough of the original packet (IPv4
> header = 8 octets) for the outermost tunnel entry router to craft a
> valid PTB to the sending host, without having to cache anything from
> the original traffic packet.  This would occur by a chain of events:
>
>     A PTB from router in 3rd tunnel is recognised by the
>     ITE of that tunnel as valid, since it contains enough bytes
>     after the IPv4 header.  For instance, these bytes would include
>     the SEAL ID or any other extra stuff the router put into the
>     headers to enable it to accept only genuine PTBs and reject
>     ones spoofed by an off-path attacker.  So the ITE at the start
>     of the 3rd tunnel sends:
>
>     A PTB to the ITE of the 2nd tunnel, which is long enough, as
>     described above, to be validated and to contain enough of the
>     tunneled packet for the ITE of the 2nd tunnel to send:
>
>     A PTB to the ITE of the 1st (outer) tunnel.  This contains
>     enough of the original traffic packet (at least the IPv4
>     header plus 8 octets) for this ITE to be able to construct
>     a valid PTB to the sending host, without the need to cache
>     the initial part of the packet.  Likewise, the PTB, as
>     mentioned above, must contain whatever this ITE needs to
>     validate it as genuine.
>
> If the RFC 1191 designers had correctly anticipated the need for one
> or more levels of tunneling to support their PMTUD system, then I
> think they would have altered the PTB requirements to be as long as
> those for IPv6's RFC 1981.  Then we probably would have tunnels today
> which properly support RFC 1191 PMTUD.

But, if any one of those tunnels uses IPsec encryption
or the like there is no opportunity for performing the
necessary translation function. So if there were a
decent segmentation and reassembly capability it seems
like IPsec implementations would be wise to use it.

> Also, I think that DF=0 packets should be deprecated - unless perhaps
> they are shorter than some constant such as 1200 bytes or so.  I
> think it would be bad to expect ITRs and ETRs and the whole CES
> system to work over paths with MTUs below this.  People shouldn't use
> such short PMTU links in the DFZ and shouldn't place their ITRs or
> ETRs anywhere where there are such short PMTU links between them and
> the DFZ.

DF=0 has two benefits - it can allow good data to
get through in cases where DF=1 would have dropped
the data, and it can allows MTU indication through
to the ETE which can report back to the ITE.

> My view is that for IPv6, RFC 1981 is an excellent system.

How can a system that places blind faith in the network
be "excellent"?

> From your research (msg05910), it seems that the current state of
> PMTUD in IPv4 is a shambles - with some networks blocking PTBs, some
> tunnels (or combinations of tunnels) not generating PTBs and with
> some hosts ignoring PTBs, or not responding properly to them.  Also
> some hosts send DF=0 packets of 1470 bytes (Google at least).
>
> As far as I know, everything generally works because many hosts are
> configured not to send packets long enough to run into PMTU problems.

Agree.

> From the current basis, there's no way we can generally adopt
> jumboframe paths in the DFZ as they appear.

Also agree.

> Nor is there a way of introducing a tunneling-based CES architecture
> which relies for its PMTUD on PTBs.  My IPTM approach and I think
> your SEAL approach should be able to cope without relying on PTBs
> from within the tunnel (but see my forthcoming message).  But what if
> the ITRs (ITEs) can correctly sense the PMTU to the ETRs (ETEs) and
> are unable to alter the sending host's packet lengths?
>
> This could be due to:
>
>   1  A PTB sent by the ITR is dropped by some filtering system
>      before it can get to the SH.  This seems more likely if
>      the ITR is outside the ISP or end-user network where the
>      SH is located, than within it.
>
>      If people filter PTBs from entering their system, or use an
>      ISP which does the same, this is their own fault.
>
>      The trouble is, they get away with it now, because the packets
>      their hosts send are generally short enough not to run into MTU
>      problems.  Unfortunately, such networks will perceive the
>      difficulties resulting from their choices as being caused by
>      sending packets to a host with an SPI ("edge") address in the
>      CES architecture - and may not think it is their own filtering
>      which is causing the trouble.
>
>   2  The SH ignoring or responding incorrectly to the PTB.
>
>      As above - they get away with it now, and would perceive the
>      problem as being caused by the destination network which
>      is using the CES system's "edge" space.

Cases 1 and 2 are a problem of the end site and not of
the ITE. If the ITE as an edge router of the site is
sending PTBs and the source host is not either not
getting them or not responding correctly then the end
site has to find the problems and fix them.

>   3  The SH sends DF=0 packets which are too long, after
>      encapsulation for some, many or all paths to ETRs.
>
>      Again, as above, they get away with it now - but would blame
>      the CES system, or rather the destination network which they
>      may not know has adopted the "edge" space provided by the
>      CES system.
>
>      So does a CES system have to fragment every such packet?
>      It seems so.

The CES needs to select a "safe" size for performing inner
fragmentation while not choosing one so excessively small
as to invoke inner fragmentation very often.

> I think that to implement defensive, complex protocols such as RFC
> 4821 would be to accept and allow all these bad practices, and would
> forever doom us to having to do extra work, and suffer extra
> flakiness, just because of these bad practices.
>
> RFC 4821 will always be a slower and less accurate method of
> determining PMTU to a given host than RFC 1191 or RFC 1981.  It would
> be subject to choosing a lower than proper value, if there was an
> outage for a while and it interpreted this as a PMTU limitation.

My belief is that SEAL used correctly has a chance
to establish a minimum "Internet cell size" of 1500.
Then, if end systems adopt the strategy of "use
classic PMTUD for packets no larger than 1500 and
use RFC4821 or equivalent for packets larger than
1500" then we would have a path to an MTU-clean
Internet that can scale to any future packet sizes.

Fred
fred.l.templin@boeing.com

>  - Robin