[rrg] IRON: SEAL summary

Robin Whittle <rw@firstpr.com.au> Mon, 08 February 2010 09:35 UTC

Return-Path: <rw@firstpr.com.au>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id CD1A63A72D2 for <rrg@core3.amsl.com>; Mon, 8 Feb 2010 01:35:51 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.667
X-Spam-Level:
X-Spam-Status: No, score=-1.667 tagged_above=-999 required=5 tests=[AWL=0.228, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8kDTdbeUDlWL for <rrg@core3.amsl.com>; Mon, 8 Feb 2010 01:35:49 -0800 (PST)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id 96D383A682C for <rrg@irtf.org>; Mon, 8 Feb 2010 01:35:48 -0800 (PST)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id 4D19F175E5C; Mon, 8 Feb 2010 20:36:47 +1100 (EST)
Message-ID: <4B6FDB2F.5090203@firstpr.com.au>
Date: Mon, 08 Feb 2010 20:36:47 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Subject: [rrg] IRON: SEAL summary
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Feb 2010 09:35:51 -0000

This is a summary of my current probably imperfect understanding of
the parts of Fred Templin's SEAL tunneling protocol which are
important for his scalable routing proposal: RANGER (now IRON).

Please consult any discussion which follows - I wrote this without
checking it first with Fred, so he will no provide corrections.

At the end I compare this understanding of SEAL with my IPTM
arrangement for Ivip, and with the non-stateful and stateful
approaches of LISP.


  - Robin



Fred's "RANGER" scalable routing proposal is now known as IRON:

   The Internet Routing Overlay Network (IRON)
   http://tools.ietf.org/html/draft-templin-iron-00

The proposal as currently described:
   http://tools.ietf.org/html/draft-irtf-rrg-recommendation-04#section-16

is a particular application of Fred's more general-purpose RANGER:

   http://tools.ietf.org/html/draft-templin-ranger-09

which itself is based on ISATAP (RFC 5214).  RANGER and therefore
IRON uses SEAL for tunneling with Path MTU Discovery management
functions.

Next I will write up my understanding of IRON.


The latest SEAL ID is:

  http://tools.ietf.org/html/draft-templin-intarea-seal-08

Perhaps SEAL is capable of tunneling IPv4 packets in IPv6 tunnels and
vice-versa, but I am only interested in SEAL for pure IPv4 and pure
IPv6, since this is how I am trying to understand IRON.

SEAL is intended to be suitable for the extremely ad-hoc tunneling
arrangements found in Core-Edge Separation solutions to the routing
scaling problem, where an ITR may have a sudden need to tunnel one or
more packets to an ETR it has never had anything to do with.

In SEAL, the router (or some function programmed into a server,
rather than a conventional router) which accepts packets and tunnels
them is known as the ITE (Ingress Tunnel Endpoint).  The router or
other device which the tunnel reaches to, and which decapsulates the
inner packet, is the ETE (Egress Tunnel Endpoint).

Please see the discussions between Fred and me in the thread: "SEAL
critique, PMTUD, RFC4821 = vapourware".

  http://www.ietf.org/mail-archive/web/rrg/current/msg05816.html RW
  http://www.ietf.org/mail-archive/web/rrg/current/msg05834.html   FT
  http://www.ietf.org/mail-archive/web/rrg/current/msg05843.html RW
  http://www.ietf.org/mail-archive/web/rrg/current/msg05902.html   FT
  http://www.ietf.org/mail-archive/web/rrg/current/msg05924.html RW
  http://www.ietf.org/mail-archive/web/rrg/current/msg05927.html   FT
  http://www.ietf.org/mail-archive/web/rrg/current/msg05976.html RW


SEAL for IPv4 and IPv6
----------------------

SEAL is capable of "segmenting" (fragmenting, but within the SEAL
protocol, rather than by using IPv4 or IPv4 fragmentation mechanisms)
packets which are known to be too long for the PMTU to a given ETE.
However, this is not intended to be used with IRON.

SEAL is intended to tunnel packets to an ETE, without any need to
establish tunneling arrangements, without expecting acknowledgements
of successful receipt by the ETE and without resending any packets.
The tunnel is a one-way (ITE to ETE), ad-hoc, arrangement - but the
ITE stores some state for each particular ETE it is tunneling to.

SEAL ITEs do not cache any part of the packets they send.  So in
order to generate a PTB to the sending host (which may itself be
another router, if this SEAL tunnel is already within an outer tunnel
- which I think is not the case with IRON) the ITE relies on
receiving enough of the packet from (IPv6-only) a PTB from a limiting
router on the path to the ETE, or (IPv4-only) a "Fragmentation
Experienced" message from the ETE itself.

I assume that for IRON, there will be no need for "mid-level headers"
such as UDP between the outer header and the SEAL (or IPv6
Fragmentation) header - but see msg05927 and msg05976 and search for
"ECMP".


There is some pretty confusing text about the Tunnel Interface MTU -
4.3.1.  One part discusses setting this to 1500 bytes or more.  Later
in that paragraph there is discussion of setting it to smaller values
than this. The next paragraph discusses setting it to an infinite
value.  I hope Fred will provide some guidance on how this would be
done for IRON.

Each ITE maintains some state for each ETE it tunnels packets to (by
each ETE IP address).  I guess there would be a process for deleting
this after a while if no packets are sent to that ETE, since over
time, the state could grow to a considerable size, depending on how
many ETEs there are in the world.

    SEAL-ID most recently used.  A 32 bit value which is initialised
    to some random value, and then incremented modulo 2^32 every time
    a packet is tunneled to this particular ETE.

    Further state as required to implement a window function or
    some other arrangement by which this ITE can test a SEAL-ID
    in an incoming message including at least these:

       PTB from a router in the tunnel path (IPv6 only).
       SEAL Packet Too Big message from ITE (Not needed in IRON?).
       SEAL "Fragmentation Experienced" (IPv4 only).

    Please see the last two messages listed above for Fred's approach
    to doing this and my critique and timer-based suggestion.

Other items of state are listed in 4.3.3:

    MHLEN   Constant mid-level header length.  In this attempt to
            describe SEAL, I will assume there is no need for these
            mid-level headers - between the outer IP header and the
            SEAL (or IPv6 fragmentation) header which precedes the
            encapsulated packet.  So I will assume this = 0.

    HLEN    Constant outer header length: 20 for IPv4 and 40 for
            IPv6 plus the length of the SEAL header (IPv4 only -
            8 bytes) or IPv6 Fragment Header (8 bytes).  So these
            are constants:

               IPv4 28           IPv6 48

    S_MSS   Variable.  SEAL Maximum Segment Size.  I think this is
            initialised to the value of (the ID says "no larger
            than") the MTU of the "underlying IP interface".  I guess
            this means that the ITE has a single interface for
            sending out encapsulated packets on.  (To be pernickety,
            it could be pointed out that the ITE might be a multiple
            interface router, with different MTUs on each, and that
            the best path to a given ETE might change from one
            interface to another.)

            According to the ID, this value may be adjusted upwards
            and downwards based on received SEAL Reassembly Report
            messages.

            However, I think these are not of interest in IRON - and
            that it is other messages which alter this value which
            we need to consider:

              IPv4: SEAL "Fragmentation Experienced" from the ETE.

              IPv6: PTB from a router in the tunnel path.

    S_MRU   Variable.  SEAL Maximum Reassembly Unit.  Initialised to
            "infinity", but the effective value of S_MRU is never
            more than 256 * S_MSS.  (Since S_MSS can rise and fall,
            this means there are really two items of state: one the
            limit and the other the effective value, which may be
            increased or decreased according to S_MSS * 256, as long
            as it remains less than or equal to the limit value.)


SEAL for IPv4
-------------

DF=0 packets will be fragmented with standard IPv4 techniques before
other processing, if they are longer than:

   (MIN(S_MRU, S_MSS) - 28)

The fragments will be of this length and will be tunneled as
described below.

DF=1 packets which are longer than:

   (S_MRU - 28)

will result in a PTB being sent to the Sending Host (SH).  Apart from
being used to generate that PTB, the packet will be dropped.

Encapsulation involves:

  IPv4 outer header   With the 16 Identification bit field set to
                      the 16 most significant bits of SEAL-ID.
                      DF = 0, so the packet is fragmentable by
                      any router which finds the packet is too long
                      for its next-hop MTU.

  SEAL header (32 bits) as described below.

  The original traffic packet.

The SEAL header (only for IPv4) has a 16 bit field "ID-Extension"
which is set to the least significant 16 bits of SEAL-ID.  Section
4.2 discusses this.  I haven't figured out exactly how the other bits
would be set.

If the packet arrives intact, then the ETE decapsulates it and
forwards the traffic packet to wherever it needs to go.  In IRON,
most of the time, there is a single tunnel to the IRON router which
directly connects to the end-user network.  However initial packets
go to an IRON router (the VP router) which is typically not the
router which connects to the end-user network - so then the packet
would be decapsulated and tunneled in the same manner to the IRON
router which connects to the destination network.

If the packet is too long for one router in the ITE -> ETE path, then
that router will fragment the whole packet and the ETE will (usually)
receive the first and other fragments.  The first fragment should be
as long as the limit imposed at that router by the next-hop MTU.  If
there were two or more MTU limits, such as 1400 and then 1300, the
fragment generated at the first limiting router would be 1400 bytes
long, and this would be fragmented at the second router, so the first
fragment would be 1300 bytes.

In this way, the ETE discovers the PMTU of the ITE -> ETE tunnel path.

The ETE does not attempt to deliver the packet, but sends a
"Fragmentation Experienced" message to the ITE.  See 4.3.9.1.2 and
4.4.5.1.2.  This message is defined in Figure 6 and contains as much
of the first fragment as would make the total message 576 bytes in
length.  There are 20 bytes for the IPv4 header and 16 for this
particular SEAL header, so up to 540 bytes of the first fragment will
be sent back to the ITE.  There is no acknowledgement of reception by
the ITE.

The ITE can authenticate this message by looking at the 16 bit ID in
the IPv4 header of the enclosed portion of the first fragment - and
the 16 bits of "ID-Extension" in the SEAL header in that fragment.

The S_MSS field of this message is set to the length of the first
fragment, which is (or should be, and is assumed to be) the PMTU of
the ITE -> ETE path.

The S_MRU field may be set to zero.  As far as I know S_MRU is how
big a packet the ETE is prepared to reconstruct from packets which
are fragmented with IPv4's native fragmentation - but it may also
apply to the use of SEAL's internal segmentation system, which I
understand is not generally used for IRON.

There is a separate SEAL PTB message (4.4.5.1.3) which as far as I
know, is not relevant to IRON - since I think it concerns SEAL's
internal "segmentation" system.


I understand that the reason Fred uses DF=0 packets for IPv4 tunnels
is that IPv4 routers, when sending back a PTB, are required (by RFC
1191) to only return the IPv4 header and the next 8 bytes.  This is
enough for the ITE to authenticate the PTB as genuine, since these 8
bytes are the SEAL header. With the IPv4 header, the ITE can
therefore see the full 32 bit SEAL-ID.  However, there is nothing of
the original traffic packet in the PTB, so the only way the ITE could
generate a PTB to the SH would be to have cached the first 28 bytes
of the original traffic packet.  To avoid the need for caching, and
so as not to rely on PTBs from routers in the ITE -> ETE tunnel, SEAL
ITEs tunnel IPv4 packets using DF=0 packets.

The SEAL header used in this IPv4 process contains a flag which, when
set, signals that the ETE should report the successful reception of
the packet to the ITE.  I am not sure to what extent this would be
used for IRON.


SEAL for IPv6
-------------

DF=1 packets which are longer than:

   (S_MRU - 48)

will result in a PTB being sent to the Sending Host (SH).  Apart from
being used to generate that PTB, the packet will be dropped.

When encapsulating IPv6 packets, the SEAL ITE does not use a SEAL
header.  It uses the IPv6 Fragment Header:

  http://tools.ietf.org/html/rfc2460#section-4.5

   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Next Header  |   Reserved    |      Fragment Offset    |Res|M|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         Identification                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The single packet does not contain a "fragment" - this is just using
the header for another purpose.  This header contains a 32 bit
"Identification" field, which is set to the SEAL-ID value.

If the packet does not encounter any MTU limits and arrives at the
ETE, then the ETE decapsulates it and does whatever it needs to do
with it - deliver it directly to the end-user network or perhaps (if
the ETE is the VP IRON router) tunnel it again to an IRON router
which will deliver it to the end-user network.

If the packet hits an MTU limit, then the limiting router will send
back a PTB to the ITE, and since this is IPv6, the PTB will contain
enough of the packet for the ITE to construct a valid PTB for the SH.

There is no facility, as there was with IPv4, for the ITE to request
the ETE to confirm that the packet arrived.


Rejecting larger packets directly
---------------------------------

Once the ITE has established a value for S_MSS for a given ETE, then
any packet it needs to tunnel to that ETE will be rejected with a PTB
to the SH if, once it was encapsulated, it would be too long for the
S_MSS of this ETE.   So if a DF=1 IPv4 packet arrives with a length
longer than (S_MSS - 28), it will be rejected with a PTB.  Likewise
an IPv6 packet longer than (S_MSS - 48).

As far as I know, all IPv4 DF=0 packets longer than (S_MSS - 28) will
be fragmented into IPv4 fragments of this length (with the final one
potentially - typically - being shorter) and then the fragments will
be tunnelled.  At the ETE, each fragment will be forwarded to the
destination network and then host (or tunnelled to another IRON
router which will forward them to the destination network and then
host - and the destination host will reassemble the fragments using
the standard IPv4 system.)

The ITE can't reject any DF=0 packets.  It can however learn the PMTU
to the ETE by the ETE reporting the size of any fragments which
result from the tunneled versions of the IPv4 native fragments.  That
will enable the ITE to use IPv4 native fragmentation on subsequent
DF=0 packets to this ITE in order that they will fit within the newly
discovered PMTU limitation, which is stored in this ETE's S_MSS variable.


Adapting to changed PMTUs
-------------------------

I am not sure if or where it is specified, but I understand that SEAL
will allow exploration of increased PMTUs.  According to RFC 1191 and
therefore RFC 1981, the SH can try sending a longer packet than it
has been told to send by a previous PTB, after 10 minutes has elapsed.

I assume that if the ITE is only handling DF=0 packets, or at least
that if there are no DF=1 packets which are longer than the current
limitation (S_MSS - 28) and that if DF=0 packets keep arriving,
requiring native IPv4 fragmentation before encapsulation (that is,
the DF=0 packets are longer than (S_MSS - 28)) then after 10 minutes
or so, the ITE should explore sending larger packets into the tunnel.

Except when lost packets prevent it, the SEAL ITE will instantly
discover a reduced PMTU to a given ETE and so reduce that ETE's S_MSS
value, due to:

  IPv4:    Limiting router fragments the DF=0 tunnel packet and
           the ETE reports to the ITE the length of the first
           fragment, which is the new PMTU limit of this path.

           If the original packet was DF=1, the ITE will generate
           a suitable PTB to the SH.

  IPv6:    A PTB arrives at the ITE, and the ITE sends a PTB to the
           SH.



Jumboframes
-----------

("Jumbograms" refer to IPv6 packets with special formats so they can
be longer than 2^16 bytes, and can be as long as 4 gigabytes.
Neither SEAL, Ivip's IPTM, nor any CES proposal attempts to deal with
these.)

As far as I understand how SEAL would work, SEAL will be able to
smoothly adapt to the appearance of jumboframe ~9k byte paths between
 ITEs and ETEs.


Coping with blocked or missing PTBs
-----------------------------------

I understand that for IPv4, between the ITE and ETE, SEAL does not
use DF=1 packets and so doesn't rely at all on PTBs.   Sometimes, I
think, the DF=0 arrangement would produce faster results than by
using PTBs as is done with IPv6. Other times it might be slower - it
depends on the location and nature of the MTU limits along the tunnel.

It does rely on the limiting router(s) producing a first fragment
which is the same length as the limiting PMTU of the ITE -> ETE path
- which seems reasonable.

For IPv6, I understand that SEAL relies on PTBs being sent by routers
in the ITE -> ETE tunnel path - and these being received by the ITE.

My impression is that there are problems today with some tunnels or
combinations of tunnels not generating PTBs.  Also, some networks
apparently stop the reception of PTBs from the DFZ, so this would
probably prevent SEAL's PMTUD from working.  As far as I know, the
SEAL ITE doesn't have a way with its handling of ordinary IPv6
traffic packets to request that the ETE acknowledge their successful
reception.


For both IPv4 and IPv6, if the SH ignores PTBs, or if there is some
filtering between the ITE and the SH which drops PTBs, then there's
no way the SH is going to adapt its packets to the lengths required
by SEAL to send them without fragmentation, segmentation or whatever.

As far as I know, SEAL does have "segmentation" capabilities - but
these are not intended to be used within the IRON CES architecture.
So SEAL's segmentation is no solution to this kind of failure of
PMTUD.  There's nothing to be done about this - the fault is in the
SH and/or the filtering, so these need to be fixed.




Comparison with Ivip's IPTM
---------------------------

Here is a rough comparison between my understanding of SEAL and my
IPTM approach to PMTUD tunneling when Ivip uses encapsulation:

    http://www.firstpr.com.au/ip/ivip/pmtud-frag/

  IPTM involves the ITRs caching enough of the IPv4 traffic packet
  to generate a valid PTB to the SH.  This is more expensive than
  SEAL.  However, this is only done when sending a B and A pair of
  packets, which is only when the length (after encapsulation) is
  in the Zone of Uncertainty.  Once this Zone is reduced to zero,
  there's no need to use the IPTM protocol, since traffic packets
  are either encapsulated or rejected with a PTB.

  Maybe IPTM doesn't require caching of part of the IPv6 traffic
  packet, since the PTBs should contain enough of it to generate
  a PTB.  However, IPTM can work even if no PTBs are received.
  It is probably best for the ITR to cache the first 540 bytes or
  so of those IPv6 traffic packets which are used in this
  probing of PMTU.

  IPTM uses the same protocol for both IPv6 and IPv4: a dual
  packet arrangement in which, if both are received, the
  traffic packet will be delivered and the ITR will learn of this
  and so be able to raise the lower limit of the range of possible
  PMTU values to this ETR, so reducing the Zone of Uncertainty.
  Specifically, IPTM does not use DF=0 packets in the tunnel, like
  SEAL does.

  IPTM makes use of PTBs from within the tunnel, but if none
  arrive at the ITR, then the ITR can still (usually) determine
  whether the long B packet arrived or not at the ITR.  When
  longer packets do not arrive and shorter ones do, in the
  absence of PTBs, the ITR can try sending shorter packets until
  a size is found which are reliably delivered.  (This is one
  of the many things I need to work on when developing IPTM.)

  If the ITR steps downwards by 20 bytes per attempt, it may
  overshoot the actual PMTU limit somewhat, but it will find
  a value which works, and is usually not too far below the
  real PMTU limit - all without any reliance on PTBs within
  the tunnel.  As far as I know SEAL needs PTBs from the tunnel
  routers to work with IPv6 - but perhaps I don't understand
  this part of SEAL correctly.

  So for both IPv4 and IPv6, IPTM should be able to estimate
  the PMTU even if no PTBs are received.


  IPTM will probably involve some limit on the maximum size of
  IPv4 DF=0 packet which will be allowed.  I suppose the ITR could
  fragment really large ones - if a host sent ~9k byte DF=0
  packets - but I think that sort of packet should not be
  tolerated or encouraged.

  IPTM will adapt to larger PMTUs by trying larger packets
  - assuming a SH sends them - after 10 minutes.

  IPTM differs very significantly from SEAL in that all SEAL's
  packets are capable of reporting a lower PMTU to the ITE.  This
  is not the case with Ivip and IPTM.  IPTM is only used when
  the ITR is unsure whether the traffic packet will run into an
  MTU limitation.  This would typically lead to the reduction and
  soon the elimination of the Zone of Uncertainty as successive
  attempts using IPTM either succeed or fail.

  Whenever an Ivip ITR tunnels a packet to an ETR and its
  length, once encapsulated (just 20 or 40 bytes extra, for IP in
  IP encapsulation) is no greater than the current minimum estimate
  for the PMTU to this ETR, then the ITR encapsulates the packet
  normally.  This means the outer header source address is that of
  the SH.  So the ITR would not get any resulting PTB.  The SH would
  get a PTB - would not recognise it.

  This means that the ITR may not notice a drop in PMTU until after
  10 minutes, when it would retry sending longer packets and so
  discover the new lower value.

  This is something I need to work on.  One approach is to have
  the ITR perform IPTM on these traffic packets few minutes.

  IPTM as currently described uses 32 bit nonces.  Perhaps I
  will change this to something like Fred's arrangement of
  a sequentially increasing value for each ETR, but starting
  from a randomly initialised value.

  IPTM's encapsulation overhead is 20 and 40 bytes for IPv4 and
  IPv6 respectively.  SEAL's is 28 and 48 - which is less than
  LISP's, since LISP has an 8 byte UDP header plus the 8 byte
  LISP header: 36 and 56 bytes for IPv4 and IPv6 respectively.


Comparison with LISP
---------------------

LISP has non-stateful and a stateful approach to handling PMTUD
problems, neither of which are mandatory.

The non-stateful approach might work if the constant was set well
below 1500, such as 1400 or so (depending on the lowest PMTU from any
ITR to any ETR) but it will lock the whole system into this size
without any possibility of using higher values closer to 1550 bytes,
or jumboframe ~9k byte paths in the DFZ as these become available.

I think the the stateful approach, from the OPENLISP project in Belgium:

  http://tools.ietf.org/html/draft-ietf-lisp-06#section-5.4.2

needs more work.

It doesn't mention how DF=0 packets would be handled, how the system
could work if PTBs were not received from the limiting tunnel router,
or how the ITR would explore larger values of PMTU after 10 minutes.