Re: [rrg] IRON: SEAL summary V2

Robin Whittle <rw@firstpr.com.au> Tue, 09 February 2010 03:32 UTC

Return-Path: <rw@firstpr.com.au>
X-Original-To: rrg@core3.amsl.com
Delivered-To: rrg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id EE0FC28B797 for <rrg@core3.amsl.com>; Mon, 8 Feb 2010 19:32:50 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.68
X-Spam-Level:
X-Spam-Status: No, score=-1.68 tagged_above=-999 required=5 tests=[AWL=0.215, BAYES_00=-2.599, HELO_EQ_AU=0.377, HOST_EQ_AU=0.327]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XRc0bXULa1+a for <rrg@core3.amsl.com>; Mon, 8 Feb 2010 19:32:49 -0800 (PST)
Received: from gair.firstpr.com.au (gair.firstpr.com.au [150.101.162.123]) by core3.amsl.com (Postfix) with ESMTP id 49D373A74CB for <rrg@irtf.org>; Mon, 8 Feb 2010 19:32:48 -0800 (PST)
Received: from [10.0.0.6] (wira.firstpr.com.au [10.0.0.6]) by gair.firstpr.com.au (Postfix) with ESMTP id C5543175A8B; Tue, 9 Feb 2010 14:33:51 +1100 (EST)
Message-ID: <4B70D7A0.5030905@firstpr.com.au>
Date: Tue, 09 Feb 2010 14:33:52 +1100
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: RRG <rrg@irtf.org>
References: <4B6FDB2F.5090203@firstpr.com.au>
In-Reply-To: <4B6FDB2F.5090203@firstpr.com.au>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Subject: Re: [rrg] IRON: SEAL summary V2
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Feb 2010 03:32:51 -0000

Based on discussions with Fred (my recent messages), here is my
revised attempt at describing SEAL, at least as far as it is used in
the IRON-RANGER Core Edge Separation scalable routing proposal.

Please see the initial attempt:

  http://www.ietf.org/mail-archive/web/rrg/current/msg05977.html

for references to the source documents and older messages and for the
comparisons with Ivip's IPTM and LISP, which Fred and I discussed in
(msg05979) and (msg05981).

 - Robin


ITEs and ETEs
-------------

An Ingress Tunnel Endpoint (ITE) is part of the VET interface in an
IRON router.  It uses SEAL to encapsulate and tunnel packets to a
distant IRON router, with SEAL's approach to PMTUD which enables the
IRON router which is the ITE to send Packet To Big (PTB) messages to
the Sending Host (SH).   The ETE (Egress Tunnel Endpoint) is
implemented in the VET interface section of the distant IRON router.

When SEAL and RANGER are updated as Fred plans, the ETE router will
be able to send a SEAL message to the ITE router telling it to
redirect packets to a given prefix, to another IRON router.  This
will involve a caching time and the SEAL_ID from the SEAL header of
the tunnel packet which gave rise to the redirection.  This is an
important part of IRON - which I will attempt to describe in a
separate message.


SEAL for IPv4 and IPv6
----------------------

SEAL is capable of "segmenting" (fragmenting, but within the SEAL
protocol, rather than by using IPv4 or IPv4 fragmentation mechanisms)
packets which are known to be too long for the PMTU to a given ETE.
However, this is not intended to be used with IRON.

SEAL is intended to tunnel packets to an ETE, without any need to
establish tunneling arrangements, without expecting acknowledgements
of successful receipt by the ETE and without resending any packets.
The tunnel is a one-way (ITE to ETE), ad-hoc, arrangement - but the
ITE stores some state for each particular ETE it is tunneling to.

SEAL ITEs do not cache any part of the packets they send.  So in
order to generate a PTB to the sending host (which may itself be
another router, if this SEAL tunnel is already within an outer tunnel
- which I think is not the case with IRON) the ITE relies on
receiving enough of the packet from (IPv6-only) a PTB from a limiting
router on the path to the ETE, or (IPv4-only) a "Fragmentation
Experienced" message from the ETE itself.

I assume that for IRON, there will be no need for "mid-level headers"
such as UDP between the outer header and the SEAL header - but see
msg05927 and msg05976 and search for "ECMP".  (msg05980 and msg05981
indicate there is some confusion about UDP headers and where they
would go.)


There is some pretty confusing text about the Tunnel Interface MTU -
4.3.1.  One part discusses setting this to 1500 bytes or more.  Later
in that paragraph there is discussion of setting it to smaller values
than this. The next paragraph discusses setting it to an infinite
value.  I hope Fred will provide some guidance on how this would be
done for IRON.

Each ITE maintains some state for each ETE it tunnels packets to (by
each ETE IP address).  I guess there would be a process for deleting
this after a while if no packets are sent to that ETE, since over
time, the state could grow to a considerable size, depending on how
many ETEs there are in the world.

    SEAL-ID most recently used.  A 32 bit value which is initialised
    to some random value, and then incremented modulo 2^32 every time
    a packet is tunneled to this particular ETE.

    Further state as required to implement a window function or
    some other arrangement by which this ITE can test a SEAL-ID
    in an incoming message including at least these:

       PTB from a router in the tunnel path (IPv6 only).
       SEAL Packet Too Big message from ITE (Not needed in IRON?).
       SEAL "Fragmentation Experienced" (IPv4 only).

    Please see previous messages for Fred's approach to doing this
    and my critique and timer-based suggestion.

Other items of state are listed in 4.3.3:

    MHLEN   Constant mid-level header length.  In this attempt to
            describe SEAL, I will assume there is no need for these
            "mid-level" (?) headers - between the outer IP header and
            the SEAL header which precedes the encapsulated packet.
            So I will assume this = 0.

    HLEN    Constant outer header length: 20 for IPv4 and 40 for
            IPv6 plus the length of the SEAL header - 8 bytes I guess
            but Fred hasn't yet defined it.  It needs 4 bytes just
            for the SEAL_ID.

               IPv4 28           IPv6 48

    S_MSS   Variable.  SEAL Maximum Segment Size.  I think this is
            initialised to the value of (the ID says "no larger
            than") the MTU of the "underlying IP interface".  I guess
            this means that the ITE has a single interface for
            sending out encapsulated packets on.  (To be pernickety,
            it could be pointed out that the ITE might be a multiple
            interface router, with different MTUs on each, and that
            the best path to a given ETE might change from one
            interface to another.)

            According to the ID, this value may be adjusted upwards
            and downwards based on received SEAL Reassembly Report
            messages.

            However, I think these are not of interest in IRON - and
            that it is other messages which alter this value which
            we need to consider:

              IPv4: SEAL "Fragmentation Experienced" from the ETE.

              IPv6: PTB from a router in the tunnel path.

    S_MRU   Variable.  SEAL Maximum Reassembly Unit.  Initialised to
            "infinity", but the effective value of S_MRU is never
            more than 256 * S_MSS.  (Since S_MSS can rise and fall,
            this means there are really two items of state: one the
            limit and the other the effective value, which may be
            increased or decreased according to S_MSS * 256, as long
            as it remains less than or equal to the limit value.)


SEAL for IPv4
-------------

DF=0 packets will be fragmented with standard IPv4 techniques before
other processing, if they are longer than:

   (MIN(S_MRU, S_MSS) - 28)

The fragments will be of this length and will be tunneled as
described below.

DF=1 packets which are longer than:

   (S_MRU - 28)

will result in a PTB being sent to the Sending Host (SH).  Apart from
being used to generate that PTB, the packet will be dropped.

Encapsulation involves:

  IPv4 outer header   With the 16 Identification bit field set to
                      the 16 most significant bits of SEAL-ID.
                      DF = 0, so the packet is fragmentable by
                      any router which finds the packet is too long
                      for its next-hop MTU.

  SEAL header (32 bits) as described below.

  The original traffic packet.

The IPv4 SEAL header has a 16 bit field "ID-Extension"  which is set
to the least significant 16 bits of SEAL-ID.  Section 4.2 discusses
this.  I haven't figured out exactly how the other bits would be set.

If the packet arrives intact, then the ETE decapsulates it and
forwards the traffic packet to wherever it needs to go.  In IRON,
most of the time, there is a single tunnel to the IRON router which
directly connects to the end-user network.  However initial packets
go to an IRON router (the VP router) which is typically not the
router which connects to the end-user network - so then the packet
would be decapsulated and tunneled in the same manner to the IRON
router which connects to the destination network.

If the packet is too long for one router in the ITE -> ETE path, then
that router will fragment the whole packet and the ETE will (usually)
receive the first and other fragments.  The first fragment should be
as long as the limit imposed at that router by the next-hop MTU.  If
there were two or more MTU limits, such as 1400 and then 1300, the
fragment generated at the first limiting router would be 1400 bytes
long, and this would be fragmented at the second router, so the first
fragment would be 1300 bytes.

In this way, the ETE discovers the PMTU of the ITE -> ETE tunnel path.

The ETE does not attempt to deliver the packet, but sends a
"Fragmentation Experienced" message to the ITE.  See 4.3.9.1.2 and
4.4.5.1.2.  This message is defined in Figure 6 and contains as much
of the first fragment as would make the total message 576 bytes in
length.  There are 20 bytes for the IPv4 header and 16 for this
particular SEAL header, so up to 540 bytes of the first fragment will
be sent back to the ITE.  There is no acknowledgement of reception by
the ITE.

The ITE can authenticate this message by looking at the 16 bit ID in
the IPv4 header of the enclosed portion of the first fragment - and
the 16 bits of "ID-Extension" in the SEAL header in that fragment.

The S_MSS field of this message is set to the length of the first
fragment, which is (or should be, and is assumed to be) the PMTU of
the ITE -> ETE path.

The S_MRU field may be set to zero.  As far as I know S_MRU is how
big a packet the ETE is prepared to reconstruct from packets which
are fragmented with IPv4's native fragmentation - but it may also
apply to the use of SEAL's internal segmentation system, which I
understand is not generally used for IRON.

There is a separate SEAL PTB message (4.4.5.1.3) which is not
relevant to IRON - since it concerns SEAL's internal "segmentation"
system.


I understand that the reason Fred uses DF=0 packets for IPv4 tunnels
is that IPv4 routers, when sending back a PTB, are required (by RFC
1191) to only return the IPv4 header and the next 8 bytes.  This is
enough for the ITE to authenticate the PTB as genuine, since these 8
bytes are the SEAL header. With the IPv4 header, the ITE can
therefore see the full 32 bit SEAL-ID.  However, there is nothing of
the original traffic packet in the PTB, so the only way the ITE could
generate a PTB to the SH would be to have cached the first 28 bytes
of the original traffic packet.  To avoid the need for caching, and
so as not to rely on PTBs from routers in the ITE -> ETE tunnel, SEAL
ITEs tunnel IPv4 packets using DF=0 packets.

The SEAL header used in this IPv4 process contains a flag which, when
set, signals that the ETE should report the successful reception of
the packet to the ITE.  I am not sure to what extent this would be
used for IRON.


SEAL for IPv6
-------------

DF=1 packets which are longer than:

   (S_MRU - 48)

will result in a PTB being sent to the Sending Host (SH).  Apart from
being used to generate that PTB, the packet will be dropped.

When encapsulating IPv6 packets, there will be a new SEAL header
which Fred is yet to define.

I guess it will be 8 bytes.  It will contain, at least, a 4 byte
SEAL_ID field and a flag to ask the ETE to tell the ITE the packet
was received correctly.

If the packet does not encounter any MTU limits and arrives at the
ETE, then the ETE decapsulates it and does whatever it needs to do
with it - deliver it directly to the end-user network or perhaps (if
the ETE is the VP IRON router) tunnel it again to an IRON router
which will deliver it to the end-user network.

If the packet hits an MTU limit, then the limiting router will send
back a PTB to the ITE, and since this is IPv6, the PTB will contain
enough of the packet for the ITE to construct a valid PTB for the SH.

If the ITE sets the "Acknowledgement Requested" bit then the ETE will
tell the ITE it has been received.  The ITE can therefore try
different length packets to determine the PMTU to this ETE, even if
it receives no PTBs from whichever router is causing the limitation.


Rejecting larger packets directly
---------------------------------

Once the ITE has established a value for S_MSS for a given ETE, then
any packet it needs to tunnel to that ETE will be rejected with a PTB
to the SH if, once it was encapsulated, it would be too long for the
S_MSS of this ETE.   So if a DF=1 IPv4 packet arrives with a length
longer than (S_MSS - 28), it will be rejected with a PTB.  Likewise
an IPv6 packet longer than (S_MSS - 48).

As far as I know, all IPv4 DF=0 packets longer than (S_MSS - 28) will
be fragmented into IPv4 fragments of this length (with the final one
potentially - typically - being shorter) and then the fragments will
be tunnelled.  At the ETE, each fragment will be forwarded to the
destination network and then host (or tunnelled to another IRON
router which will forward them to the destination network and then
host - and the destination host will reassemble the fragments using
the standard IPv4 system.)

The ITE can't reject any DF=0 packets.  It can however learn the PMTU
to the ETE by the ETE reporting the size of any fragments which
result from the tunneled versions of the IPv4 native fragments.  That
will enable the ITE to use IPv4 native fragmentation on subsequent
DF=0 packets to this ITE in order that they will fit within the newly
discovered PMTU limitation, which is stored in this ETE's S_MSS variable.

(See discussion in msg05979 and msg05981.)



Adapting to changed PMTUs
-------------------------

I am not sure if or where it is specified, but I understand that SEAL
will allow exploration of increased PMTUs.  According to RFC 1191 and
therefore RFC 1981, the SH can try sending a longer packet than it
has been told to send by a previous PTB, after 10 minutes has elapsed.

I assume that if the ITE is only handling DF=0 packets, or at least
that if there are no DF=1 packets which are longer than the current
limitation (S_MSS - 28) and that if DF=0 packets keep arriving,
requiring native IPv4 fragmentation before encapsulation (that is,
the DF=0 packets are longer than (S_MSS - 28)) then after 10 minutes
or so, the ITE should explore sending larger packets into the tunnel.

Except when lost packets prevent it, the SEAL ITE will instantly
discover a reduced PMTU to a given ETE and so reduce that ETE's S_MSS
value, due to:

  IPv4:    Limiting router fragments the DF=0 tunnel packet and
           the ETE reports to the ITE the length of the first
           fragment, which is the new PMTU limit of this path.

           If the original packet was DF=1, the ITE will generate
           a suitable PTB to the SH.

  IPv6:    A PTB arrives at the ITE, and the ITE sends a PTB to the
           SH.



Jumboframes
-----------

"Jumbograms" refer to IPv6 packets with special formats so they can
be longer than 2^16 bytes, and can be as long as 4 gigabytes.
Neither SEAL, Ivip's IPTM, nor any CES proposal attempts to deal with
these.

See discussion in msg05979 and msg05981 - Fred writes that SEAL can
accommodate jumbograms - but I can't see how.

As far as I understand how SEAL would work, SEAL will be able to
smoothly adapt to the appearance of jumboframe ~9k byte paths between
 ITEs and ETEs.


Coping with blocked or missing PTBs
-----------------------------------

For IPv4, between the ITE and ETE, SEAL does not use DF=1 packets and
so doesn't rely at all on PTBs.   Sometimes, I think, this DF=0
arrangement would produce faster results than by using PTBs as is
done with IPv6. Other times it might be slower - it depends on the
location and nature of the MTU limits along the tunnel.

It does rely on the limiting router(s) producing a first fragment
which is the same length as the limiting PMTU of the ITE -> ETE path
- which seems reasonable.

For IPv6, SEAL relies on PTBs being sent by routers in the ITE -> ETE
tunnel path - and these being received by the ITE.

My impression is that there are problems today with some tunnels or
combinations of tunnels not generating PTBs.  Also, some networks
apparently stop the reception of PTBs from the DFZ, so this would
probably prevent SEAL's PMTUD from working.  When Fred revises SEAL,
the ITE will be able to tunnel IPv6 traffic packets and request that
the ETE acknowledge their successful reception.


For both IPv4 and IPv6, if the SH ignores PTBs, or if there is some
filtering between the ITE and the SH which drops PTBs, then there's
no way the SH is going to adapt its packets to the lengths required
by SEAL to send them without fragmentation, segmentation or whatever.

No CES architecture tries to cope with these problems - they must be
fixed directly.

As far as I know, SEAL does have "segmentation" capabilities - but
these are not intended to be used within the IRON CES architecture.
So SEAL's segmentation is no solution to this kind of failure of
PMTUD.  There's nothing to be done about this - the fault is in the
SH and/or the filtering, so these need to be fixed.