[RAM] 5 Database <--> ITR push, pull and notify

Robin Whittle <rw@firstpr.com.au> Mon, 02 July 2007 14:39 UTC

Return-path: <ram-bounces@iab.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1I5N3Y-0004hO-HT; Mon, 02 Jul 2007 10:39:12 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1I5N3X-0004hJ-FR for ram@iab.org; Mon, 02 Jul 2007 10:39:11 -0400
Received: from gair.firstpr.com.au ([150.101.162.123]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1I5N3P-0000VF-Cl for ram@iab.org; Mon, 02 Jul 2007 10:39:11 -0400
Received: from [10.0.0.9] (dell.firstpr.com.au [10.0.0.9]) by gair.firstpr.com.au (Postfix) with ESMTP id D7AFD59E44; Tue, 3 Jul 2007 00:39:01 +1000 (EST)
Message-ID: <46890ECE.2030309@firstpr.com.au>
Date: Tue, 03 Jul 2007 00:42:22 +1000
From: Robin Whittle <rw@firstpr.com.au>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.13) Gecko/20060414
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: ram@iab.org
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: 713965ef965c5660ee5d9a664a73ef4a
Subject: [RAM] 5 Database <--> ITR push, pull and notify
X-BeenThere: ram@iab.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Routing and Addressing Mailing List <ram.iab.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ram>, <mailto:ram-request@iab.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ram>
List-Post: <mailto:ram@iab.org>
List-Help: <mailto:ram-request@iab.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ram>, <mailto:ram-request@iab.org?subject=subscribe>
Errors-To: ram-bounces@iab.org

Here are some thoughts on how three classes of ITR function might
best get up-to-date information about the contents of the central
database.

Also some thoughts on the very simple nature of the Ivip mapping
database and how it could be compactly conveyed.

The first two types of ITR function are performed in routers, or a
PC running special software to do the encapsulation, but connected
as a stub to a BGP router and behaving as BGP router, and anycast
advertising the Ivip-mapped master-subnets.

  ITRD     Gets the full Database in real-time via push.

  ITRC     Queries database by some means and Caches
           results.  It would also receive notifications.

The third is not really a router, but we think of it as being like
an ITR:

  ITFH    Encapsulating Tunneling Function in a Host.  It behaves
          like an ITRC, but only for the packets the host is
          sending.

I envisage some kind of redundant, multi-server, database which
holds the entire mapping data, which is subject to updates from
various sources, which I discuss in the previous message.  I will
depict this as three actual Central Database Servers CDBS1, CDBS2
and CDBS3.  I know virtually nothing about databases, but lets say
these servers compare notes and keep themselves up to date, with
the entire system still able to take updates and feed out a stream
of change data to the ITRs, even if one or two of these servers
are dead.  In practice, there would be an expandable number of
these servers, in different locations.

I envisage this central system would periodically generate a
complete file of its contents, and then ever second or so spit out
one or more packets containing all the changes to that data.  How
to get that information to the ITRs is a major challenge.

As I wrote in the previous message:

   I guess it may be possible for there to be multiple separate
   "central database" systems, each with their own trees of
   delegated update input control.  Each such separate system
   could cover a subset of the entire Ivip-mapped space.

   Then each ITRD would need to get update streams from all these
   systems.  Alternatively, there needn't be a single ITRD which
   handles every Ivip mapped IP address.  This could be an
   important scaling consideration, reducing the amount of RAM and
   FIB capacity in each ITRD, and performing the entire ITRD
   function with several  routers (or just plain servers with a
   gigabit Ethernet link to a router) each handling a subset of
   the total range of Ivip-mapped address space.

For now, I will assume that each ITRD needs to get a single update
stream from a single central Ivip database, and that it handles
all Ivip-mapped addresses.

The great thing about an ITRD is that its FIB is updated instantly
it receives the updates, so the router is always tunneling packets
to the correct IP address (of an ETR, TTR or ETR-host).  This
means it can be used as a "final line of defense" to catch packets
addressed to Ivip-mapped address space which an ITRC or ITFH lets
through (or tunnels explicitly to this ITRC) because it doesn't
yet have the mapping information for that packet's Ivip-mapped
destination address.  (See message 1 today.)

As the number of Ivip-mapped addresses and prefixes grows, and as
the update rate grows, the burden on an ITRD becomes more and more
intense, irrespective of the actual number of packets flowing
through it.  However, the major router manufacturers will surely
be able to make such devices, for a suitable price.

I wonder about making a DIY ITRD with software on a suitably gutsy
server.  The idea would be to be able to provide full handling of
a limited data-rate stream of packets, with ITRCs and ITRHs
handling most of the load, but to do it without using a router.
Classifying, encapsulating and sending packets in software will be
pretty slow compared to a router's FIB, but it would be easy to
create such servers for very little cost.  A dual or quad CPU chip
running in 64 bit mode, with 8 gigs of RAM and one or two gigabit
Ethernet interfaces on the motherboard, boot from a USB stick.
Total cost: well under USD$1k.  There's not even a hard disc
required.

To match the traffic to the capacity of these DIY ITRDs, there
could be at least two strategies, both potentially used at the
same time.

1 - Split the entire Ivip-mapped address space over several
    DIY ITRDs.

2 - Plant more of them around the edge network, so each one
    gets less packets.

This approach of having an ITRD work with only a subset of the
Ivip-mapped address space is likely to be a crucial scaling
technique in the future.  Therefore, there needs to be some way
that the stream of updates, and the periodic database copies, can
similarly be split.  For now, I won't consider this.  I will
assume every ITRD handles the full range of Ivip-mapped addresses.

I see three major tasks which need to be performed so the ITRDs,
ITRCs and ITFHs get what they need.  Since ITFHs are caching
("pull") ITRs like ITRCs, their needs are similar.  They don't get
a continual stream of updates - they need a fast response to their
queries.   I also propose a system of notifications, to
instantly push updated mapping information to those ITRCs and
ITFHs which need it.


Distributing the update stream to ITRDs
---------------------------------------

The most obvious way to do this is have the central database put
out N streams of updates, one to each of N replicators.  Each
replicator sends the same stream to further replicators etc.
Eventually, each ITRD gets a stream from one replicator.  This
could all work pretty quickly, with maybe a few seconds elapsing
between the end-user changing their mapping and the ITRDs all over
the world changing their FIB to tunnel packets to the new address.

This pure tree-structured system suffers from single-point of
failure problems and, if TCP is used, will frequently get stuck
for a moment when a packet goes missing.

I am trying to imagine a more diffuse system, which has some
tolerance for lost packets, but still delivers most of the updates
on time, and copes with those in lost packets either by getting
the same packet via another source, or by having some (admitted
rather painful sounding) system where a replicator can ask its
some distant replicator for a missing packet.  Exactly how the
updates would be compressed and packed, or coded in a way which is
suitable for splitting to multiple ITRDs with different subsets of
the Ivip-mapped address space, I haven't considered.

I don't know much about Google's internal operations, but I like
their idea of throwing lots of cheap PCs at the problem and making
the system work reliably at a very high speed by sending out more
requests and generally getting more answers than is required for
success.

Lets say the first level of replicators take the stream from the
central database (the three or more servers which act as a team
and somehow generate one or more streams, all the same).  The
streams probably need to be protected by occasional hashes,
signatures or whatever so each replicator can detect it if is
being fed something crooked by an impostor.

Then each of the second level of replicators gets streams from at
least two of the first level.  However, each replicator can
receive one stream and put out, say, 16 streams.

Somehow, these replicators form a meshed, downward packet-sending
"fanning out net across the Net", structure where they send their
streams to ITRDs.  Maybe each ITRD gets two streams from two
replicators, via two links, so it is still in business if one link
or replicator fails.


     Tree-structured system of delegated
     authority and servers which at its
     leaves, have the facilities for end-users
     to securely control the Ivip-mapping of
     their one, few or many IP addresses.


        \      |    |    |      /
         \     |    |    |     /
          \    |    |    |    /
           \   |    |    |   |
           |   |    |    |   |
           V   V    V    V   V

             ...............
         .                    .
       .   CDSB1---------CDSB2  .    Distributed,
      .       \           /      .   redundant,
      .        \         /       .   central
      .         \       /        .   database.
      .          \     /         .
      .           \   /          .
       .          CDSB3         .
         .                     .
             ...............

           |  |  |  |  |  |  |       Generates periodic
           |  |  |  |  |  |  |       complete database
          /   |  |  |  |  |   \      files and a stream
         /   /   |  |  |   \   \     of second-by-second
        /   /   /   |  |    \   \    updates.
                    V
                    |
                Replicator (Level 1)
                ||||||||||
               / |||||||| \
              / / |||||| \ \
             / / / |||| \ \ \
            / / / / || \ \ \ \            From another
           / / / / / |  \ \ \ \           Level 1
                     V                    Replicator
                     |                       /
                     |                      /
                     |                     /
                     |                    /
                     |                   /
                     |   /------<-------/
                     |  /
                     | /
                     |/
                 Replicator (Level 2)
                 ||||||||||
                / |||||||| \
               / / |||||| \ \
              / / / |||| \ \ \
             / / / / || \ \ \ \
            / / / / / |  \ \ \ \


               etc. etc. etc.


       Two or more streams from different replicators
       perhaps at different levels.

                     |  /
                     | /
                     |/
                   ITRD


The ITRD could itself be a replicator - but I think it is best to
leave its CPU and links to routing, and let some cheap servers be
the replicators.


This diagram is for the continual update streams.  Once an ITRD
has a complete copy of the database, the streams will keep it
totally up to date.

But how does it get that way?   It would be possible to distribute
by the same system a separate Hourly Database Dump stream, in
parallel with the real-time update stream.  The Hourly Database
Dump could be ignored by (and not sent by a replicator to) those
ITRs which didn't want it.

Lets say the dump only took 20 minutes to send.  Exactly how big
the Ivip database is going to be, I am not sure.  But there are
ways of splitting the whole Ivip system into parallel systems
which are within scaling limits, such as by having a cluster of
ITRDs in one site which each handle a subset of the whole address
space.

I anticipate the database contents would be very simple.

For IPv4, each IP address or prefix or whatever with a single ETR
address will be described by:

32 bits:    Start address.
16 bits:    Span.  0 means a single IP address.
32 bits:    Address to tunnel packets to.

(We could use more than 16 bits for the span, but very few swaths of
Ivip address space with the same mapping will be more than 65,636
long.  So it is better to keep the data compact and specify any such
long contiguous spans, including areas which are not mapped to
anything (0) with a few repetitions of the above with a span = 65,535.)

These would probably compress pretty well too, since the span will
generally contain 38, 30 and often 32 zeros in a row.

** There may also need to be a caching time with this data, so
** that when a query is made to a QSD (defined below) which has
** its own copy of the database, that the response contains a
** "time to cache".

When sending the whole database, all that is needed for each
master-subnet is the starting address and then repetitions of, for
instance:


8 bits:    Span 0 means one IP address 255 means 256 addresses.

32 bits:   Address to tunnel packets to.
           (0x0000 means not mapped - drop the packet.
            0xFFFF means end of this master subnet.)

** Also maybe a caching time.

The whole database could then be specified in segments, one for
each non-contiguous range (a "master-subnet").  For instance, there
is a 32 bit number to specify the start, then a sequence of 8+32
bits like this.

When there are more than 256 addresses to be mapped to the same
address, simply use multiple such sequences.  There won't be many
of those compared to the large number of single IP addresses and
smaller ranges.  Note there is no requirement here for binary
boundaries, aggregation etc.

A worst-case scenario is to map N IPv4 addresses, completely
independently, would be 5*N bytes.

(For IPv6, with Ivip working to a /64 granularity, it would be
 8 bits plus 64, so 9 bytes.)

So when the Ivip system handles a billion IPv4 addresses, it takes 5
gigabytes at worst case, probably 2 or so gigabytes, after
compression and allowing for some of the space being mapped as
prefixes (or any arrangement of contiguous addresses with the same
ETR address).

Allowing for inefficiencies, over a 1 Gbps link with 100Mbps
devoted to receiving or sending the database dump, this would take
(3 Gigabytes) about five minutes.

So lets make it the "Ten minute database dump"!

Lets say every ten minutes, the central system sends to the
replicators this dump, in parallel with real-time updates.

Then when an ITRD boots up, it waits for the start of the dump,
stores the whole thing as it comes in, while storing real time
updates.  When the whole dump arrives, it decodes it into RAM,
applies the realtime updates so far received, and keeps on
updating the data in RAM from then on.  From that moment on, it can
start crunching the data into whatever form drives its FIB.

Maybe, this periodic dump system would be a good enough way of
coping with occasional lost packets in the real-time updates.  But
then, we would need a more solid arrangement so the periodic dumps
are not corrupted or slowed much by lost packets.


Replicators and missing packets
-------------------------------

Lets say the one or more update streams have packets with
sequential 64 bit numbering, so any replicator which receives it
can easily tell which stream it belongs to, and at what point in
that stream.  Then these packets can be sent as UDP.

With two separate feeds from two separate replicators, generally
each replicator is going to get at least one copy of all the
packets.  But what about the rare occasion when it doesn't . . .

This is a little tricky.  Maybe the ITRDs can be made to cope
gracefully, with some identifiable gap in their data, but not
actually wrong information.  Lets say there were gaps in the "10
minute database dump" with which the ITRD booted.  It would carry
on, simply dropping packets to that range of addresses it doesn't
have any mapping information for.  Ideally, this would be a small
range of addresses, and maybe no-one is sending packets to that
range through this ITR.  The problem should be fixed with the next
10 minute database dumps.

For gaps in subsequent 10 minute database dumps, the ITRD could
quite reasonably keep the mapping it had from the previous cycle,
since in principle, the 10 minute database dump is providing
identical data to that which the ITRD had figured out for itself
based on the last dump and ten minutes worth of updates.

So maybe the odd missing packet is not such a lot of fuss in this
system.  Below, I will assume it can't be tolerated.

A replicator C receives two or more streams from two other
replicators A and B.  If it doesn't get a particular packet (with a
particular sequence number) from either A or B, there are several
possible causes.  One is that both A and B sent the packets but they
were both lost in transit.

In that case, the C could request from either A or B to resent
this one, or multiple, packets.  A single message could request
quite a number of packets to be sent, and the A or B could
dutifully send them all, several times with a few seconds between
them, just to be sure.

But maybe A or B didn't get the packets either.

In that case, it would be best if a packet with a sequence number
was always sent by A or B, but with a flag to say that the
replicator sending this packet has not yet received its data - so
"stay tuned" and there will be another such packet, with the same
sequence number, but the correct data, sent ASAP.

Maybe C needs to be able to talk to some rather distant
replicators, which are not on the upstream path of A or B, to ask
it to send the desired packets.  If it sends out two such requests
to two topologically different replicators, it will have a good
chance of getting the right data.  If no response comes in a
second or two, it can always try again.  A response may also come
back: "This replicator doesn't yet have the packet in question but
will send you a copy if and when it gets a copy."

The rolling "10 minute database dump" system, the checks on the
integrity of the data before it is used, the configuration of ITRs
so they can't tunnel packets to certain ranges of addresses (for
instance to protect critical infrastructure such as the DNS) may
be enough to allow a low level of holes in the dump and update data.

Another approach is to use Reed-Solomon Forward Error Correction
and interleaving so that even with some lost packets, the data
gets through, with an overhead of say 30 or 40% extra packets
required.  But I think these algorithms are probably really
difficult to do on ordinary CPUs.


Query Servers
-------------

There could be an IETF protocol for querying a server about Ivip
mapping for a particular IP address.  (The ITRC and ITFH only
requests how to encapsulate individual packets addressed to a single
Ivip-mapped address, but maybe it would be good to extend the
protocol to cover a range of addresses . . . I won't think about
that now.)

The client asks "What is the Ivip mapping for this IP address?"

The server responds with either:

1 - I don't know but am finding out . . . and will respond with
    another packet soon.  (Actually - this is probably not
    necessary in the scheme below.)

2 - This address is not within one of the master-subnets which
    the Ivip system covers.  (Ideally, the ITRs should already
    know this, and the list of master-subnets will change, so
    each ITR needs to get it securely from somewhere.)

3 - This address is not mapped at present, so drop packets sent
    to this address, or to any address M  to N inclusive.  (Or M
    with a span.) Cache this information for S seconds.

4 - This address is mapped to address Z and so are all the
    addresses from M  to N inclusive.  (Or M with a span.)
    Cache this information for S seconds.


There are two types of query servers:

  QSD  Has a full copy of the Database, gained just like an ITRD
       does - from one or two or more streams of packets sent from
       replicators.  This function could be implemented in a router
       which is an ITRD, or in a replicator, but I think it is
       better to have a dedicated server for this purpose.
       Again, the QSD server doesn't need a hard drive - just
       booting from a USB stick and doing its work in RAM.

  QSC  Caching Query Server.  Doesn't have a copy of the database,
       but has at least one link to either a QSD or a QSC which
       does link (through potentially multiple levels) to one or
       more QSDs.

       The QSC function could be implemented in a server or
       perhaps in a router which is performing the ITRC function.


Each decent edge network would have one or two QSDs, probably each
also being a little Replicator to feed data to its one or two
ITRDs.  These may be fairly low-volume ITRDs, implemented on cheap
servers, because I plan that most traffic would be handled by
proper routers, with hardware FIBs, which implement the ITRC
function, for some level of traffic and number of concurrently
mapped IP addresses or prefixes.  (Unless the edge network was a
web-server farm in which case it would make sense for all the
servers to have their operating systems upgraded to perform ITFH,
rather then rely on expensive routers to do the work.)

The low-volume ITRDs are there to instantly tunnel the packets
which the ITRCs and ITFHs don't yet know how to handle.  They
don't know how to encapsulate a packet when its destination
address is within the Ivip-mapped range but which the ITRC or ITFH
has no cached data about.  Maybe it is just a one-off packet, so
the ITRC or ITFH either lets it pass, and be forwarded to a
low-volume ITRD, or alternatively the ITRC or ITFH explicitly
tunnels the "novel" packets to the nearest ITRD.  It is best not to
launch a query for every novel packet, because that would allow an
attacker with a scattergun set of short packets to tie up a lot of
resources in the edge network.

Meanwhile, the ITRC and ITFHs decide for themselves whether or not
 enough novel packets have an address it currently doesn't have the
mapping information for.  Whenever that decision is positive, an
ITRC or an ITFH sends a request to one or more upstream QSDs or QSCs.

Both the QSD and the QSC behave in the same way to queries and the
idea is that the ITRC and ITFHs get a response, from a fresh (a few
seconds old at most) copy of the database, within a fraction of a
second.


       Two or more streams from different replicators
       perhaps at different levels.

     Internet        |  /
                     | /
 ....................|/.......................................
                Replicator
  Edge network      &QSD           (Query Server with a local
  which is          /|\            copy of the Database.)
  hip to Ivip      / | \
                  /  |  \-----------\
                 /  QSC             QSC
              ITRD  /|\             /|\
                   / | \           / | \
                  /  |  \         /  |  \
                 /   |  ITRC   ITRC QSC  \
                /   ITRC            /|    ITFH
               /                   / |
             ITFH                 / ITRC
                                ITRC




Notification of changed mapping
-------------------------------

There is a fundamental problem with the mismatch of the needs of
multihoming end-users with the idea of cached ITRs.

A multihomed edge network might run like a charm from ISP-A and its
ETR for months or years - but the moment something goes wrong,
no-one wants the Ivip system to dawdle in changing the mapping.

Meanwhile, no-one wants the ITRCs and ITFHs to get responses
saying "cache this result for ten seconds" so they have to keep
asking again and again, with the result never changing for months or
years.

Maybe the system could operate on a different basis from variable
caching times.  Maybe there could be a fixed caching time, to
simplify things.  I will explore a fixed caching time of 10
minutes (assuming a 10 minute full database dump cycle to all the
QSDs), but there could be arguments for other times and for each
response having a user-defined caching time.

Let's say that if a client of a QSD or QSC (which means the client
could be a QSC, an ITRC or an ITFH) makes a query about an IP
address, it gets back a response of this set, which is a
modification of the 1 to 4 above:

1 - (We don't need the response 1 above.)

2 - This address is not within one of the master-subnets which
    the Ivip system covers.  (ITRs shouldn't ask this - they
    should already know.)

3 - This address is not mapped at present, so drop packets sent
    to this address, or to any address M  to N inclusive.  (Or M
    with a span.)
    (No cache time - the client is assumed to cache it for
     10 minutes.)

4 - This address is mapped to address Z and so are all the
    addresses from M  to N inclusive.  (Or M with a span.)
    If this changes in the next ten minutes, you will get a
    notification.
    If you still want to know the mapping of this address in
    ten minutes time, ask again.


Now, the chain of 0 or more QSCs and the final query server, the
QSD, can retain a record, for a standard 10 minute period (should be
easy to implement if the whole database dump cycle is 10 minutes)
and if the QSD gets a real-time update of an address or prefix
changing, it sends a message to that client, which passes it on to
every client that needs it.

Lets say an ITRC7 asks QSC5 about 20.0.10.34.  (Exactly the same
behavior applies for a host-based ITFH function.) Maybe QSC5 already
has cached an answer about this, so it responds immediately.
Alternatively, QSC5 asks its upstream query server QSC2 which then
asks the QSD1, which knows the answer.  The responses go back down
to QSC2, QSC5 and then to ITRC7:

   This address is mapped to 33.55.66.77

   So is the range 20.0.10.32 to 20.0.10.37 inclusive.

There is an implicit 10 minute caching time for this information,
after which each of these devices can forget the information
unless there is some subsequent request.

But lets say after 7 minutes, a change arrives which affects the
validity of this response.  Maybe the change doesn't affect the IP
address for which the query was made.  Still, a notification needs
to be sent.  For instance:

   The range 20.0.10.37 to 20.0.10.37 inclusive is now
   mapped to 33.55.88.99.


Each QSC maintains a list of recent queries from its downstream
QSCs, ITRCs and ITFHs (it doesn't need to know what they are) and
it sends a Notification packet to them with the above updated
information.

This way, ITRCs and ITFHs don't need to keep bugging their QSCs,
or at least not more than once every 10 minutes.  However, within
a few seconds, a change in the database filters through to every
ITRD in the Net, and to all the caching ITRs which need it.

There probably needs to be some small charge for these changes,
because they do burden the entire Ivip system - the databases, the
replicators, the ITRs (which are mainly run by and located within
edge networks).  That charge won't be a problem for multihoming
end-users, but it would give physically mobile mobile-IP end-users
some incentive not to chop and change too much, whilst being made
to financially contribute to the Ivip system with these fees  -
which I guess would be charged back to the various companies which
control the various master-prefixes.

So maybe we can make the cashed up road-warrior Mobile IP folks
pay for the whole Ivip system!   The global network of ITRs beaming
their packets to whatever TTR they choose is probably their dream
come true.

The mobile IP users still need to have their independent networks of
TTRs, but they would have great mobility, access by any host, and
generally highly optimal packet paths in both directions.

They would literally be able to plug their laptop, phone, whatever
these things are combined into and called at the time, into any old
Ethernet socket with DHCP, or have its WiFi system connect to any
open Access Point, and the host software and the centralised
management system would automatically keep the device well
connected, always on its own IP address, with highly optimal paths
for most of the the packets - without any prior arrangement with
edge networks.


 - Robin         http://www.firstpr.com.au/ip/ivip/





_______________________________________________
RAM mailing list
RAM@iab.org
https://www1.ietf.org/mailman/listinfo/ram