Re: [RAM] 5 Database <--> ITR push, pull and notify

Robin Whittle <rw@firstpr.com.au> Fri, 06 July 2007 08:45 UTC

Return-path: <ram-bounces@iab.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1I6jRH-00045C-9w; Fri, 06 Jul 2007 04:45:19 -0400
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1I6jRF-00043w-Sg for ram@iab.org; Fri, 06 Jul 2007 04:45:17 -0400
Received: from gair.firstpr.com.au ([150.101.162.123]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1I6jR9-000189-TD for ram@iab.org; Fri, 06 Jul 2007 04:45:17 -0400
Received: from [10.0.0.8] (zita.firstpr.com.au [10.0.0.8]) by gair.firstpr.com.au (Postfix) with ESMTP id AED9A59DA2; Fri, 6 Jul 2007 18:45:06 +1000 (EST)
Message-ID: <468E0106.8040007@firstpr.com.au>
Date: Fri, 06 Jul 2007 18:44:54 +1000
From: Robin Whittle <rw@firstpr.com.au>
Organization: First Principles
User-Agent: Thunderbird 2.0.0.4 (Windows/20070604)
MIME-Version: 1.0
To: ram@iab.org
Subject: Re: [RAM] 5 Database <--> ITR push, pull and notify
References: <46890ECE.2030309@firstpr.com.au>
In-Reply-To: <46890ECE.2030309@firstpr.com.au>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Scan-Signature: f8184d7d4d1b986353eb58ea3e887935
X-BeenThere: ram@iab.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Routing and Addressing Mailing List <ram.iab.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ram>, <mailto:ram-request@iab.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ram>
List-Post: <mailto:ram@iab.org>
List-Help: <mailto:ram-request@iab.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ram>, <mailto:ram-request@iab.org?subject=subscribe>
Errors-To: ram-bounces@iab.org

Here is a decentralised alternative to what I proposed - to take
the place of the "Distributed redundant central database" and to
remove the need to send regular dumps of the database through the
replicator system.

In this new approach, the replicators, ITRDs and QSDs are doing
much the same thing as before, but the replicators only carry
real-time updates.  Rather than a single stream, there are
multiple streams, each stream carrying updates for one of the Ivip
system's "master-subnets".


Multiple companies, organisations etc. which have one or more
"master-subnets" in the Ivip system each have their own UAS
(Update Authorisation Server) systems, which in themselves would
be some kind of redundant pair or set of servers which behaved as
one and was not subject to a single point of failure.

The diagram in the message a few days ago "4 User-interface and
delegation tree for central database (LISP/Ivip)" shows how the
UAS-X of company X delegates control of different sections of this
"master-subnet" and gets all the updates within a fraction of a
second of the end-user's making the changes.

The UAS-X (or redundant set of such servers) is the central store
of mapping database for this "master-subnet".  UAS-X could handle
multiple separate "master-subnets" but this example only considers
one IPv4 "master-subnet" 20.0.0.0/14.

There could be potentially large number of such UAS systems, maybe
hundreds or up to tens of thousands.  Ideally there would be no
more than a few dozen or a few hundred.

Each UAS periodically, say ever 10 minutes, generates a compressed
"dump" file of its database and makes it available for download by
HTTP or FTP on multiple redundant servers.  Each dump file has a
timestamp.  There needs to be some crypto system so that ITRDs and
QSDs can check the hash of the dump file they download against
some signed record of its proper value.  There also needs to be
some crypto validation of the dump messages - which I haven't
thought about yet.

Each UAS continually generates a UDP stream of updates, also
timestamped.  One of the messages in that stream may be "a dump
file was generated now".  In practice, the UAS system generates
several identical streams from different locations.  Maybe it
generates an update message as soon as the incoming updates fill a
UDP packet or after one second elapses.  If no updates come in for
ten seconds, maybe it sends a time-stamped update message anyway,
with no updates.  Each message needs a 64 bit sequence number, and
a 32 bit or similar identifier for which master-subnet it is
updating the mapping of.

The distributed system of "replicators" is configured to replicate
the contents of the update stream produced by UAS-X.

A newly booted ITRD (Ingress Tunnel Router with full database) or
QSD (Query Server with full Database) performs the following
procedure, for each of the master-subnets in the Ivip system.  The
ITRD or QSD is receiving from the replicator system many streams
of updates, including for the master-subnet in question, which is
the one coming from UAS-X.  (UAS-X could also be responsible for
other master-subnets.)

The ITRD/QSD monitors the stream, waiting for the flag which says
a dump has been created.  It then buffers all subsequent updates
in the stream, waits until the dump file is available (which could
take some seconds) and then starts to download the dump file.

By the time it arrives, perhaps a minute or so of updates will
have been buffered.

The ITRD/QSC unpacks the dump file into an array in RAM which is 4
bytes for every IP address in the master-subnet. (This is an IPv4
example.)  It then applies the buffered updates, bringing the data
totally up-to-date with the last received update - and then
continues to apply all subsequent update messages as they arrive
from the replicator system.

Then the ITRD/QSD has an up-to-date copy of UAS-X's database for
this master prefix.  A QSD can start answering queries about it.
A ITRD can advertise this master-subnet's BGP prefix.  Soon it
will receive packets addressed to this master-subnet and can
encapsulate them and forward them to peers towards their ETRs.

It would be important to have a close or perfect match between the
address range of each "master-subnet" and the BGP advertisement
which the ITRs make for it.  We want each ITR either advertising
or not advertising a BGP prefix.  We don't want excessive churn in
the advertisements, such as advertising a small subnet (longer
prefix) when one master-subnet's mapping is complete and then
withdrawing this to advertise a larger subnet when an adjacent
master-subnet's mapping data is complete.

On the other hand, even if there was a massive master-subnet, like
a whole /8, it would be good for some ITRDs to advertise a subset
of this, if the total ITR load was to be split among several ITRDs
by making each one only handle a subset of Ivip-mapped address space.

Periodically the ITRD/QSC could repeat the process of downloading
a dump file, building a second array in RAM while the current one
is being updated.  When the process was complete, it would switch
to using the second one for its queries or mapping functions,
freeing the first area of memory.

Since some or many of the packets coming from the UAS systems to
the Level 1 replicators might be short, perhaps they should have a
way of combining shorter packets into longer ones, to reduce the
total number which need to be sent through the rest of the
replicator system.  This could be dodgy, since a single missing
packet at that point could cause some difference in the streams
leaving each Level 1 replicator.


If there were 30 Level 1 replicators, UAS-X might generate streams
to every such replicator.  (Maybe two streams to every Level 1
replicator?) Level 2 replicators typically receive streams from
two level 1 replicators for redundancy.  There could be hundreds
of systems like UAS-X feeding UDP update streams to the Level 1
replicators.

This removes any central system for handling the data.  It does
away with my initial idea of sending regular (say every 10
minutes) full database dumps via the replicator network.

So an ITRD would boot and advertise various prefixes as it
acquired  the full mapping information for each master-prefix.



(See "4 User-interface and delegation tree ...".)

\   |   /   }  Update information from end-users
 \  V  /    }  directly or via child UAS systems.
  \ | /
   \|/
  UAS-X --------->-------------------[Dump server 1]
   /|\                         \
  / | \                         \----[Dump server 2]
 /  |  \                         \
/   V   \                         \-- etc.
    |    \
    |
    |  30 UDP streams of identical realtime
    |  updates to the 30 Level 1 replicators
    |
    |
\   \    |    /   /     Each of the 30 Level 1 replicators gets a
 \   \   V   /   /      stream from every UAS such as UAS-X - one
  \   \  |  /   /       stream for every "master-subnet".
  [Replicator-N]
     / / | \ \
    / /  V  \ \         Each of 30 Level 1 replicators sends 30
   /  |  |  |  \        "full streams" (the sum of all the streams
         |              it receives from systems like UAS-X) to
         |              Level 2 replicators.
         \
          \         /
           \       /
            \     /     Level 2 replicator gets two (ideally
         [Replicator]   identical) full streams from two of the
           / / | \ \    Level 1 replicators.
          / /  V  \ \
         /  |  |  |  \
                  |
                  /
                 /
      \         /
       \       /
        \     /         Level 3 replicator gets two (ideally
     [Replicator]       identical) full streams from two of the
       / / | \ \        Level 2 replicators.
      / /  V  \ \
     /  |  |  |  \      All these replicators are cheap
        |     |         diskless Linux/BSD servers with one or
        |     |         two gigabit Ethernet links.  They would
        |     |         ideally be located on stub connections
        |     |         to transit routers, though the Level
   \    |     |         3 (or 4 etc. if desired) might be at
    \   |     |         the border of, or inside, provider and
     \  |     |         ASN-end-user networks.
      \ |      \
      ITRD     QSD      ITRDs and QSDs ideally get two or more
               /|\      ideally identical full feeds of updates -
              / | \     so generally a missing packet from one
             / QSC \    is fine since the other stream has the
            /  /|\  \   same packet.
           /  / |
          /  /  |       Both therefore have a real-time updated
         /  QSC |       copy all the databases of all the UASes
      ITRC /|  ITRC     like UAS-X.  Queries go up to the QSD -
          / |           or to a QSC which has a cached answer.
         /  |           Responses go back down to the requester
       ITFH ITRC        which is either a QSC or one of the two
                        "pull and be notified" caching ITRs:
                        ITRC (ITR with Cached mapping) and
                        ITFH (Ingress Tunnel Function in Host).

The figures quoted below are wide guesses for the example.
Exactly what the database sizes are, the update rates, the data
rates of updates etc. depends on many factors.  I want a system
which can scale to handling one or two billion IPv4 addresses,
with some of these having reasonably frequent updates due to
mobility - with those mobile end-users paying per update to help
finance this system.

The average Level 2 replicator gets two full streams from two
widely separated Level 1 replicators.  This means there can be 450
Level 2 replicators, each of which sends out 30 streams to Level 3
replicators.  The pattern continues with 15 * 450 = 6750 Level 3
replicators, each of which has 30 output streams, with most ITRDs
and QSDs getting two streams, from widely separated Level 3
replicators.

With this push to ITRDs and QSDs, pull via queries to QSDs, and
the QSDs notifying (carefully directed push) their child QSCs or
ITRCs of an update - those which recently (10 minutes?) made a
query whose answer covered one or more addresses affected by an
update - the entire global ITRD, ITRC and ITFH system should get
updates within a few seconds of the end-user making their change.


There would be some agreed, centrally coordinated, system by which
the Level 1 replicators and the ITRDs and QSDs could recognise
which UAS systems were currently a part of the Ivip system, and
their IP addresses.

That could be as simple as some organisation pointing to them with
DNS from a domain of theirs.  It could also be in the form of an
agreement for the replicators and ITRs to accept updates from a
generally expanding list of UASes.  This would involve central
coordination, but it doesn't involve centralised flows or storage
of data - just non-real-time configuration information which the
operators of replicators, ITRDs and QSDs would follow.


For scaling purposes, some ITRDs may not cover the entire set of
Ivip-mapped master-prefixes.  Then two or more ITRDs at the same
site could split the load among themselves.

In that case, an ITRD which covers a subset should be able to
request of its upstream (typically two, but maybe three or more)
replicators only to send those updates for the master-prefixes it
is advertising.  This means the packet format needs to be easy for
the replicators to recognise and classify, which would be more
complex if some or all of the full stream packets contained data
collected from separate packets sent by separate UAS systems to
the Level 1 replicators.


In the previous message I mentioned how a replicator, ITRD or QSD
could request of an upstream replicator another copy of some
packets it was missing.  Maybe that could be done via making the
packets available in a structured manner, such as with file names
with their sequence number in ASCII as the name, available in the
same web servers which are used to supply full dump files of the
database for each master-subnet.   These files only need to be
kept for 10 or 20 minutes at most, assuming there is a dump every
10 minutes or so.

In principle, the mapping data in each ITRD or ITRC could be
simply updated for months or years, leaving soft RAM errors to
fester for this time.  Each such error would tunnel packets for
given IP address to the wrong location.   Fully refreshing the
mapping data in each ITRD from a dump every few hours, as CPU
capacity permits, would prevent soft errors accumulating for too
long in non-error corrected RAM.

   Robin       http://www.firstpr.com.au/ip/ivip/




_______________________________________________
RAM mailing list
RAM@iab.org
https://www1.ietf.org/mailman/listinfo/ram