[ogpx] My take on Teleports and protocol resilience

Carlo Wood <carlo@alinoe.com> Sun, 25 October 2009 12:15 UTC

Return-Path: <carlo@alinoe.com>
X-Original-To: ogpx@core3.amsl.com
Delivered-To: ogpx@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id E41123A690F for <ogpx@core3.amsl.com>; Sun, 25 Oct 2009 05:15:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.077
X-Spam-Level: *
X-Spam-Status: No, score=1.077 tagged_above=-999 required=5 tests=[AWL=-0.093, BAYES_50=0.001, HELO_EQ_AT=0.424, HOST_EQ_AT=0.745]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WqPskactz1BJ for <ogpx@core3.amsl.com>; Sun, 25 Oct 2009 05:15:44 -0700 (PDT)
Received: from viefep16-int.chello.at (viefep16-int.chello.at [62.179.121.36]) by core3.amsl.com (Postfix) with ESMTP id 80D9D3A68A7 for <ogpx@ietf.org>; Sun, 25 Oct 2009 05:15:43 -0700 (PDT)
Received: from edge04.upc.biz ([192.168.13.239]) by viefep16-int.chello.at (InterMail vM.7.09.01.00 201-2219-108-20080618) with ESMTP id <20091025121553.XFJN7738.viefep16-int.chello.at@edge04.upc.biz> for <ogpx@ietf.org>; Sun, 25 Oct 2009 13:15:53 +0100
Received: from mail9.alinoe.com ([77.250.43.12]) by edge04.upc.biz with edge id x0Fq1c05J0FlQed040Fr8v; Sun, 25 Oct 2009 13:15:53 +0100
X-SourceIP: 77.250.43.12
Received: from carlo by mail9.alinoe.com with local (Exim 4.69) (envelope-from <carlo@alinoe.com>) id 1N220h-00030O-SL for ogpx@ietf.org; Sun, 25 Oct 2009 13:15:47 +0100
Date: Sun, 25 Oct 2009 13:15:47 +0100
From: Carlo Wood <carlo@alinoe.com>
To: ogpx@ietf.org
Message-ID: <20091025121547.GB7775@alinoe.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.20 (2009-06-14)
Subject: [ogpx] My take on Teleports and protocol resilience
X-BeenThere: ogpx@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Virtual Worlds and the Open Grid Protocol <ogpx.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ogpx>
List-Post: <mailto:ogpx@ietf.org>
List-Help: <mailto:ogpx-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 25 Oct 2009 12:15:45 -0000

Lets start with some brainstorming...

Things that might be relevant, in no particular order.

* State of avatar
* State of attachments
* Location of avatar
* Time of unresponsiveness of avatar (position) towards viewer
* Time of unresponsiveness of attachments (attaching/detaching/scripts)
* Perception of the viewer
* Perception of other viewers
* Region boundary crossing
* Teleport over larger distance
* User configurable parameters
* did I miss something important?


Real message starts here:

Something that protocol (or implementation)- wise leads to problems
are copies of the same data on different hosts; this always leads
to desynchronization of this data with heaps of untrackable problems
(bugs).

The ONLY way to keep a copy of some data reliably synchrononized is by
having a stream of state changes being sent from one host to another,
where the messages that contain that state changes always keep
the same order and the stream is never terminated/lost (in which case
a full-resynchronization would be necessary). Note that this means
that the source of the state changes has to be a single point of
origin, which automatically means that we can identify the REAL
(original) copy of the data. Thus:

  [original] ---> stream of state changes --> [copy]

Lets call this an "unidirectional state" by lack of a better word.

[PS I'm ignoring actual implementation details here. I'm NOT
    saying that currently anyone is using a TCP-stream of
    state changes, so don't quote JUST this part with the comment
    that this is not how it current works, Infinity :p
    Instead, this is an mathematical approach, the abstract
    equivalent of any possible deployment case].

Basically, we want to avoid copies of the same data.


The simulator contains a lot of state information that cannot
be moved away from the simulator though: because the state is
needed to calculate the interactions between all objects and
avatars in the region, which would become way too slow if,
for example, several agent domain services would have to be
queried all the time. I can imagine that this also holds for
the attachments on an avatar.

This means that if an avatar moves from one region to another,
all this state information has to be transfered too.

Thus, the origin sends state information to the new region.
If this fails, then we want the user to resume at it's old
location: we need to keep a copy of the state until we know
the teleport was successful. In other words, we will have
a copy of data at two different hosts, temporarily.

One way to make sure that this is not a problem is by
freezing all state; then copy it, and only once the copy
is successful, destroy the old data and resume the simulation
in the other region.

The problem we try to tackle now is the case were the old
region is not responsive...


In this case we have to notice that the AD can ALSO detect
that at least the new region is able to host the avatar:
the teleport can *partly* succeed. This can be detected
independent of copying all the state information.

Secondly, we have to realize that the 'location' of the
avatar is (state) data that does NOT have to be copied
(we're changing it anyway) and therefore is not part of
said state, and does not suffer from the copy-problem.

The same could be said about animations (which is currently
broken in Second Life: you must stop all animations
before teleporting or a desync happens): we could state
that after a teleport no animations are active, and leave
it to the viewer to re-initiate and required animations.
However, animations can be made unidirectional (meaning
that if anyone but the agent service wants to change the
animation, it has to send a request message to the AD,
which then grants it, so that the actual state change always
originates from the agent service (AS)).

As a result, we can think of the animation state as:


  [AS:animation state] --- message stream --> [simulator:animation state copy]

which means that we can teleport, keeping the correct
animation(s) without bothering with the source region.


The only clear case where this kind of 'trick' doesn't work
(I think it's not a trick, but the best way to implement this)
is for script states: those change way too frequent to host
them on the AS.


Conclusion

Thus, in the case of an unresponsive source region,
we CAN - without risks of desynchronization - immediately
transfer the avatar (location) and animations, and
any other INFREQUENTLY CHANGING, UNIDIRECTIONAL DATA
that can be stored on the AS (and buffered on the
simulator), like clothing UUIDs, and attachments.

However, the scripts in the attachments will not run
until their state is transfered from the old region.

Imho, this is hardly a problem: after user configurable
timeout we leave it to the user what he wants to do:
logout or reset the scripts :p... Ok, we just reset
the scripts after some time, where the timeout is
determined by the AD with possible input from the user.

-- 
Carlo Wood <carlo@alinoe.com>