Re: [ogpx] Teleports and protocol resilience

Morgaine <morgaine.dinova@googlemail.com> Sat, 24 October 2009 03:45 UTC

Return-Path: <morgaine.dinova@googlemail.com>
X-Original-To: ogpx@core3.amsl.com
Delivered-To: ogpx@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 1E1C23A63D3 for <ogpx@core3.amsl.com>; Fri, 23 Oct 2009 20:45:37 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.16
X-Spam-Level:
X-Spam-Status: No, score=-1.16 tagged_above=-999 required=5 tests=[AWL=0.085, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, J_CHICKENPOX_21=0.6, SARE_UNSUB18=0.131]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Q3aB4vbMAzx6 for <ogpx@core3.amsl.com>; Fri, 23 Oct 2009 20:45:35 -0700 (PDT)
Received: from mail-ew0-f208.google.com (mail-ew0-f208.google.com [209.85.219.208]) by core3.amsl.com (Postfix) with ESMTP id 6841A3A635F for <ogpx@ietf.org>; Fri, 23 Oct 2009 20:45:34 -0700 (PDT)
Received: by ewy4 with SMTP id 4so2193547ewy.37 for <ogpx@ietf.org>; Fri, 23 Oct 2009 20:45:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=gA+humdm6H1VR2mRXzWXEU9VfxCnNqjF2i8lZPhIsBM=; b=QQbyqXCN0dpa+0i8tr1PNPSx6ugEsfGvevh4K51+rQJkOgSTLr6horKyh3NyUjjdSc DrZyH39T6pV7K0UtJlNeNEE6F9fk6W2hfCjQUXLp2mg2T7UTlfba0iKWQEkoiyaR7YBe qLhu7C+33xkWqrL2YKtzcjdQVO7e+ftU4KBIs=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=M1xQR6DGlMobT9PQtnEncL73Kt21y1n/OG/Vw15ilYrczJ/QsQhaYBtKu/LnA6EQ+a ZuB0BZ3Qde43yNmaSVi7ZaB6auOR5HQqq52sAM0MC0ZKRwL/gQA0YhY3jOHwDfSDmd4k oosG0xGwzweYxUcpbJbKUBNP8KQxl22U37sbk=
MIME-Version: 1.0
Received: by 10.210.4.13 with SMTP id 13mr1435380ebd.64.1256355940629; Fri, 23 Oct 2009 20:45:40 -0700 (PDT)
In-Reply-To: <OF1FEFE8EB.C2DA4EA2-ON85257658.005C9A89-85257658.0063C6C5@us.ibm.com>
References: <e0b04bba0910122213n66886b92x57446ad84def466f@mail.gmail.com> <e0b04bba0910230947y5b756bb0uee30c1b37d397d21@mail.gmail.com> <OF1FEFE8EB.C2DA4EA2-ON85257658.005C9A89-85257658.0063C6C5@us.ibm.com>
Date: Sat, 24 Oct 2009 04:45:40 +0100
Message-ID: <e0b04bba0910232045t30e36d69g1ff04a50e899896@mail.gmail.com>
From: Morgaine <morgaine.dinova@googlemail.com>
To: ogpx@ietf.org
Content-Type: multipart/alternative; boundary=000e0ce03f128d7d550476a629f0
Subject: Re: [ogpx] Teleports and protocol resilience
X-BeenThere: ogpx@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Virtual Worlds and the Open Grid Protocol <ogpx.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ogpx>
List-Post: <mailto:ogpx@ietf.org>
List-Help: <mailto:ogpx-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 24 Oct 2009 03:45:37 -0000

Excellent observation, David!

To use a different form of words, fault conditions can occur in all the
possible states of the system, and if we want the protocol to be robust then
we'll probably have to examine all its major states and ensure that there
are always state transitions available out of them.

In that light, being able to TP out of a stuck region is just one of those
recovery transitions without which the protocol would simply break as a
result of the exception.  There is of course no need to design VWRAP such
that exception handling always ends in session termination --- this would
result in a very poor user experience by default, and so it should be
avoided wherever possible.

My suggestion is therefore likely to be just one of many designed to avoid
session termination under conditions that, while not normal, are
nevertheless unfortunately rather common in practice.  The user's client
application should never be forced to disconnect completely (or even worse
to terminate) unless there is absolutely no alternative, and it's up to us
to provide such alternatives wherever we can.

Breaking the user's immersive experience through termination should only be
a solution of last resort.


Morgaine.





====================================

On Fri, Oct 23, 2009 at 7:09 PM, David W Levine <dwl@us.ibm.com> wrote:

>
> This is a good point, and highlights a very tricky issue in any distributed
> system design. A protocol should behave gracefully in the face of failure.
> One of the definite challenges here is partition vs. failure. Its impossible
> to distinguish between a failed service and one which is lost behind a
> network partition. In the case of a partition, with an avatar continuing to
> run on momentarily isolated region, care needs to be taken to recover
> gracefully not only from the "can't get to the region service" to get it to
> release the agent, but also the "The region has returned to contact"  So,I
> think  when we talk about being resilient we need to think about the range
> of failure, from "the service I was using has crashed" to "the service I was
> using has become invisible due
> to a network partition." Of course, they look identical. This rather
> strongly implies good semantics on what happens to a "Stub" session which
>  becomes isolated as well as the "new" session
> which we create by bypassing a failed service. (In the case of teleport,
> the old region's idea of agent state and the new ones, in other cases,
> possibly other isolated state)
>
> Food for thought
>
> - David
> ~ Zha
>
>
>
>  *Morgaine <morgaine.dinova@googlemail.com>*
> Sent by: ogpx-bounces@ietf.org
>
> 10/23/2009 12:47 PM
>   To
> ogpx@ietf.org
> cc
>   Subject
> Re: [ogpx] Teleports and protocol resilience
>
>
>
>
> Looking back at the replies in this thread, I think that the goal and the
> means to achieve it didn't quite come across.
>
> I was trying to address only a very specific issue, just protocol
> resilience under source region non-responsiveness, since this is common
> enough that it merits addressing.  I did not suggest that there be any
> perceivable change of teleport semantics under normal operation (because no
> such change is needed), only a change in service coupling.  The semantics we
> experience in SL and in Opensim would remain completely unchanged, except in
> the single case of source region non-responsiveness.  Under this single
> anomalous case there *would* be a perceivable change, but that change
> would be a huge improvement.
>
> There would be no new decoherence introduced since exactly the same state
> changes would occur on TP as before, with no possibiity of agent state
> change in the source region once the AD accepts the TP.
>
> All that's needed to achieve such resilience for teleport at the protocol
> level is a slight revision of operation phasing to permit greater execution
> overlap, as I outlined.  This is independent of anything else that happens
> in the course of the overall teleport operation --- the change of phasing
> would affect only the transfer of *agent location* alone, nothing else.
>
> In particular, it should not be confused with the separate requirement for
> instantiation of assets or objects at destination, nor with the matter of
> serializing and deserializing script states.  The latter has not even been
> defined for VWRAP, so it's hard to talk about changing it.  In any event,
> this isn't about those aspects of teleport, and doesn't affect them --- they
> would continue to work as before.
>
> One of the central aspects of VWRAP is that the protocol is based on a
> multiple services model, and one of the key approaches in highly scalable
> systems design is to keep services decoupled to the largest extent
> possible.  That's what I'm proposing here, a *partial decoupling* that has
> no normal semantic change but which does have benefits in anomalous
> situations.
>
> Agent location change *can* be decoupled significantly from asset
> instantiation change and script state transfer.  My suggestion referred to
> this decoupled *agent location change* only, not to asset and simulation
> services.  Those other two services undergo state transitions at the same
> time as change of agent location does on TP, but services should never be
> coupled together unnecessarily, and in this case the coupling can be left
> very weak.  The three types of service operations can proceed each at their
> own independent rates, coupled at TP initiation time and nowhere else.
>
> It should be noted that the legacy protocols do some of this already, in
> that the agent is already active in the destination region long before her
> avatar or objects have appeared.  Furthermore, the avatar currently
> continues to be visible in the source region for a while after the agent
> becomes active in the destination region, because of normal operation
> latencies, sim-side queueing, and client lag.  This is a normal part of
> current operation, and is not considered an anomaly. What's important is
> that no new state change to the agent is possible in the source region after
> TP is initiated, and that would remain true.
>
> The impact of this on the other parts of the puzzle needs to wait until
> those other parts are examined.  We're not there yet, but I would hope that
> improving teleport protocol resilience would be a desireable goal when the
> only noticeable change in semantics occurs under fault conditions and
> provides a major improvement on current behaviour.
>
>
> Morgaine.
>
>
>
>
>
>
> ======================================
>
> On Tue, Oct 13, 2009 at 6:13 AM, Morgaine <*morgaine.dinova@googlemail.com
> * <morgaine.dinova@googlemail.com>> wrote:
> One of the advantages we have in developing the VWRAP protocols is that we
> are able to look back at legacy SL and Opensim protocols and recognize
> design mistakes or limitations in them.  This allows us to avoid repeating
> such mistakes or limitations in the next generation of systems.
>
> One of the most common sources of frustration and dissatisfaction is
> simulator non-responsiveness.  While this has many possible causes, in VWRAP
> we are not interested in the internal implementation of simulators, but we
> *ARE* interested in the ability of a protocol endpoint to perform its duty
> within the protocol.  A jammed simulator host is in many cases quite unable
> to perform its protocol duties, or in some cases only exceedingly slowly,
> often timing out in a TP for example.  We have a huge amount of experience
> of this happening in both SL and Opensim, so it is a practical reality.  On
> occasion, simulators will be unable to fulfil their part in a protocol, and
> this needs to be taken into account because it is *not uncommon*.
>
> One key area in which the above is relevant is in teleports *OUT* of a
> simulator that is under distress.  Quite often users wish nothing more than
> to *leave* the region being run by a dying simulator, but when
> teleport-out requires cooperation from the host that one is trying to leave
> then this is often not possible at all.  In this situation, the only remedy
> in existing systems is to forcibly terminate the client and relog in another
> region.  We should avoid such out-of-protocol remedies being necessary
> through good protocol design.
>
> In VWRAP, we have both Rez Avatar and Derez Avatar capabilities, which lead
> to corresponding protocol operations during teleport.  If R1 is a region
> being run by a non-responsive simulator from which we want to escape, and R2
> is another region to which we wish to go, if the protocol requires a Derez
> in R1 to be completed before a Rez in R2 can commence then the user will
> have difficulties.  Clearly we don't want this.
>
> In *http://tools.ietf.org/html/draft-hamrick-ogp-intro-00*<http://tools.ietf.org/html/draft-hamrick-ogp-intro-00>, it is made clear that "
> *The agent domain MUST also remove the avatar from it's current location
> before** placing the avatar in the destination location*."  This suggests
> that the protocol will be sensitive to R1 non-responsiveness.  While we do
> not yet have an actual VWRAP Teleport draft, it seems likely that its
> initial incarnation will have that same problem built in.
>
> I suggest that the protocol define Derez and Rez as *concurrent* and *
> non-dependent* operations to avoid this situation.  The AD can mark R1 as
> disabled for all further agent state changes --- this will provide all the
> protection needed to prevent brief double-presence anomalies from being
> significant.  If a jammed R1 refuses to give up its hold on the avatar, then
> at least the user will not suffer from it.  Reaping dead simulator sessions
> then becomes a problem for the region operator alone, and not for the AD,
> the user, and the region as happens now.
>
>
> Morgaine.
>
>
>
> _______________________________________________
> ogpx mailing list
> ogpx@ietf.org
> https://www.ietf.org/mailman/listinfo/ogpx
>
>