Re: [ogpx] Teleports and protocol resilience

David W Levine <dwl@us.ibm.com> Fri, 23 October 2009 18:09 UTC

Return-Path: <dwl@us.ibm.com>
X-Original-To: ogpx@core3.amsl.com
Delivered-To: ogpx@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C42E23A691A; Fri, 23 Oct 2009 11:09:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -5.133
X-Spam-Level:
X-Spam-Status: No, score=-5.133 tagged_above=-999 required=5 tests=[AWL=0.734, BAYES_00=-2.599, HTML_MESSAGE=0.001, J_CHICKENPOX_21=0.6, RCVD_IN_DNSWL_MED=-4, SARE_UNSUB18=0.131]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id K8G5kK8oaEfa; Fri, 23 Oct 2009 11:09:46 -0700 (PDT)
Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.143]) by core3.amsl.com (Postfix) with ESMTP id A024F3A6927; Fri, 23 Oct 2009 11:09:42 -0700 (PDT)
Received: from d01relay01.pok.ibm.com (d01relay01.pok.ibm.com [9.56.227.233]) by e3.ny.us.ibm.com (8.14.3/8.13.1) with ESMTP id n9NI1qdV028778; Fri, 23 Oct 2009 14:01:52 -0400
Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay01.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id n9NI9qu8094654; Fri, 23 Oct 2009 14:09:52 -0400
Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.14.3/8.13.1/NCO v10.0 AVout) with ESMTP id n9NI9qC8030666; Fri, 23 Oct 2009 14:09:52 -0400
Received: from d01ml605.pok.ibm.com (d01ml605.pok.ibm.com [9.56.227.91]) by d01av01.pok.ibm.com (8.14.3/8.13.1/NCO v10.0 AVin) with ESMTP id n9NI9p9e030658; Fri, 23 Oct 2009 14:09:51 -0400
In-Reply-To: <e0b04bba0910230947y5b756bb0uee30c1b37d397d21@mail.gmail.com>
References: <e0b04bba0910122213n66886b92x57446ad84def466f@mail.gmail.com> <e0b04bba0910230947y5b756bb0uee30c1b37d397d21@mail.gmail.com>
To: Morgaine <morgaine.dinova@googlemail.com>
MIME-Version: 1.0
X-KeepSent: 1FEFE8EB:C2DA4EA2-85257658:005C9A89; type=4; name=$KeepSent
X-Mailer: Lotus Notes Release 8.0.2 HF623 January 16, 2009
Message-ID: <OF1FEFE8EB.C2DA4EA2-ON85257658.005C9A89-85257658.0063C6C5@us.ibm.com>
From: David W Levine <dwl@us.ibm.com>
Date: Fri, 23 Oct 2009 14:09:50 -0400
X-MIMETrack: Serialize by Router on D01ML605/01/M/IBM(Release 8.5.1|September 28, 2009) at 10/23/2009 14:09:51, Serialize complete at 10/23/2009 14:09:51
Content-Type: multipart/alternative; boundary="=_alternative 0063C6C385257658_="
Cc: ogpx-bounces@ietf.org, ogpx@ietf.org
Subject: Re: [ogpx] Teleports and protocol resilience
X-BeenThere: ogpx@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Virtual Worlds and the Open Grid Protocol <ogpx.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ogpx>
List-Post: <mailto:ogpx@ietf.org>
List-Help: <mailto:ogpx-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 23 Oct 2009 18:09:47 -0000

This is a good point, and highlights a very tricky issue in any 
distributed system design. A protocol should behave gracefully in the face 
of failure. One of the definite challenges here is partition vs. failure. 
Its impossible to distinguish between a failed service and one which is 
lost behind a network partition. In the case of a partition, with an 
avatar continuing to run on momentarily isolated region, care needs to be 
taken to recover gracefully not only from the "can't get to the region 
service" to get it to release the agent, but also the "The region has 
returned to contact"  So,I think  when we talk about being resilient we 
need to think about the range of failure, from "the service I was using 
has crashed" to "the service I was using has become invisible due
to a network partition." Of course, they look identical. This rather 
strongly implies good semantics on what happens to a "Stub" session which 
becomes isolated as well as the "new" session
which we create by bypassing a failed service. (In the case of teleport, 
the old region's idea of agent state and the new ones, in other cases, 
possibly other isolated state) 

Food for thought

- David
~ Zha




Morgaine <morgaine.dinova@googlemail.com> 
Sent by: ogpx-bounces@ietf.org
10/23/2009 12:47 PM

To
ogpx@ietf.org
cc

Subject
Re: [ogpx] Teleports and protocol resilience






Looking back at the replies in this thread, I think that the goal and the 
means to achieve it didn't quite come across.

I was trying to address only a very specific issue, just protocol 
resilience under source region non-responsiveness, since this is common 
enough that it merits addressing.  I did not suggest that there be any 
perceivable change of teleport semantics under normal operation (because 
no such change is needed), only a change in service coupling.  The 
semantics we experience in SL and in Opensim would remain completely 
unchanged, except in the single case of source region non-responsiveness.  
Under this single anomalous case there would be a perceivable change, but 
that change would be a huge improvement.

There would be no new decoherence introduced since exactly the same state 
changes would occur on TP as before, with no possibiity of agent state 
change in the source region once the AD accepts the TP.

All that's needed to achieve such resilience for teleport at the protocol 
level is a slight revision of operation phasing to permit greater 
execution overlap, as I outlined.  This is independent of anything else 
that happens in the course of the overall teleport operation --- the 
change of phasing would affect only the transfer of agent location alone, 
nothing else.

In particular, it should not be confused with the separate requirement for 
instantiation of assets or objects at destination, nor with the matter of 
serializing and deserializing script states.  The latter has not even been 
defined for VWRAP, so it's hard to talk about changing it.  In any event, 
this isn't about those aspects of teleport, and doesn't affect them --- 
they would continue to work as before.

One of the central aspects of VWRAP is that the protocol is based on a 
multiple services model, and one of the key approaches in highly scalable 
systems design is to keep services decoupled to the largest extent 
possible.  That's what I'm proposing here, a partial decoupling that has 
no normal semantic change but which does have benefits in anomalous 
situations.

Agent location change can be decoupled significantly from asset 
instantiation change and script state transfer.  My suggestion referred to 
this decoupled agent location change only, not to asset and simulation 
services.  Those other two services undergo state transitions at the same 
time as change of agent location does on TP, but services should never be 
coupled together unnecessarily, and in this case the coupling can be left 
very weak.  The three types of service operations can proceed each at 
their own independent rates, coupled at TP initiation time and nowhere 
else.

It should be noted that the legacy protocols do some of this already, in 
that the agent is already active in the destination region long before her 
avatar or objects have appeared.  Furthermore, the avatar currently 
continues to be visible in the source region for a while after the agent 
becomes active in the destination region, because of normal operation 
latencies, sim-side queueing, and client lag.  This is a normal part of 
current operation, and is not considered an anomaly. What's important is 
that no new state change to the agent is possible in the source region 
after TP is initiated, and that would remain true.

The impact of this on the other parts of the puzzle needs to wait until 
those other parts are examined.  We're not there yet, but I would hope 
that improving teleport protocol resilience would be a desireable goal 
when the only noticeable change in semantics occurs under fault conditions 
and provides a major improvement on current behaviour.


Morgaine.






======================================

On Tue, Oct 13, 2009 at 6:13 AM, Morgaine <morgaine.dinova@googlemail.com> 
wrote:
One of the advantages we have in developing the VWRAP protocols is that we 
are able to look back at legacy SL and Opensim protocols and recognize 
design mistakes or limitations in them.  This allows us to avoid repeating 
such mistakes or limitations in the next generation of systems.

One of the most common sources of frustration and dissatisfaction is 
simulator non-responsiveness.  While this has many possible causes, in 
VWRAP we are not interested in the internal implementation of simulators, 
but we ARE interested in the ability of a protocol endpoint to perform its 
duty within the protocol.  A jammed simulator host is in many cases quite 
unable to perform its protocol duties, or in some cases only exceedingly 
slowly, often timing out in a TP for example.  We have a huge amount of 
experience of this happening in both SL and Opensim, so it is a practical 
reality.  On occasion, simulators will be unable to fulfil their part in a 
protocol, and this needs to be taken into account because it is not 
uncommon.

One key area in which the above is relevant is in teleports OUT of a 
simulator that is under distress.  Quite often users wish nothing more 
than to leave the region being run by a dying simulator, but when 
teleport-out requires cooperation from the host that one is trying to 
leave then this is often not possible at all.  In this situation, the only 
remedy in existing systems is to forcibly terminate the client and relog 
in another region.  We should avoid such out-of-protocol remedies being 
necessary through good protocol design.

In VWRAP, we have both Rez Avatar and Derez Avatar capabilities, which 
lead to corresponding protocol operations during teleport.  If R1 is a 
region being run by a non-responsive simulator from which we want to 
escape, and R2 is another region to which we wish to go, if the protocol 
requires a Derez in R1 to be completed before a Rez in R2 can commence 
then the user will have difficulties.  Clearly we don't want this.

In http://tools.ietf.org/html/draft-hamrick-ogp-intro-00 , it is made 
clear that "The agent domain MUST also remove the avatar from it's current 
location before placing the avatar in the destination location."  This 
suggests that the protocol will be sensitive to R1 non-responsiveness.  
While we do not yet have an actual VWRAP Teleport draft, it seems likely 
that its initial incarnation will have that same problem built in.

I suggest that the protocol define Derez and Rez as concurrent and 
non-dependent operations to avoid this situation.  The AD can mark R1 as 
disabled for all further agent state changes --- this will provide all the 
protection needed to prevent brief double-presence anomalies from being 
significant.  If a jammed R1 refuses to give up its hold on the avatar, 
then at least the user will not suffer from it.  Reaping dead simulator 
sessions then becomes a problem for the region operator alone, and not for 
the AD, the user, and the region as happens now.


Morgaine.



_______________________________________________
ogpx mailing list
ogpx@ietf.org
https://www.ietf.org/mailman/listinfo/ogpx