Re: [ogpx] Teleports and protocol resilience
Morgaine <morgaine.dinova@googlemail.com> Sun, 25 October 2009 02:47 UTC
Return-Path: <morgaine.dinova@googlemail.com>
X-Original-To: ogpx@core3.amsl.com
Delivered-To: ogpx@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix)
with ESMTP id B51843A63EB for <ogpx@core3.amsl.com>;
Sat, 24 Oct 2009 19:47:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.249
X-Spam-Level:
X-Spam-Status: No, score=-1.249 tagged_above=-999 required=5 tests=[AWL=0.127,
BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001,
J_CHICKENPOX_21=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com
[127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8w5IiknUQ4Ge for
<ogpx@core3.amsl.com>; Sat, 24 Oct 2009 19:47:32 -0700 (PDT)
Received: from mail-ew0-f208.google.com (mail-ew0-f208.google.com
[209.85.219.208]) by core3.amsl.com (Postfix) with ESMTP id 990AB3A677D for
<ogpx@ietf.org>; Sat, 24 Oct 2009 19:47:31 -0700 (PDT)
Received: by ewy4 with SMTP id 4so2706731ewy.37 for <ogpx@ietf.org>;
Sat, 24 Oct 2009 19:47:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma;
h=domainkey-signature:mime-version:received:in-reply-to:references
:date:message-id:subject:from:to:content-type;
bh=/T86+TtcGDU1jSUri6xpw4JfWrB328usURwO7I3u4uk=;
b=d67bU42Nx2jg3ZDe5Xpuiz2YKxQOsAgmsntNbw8q57kK6/etmv/kyKcFHkGNJUYnNA
iN6Akk88DqVA+gUPqAVSrlooplGwsZ9ujYI3gs3PX+RD+w2hIhlW3rbJSgBNk+tNwfS4
r5eoFvBtQL4gpjVOh5aOBUyaHLKeIuxCjN4Nc=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:content-type;
b=DMYUBCXJfslbMaJqbZPS5AG3jDAraDeh6g6aIDk0gAI+6hqBldh2hg5HGJGmc1+035
fo8lpXz5vFgv4luI5nIVea9XlT7kIu6T10oxyCFmHJi6r4p3OjiraSz0D6oMi/S5A3sj
wubzKQSrEMuyB21qvWrU6Ot+Sc2K7sFDloL0U=
MIME-Version: 1.0
Received: by 10.211.184.14 with SMTP id l14mr2593413ebp.18.1256438859494;
Sat, 24 Oct 2009 19:47:39 -0700 (PDT)
In-Reply-To: <9b8a8de40910240659h649d0c35kab5706b5a2cd6caf@mail.gmail.com>
References: <e0b04bba0910122213n66886b92x57446ad84def466f@mail.gmail.com>
<e0b04bba0910230947y5b756bb0uee30c1b37d397d21@mail.gmail.com>
<OF1FEFE8EB.C2DA4EA2-ON85257658.005C9A89-85257658.0063C6C5@us.ibm.com>
<e0b04bba0910232045t30e36d69g1ff04a50e899896@mail.gmail.com>
<4AE29CDF.1070208@cox.net>
<e0b04bba0910240308w74772eefoa7e0b2ebb34e5d4a@mail.gmail.com>
<9b8a8de40910240659h649d0c35kab5706b5a2cd6caf@mail.gmail.com>
Date: Sun, 25 Oct 2009 03:47:39 +0100
Message-ID: <e0b04bba0910241947w231b0026ofb1b8db4ca93d6d1@mail.gmail.com>
From: Morgaine <morgaine.dinova@googlemail.com>
To: ogpx@ietf.org
Content-Type: multipart/alternative; boundary=001517476732e6f78a0476b977f8
Subject: Re: [ogpx] Teleports and protocol resilience
X-BeenThere: ogpx@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Virtual Worlds and the Open Grid Protocol <ogpx.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ogpx>,
<mailto:ogpx-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ogpx>
List-Post: <mailto:ogpx@ietf.org>
List-Help: <mailto:ogpx-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ogpx>,
<mailto:ogpx-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 25 Oct 2009 02:47:35 -0000
On Sat, Oct 24, 2009 at 2:59 PM, Vaughn Deluca <vaughn.deluca@gmail.com>wrote;wrote: > > - The agent domain is able to complete a teleport (after a timeout period) > even if the original region completley fails to respond, so *even* if there > is no guarantee the invocation of the derez cap has been processed. > By "complete a teleport" I mean "*succeed in placing the agent in the destination region independent of the state of its avatar*". I'm making this distinction because that's precisely the semantic we have right now in SL and Opensim --- the avatar wearables and attachments arrive much later, asynchronously. The *avatar* does *not* obey a semantic of "one place only" but has parallel existence for a short while in both regions even under normal operation. (Under abnormal operation of the source region, that overlap can last a very long time.) It's the *active agent* that has unique existence in only one region at a time, and that's the semantic which I want to preserve as well. (It's worth mentioning that the avatar and its equiped objects are left to appear/disappear at their own sweet rate for a very good pragmatic reason. Teleport would take an eternity if it were defined as a fully synchronous operation for all relevant parts of agent and avatar together.) > - To prevent true duplication of avatars, the again domain keeps some > record of the old connection, to be able to deal gracefully with the case > that the old region comes back to life. This implies that in some cases two > copies of the avatar might be visible, one "dead body" in the old region, > and the live version in the new region. This might in some cases be > preferable to a stranded avatar, yet in other situation could be > unacceptable (e.g. in a combat game, were mere visual presence might be > enough to influence the game). > Note that this dual presence of avatars is exactly the situation that we have currently, as I outlined above. It's important to be clear about the distinction between agents and avatars in this area. The key semantic is that, even if an avatar is still visible in the source region after a teleport, there be *no active agent* associated with it. In other words, any events which the source region may send the agent after the AD has accepted the TP are discarded. Under normal operation there should be no such inappropriate events arriving because the source region will have processed the Derez, but under abnormal operation anything can happen so one has to handle it defensively. As concurrent programming becomes more common in this emerging "age of multicore", we're going to be seeing these kinds of anomalies arise with ever greater frequency, particularly as regions become virtualized or implemented scalably on clusters. > -Therefore, in spite of the freedom of the agent domain to complete a tp > from a non-responding region, its left up to the region to decide if it > wants to accept the rezzing of an avatar that has not completed the > de-rezzing in the old region. > Yes. As Lawson highlighted, "one size fits all" is not appropriate. > > - Since there is no way to get the most recent agent state from the old > region, this special TP without conformation from the originating region > would bring back the agent in some default state in the new region. This > process would lead to a graceful degradation of the system. With some clever > setting of defaults it might not be too disturbing for the avatar. It is > left up to the client to instruct the agent service about the way to deal > with these types of anomalies. > Fortunately it's a very easy anomaly to handle, since the TP simply acquires the semantic of a first-time rez in the destination region at the point of source region timeout. > > -Finally, your suggestions imply that the Agent service MAY acts as an > intermediate in the TP, rather than the two regions communicating directly > as the current TP draft specifies. > The AD carries the "first say" responsibility for agent teleport policy, that's its key role in teleports, and therefore it must always be an intermediary in teleports otherwise it cannot effect such agent policy. Region-to-region communication should really be seen as a behind-the-scenes optimization or an implementation detail for *avatar* asset/object/state transfer --- notice clearly it's *avatar* here, not agent. And since it breaks resilient teleport under source region anomalies, it needs to be optional. > > At first sight is all looks reasonably realistic, yet it is for sure more > complicated than the current TP draft, and I am not really able to judge > what we are actually bringing onto our selfs here. > These are pre-draft discussions for VWRAP --- I've purposely avoided focussing on the original OGP Teleport drafts since many things are changing for VWRAP anyway as a result of the introduction of RD-controlled services that do not feature in the original drafts. We've got to get the VWRAP teleport semantics well understood in our minds first before churning out a VWRAP Teleport draft spec for it. There are still many unknowns in this area. :-) Morgaine. ======================================== On Sat, Oct 24, 2009 at 2:59 PM, Vaughn Deluca <vaughn.deluca@gmail.com>wrote;wrote: > Morgaine, > > If I understand you right, you are suggesting the following: > > - The agent domain is able to complete a teleport (after a timeout period) > even if the original region completley fails to respond, so *even* if there > is no guarantee the invocation of the derez cap has been processed. > > - To prevent true duplication of avatars, the again domain keeps some > record of the old connection, to be able to deal gracefully with the case > that the old region comes back to life. This implies that in some cases two > copies of the avatar might be visible, one "dead body" in the old region, > and the live version in the new region. This might in some cases be > preferable to a stranded avatar, yet in other situation could be > unacceptable (e.g. in a combat game, were mere visual presence might be > enough to influence the game). > > -Therefore, in spite of the freedom of the agent domain to complete a tp > from a non-responding region, its left up to the region to decide if it > wants to accept the rezzing of an avatar that has not completed the > de-rezzing in the old region. > > - Since there is no way to get the most recent agent state from the old > region, this special TP without conformation from the originating region > would bring back the agent in some default state in the new region. This > process would lead to a graceful degradation of the system. With some clever > setting of defaults it might not be too disturbing for the avatar. It is > left up to the client to instruct the agent service about the way to deal > with these types of anomalies. > > -Finally, your suggestions imply that the Agent service MAY acts as an > intermediate in the TP, rather than the two regions communicating directly > as the current TP draft specifies. > > At first sight is all looks reasonably realistic, yet it is for sure more > complicated than the current TP draft, and I am not really able to judge > what we are actually bringing onto our selfs here. > > - Vaughn > . > > > On Sat, Oct 24, 2009 at 12:08 PM, Morgaine <morgaine.dinova@googlemail.com > > wrote: > >> Indeed! Not only are extra user options always a good thing, but a TP >> from a region in one world could be to a region in a highly different world >> with very specific requirements. As you point out, "one size fits all" is >> definitely inappropriate. >> >> In this particular case, that could be handled by the AD sending the >> client an event notification to the effect that the source region did not >> respond within the timeout period. >> >> The client would then be free to implement disconnection if this is the >> user's configured choice. I think it's fair to expect that most people >> would not choose that option, but it should certainly be available to them >> if needed. >> >> The agent would probably already be located in the destination region by >> this time from the AD's perspective (depending on timout length), but not >> necessarily from the client's perspective --- that's a matter of client >> implementation and user options. Some Second Life clients already implement >> "Don't blank screen on TP", "Don't show TP progress bar" and "Display avatar >> before wearables are available" as options, so they're already handling the >> component parts of the overall TP operation quite flexibly. >> >> The AD of a world may itself have a policy of not allowing TPs to complete >> when the source region is unavailable, but this should not be hardwired into >> the protocol. As you say, one size does not fit all, particularly in the >> case of choices that provide a bad user experience. >> >> >> Morgaine. >> >> >> >> >> >> >> >> =========================================== >> >> >> On Sat, Oct 24, 2009 at 7:21 AM, Lawson English <lenglish5@cox.net>wrote;wrote: >> >>> Morgaine wrote: >>> >>>> Excellent observation, David! >>>> >>>> To use a different form of words, fault conditions can occur in all the >>>> possible states of the system, and if we want the protocol to be robust then >>>> we'll probably have to examine all its major states and ensure that there >>>> are always state transitions available out of them. >>>> >>>> In that light, being able to TP out of a stuck region is just one of >>>> those recovery transitions without which the protocol would simply break as >>>> a result of the exception. There is of course no need to design VWRAP such >>>> that exception handling always ends in session termination --- this would >>>> result in a very poor user experience by default, and so it should be >>>> avoided wherever possible. >>>> >>>> My suggestion is therefore likely to be just one of many designed to >>>> avoid session termination under conditions that, while not normal, are >>>> nevertheless unfortunately rather common in practice. The user's client >>>> application should never be forced to disconnect completely (or even worse >>>> to terminate) unless there is absolutely no alternative, and it's up to us >>>> to provide such alternatives wherever we can. >>>> >>>> Breaking the user's immersive experience through termination should only >>>> be a solution of last resort. >>>> >>>> >>>> Morgaine. >>>> >>>> >>>> >>>> >>>> >>>> ==================================== >>>> >>>> On Fri, Oct 23, 2009 at 7:09 PM, David W Levine <dwl@us.ibm.com<mailtomailto: >>>> dwl@us.ibm.com>> wrote: >>>> >>>> >>>> This is a good point, and highlights a very tricky issue in any >>>> distributed system design. A protocol should behave gracefully in >>>> the face of failure. One of the definite challenges here is >>>> partition vs. failure. Its impossible to distinguish between a >>>> failed service and one which is lost behind a network partition. >>>> In the case of a partition, with an avatar continuing to run on >>>> momentarily isolated region, care needs to be taken to recover >>>> gracefully not only from the "can't get to the region service" to >>>> get it to release the agent, but also the "The region has returned >>>> to contact" So,I think when we talk about being resilient we >>>> need to think about the range of failure, from "the service I was >>>> using has crashed" to "the service I was using has become >>>> invisible due >>>> to a network partition." Of course, they look identical. This >>>> rather strongly implies good semantics on what happens to a "Stub" >>>> session which becomes isolated as well as the "new" session >>>> which we create by bypassing a failed service. (In the case of >>>> teleport, the old region's idea of agent state and the new ones, >>>> in other cases, possibly other isolated state) >>>> Food for thought >>>> >>>> - David >>>> ~ Zha >>>> >>>> >>>> >>>> *Morgaine <morgaine.dinova@googlemail.com >>>> <mailto:morgaine.dinova@googlemail.com>>* >>>> Sent by: ogpx-bounces@ietf.org <mailto:ogpx-bounces@ietf.org> >>>> >>>> >>>> 10/23/2009 12:47 PM >>>> >>>> >>>> To >>>> ogpx@ietf.org <mailto:ogpx@ietf.org> >>>> >>>> cc >>>> >>>> Subject >>>> Re: [ogpx] Teleports and protocol resilience >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Looking back at the replies in this thread, I think that the goal >>>> and the means to achieve it didn't quite come across. >>>> >>>> I was trying to address only a very specific issue, just protocol >>>> resilience under source region non-responsiveness, since this is >>>> common enough that it merits addressing. I did not suggest that >>>> there be any perceivable change of teleport semantics under normal >>>> operation (because no such change is needed), only a change in >>>> service coupling. The semantics we experience in SL and in >>>> Opensim would remain completely unchanged, except in the single >>>> case of source region non-responsiveness. Under this single >>>> anomalous case there */would/* be a perceivable change, but that >>>> change would be a huge improvement. >>>> >>>> There would be no new decoherence introduced since exactly the >>>> same state changes would occur on TP as before, with no possibiity >>>> of agent state change in the source region once the AD accepts the >>>> TP. >>>> >>>> All that's needed to achieve such resilience for teleport at the >>>> protocol level is a slight revision of operation phasing to permit >>>> greater execution overlap, as I outlined. This is independent of >>>> anything else that happens in the course of the overall teleport >>>> operation --- the change of phasing would affect only the transfer >>>> of /agent location/ alone, nothing else. >>>> >>>> In particular, it should not be confused with the separate >>>> requirement for instantiation of assets or objects at destination, >>>> nor with the matter of serializing and deserializing script >>>> states. The latter has not even been defined for VWRAP, so it's >>>> hard to talk about changing it. In any event, this isn't about >>>> those aspects of teleport, and doesn't affect them --- they would >>>> continue to work as before. >>>> >>>> One of the central aspects of VWRAP is that the protocol is based >>>> on a multiple services model, and one of the key approaches in >>>> highly scalable systems design is to keep services decoupled to >>>> the largest extent possible. That's what I'm proposing here, a >>>> /partial decoupling/ that has no normal semantic change but which >>>> does have benefits in anomalous situations. >>>> >>>> Agent location change /can/ be decoupled significantly from asset >>>> instantiation change and script state transfer. My suggestion >>>> referred to this decoupled /agent location change/ only, not to >>>> asset and simulation services. Those other two services undergo >>>> state transitions at the same time as change of agent location >>>> does on TP, but services should never be coupled together >>>> unnecessarily, and in this case the coupling can be left very >>>> weak. The three types of service operations can proceed each at >>>> their own independent rates, coupled at TP initiation time and >>>> nowhere else. >>>> >>>> It should be noted that the legacy protocols do some of this >>>> already, in that the agent is already active in the destination >>>> region long before her avatar or objects have appeared. >>>> Furthermore, the avatar currently continues to be visible in the >>>> source region for a while after the agent becomes active in the >>>> destination region, because of normal operation latencies, >>>> sim-side queueing, and client lag. This is a normal part of >>>> current operation, and is not considered an anomaly. What's >>>> important is that no new state change to the agent is possible in >>>> the source region after TP is initiated, and that would remain true. >>>> >>>> The impact of this on the other parts of the puzzle needs to wait >>>> until those other parts are examined. We're not there yet, but I >>>> would hope that improving teleport protocol resilience would be a >>>> desireable goal when the only noticeable change in semantics >>>> occurs under fault conditions and provides a major improvement on >>>> current behaviour. >>>> >>>> >>>> Morgaine. >>>> >>>> >>>> >>>> >>>> >>>> >>>> ====================================== >>>> >>>> On Tue, Oct 13, 2009 at 6:13 AM, Morgaine >>>> <_morgaine.dinova@googlemail.com_ >>>> >>>> <mailto:morgaine.dinova@googlemail.com>> wrote: >>>> One of the advantages we have in developing the VWRAP protocols is >>>> that we are able to look back at legacy SL and Opensim protocols >>>> and recognize design mistakes or limitations in them. This allows >>>> us to avoid repeating such mistakes or limitations in the next >>>> generation of systems. >>>> >>>> One of the most common sources of frustration and dissatisfaction >>>> is simulator non-responsiveness. While this has many possible >>>> causes, in VWRAP we are not interested in the internal >>>> implementation of simulators, but we *ARE* interested in the >>>> ability of a protocol endpoint to perform its duty within the >>>> protocol. A jammed simulator host is in many cases quite unable >>>> to perform its protocol duties, or in some cases only exceedingly >>>> slowly, often timing out in a TP for example. We have a huge >>>> amount of experience of this happening in both SL and Opensim, so >>>> it is a practical reality. On occasion, simulators will be unable >>>> to fulfil their part in a protocol, and this needs to be taken >>>> into account because it is /not uncommon/. >>>> >>>> One key area in which the above is relevant is in teleports *OUT* >>>> of a simulator that is under distress. Quite often users wish >>>> nothing more than to /leave/ the region being run by a dying >>>> simulator, but when teleport-out requires cooperation from the >>>> host that one is trying to leave then this is often not possible >>>> at all. In this situation, the only remedy in existing systems is >>>> to forcibly terminate the client and relog in another region. We >>>> should avoid such out-of-protocol remedies being necessary through >>>> good protocol design. >>>> >>>> In VWRAP, we have both Rez Avatar and Derez Avatar capabilities, >>>> which lead to corresponding protocol operations during teleport. >>>> If R1 is a region being run by a non-responsive simulator from >>>> which we want to escape, and R2 is another region to which we wish >>>> to go, if the protocol requires a Derez in R1 to be completed >>>> before a Rez in R2 can commence then the user will have >>>> difficulties. Clearly we don't want this. >>>> >>>> In _http://tools.ietf.org/html/draft-hamrick-ogp-intro-00_ , it is >>>> made clear that "/The agent domain MUST also remove the avatar >>>> from it's current location *before*// placing the avatar in the >>>> destination location/." This suggests that the protocol will be >>>> sensitive to R1 non-responsiveness. While we do not yet have an >>>> actual VWRAP Teleport draft, it seems likely that its initial >>>> incarnation will have that same problem built in. >>>> >>>> I suggest that the protocol define Derez and Rez as */concurrent/* >>>> and */non-dependent/* operations to avoid this situation. The AD >>>> can mark R1 as disabled for all further agent state changes --- >>>> this will provide all the protection needed to prevent brief >>>> double-presence anomalies from being significant. If a jammed R1 >>>> refuses to give up its hold on the avatar, then at least the user >>>> will not suffer from it. Reaping dead simulator sessions then >>>> becomes a problem for the region operator alone, and not for the >>>> AD, the user, and the region as happens now. >>>> >>>> Beware of one-size-fits-all requirements. I can easily imagine there >>> will be users who wouldn't want to have a degraded/alternate/unexpected >>> experience on TP failure and who WOULD want the system just to reset. And no >>> doubt other scenarios will make sense as well. After all, if someone uses >>> VWRAP as the foundation for a gaming platform, they may have completely >>> different rules that they expect users participating to follow as long as >>> they are participating. >>> >>> >>> Lawson >>> >>> Lawson >>> >> >> >> _______________________________________________ >> ogpx mailing list >> ogpx@ietf.org >> https://www.ietf.org/mailman/listinfo/ogpx >> >> >
- [ogpx] Teleports and protocol resilience Morgaine
- Re: [ogpx] Teleports and protocol resilience Kari Lippert
- Re: [ogpx] Teleports and protocol resilience Infinity Linden (Meadhbh Hamrick)
- Re: [ogpx] Teleports and protocol resilience Joshua Bell
- Re: [ogpx] Teleports and protocol resilience David W Levine
- Re: [ogpx] Teleports and protocol resilience Meadhbh Hamrick
- Re: [ogpx] Teleports and protocol resilience Mike Dickson
- Re: [ogpx] Teleports and protocol resilience Charles Krinke
- Re: [ogpx] Teleports and protocol resilience Joshua Bell
- Re: [ogpx] Teleports and protocol resilience Vaughn Deluca
- Re: [ogpx] Teleports and protocol resilience Joshua Bell
- Re: [ogpx] Teleports and protocol resilience Vaughn Deluca
- Re: [ogpx] Teleports and protocol resilience Meadhbh Hamrick
- Re: [ogpx] Teleports and protocol resilience Dan Olivares
- Re: [ogpx] Teleports and protocol resilience Meadhbh Hamrick
- Re: [ogpx] Teleports and protocol resilience Vaughn Deluca
- Re: [ogpx] Teleports and protocol resilience Morgaine
- Re: [ogpx] Teleports and protocol resilience David W Levine
- Re: [ogpx] Teleports and protocol resilience Morgaine
- Re: [ogpx] Teleports and protocol resilience Lawson English
- Re: [ogpx] Teleports and protocol resilience Morgaine
- Re: [ogpx] Teleports and protocol resilience Vaughn Deluca
- Re: [ogpx] Teleports and protocol resilience Morgaine
- Re: [ogpx] Teleports and protocol resilience Carlo Wood