Re: [ogpx] Teleports and protocol resilience

Vaughn Deluca <vaughn.deluca@gmail.com> Sat, 24 October 2009 13:59 UTC

Return-Path: <vaughn.deluca@gmail.com>
X-Original-To: ogpx@core3.amsl.com
Delivered-To: ogpx@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 3FAD53A6843 for <ogpx@core3.amsl.com>; Sat, 24 Oct 2009 06:59:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.174
X-Spam-Level:
X-Spam-Status: No, score=-2.174 tagged_above=-999 required=5 tests=[AWL=-0.176, BAYES_00=-2.599, HTML_MESSAGE=0.001, J_CHICKENPOX_21=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 795twTGJMo1r for <ogpx@core3.amsl.com>; Sat, 24 Oct 2009 06:59:11 -0700 (PDT)
Received: from mail-fx0-f218.google.com (mail-fx0-f218.google.com [209.85.220.218]) by core3.amsl.com (Postfix) with ESMTP id DFFAB3A67F8 for <ogpx@ietf.org>; Sat, 24 Oct 2009 06:59:10 -0700 (PDT)
Received: by fxm18 with SMTP id 18so11006473fxm.37 for <ogpx@ietf.org>; Sat, 24 Oct 2009 06:59:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type; bh=GTJ8S76+z0CBw4P2qYkSiFivsa6Zdkupxj5VRVDxQaA=; b=QB0aMUYcxVjf/xnL6XPY8LHr1g73AkVE+84Mzdr6srNbFxidCb6pQFUrf/1E5WPWAs jfT4QaUZdVKax6emQS9cECNkGeCvfqExaKSxpvUY+8JB153kpvQ9JjW72m7BWu14DvxR ghFUMz+5H0+fsMoCqOA3/S5ew4c5u6tLw14/4=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; b=Inq2XfBzSeWinGoev3JzaWEWSp3VYECQEz3tF76MpZRLljQYrTE3AkTfsOHli4JABn TTjHxa3mQH0pXwbQ+nFY+vRdgpXDImEyRLkRwPyBy95hRSEFUojlnrSNu7Gds+IdKxOB PV2WqEPfej35t1Lr9R1KJm+XO2nMhvkI5Oncg=
MIME-Version: 1.0
Received: by 10.204.162.210 with SMTP id w18mr3498863bkx.174.1256392759048; Sat, 24 Oct 2009 06:59:19 -0700 (PDT)
In-Reply-To: <e0b04bba0910240308w74772eefoa7e0b2ebb34e5d4a@mail.gmail.com>
References: <e0b04bba0910122213n66886b92x57446ad84def466f@mail.gmail.com> <e0b04bba0910230947y5b756bb0uee30c1b37d397d21@mail.gmail.com> <OF1FEFE8EB.C2DA4EA2-ON85257658.005C9A89-85257658.0063C6C5@us.ibm.com> <e0b04bba0910232045t30e36d69g1ff04a50e899896@mail.gmail.com> <4AE29CDF.1070208@cox.net> <e0b04bba0910240308w74772eefoa7e0b2ebb34e5d4a@mail.gmail.com>
Date: Sat, 24 Oct 2009 15:59:18 +0200
Message-ID: <9b8a8de40910240659h649d0c35kab5706b5a2cd6caf@mail.gmail.com>
From: Vaughn Deluca <vaughn.deluca@gmail.com>
To: Morgaine <morgaine.dinova@googlemail.com>
Content-Type: multipart/alternative; boundary=000325559f3e19f9d90476aebc3b
Cc: ogpx@ietf.org
Subject: Re: [ogpx] Teleports and protocol resilience
X-BeenThere: ogpx@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Virtual Worlds and the Open Grid Protocol <ogpx.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ogpx>
List-Post: <mailto:ogpx@ietf.org>
List-Help: <mailto:ogpx-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ogpx>, <mailto:ogpx-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 24 Oct 2009 13:59:14 -0000

Morgaine,

If I understand you right, you are suggesting the following:

- The agent domain is able to complete a teleport (after a timeout period)
even if the original region completley fails to respond, so *even* if there
is no guarantee the invocation of the derez cap has been processed.

- To prevent true duplication of avatars, the again domain keeps some record
of the old connection, to be able to deal gracefully with the  case that the
old region comes back to life. This implies that in some cases two copies of
the avatar might be visible, one "dead body" in the old region, and the live
version in the new region. This might in some cases be preferable to a
stranded avatar, yet in other situation could be unacceptable (e.g. in a
combat game, were mere visual presence might be enough to influence the
game).

-Therefore, in spite of the freedom of the agent domain to complete a tp
from a non-responding region, its left up to the region to decide if it
wants to accept the rezzing of an avatar that has not  completed the
de-rezzing in the old region.

- Since there is no way to get the most recent agent state from the old
region, this special TP without conformation from the originating region
would bring back the agent in some default state in the new region. This
process would lead to a graceful degradation of the system. With some clever
setting of defaults it might not be too disturbing for the avatar.  It is
left up to the client to instruct the agent service about the way to deal
with these types of anomalies.

-Finally, your suggestions imply that the Agent  service MAY acts as an
intermediate in the TP, rather than the two regions communicating directly
as the current TP draft specifies.

At first sight is all looks reasonably realistic, yet it is for sure more
complicated than the current TP draft, and I am not really able to judge
what we are actually bringing onto our selfs here.

- Vaughn
.


On Sat, Oct 24, 2009 at 12:08 PM, Morgaine
<morgaine.dinova@googlemail.com>wrote;wrote:

> Indeed!   Not only are extra user options always a good thing, but a TP
> from a region in one world could be to a region in a highly different world
> with very specific requirements.  As you point out, "one size fits all" is
> definitely inappropriate.
>
> In this particular case, that could be handled by the AD sending the client
> an event notification to the effect that the source region did not respond
> within the timeout period.
>
> The client would then be free to implement disconnection if this is the
> user's configured choice.  I think it's fair to expect that most people
> would not choose that option, but it should certainly be available to them
> if needed.
>
> The agent would probably already be located in the destination region by
> this time from the AD's perspective (depending on timout length), but not
> necessarily from the client's perspective --- that's a matter of client
> implementation and user options.  Some Second Life clients already implement
> "Don't blank screen on TP", "Don't show TP progress bar" and "Display avatar
> before wearables are available" as options, so they're already handling the
> component parts of the overall TP operation quite flexibly.
>
> The AD of a world may itself have a policy of not allowing TPs to complete
> when the source region is unavailable, but this should not be hardwired into
> the protocol.  As you say, one size does not fit all, particularly in the
> case of choices that provide a bad user experience.
>
>
> Morgaine.
>
>
>
>
>
>
>
> ===========================================
>
>
> On Sat, Oct 24, 2009 at 7:21 AM, Lawson English <lenglish5@cox.net> wrote:
>
>> Morgaine wrote:
>>
>>> Excellent observation, David!
>>>
>>> To use a different form of words, fault conditions can occur in all the
>>> possible states of the system, and if we want the protocol to be robust then
>>> we'll probably have to examine all its major states and ensure that there
>>> are always state transitions available out of them.
>>>
>>> In that light, being able to TP out of a stuck region is just one of
>>> those recovery transitions without which the protocol would simply break as
>>> a result of the exception.  There is of course no need to design VWRAP such
>>> that exception handling always ends in session termination --- this would
>>> result in a very poor user experience by default, and so it should be
>>> avoided wherever possible.
>>>
>>> My suggestion is therefore likely to be just one of many designed to
>>> avoid session termination under conditions that, while not normal, are
>>> nevertheless unfortunately rather common in practice.  The user's client
>>> application should never be forced to disconnect completely (or even worse
>>> to terminate) unless there is absolutely no alternative, and it's up to us
>>> to provide such alternatives wherever we can.
>>>
>>> Breaking the user's immersive experience through termination should only
>>> be a solution of last resort.
>>>
>>>
>>> Morgaine.
>>>
>>>
>>>
>>>
>>>
>>> ====================================
>>>
>>> On Fri, Oct 23, 2009 at 7:09 PM, David W Levine <dwl@us.ibm.com <mailto:
>>> dwl@us.ibm.com>> wrote:
>>>
>>>
>>>    This is a good point, and highlights a very tricky issue in any
>>>    distributed system design. A protocol should behave gracefully in
>>>    the face of failure. One of the definite challenges here is
>>>    partition vs. failure. Its impossible to distinguish between a
>>>    failed service and one which is lost behind a network partition.
>>>    In the case of a partition, with an avatar continuing to run on
>>>    momentarily isolated region, care needs to be taken to recover
>>>    gracefully not only from the "can't get to the region service" to
>>>    get it to release the agent, but also the "The region has returned
>>>    to contact"  So,I think  when we talk about being resilient we
>>>    need to think about the range of failure, from "the service I was
>>>    using has crashed" to "the service I was using has become
>>>    invisible due
>>>    to a network partition." Of course, they look identical. This
>>>    rather strongly implies good semantics on what happens to a "Stub"
>>>    session which  becomes isolated as well as the "new" session
>>>    which we create by bypassing a failed service. (In the case of
>>>    teleport, the old region's idea of agent state and the new ones,
>>>    in other cases, possibly other isolated state)
>>>    Food for thought
>>>
>>>    - David
>>>    ~ Zha
>>>
>>>
>>>
>>>    *Morgaine <morgaine.dinova@googlemail.com
>>>    <mailto:morgaine.dinova@googlemail.com>>*
>>>    Sent by: ogpx-bounces@ietf.org <mailto:ogpx-bounces@ietf.org>
>>>
>>>
>>>    10/23/2009 12:47 PM
>>>
>>>
>>>    To
>>>        ogpx@ietf.org <mailto:ogpx@ietf.org>
>>>
>>>    cc
>>>
>>>    Subject
>>>        Re: [ogpx] Teleports and protocol resilience
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>    Looking back at the replies in this thread, I think that the goal
>>>    and the means to achieve it didn't quite come across.
>>>
>>>    I was trying to address only a very specific issue, just protocol
>>>    resilience under source region non-responsiveness, since this is
>>>    common enough that it merits addressing.  I did not suggest that
>>>    there be any perceivable change of teleport semantics under normal
>>>    operation (because no such change is needed), only a change in
>>>    service coupling.  The semantics we experience in SL and in
>>>    Opensim would remain completely unchanged, except in the single
>>>    case of source region non-responsiveness.  Under this single
>>>    anomalous case there */would/* be a perceivable change, but that
>>>    change would be a huge improvement.
>>>
>>>    There would be no new decoherence introduced since exactly the
>>>    same state changes would occur on TP as before, with no possibiity
>>>    of agent state change in the source region once the AD accepts the TP.
>>>
>>>    All that's needed to achieve such resilience for teleport at the
>>>    protocol level is a slight revision of operation phasing to permit
>>>    greater execution overlap, as I outlined.  This is independent of
>>>    anything else that happens in the course of the overall teleport
>>>    operation --- the change of phasing would affect only the transfer
>>>    of /agent location/ alone, nothing else.
>>>
>>>    In particular, it should not be confused with the separate
>>>    requirement for instantiation of assets or objects at destination,
>>>    nor with the matter of serializing and deserializing script
>>>    states.  The latter has not even been defined for VWRAP, so it's
>>>    hard to talk about changing it.  In any event, this isn't about
>>>    those aspects of teleport, and doesn't affect them --- they would
>>>    continue to work as before.
>>>
>>>    One of the central aspects of VWRAP is that the protocol is based
>>>    on a multiple services model, and one of the key approaches in
>>>    highly scalable systems design is to keep services decoupled to
>>>    the largest extent possible.  That's what I'm proposing here, a
>>>    /partial decoupling/ that has no normal semantic change but which
>>>    does have benefits in anomalous situations.
>>>
>>>    Agent location change /can/ be decoupled significantly from asset
>>>    instantiation change and script state transfer.  My suggestion
>>>    referred to this decoupled /agent location change/ only, not to
>>>    asset and simulation services.  Those other two services undergo
>>>    state transitions at the same time as change of agent location
>>>    does on TP, but services should never be coupled together
>>>    unnecessarily, and in this case the coupling can be left very
>>>    weak.  The three types of service operations can proceed each at
>>>    their own independent rates, coupled at TP initiation time and
>>>    nowhere else.
>>>
>>>    It should be noted that the legacy protocols do some of this
>>>    already, in that the agent is already active in the destination
>>>    region long before her avatar or objects have appeared.
>>>  Furthermore, the avatar currently continues to be visible in the
>>>    source region for a while after the agent becomes active in the
>>>    destination region, because of normal operation latencies,
>>>    sim-side queueing, and client lag.  This is a normal part of
>>>    current operation, and is not considered an anomaly. What's
>>>    important is that no new state change to the agent is possible in
>>>    the source region after TP is initiated, and that would remain true.
>>>
>>>    The impact of this on the other parts of the puzzle needs to wait
>>>    until those other parts are examined.  We're not there yet, but I
>>>    would hope that improving teleport protocol resilience would be a
>>>    desireable goal when the only noticeable change in semantics
>>>    occurs under fault conditions and provides a major improvement on
>>>    current behaviour.
>>>
>>>
>>>    Morgaine.
>>>
>>>
>>>
>>>
>>>
>>>
>>>    ======================================
>>>
>>>    On Tue, Oct 13, 2009 at 6:13 AM, Morgaine
>>>    <_morgaine.dinova@googlemail.com_
>>>
>>>    <mailto:morgaine.dinova@googlemail.com>> wrote:
>>>    One of the advantages we have in developing the VWRAP protocols is
>>>    that we are able to look back at legacy SL and Opensim protocols
>>>    and recognize design mistakes or limitations in them.  This allows
>>>    us to avoid repeating such mistakes or limitations in the next
>>>    generation of systems.
>>>
>>>    One of the most common sources of frustration and dissatisfaction
>>>    is simulator non-responsiveness.  While this has many possible
>>>    causes, in VWRAP we are not interested in the internal
>>>    implementation of simulators, but we *ARE* interested in the
>>>    ability of a protocol endpoint to perform its duty within the
>>>    protocol.  A jammed simulator host is in many cases quite unable
>>>    to perform its protocol duties, or in some cases only exceedingly
>>>    slowly, often timing out in a TP for example.  We have a huge
>>>    amount of experience of this happening in both SL and Opensim, so
>>>    it is a practical reality.  On occasion, simulators will be unable
>>>    to fulfil their part in a protocol, and this needs to be taken
>>>    into account because it is /not uncommon/.
>>>
>>>    One key area in which the above is relevant is in teleports *OUT*
>>>    of a simulator that is under distress.  Quite often users wish
>>>    nothing more than to /leave/ the region being run by a dying
>>>    simulator, but when teleport-out requires cooperation from the
>>>    host that one is trying to leave then this is often not possible
>>>    at all.  In this situation, the only remedy in existing systems is
>>>    to forcibly terminate the client and relog in another region.  We
>>>    should avoid such out-of-protocol remedies being necessary through
>>>    good protocol design.
>>>
>>>    In VWRAP, we have both Rez Avatar and Derez Avatar capabilities,
>>>    which lead to corresponding protocol operations during teleport.    If
>>> R1 is a region being run by a non-responsive simulator from
>>>    which we want to escape, and R2 is another region to which we wish
>>>    to go, if the protocol requires a Derez in R1 to be completed
>>>    before a Rez in R2 can commence then the user will have
>>>    difficulties.  Clearly we don't want this.
>>>
>>>    In _http://tools.ietf.org/html/draft-hamrick-ogp-intro-00_ , it is
>>>    made clear that "/The agent domain MUST also remove the avatar
>>>    from it's current location *before*// placing the avatar in the
>>>    destination location/."  This suggests that the protocol will be
>>>    sensitive to R1 non-responsiveness.  While we do not yet have an
>>>    actual VWRAP Teleport draft, it seems likely that its initial
>>>    incarnation will have that same problem built in.
>>>
>>>    I suggest that the protocol define Derez and Rez as */concurrent/*
>>>    and */non-dependent/* operations to avoid this situation.  The AD
>>>    can mark R1 as disabled for all further agent state changes ---
>>>    this will provide all the protection needed to prevent brief
>>>    double-presence anomalies from being significant.  If a jammed R1
>>>    refuses to give up its hold on the avatar, then at least the user
>>>    will not suffer from it.  Reaping dead simulator sessions then
>>>    becomes a problem for the region operator alone, and not for the
>>>    AD, the user, and the region as happens now.
>>>
>>>  Beware of one-size-fits-all requirements. I can easily imagine there
>> will be users who wouldn't want to have a degraded/alternate/unexpected
>> experience on TP failure and who WOULD want the system just to reset. And no
>> doubt other scenarios will make sense as well. After all, if someone uses
>> VWRAP as the foundation for a gaming platform, they may have completely
>> different rules that they expect users participating to follow as long as
>> they are participating.
>>
>>
>> Lawson
>>
>> Lawson
>>
>
>
> _______________________________________________
> ogpx mailing list
> ogpx@ietf.org
> https://www.ietf.org/mailman/listinfo/ogpx
>
>