[rtcweb] Review of draft-ietf-rtcweb-use-cases-and-requirements-04.txt

Dan Burnett <dburnett@voxeo.com> Thu, 08 September 2011 09:16 UTC

Return-Path: <dburnett@voxeo.com>
X-Original-To: rtcweb@ietfa.amsl.com
Delivered-To: rtcweb@ietfa.amsl.com
Received: from localhost (localhost []) by ietfa.amsl.com (Postfix) with ESMTP id 6C88A21F8A69 for <rtcweb@ietfa.amsl.com>; Thu, 8 Sep 2011 02:16:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.599
X-Spam-Status: No, score=-3.599 tagged_above=-999 required=5 tests=[AWL=1.000, BAYES_00=-2.599, GB_I_LETTER=-2]
Received: from mail.ietf.org ([]) by localhost (ietfa.amsl.com []) (amavisd-new, port 10024) with ESMTP id MjliSX3GYket for <rtcweb@ietfa.amsl.com>; Thu, 8 Sep 2011 02:16:31 -0700 (PDT)
Received: from voxeo.com (mmail.voxeo.com []) by ietfa.amsl.com (Postfix) with ESMTP id D6DE821F8A64 for <rtcweb@ietf.org>; Thu, 8 Sep 2011 02:16:30 -0700 (PDT)
Received: from [] (account dburnett@voxeo.com HELO []) by voxeo.com (CommuniGate Pro SMTP 5.3.8) with ESMTPSA id 95321363; Thu, 08 Sep 2011 09:18:13 +0000
From: Dan Burnett <dburnett@voxeo.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 08 Sep 2011 05:18:12 -0400
Message-Id: <0D3E4B73-39A0-41C6-93E6-88E46E9416FF@voxeo.com>
To: public-webrtc@w3.org
Mime-Version: 1.0 (Apple Message framework v1084)
X-Mailer: Apple Mail (2.1084)
Cc: rtcweb@ietf.org
Subject: [rtcweb] Review of draft-ietf-rtcweb-use-cases-and-requirements-04.txt
X-BeenThere: rtcweb@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Real-Time Communication in WEB-browsers working group list <rtcweb.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/rtcweb>
List-Post: <mailto:rtcweb@ietf.org>
List-Help: <mailto:rtcweb-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/rtcweb>, <mailto:rtcweb-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 08 Sep 2011 09:16:32 -0000

A while ago I promised a full read-through review of the use cases and requirements document [1], primarily from the API perspective, but I have included other comments as well.
The comments follow the order of the document.  Some are editorial, and some are more substantive.

Section 4:  As a general comment, the use cases occasionally stray more into implementation rather than just being worded in terms of user needs.  This has driven some of my wording change suggestions below.  The wording is a bit unclear as to whether this use case is for only a single peer-to-peer connection or for multiple connections.  In particular, it points out that for a session there is a self-view and a remote view, but it's not clear at that point whether there might be *multiple* remote views simultaneously in the session.  However, later on in this use case it states that "Any session participant can end the session at any time."  Then there are what appear to be examples of different users, but it is not clear whether it only needs to be possible for each of these kinds of users to be supported (singly), or whether it must be possible to support communication with all simultaneously.

Since there is a separate use case for multiparty video communication (4.2.7), I believe this use case should be cleaned up a bit.  I suggest the following text for this use case:

Two or more users have loaded a video communication web application into their browsers, provided by the same service provider, and logged into the service it provides.  The web service publishes information about user login status by pushing updates to the web application in the browsers.  When one online user selects a peer online user, a 1-1 video communication session between the browsers of the two peers is initiated.  The invited user might accept or reject the session.

During session establishment a self-view is displayed, and once the session has been established the video sent from the remote peer is displayed in addition to the self-view.  During the session, each user can select to remove and re-insert the self-view as often as desired.  Each user can also change the sizes of his/her two video displays during the session.  Each user can also pause sending of media (audio, video, or both) and mute incoming media.

It is essential that the communication cannot be eavesdropped.

Either session participant can end the session at any time.

The two users may be using communication devices of different makes, with different operating systems and browsers from different vendors.

One user has an unreliable Internet connection that sometimes loses packets and sometimes goes down completely.

One user is located behind a Network Address Translator (NAT).
******  I recommend some minor editorial changes, so that the second paragraph reads

The communication device used by one of the users has several network adapters (Ethernet, WiFi, Cellular).  The communication device is accessing the Internet using Ethernet, but the user has to start a trip during the session.  The communication device automatically changes to use WiFi when the Ethernet cable is removed and then moves to cellular access to the Internet when moving out of WiFi coverage.  The session continues even though the access method changes.
******  "previos" -> "previous".  Also, the first use of "QoS" should define the term, as in "Quality of Service (QoS)".
Actually, QoS is more a derived functional requirement than a use case, especially when that specific term is used anywhere near IETF folks.  If what the user wants is that the call continues at best available quality (to the possible detriment of other users of the same cell/dsl/whatever), we should say so. It may be the best way to do this lies in the codec or protocol and not using existing QoS methods.  We should clarify that *in this use case*, the service providers are choosing to exchange no more information about the users than what can be carried using SIP.  In other words, this is not suggesting that all RTCWeb/WebRTC web application service providers must restrict themselves only to exchanging information that can be carried via SIP (whatever SIP means in this situation).  For example, in general the interoperability of sites could be done though any IM protocol, e.g., combined with, say, oauth for identity. We should not be mandating or preferring (even by implication) any specific protocol.  If websites choose to export presence and identity to support interoperability that is up to them and does not necessarily require that the RTCWeb API provide such a mechanism.
I almost think that this implies a new, more precise requirement that the Web API MUST NOT prevent two webapps that happen to choose to peer with SIP from peering.  That makes clearer what our baseline minimum is without restricting the peering mechanisms of all webapps.  I say "clearer" rather than "clear" because "peer with SIP" is itself not very precise, but I still think it's better than what we have now.

I suggest a minor rewording of
"Each web service publishes information about user login status for users that have a relationship with the other user; how this is established is out of scope."
  to something more concrete, e.g.,
"For each user Alice who has authorized another user Bob to receive login status information, Alice's service publishes Alice's login status information to Bob.  How this authorization is defined and established is out of scope."  "thumbnail ot" -> "thumbnail of".  "can not" -> "cannot"  "simple video communication service" needs to reference 4.2.1.  The description should begin with "This use case is based on the previous one."  Also, "can not" -> "cannot" and "sound of the tank, that file" -> "sound of the tank; that file".
More substantially, the note in this section strongly suggests that the WebRTC/RTCWeb groups must be responsible for the mixing of sound objects with streams before rendering.  It might be clearer to state that our group's work MUST NOT prevent this and in fact should work with other groups' definitions of HTML5 audio rendering.  "mobile phone used" -> "mobile phone is used".  "can not" -> "cannot".
This use case is underspecified.  What does it mean for a user to "place and receive calls in the same way as when using a normal mobile phone"?  My mobile phone vibrates when I receive a call, and I can dial it by pressing and holding a digit on the keypad.  I don't even have a SIP softphone on my desktop that can do either one.  The login must also allow the user to manage their account, pay bills, add services, etc.  More interestingly, it should be possible to write a portal web app that, once the user is logged in, does not require the user to submit an additional set of credentials to access the phone functionality.  I don't believe this use case goes far enough.  The phone experience should be sufficiently embedded in the page that the user's context can be passed with the call, possibly resulting in a deep dial into an IVR tree or a customer service representative not having to ask questions that the user has already answered at the website level. The key here is that we should be aspiring to a user experience that is *better* than that of a PSTN call, not just equivalent.  "can not" -> "cannot".  "All participant are authenticated" -> "All participants are authenticated". "There exists several" -> "There exist several".  "one low resolution stream, the" -> "one low resolution stream, and the".  "c) each browser" -> "or c) each browser".  "just an high" -> "just a high".  "reslution" -> "resolution".
Also, we should probably note in this use case that the spatialization could not only happen as part of the server-side mixing but also by having the server tag the stream with spatialization info and having the browser render it.

F2:  "in presence of" -> "in the presence of"

F5:  ditto

F8:  "any more" -> "anymore"

F15:  I think this is venturing out of scope.  Perhaps a better phrasing is "The webrtc browser component MUST interoperate with other HTML5 methods for processing and mixing sound objects (media that is retrieved from another source than the established media stream(s) with the peer(s) with audio streams)."

F18:  While support for a minimum common codec is important, requiring it to be commonly supported by existing legacy telephony services is technically only a nice-to-have feature.  One might consider gsm610 as an alternative, for example.

F19:  The first letter needs to be capitalized.

F24:  "carried in SIP" is not sufficiently precise.  More clarity here might improve some of the discussions we are currently having.

F26:  "in presence of" -> "in the presence of"

General comment about all of the API requirements in section 5.3:  they are not written as API requirements, but as *web application* requirements.  Since many of the requirements on the web application could be met through means other than the WebRTC API, it is easy for people to agree with the requirement but strongly disagree on whether the API needs to be the *mechanism* by which the requirement is satisfied.  Although I have not reworded all of the requirements below, I think it would be much clearer if we only wrote the requirements that the Web API itself must satisfy as "The Web API MUST ...".  For example, "The Web API MUST inform a web application when a stream from a peer is no longer received."
I suspect that this will help make clear where we disagree on which requirements must be addressed by the Web API itself and which must merely not be prevented by the Web API (and thus could be satisfied external to the WebRTC API).

A8 and A10:  It would be good to clarify here somewhere what the difference is between pause/resume and cease/start for a stream.

A14:  As written this is not entirely in scope.  Perhaps the following phrasing would be more accurate?

"The Web API MUST NOT prevent panning, mixing, and other processing for individual streams."

A15:  This requirement is too specific in terms of how identifiers are shared.  Would the following perhaps be more accurate?

"For each stream, the Web API MUST provide to both parties of the communication an identifier for the stream that is a) the same at both ends, b) serializable, and c) unique relative to all other stream identifiers in use by either party."

The word "serializable" is not exactly correct, but the idea I'm trying to convey is that the identifier can safely be passed from one party to the other and back again, via WebRTC calls or otherwise.

A16:  A minor nit here -- we probably should not use the word "datagram" at this stage because of its implementation implications.  What about "In addition to the streams listed elsewhere, the Web API MUST provide a mechanism for sending and receiving isolated discrete chunks of data."

A17:  Another minor nit -- presumably this only applies when the signal is audio.  Maybe we could reword as "For streams of type audio, it MUST be possible for the web application author to indicate, via the Web API, when the stream is speech."

7.2:  All but the last paragraph here should be written as requirements in section 5.2, not in the security considerations section.  They need to be not security afterthoughts but primary requirements for implementations.
Additionally, I think we should be more explicit about consent revision to include revocation, i.e., "The browser is expected to provide mechanisms for users to revise and even completely revoke consent to use device resources such as cameras and microphones."
Along the same lines, I believe we also discussed at the WebRTC meeting in Quebec that the browser should provide a user-visible security indicator (such as a padlock) indicating the encryption level of the session.  Maybe this should be a requirement?
Also, "browser is needs" -> "browser needs".

7.3:  This should be a requirement in section 5.3.


-- dan

[1] http://www.ietf.org/id/draft-ietf-rtcweb-use-cases-and-requirements-04.txt