Re: [Slim] Issue 43: How to know the modality of a language indication?

Gunnar Hellström <gunnar.hellstrom@omnitor.se> Mon, 16 October 2017 22:21 UTC

Return-Path: <gunnar.hellstrom@omnitor.se>
X-Original-To: slim@ietfa.amsl.com
Delivered-To: slim@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4ADE213247A for <slim@ietfa.amsl.com>; Mon, 16 Oct 2017 15:21:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.599
X-Spam-Level:
X-Spam-Status: No, score=-2.599 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id z0VXXnwHjjCK for <slim@ietfa.amsl.com>; Mon, 16 Oct 2017 15:21:10 -0700 (PDT)
Received: from bin-vsp-out-02.atm.binero.net (bin-mail-out-05.binero.net [195.74.38.228]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4F1881321A7 for <slim@ietf.org>; Mon, 16 Oct 2017 15:21:09 -0700 (PDT)
X-Halon-ID: 407bb653-b2c0-11e7-99c0-005056917f90
Authorized-sender: gunnar.hellstrom@omnitor.se
Received: from [192.168.2.136] (unknown [87.96.178.34]) by bin-vsp-out-02.atm.binero.net (Halon) with ESMTPSA id 407bb653-b2c0-11e7-99c0-005056917f90; Tue, 17 Oct 2017 00:20:44 +0200 (CEST)
To: Bernard Aboba <bernard.aboba@gmail.com>
Cc: Paul Kyzivat <pkyzivat@alum.mit.edu>, slim@ietf.org
References: <CAOW+2dtSOgp3JeiSVAttP+t0ZZ-k3oJK++TS71Xn7sCOzMZNVQ@mail.gmail.com> <p06240606d607257c9584@172.20.60.54> <fb9e6b79-7bdd-9933-e72e-a47bc8c93b58@omnitor.se> <CAOW+2dtteOadptCT=yvfmk01z-+USfE4a7JO1+u_fkTp72ygNA@mail.gmail.com> <da5cfaea-75f8-3fe1-7483-d77042bd9708@alum.mit.edu> <b2611e82-2133-0e77-b72b-ef709b1bba3c@omnitor.se> <1b0380ef-b57d-3cc7-c649-5351dc61f878@alum.mit.edu> <CAOW+2dtVE5BDmD2qy_g-asXvxntif4fVC8LYO4j7QLQ5Kq2E+g@mail.gmail.com> <3fc6d055-08a0-2bdb-f6e9-99b94efc49df@alum.mit.edu> <84fb37bd-5c7a-90ea-81fd-d315faabfd96@omnitor.se> <CAOW+2dvPSUGA_7tye+KqR1TGs1kYL43TdxBCDOHVEmWOFHud0Q@mail.gmail.com>
From: Gunnar Hellström <gunnar.hellstrom@omnitor.se>
Message-ID: <49cb3e25-6d65-1773-2803-dc667cd5890c@omnitor.se>
Date: Tue, 17 Oct 2017 00:21:01 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0
MIME-Version: 1.0
In-Reply-To: <CAOW+2dvPSUGA_7tye+KqR1TGs1kYL43TdxBCDOHVEmWOFHud0Q@mail.gmail.com>
Content-Type: multipart/alternative; boundary="------------C331B49A7F658633C5C6AB18"
Content-Language: en-US
Archived-At: <https://mailarchive.ietf.org/arch/msg/slim/qbszXwYxP66BOpjUWQo0W8YYlQo>
Subject: Re: [Slim] Issue 43: How to know the modality of a language indication?
X-BeenThere: slim@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Selection of Language for Internet Media <slim.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/slim>, <mailto:slim-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/slim/>
List-Post: <mailto:slim@ietf.org>
List-Help: <mailto:slim-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/slim>, <mailto:slim-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 16 Oct 2017 22:21:15 -0000

Den 2017-10-16 kl. 01:21, skrev Bernard Aboba:
> Paul said:
>
> ""- can the UA use this information to change how to render the media?"
>
> [BA]  If the video is used for signing, an application might infer an 
> encoder preference for frame rate over resolution (e.g. in WebRTC, 
> RTCRtpParameters.degradationPreference = "maintain-framerate" )
<GH>Right, that is a valid example of how real "knowledge" of the 
modality can be used by the application.


And, as a response on issue #43,

A simple way is to say

Video media descriptions shall only contain sign language tags
Audio media descriptions shall only contain language tags for spoken 
language
Text media descriptions shall only contain language tags for written 
language
Use of other media descriptions such as message and application with 
language indications require other specifications on how to assess the 
modality for non-signed languages.

The current 5.4 does not mention our main problem with the language 
tags, that there is no difference on them if we mean use for spoken 
language or written language. We should have made better efforts to 
solve that problem long ago, but we have not.

5.4 can be modified to specify the simple limited case and the problems 
that block us from specifying other cases:


      5.4
      <https://tools.ietf.org/html/draft-ietf-slim-negotiating-human-language-17#section-5.4>.
      Media and modality Combination problems




    The problem of indicating a language tag for the view of a speaking person in a video stream is out of scope for this document.

    The problem of indicating a language tag for use of written language coded as a component in a video stream is out of scope for this document.

    The use of language tags for negotiation of languages in other media than audio, video and text is not defined in this document.

    The problem of knowing which language tags are signed and which are not can be deducted
    from the IANA language tag registry. How this is done is out of scope of this document.


--------------------------------------------------------


But if we want to allow more cases, we need to consider the following 
complications:


1. to assess if a language represents a Sign Language, the application 
can look for the word "sign" in the description in the IANA language 
registry or a copy thereof as Randall already indicated.

2. For written languages used as a text component in a video stream, it 
is possible to code this for languages requiring a script subtag, but 
not for languages with suppressed script subtags

3. We have also discussed proposals for how to code written language in 
video stream for languages not requiring a script subtag, but not got 
acceptance for our proposals. So we need to say that that is currently 
undefined.

4. We also discussed how to code a view of a speaking person in video 
and said that that could be done by using the "definitively not written" 
script subtag on a non-signed language tag in video. But that was not 
appreciated by the language experts. Another option was to not allow 
written language overlayed on video, and that is the lately used option. 
( up to version -16 or so)

5. For talking and hearing audio media, we only have that case for 
language-tags in Audio. So that is easy to code and assess.

6. For written language in text media, a check can be made about if 
"sign" is part of the language tag description, and if not, it is a 
written language.

7. For signed language in text media, a check can be made about if 
"sign" is part of the language tag description, and if it is, it is a 
signed language in text notation. (extremely unusual)

8. For use with language tags in other media than audio, video and text, 
there is a need for a description on how to assess the modality, 
especially for non-signed languages before it is used.


We can construct a section 5.4 to describe this situation, but I doubt 
that it is worth the effort.


>
> See: 
> https://rawgit.com/w3c/webrtc-pc/master/webrtc.html#dom-rtcrtpparameters-degradationpreference
>
> On Sun, Oct 15, 2017 at 2:22 PM, Gunnar Hellström 
> <gunnar.hellstrom@omnitor.se <mailto:gunnar.hellstrom@omnitor.se>> wrote:
>
>     Den 2017-10-15 kl. 21:27, skrev Paul Kyzivat:
>
>         On 10/15/17 1:49 PM, Bernard Aboba wrote:
>
>             Paul said:
>
>             "For the software to know must mean that it will behave
>             differently for a tag that represents a sign language than
>             for one that represents a spoken or written language. What
>             is it that it will do differently?"
>
>             [BA] In terms of behavior based on the signed/non-signed
>             distinction, in -17 the only reference appears to be in
>             Section 5.4, stating that certain combinations are not
>             defined in the document (but that definition of those
>             combinations was out of scope):
>
>
>         I'm asking whether this is a distinction without a difference.
>         I'm not asking whether this makes a difference in the
>         *protocol*, but whether in the end it benefits the
>         participants in the call in any way.
>
>     <GH>Good point, I was on my way to make a similar comment earlier
>     today. The difference it makes for applications to "know" what
>     modality a language tag represents in its used position seems to
>     be only for imagined functions that are out of scope for the
>     protocol specification.
>
>         For instance:
>
>         - does it help the UA to decide how to alert the callee, so
>         that the
>           callee can better decide whether to accept the call or
>         instruct the
>           UA about how to handle the call?
>
>     <GH>Yes, for a regular human user -to-user call, the result of the
>     negotiation must be presented to the participants, so that they
>     can start the call with a language and modality that is agreed.
>     That presentation could be exactly the description from the
>     language tag registry, and then no "knowledge" is needed from the
>     application. But it is more likely that the application has its
>     own string for presentation of the negotiated language and
>     modality. So that will be presented. But it is still found by a
>     table lookup between language tag and string for a language name,
>     so no real knowledge is needed.
>     We have said many times that the way the application tells the
>     user the result of the negotiation is out of scope for the draft,
>     but it is good to discuss and know that it can be done.
>     A similar mechanism is also needed for configuration of the user's
>     language preference profile further discussed below.
>
>
>         - does it allow the UA to make a decision whether to accept
>         the media?
>
>     <GH>No, the media should be accepted regardless of the result of
>     the language negotiation.
>
>
>         - can the UA use this information to change how to render the
>         media?
>
>     <GH>Yes, for the specialized text notation of sign language we
>     have discussed but currently placed out of scope, a very special
>     rendering application is needed. The modality would be recognized
>     by a script subtag to a sign language tag used in text media.
>     However, I think that would be best to also use it with a specific
>     text subtype, so that the rendering can be controlled by
>     invocation of a "codec" for that rendering.
>
>
>         And if there is something like this, will the UA be able to do
>         this generically based on whether the media is sign language
>         or not, or will the UA need to already understand *specific*
>         sign language tags?
>
>     <GH>Applications will need to have localized versions of the names
>     for the different sign languages and also for spoken languages and
>     written languages, to be used in setting of preferences and
>     announcing the results of the negotiation. It might be overkill to
>     have such localized names for all languages in the IANA language
>     registry, so it will need to be able to handle localized names of
>     a subset och the registry. With good design however, this is just
>     an automatic translation between a language tag and a
>     corresponding name, so it does in fact not require any "knowledge"
>     of what modality is used with each language tag.
>     The application can ask for the configuration:
>     "Which languages do you want to offer to send in video"
>     "Which languages do you want to offer to send in text"
>     "Which languages do you want to offer to send in audio"
>     "Which languages do you want to be prepared to receive in video"
>     "Which languages do you want to be prepared to receive in text"
>     "Which languages do you want to be prepared to receive in audio"
>
>     And for each question provide a list of language names to select
>     from. When the selection is made, the corresponding language tag
>     is placed in the profile for negotiation.
>
>     If the application provides the whole IANA language registry to
>     the user for each question, then there is a possibility that the
>     user by mistake selects a language that requires another modality
>     than the question was about. If the application shall limit the
>     lists provided for each question, then it will need a kind of
>     knowledge about which language tags suit each modality (and media)
>
>
>
>         E.g., A UA serving a deaf person might automatically introduce
>         a sign language interpreter into an incoming audio-only call.
>         If the incoming call has both audio and video then the video
>         *might* be for conveying sign language, or not. If not then
>         the UA will still want to bring in a sign language
>         interpreter. But is knowing the call generically contains sign
>         language sufficient to decide against bringing in an
>         interpreter? Or must that depend on it being a sign language
>         that the user can use? If the UA is configured for all the
>         specific sign languages that the user can deal with then there
>         is no need to recognize other sign languages generically.
>
>     <GH>We are talking about specific language tags here and knowing
>     what modality they are used for. The user needs to specify which
>     sign languages they prefer to use. The callee application can be
>     made to look for gaps between what the caller offers and what the
>     callee can accept, and from that deduct which type and languages
>     for a conversion that is needed, and invoke that as a relay
>     service. That invocation can be made completely table driven and
>     have corresponding translation profiles for available relay
>     services. But it is more likely that it is done by having some
>     knowledge about which languages are sign languages and which are
>     spoken languages and sending the call to the relay service to try
>     to sort out if they can handle the translation.
>
>
>
>     So, the answer is - no, the application does not really have any
>     knowledge about which modality a language tag represents in its
>     used position. If the user selects to indicate very rare language
>     tag indications for a media, then a match will just become very
>     unlikely.
>
>     Where does this discussion take us? Should we modify section 5.4
>     again?
>
>     Thanks
>     Gunnar
>
>             Thanks,
>             Paul
>
>                   5.4
>             <https://tools.ietf.org/html/draft-ietf-slim-negotiating-human-language-17#section-5.4
>             <https://tools.ietf.org/html/draft-ietf-slim-negotiating-human-language-17#section-5.4>>.
>                   Undefined Combinations
>
>
>
>                 The behavior when specifying a non-signed language tag
>             for a video
>                 media stream, or a signed language tag for an audio or
>             text media
>                 stream, is not defined in this document.
>
>                 The problem of knowing which language tags are signed
>             and which are
>                 not is out of scope of this document.
>
>
>
>             On Sun, Oct 15, 2017 at 10:13 AM, Paul Kyzivat
>             <pkyzivat@alum.mit.edu <mailto:pkyzivat@alum.mit.edu>
>             <mailto:pkyzivat@alum.mit.edu
>             <mailto:pkyzivat@alum.mit.edu>>> wrote:
>
>                 On 10/15/17 2:24 AM, Gunnar Hellström wrote:
>
>                     Paul,
>                     Den 2017-10-15 kl. 01:19, skrev Paul Kyzivat:
>
>                         On 10/14/17 2:03 PM, Bernard Aboba wrote:
>
>                             Gunnar said:
>
>                             "Applications not implementing such
>             specific notations
>                             may use the following simple deductions.
>
>                             - A language tag in audio media is
>             supposed to indicate
>                             spoken modality.
>
>                             [BA] Even a tag with "Sign Language" in
>             the description??
>
>                             - A language tag in text media is supposed
>             to indicate                 written modality.
>
>                             [BA] If the tag has "Sign Language" in the
>             description,
>                             can this document really say that?
>
>                             - A language tag in video media is
>             supposed to indicate
>                             visual sign language modality except for
>             the case when
>                             it is supposed to indicate a view of a
>             speaking person
>                             mentioned in section 5.2 characterized by
>             the exact same
>                             language tag also appearing in an audio
>             media specification.
>
>                             [BA] It seems like an over-reach to say
>             that a spoken
>                             language tag in video media should instead be
>                             interpreted as a request for Sign
>             Language.  If this
>                             were done, would it always be clear which
>             Sign Language
>                             was intended?  And could we really assume
>             that both
>                             sides, if negotiating a spoken language
>             tag in video
>                             media, were really indicating the desire
>             to sign?  It
>                             seems like this could easily result
>             interoperability
>                             failure.
>
>
>                         IMO the right way to indicate that two (or
>             more) media
>                         streams are conveying alternative
>             representations of the
>                         same language content is by grouping them with
>             a new
>                         grouping attribute. That can tie together an
>             audio with a
>                         video and/or text. A language tag for sign
>             language on the
>                         video stream then clarifies to the recipient
>             that it is sign
>                         language. The grouping attribute by itself can
>             indicate that
>                         these streams are conveying language.
>
>                     <GH>Yes, and that is proposed in
>                     draft-hellstrom-slim-modality-grouping with two
>             kinds of
>                     grouping: One kind of grouping to tell that two or
>             more
>                     languages in different streams are alternatives
>             with the same
>                     content and a priority order is assigned to them
>             to guide the
>                     selection of which one to use during the call. The
>             other kind of
>                     grouping telling that two or more languages in
>             different streams
>                     are desired together with the same language
>             content but
>                     different modalities ( such as the use for
>             captioned telephony
>                     with the same content provided in both speech and
>             text, or sign
>                     language interpretation where you see the
>             interpreter, or
>                     possibly spoken language interpretation with the
>             languages
>                     provided in different audio streams ). I hope that
>             that draft
>                     can be progressed. I see it as a needed complement
>             to the pure
>                     language indications per media.
>
>
>                 Oh, sorry. I did read that draft but forgot about it.
>
>                     The discussion in this thread is more about how an
>             application
>                     would easily know that e.g. "ase" is a sign
>             language and "en" is
>                     a spoken (or written) language, and also a
>             discussion about what
>                     kinds of languages are allowed and indicated by
>             default in each
>                     media type. It was not at all about falsely using
>             language tags
>                     in the wrong media type as Bernard understood my
>             wording. It was
>                     rather a limitation to what modalities are used in
>             each media
>                     type and how to know the modality with cases that
>             are not
>                     evident, e.g. "application" and "message" media types.
>
>
>                 What do you mean by "know"? Is it for the *UA*
>             software to know, or
>                 for the human user of the UA to know? Presumably a
>             human user that
>                 cares will understand this if presented with the
>             information in some
>                 way. But typically this isn't presented to the user.
>
>                 For the software to know must mean that it will behave
>             differently
>                 for a tag that represents a sign language than for one
>             that
>                 represents a spoken or written language. What is it
>             that it will do
>                 differently?
>
>                          Thanks,
>                          Paul
>
>
>                     Right now we have returned to a very simple rule:
>             we define only
>                     use of spoken language in audio media, written
>             language in text
>                     media and sign language in video media.
>                     We have discussed other use, such as a view of a
>             speaking person
>                     in video, text overlay on video, a sign language
>             notation in
>                     text media, written language in message media,
>             written language
>                     in WebRTC data channels, sign written and spoken
>             in bucket media
>                     maybe declared as application media. We do not
>             define these
>                     cases. They are just not defined, not forbidden.
>             They may be
>                     defined in the future.
>
>                     My proposed wording in section 5.4 got too many
>                     misunderstandings so I gave up with it. I think we
>             can live with
>                     5.4 as it is in version -16.
>
>                     Thanks,
>                     Gunnar
>
>
>
>                         (IIRC I suggested something along these lines
>             a long time ago.)
>
>                              Thanks,
>                              Paul
>
>                         _______________________________________________
>                         SLIM mailing list
>             SLIM@ietf.org <mailto:SLIM@ietf.org> <mailto:SLIM@ietf.org
>             <mailto:SLIM@ietf.org>>
>             https://www.ietf.org/mailman/listinfo/slim
>             <https://www.ietf.org/mailman/listinfo/slim>
>                         <https://www.ietf.org/mailman/listinfo/slim
>             <https://www.ietf.org/mailman/listinfo/slim>>
>
>
>
>                 _______________________________________________
>                 SLIM mailing list
>             SLIM@ietf.org <mailto:SLIM@ietf.org> <mailto:SLIM@ietf.org
>             <mailto:SLIM@ietf.org>>
>             https://www.ietf.org/mailman/listinfo/slim
>             <https://www.ietf.org/mailman/listinfo/slim>
>                 <https://www.ietf.org/mailman/listinfo/slim
>             <https://www.ietf.org/mailman/listinfo/slim>>
>
>
>
>
>     -- 
>     -----------------------------------------
>     Gunnar Hellström
>     Omnitor
>     gunnar.hellstrom@omnitor.se <mailto:gunnar.hellstrom@omnitor.se>
>     +46 708 204 288 <tel:%2B46%20708%20204%20288>
>
>

-- 
-----------------------------------------
Gunnar Hellström
Omnitor
gunnar.hellstrom@omnitor.se
+46 708 204 288