Re: [Slim] Issue 43: How to know the modality of a language indication?


Den 2017-10-15 kl. 23:58, skrev Paul Kyzivat:
> On 10/15/17 5:22 PM, Gunnar Hellström wrote:
>> Den 2017-10-15 kl. 21:27, skrev Paul Kyzivat:
>>> On 10/15/17 1:49 PM, Bernard Aboba wrote:
>>>> Paul said:
>>>>
>>>> "For the software to know must mean that it will behave differently 
>>>> for a tag that represents a sign language than for one that 
>>>> represents a spoken or written language. What is it that it will do 
>>>> differently?"
>>>>
>>>> [BA] In terms of behavior based on the signed/non-signed 
>>>> distinction, in -17 the only reference appears to be in Section 
>>>> 5.4, stating that certain combinations are not defined in the 
>>>> document (but that definition of those combinations was out of scope):
>>>
>>> I'm asking whether this is a distinction without a difference. I'm 
>>> not asking whether this makes a difference in the *protocol*, but 
>>> whether in the end it benefits the participants in the call in any way. 
>> <GH>Good point, I was on my way to make a similar comment earlier 
>> today. The difference it makes for applications to "know" what 
>> modality a language tag represents in its used position seems to be 
>> only for imagined functions that are out of scope for the protocol 
>> specification.
>>> For instance:
>>>
>>> - does it help the UA to decide how to alert the callee, so that the
>>>   callee can better decide whether to accept the call or instruct the
>>>   UA about how to handle the call?
>> <GH>Yes, for a regular human user -to-user call, the result of the 
>> negotiation must be presented to the participants, so that they can 
>> start the call with a language and modality that is agreed.
>
> *Today* we don't do this. We leave it for the end users to negotiate 
> the language they will use by other means - typically in-band by 
> informal negotiation. Most UAs don't have any provision to provide the 
> specified language to the end user.
<GH>Yes, and the main purpose of this work is to improve on that. - To 
make sure that the users are informed so they can start the call in an 
appropriate language. And even invoke support if needed.
>
> This of course will also work for deaf users with sign language in the 
> absence of any interpreters in the call. When it doesn't work is when 
> there is a deaf user and a hearing user, and video isn't present at 
> both ends. But in that case there isn't much hope.
>
>> That presentation could be exactly the description from the language 
>> tag registry, and then no "knowledge" is needed from the application. 
>> But it is more likely that the application has its own string for 
>> presentation of the negotiated language and modality. So that will be 
>> presented. But it is still found by a table lookup between language 
>> tag and string for a language name, so no real knowledge is needed.
>
> Clearly presenting the raw language tags won't be useful to most 
> users. So some sort of translation is needed. Such a hypothetical 
> translation table could also have properties, such as whether the 
> language is a sign language. Hence this doesn't seem to be a problem 
> we need to solve.
<GH>Right, we provide the information in language tag format to the 
devices, and how the devices collect configuration information from the 
users and provide negotiation result to the users is not specified in 
this draft.
>
>> We have said many times that the way the application tells the user 
>> the result of the negotiation is out of scope for the draft, but it 
>> is good to discuss and know that it can be done.
>> A similar mechanism is also needed for configuration of the user's 
>> language preference profile further discussed below.
>>>
>>> - does it allow the UA to make a decision whether to accept the media?
>> <GH>No, the media should be accepted regardless of the result of the 
>> language negotiation.
>>>
>>> - can the UA use this information to change how to render the media?
>> <GH>Yes, for the specialized text notation of sign language we have 
>> discussed but currently placed out of scope, a very special rendering 
>> application is needed. The modality would be recognized by a script 
>> subtag to a sign language tag used in text media. However, I think 
>> that would be best to also use it with a specific text subtype, so 
>> that the rendering can be controlled by invocation of a "codec" for 
>> that rendering.
>
> But won't the rendering also be in some way language specific? If so, 
> the device will need to be configured for the specific languages it 
> supports. Hence again a generic mechanism isn't needed.
<GH>I got lost here in the reasoning. We were looking for reasons for 
the device to do things differently for different modalities and 
therefore know modality from language tag and its position in media 
description.
We have the unsolved case of a language tag for a spoken or written 
language when used in video media. Since these cases use the same 
language tag we do not know what is meant if it is provided in video 
media description. It can mean a view of a speaker or it can mean text 
overlay on video, e.g. with mp4 coding with text component together with 
video (if I remember the use of the subtypes right). This problem is why 
we for a while had a complex rule about just one case that we allowed.

Rendering will be modality specific. Text embedded in video requires 
other rendering than the view of a speaking person. All ways we have 
proposed to indicate the difference between these two modalities have 
been rejected.
>
>>> And if there is something like this, will the UA be able to do this 
>>> generically based on whether the media is sign language or not, or 
>>> will the UA need to already understand *specific* sign language tags?
>> <GH>Applications will need to have localized versions of the names 
>> for the different sign languages and also for spoken languages and 
>> written languages, to be used in setting of preferences and 
>> announcing the results of the negotiation. It might be overkill to 
>> have such localized names for all languages in the IANA language 
>> registry, so it will need to be able to handle localized names of a 
>> subset och the registry. With good design however, this is just an 
>> automatic translation between a language tag and a corresponding 
>> name, so it does in fact not require any "knowledge" of what modality 
>> is used with each language tag.
>
> My earlier comment applies to the above.
>
>> The application can ask for the configuration:
>> "Which languages do you want to offer to send in video"
>> "Which languages do you want to offer to send in text"
>> "Which languages do you want to offer to send in audio"
>> "Which languages do you want to be prepared to receive in video"
>> "Which languages do you want to be prepared to receive in text"
>> "Which languages do you want to be prepared to receive in audio"
>>
>> And for each question provide a list of language names to select 
>> from. When the selection is made, the corresponding language tag is 
>> placed in the profile for negotiation.
>
> Hence, there need not be any *generic* mechanism, since this is based 
> on configuration.
>
>> If the application provides the whole IANA language registry to the 
>> user for each question, then there is a possibility that the user by 
>> mistake selects a language that requires another modality than the 
>> question was about. If the application shall limit the lists provided 
>> for each question, then it will need a kind of knowledge about which 
>> language tags suit each modality (and media)
>>
>>
>>>
>>> E.g., A UA serving a deaf person might automatically introduce a 
>>> sign language interpreter into an incoming audio-only call. If the 
>>> incoming call has both audio and video then the video *might* be for 
>>> conveying sign language, or not. If not then the UA will still want 
>>> to bring in a sign language interpreter. But is knowing the call 
>>> generically contains sign language sufficient to decide against 
>>> bringing in an interpreter? Or must that depend on it being a sign 
>>> language that the user can use? If the UA is configured for all the 
>>> specific sign languages that the user can deal with then there is no 
>>> need to recognize other sign languages generically.
>> <GH>We are talking about specific language tags here and knowing what 
>> modality they are used for. The user needs to specify which sign 
>> languages they prefer to use. The callee application can be made to 
>> look for gaps between what the caller offers and what the callee can 
>> accept, and from that deduct which type and languages for a 
>> conversion that is needed, and invoke that as a relay service. That 
>> invocation can be made completely table driven and have corresponding 
>> translation profiles for available relay services. But it is more 
>> likely that it is done by having some knowledge about which languages 
>> are sign languages and which are spoken languages and sending the 
>> call to the relay service to try to sort out if they can handle the 
>> translation.
>>>
>>>
>> So, the answer is - no, the application does not really have any 
>> knowledge about which modality a language tag represents in its used 
>> position. If the user selects to indicate very rare language tag 
>> indications for a media, then a match will just become very unlikely.
>>
>> Where does this discussion take us? Should we modify section 5.4 again?
>
> Frankly, I see no need for section 5.4.
<GH>This discussion at least changes the need for section 5.4 dramatically.
What we still might need to say is that we have no agreed way to 
differentiate between a view of a speaking person and text embedded in 
video by the defined notation. And we could warn for using unusual 
language tag - media combinations that will rarely be matched.
>
>     Thanks,
>     Paul
>
>> Thanks
>> Gunnar
>>>     Thanks,
>>>     Paul
>>>
>>>>       5.4
>>>> <https://tools.ietf.org/html/draft-ietf-slim-negotiating-human-language-17#section-5.4>. 
>>>>
>>>>       Undefined Combinations
>>>>
>>>>
>>>>
>>>>     The behavior when specifying a non-signed language tag for a video
>>>>     media stream, or a signed language tag for an audio or text media
>>>>     stream, is not defined in this document.
>>>>
>>>>     The problem of knowing which language tags are signed and which 
>>>> are
>>>>     not is out of scope of this document.
>>>>
>>>>
>>>>
>>>> On Sun, Oct 15, 2017 at 10:13 AM, Paul Kyzivat 
>>>> <pkyzivat@alum.mit.edu <mailto:pkyzivat@alum.mit.edu>> wrote:
>>>>
>>>>     On 10/15/17 2:24 AM, Gunnar Hellström wrote:
>>>>
>>>>         Paul,
>>>>         Den 2017-10-15 kl. 01:19, skrev Paul Kyzivat:
>>>>
>>>>             On 10/14/17 2:03 PM, Bernard Aboba wrote:
>>>>
>>>>                 Gunnar said:
>>>>
>>>>                 "Applications not implementing such specific notations
>>>>                 may use the following simple deductions.
>>>>
>>>>                 - A language tag in audio media is supposed to 
>>>> indicate
>>>>                 spoken modality.
>>>>
>>>>                 [BA] Even a tag with "Sign Language" in the 
>>>> description??
>>>>
>>>>                 - A language tag in text media is supposed to 
>>>> indicate                 written modality.
>>>>
>>>>                 [BA] If the tag has "Sign Language" in the 
>>>> description,
>>>>                 can this document really say that?
>>>>
>>>>                 - A language tag in video media is supposed to 
>>>> indicate
>>>>                 visual sign language modality except for the case when
>>>>                 it is supposed to indicate a view of a speaking person
>>>>                 mentioned in section 5.2 characterized by the exact 
>>>> same
>>>>                 language tag also appearing in an audio media 
>>>> specification.
>>>>
>>>>                 [BA] It seems like an over-reach to say that a spoken
>>>>                 language tag in video media should instead be
>>>>                 interpreted as a request for Sign Language. If this
>>>>                 were done, would it always be clear which Sign 
>>>> Language
>>>>                 was intended?  And could we really assume that both
>>>>                 sides, if negotiating a spoken language tag in video
>>>>                 media, were really indicating the desire to sign?  It
>>>>                 seems like this could easily result interoperability
>>>>                 failure.
>>>>
>>>>
>>>>             IMO the right way to indicate that two (or more) media
>>>>             streams are conveying alternative representations of the
>>>>             same language content is by grouping them with a new
>>>>             grouping attribute. That can tie together an audio with a
>>>>             video and/or text. A language tag for sign language on the
>>>>             video stream then clarifies to the recipient that it is 
>>>> sign
>>>>             language. The grouping attribute by itself can indicate 
>>>> that
>>>>             these streams are conveying language.
>>>>
>>>>         <GH>Yes, and that is proposed in
>>>>         draft-hellstrom-slim-modality-grouping    with two kinds of
>>>>         grouping: One kind of grouping to tell that two or more
>>>>         languages in different streams are alternatives with the same
>>>>         content and a priority order is assigned to them to guide the
>>>>         selection of which one to use during the call. The other 
>>>> kind of
>>>>         grouping telling that two or more languages in different 
>>>> streams
>>>>         are desired together with the same language content but
>>>>         different modalities ( such as the use for captioned telephony
>>>>         with the same content provided in both speech and text, or 
>>>> sign
>>>>         language interpretation where you see the interpreter, or
>>>>         possibly spoken language interpretation with the languages
>>>>         provided in different audio streams ). I hope that that draft
>>>>         can be progressed. I see it as a needed complement to the pure
>>>>         language indications per media.
>>>>
>>>>
>>>>     Oh, sorry. I did read that draft but forgot about it.
>>>>
>>>>         The discussion in this thread is more about how an application
>>>>         would easily know that e.g. "ase" is a sign language and 
>>>> "en" is
>>>>         a spoken (or written) language, and also a discussion about 
>>>> what
>>>>         kinds of languages are allowed and indicated by default in 
>>>> each
>>>>         media type. It was not at all about falsely using language 
>>>> tags
>>>>         in the wrong media type as Bernard understood my wording. 
>>>> It was
>>>>         rather a limitation to what modalities are used in each media
>>>>         type and how to know the modality with cases that are not
>>>>         evident, e.g. "application" and "message" media types.
>>>>
>>>>
>>>>     What do you mean by "know"? Is it for the *UA* software to 
>>>> know, or
>>>>     for the human user of the UA to know? Presumably a human user that
>>>>     cares will understand this if presented with the information in 
>>>> some
>>>>     way. But typically this isn't presented to the user.
>>>>
>>>>     For the software to know must mean that it will behave differently
>>>>     for a tag that represents a sign language than for one that
>>>>     represents a spoken or written language. What is it that it 
>>>> will do
>>>>     differently?
>>>>
>>>>              Thanks,
>>>>              Paul
>>>>
>>>>
>>>>         Right now we have returned to a very simple rule: we define 
>>>> only
>>>>         use of spoken language in audio media, written language in 
>>>> text
>>>>         media and sign language in video media.
>>>>         We have discussed other use, such as a view of a speaking 
>>>> person
>>>>         in video, text overlay on video, a sign language notation in
>>>>         text media, written language in message media, written 
>>>> language
>>>>         in WebRTC data channels, sign written and spoken in bucket 
>>>> media
>>>>         maybe declared as application media. We do not define these
>>>>         cases. They are just not defined, not forbidden. They may be
>>>>         defined in the future.
>>>>
>>>>         My proposed wording in section 5.4 got too many
>>>>         misunderstandings so I gave up with it. I think we can live 
>>>> with
>>>>         5.4 as it is in version -16.
>>>>
>>>>         Thanks,
>>>>         Gunnar
>>>>
>>>>
>>>>
>>>>             (IIRC I suggested something along these lines a long 
>>>> time ago.)
>>>>
>>>>                  Thanks,
>>>>                  Paul
>>>>
>>>>             _______________________________________________
>>>>             SLIM mailing list
>>>>             SLIM@ietf.org <mailto:SLIM@ietf.org>
>>>>             https://www.ietf.org/mailman/listinfo/slim
>>>> <https://www.ietf.org/mailman/listinfo/slim>
>>>>
>>>>
>>>>
>>>>     _______________________________________________
>>>>     SLIM mailing list
>>>>     SLIM@ietf.org <mailto:SLIM@ietf.org>
>>>>     https://www.ietf.org/mailman/listinfo/slim
>>>>     <https://www.ietf.org/mailman/listinfo/slim>
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> SLIM mailing list
> SLIM@ietf.org
> https://www.ietf.org/mailman/listinfo/slim

-- 
-----------------------------------------
Gunnar Hellström
Omnitor
gunnar.hellstrom@omnitor.se
+46 708 204 288