Re: [art] Artart last call review of draft-ietf-core-problem-details-05

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Thu, 23 June 2022 06:47 UTC

Message-ID: <dde9d36c-61e5-afcc-e15a-787c99d5fba9@it.aoyama.ac.jp>
Date: Thu, 23 Jun 2022 15:47:37 +0900
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0
Content-Language: en-US
To: Carsten Bormann <cabo@tzi.org>, Harald Alvestrand <harald@alvestrand.no>
Cc: art@ietf.org, core@ietf.org, draft-ietf-core-problem-details.all@ietf.org, last-call@ietf.org
References: <165511479760.19573.12671700576299137749@ietfa.amsl.com> <63D13796-758D-469B-AFA8-3050C9F87819@tzi.org>
From: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
In-Reply-To: <63D13796-758D-469B-AFA8-3050C9F87819@tzi.org>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: SddvyT6JjuLDabrHHMHD4nSdSP0U69sks+0AbgQ0kA+Pi/X+f8Su95UguhH99VPBKo7oClw+4TxqGNTNTBhEMXCwOg7GZuJT4aEfreb7ijMIjo4J3Yvt7TJl/k4MOXc9DNY3wrynOa3Bmb1ckJaHIk/5MSxqI3fMLTTSiw8zUkqJdJDqbgIlpWh+jaVL59yQR5KFJGurawCroRje5Nie/pTCJZ4/Lwb4XJR1kLmv6cFH1U03gYVQFAyhXUML+GRMuS4idTTvZ34Q9soF2LyzR0h1ar+Nt4eMuHyEknqZmHx+J7AWb0QopgkkqfrzOie0JgEZ4BTBIB7vhTv68PvLiBALDF0mlToTqDxHXrFyzymUTU+te42Bl9AnxNQH4+vZkpj919In8G7/Sim19fjnsOpmz/9K34b0qX3ozr0Ndcja2484MIyg1K7MhL0TuZ+EZuFcWkI/bPAjfCX6L0Kf3nw8u/YkUC/AUntWYm/VZU+tb/WEMfSHuFWEZgOYqYiPlKaKD3uXOh0JGa1wL3EJ/wh/1P89JcpxX4Lb7BvYnXR2jaChBYuC1w9sIQDzQ7uDZ61aPKBBZpGLqns8FJnNqVtIR9buU5fVA4yoEiNS0+XT3PEfOCVEi2A9EYJqq0ZUW1z+cPN9UR0xFzdGkT1fuMnMk4K0Ph+Bho1gk7bztnL+rTfyHSL2oUKQM9C7GArxJED14jyqG/3mmmD+HGUypEe3Z3f++CuYZWZ6B1ZmQhWycKPY3evcPPslvmdijR2N
X-OriginatorOrg: it.aoyama.ac.jp
X-MS-Exchange-CrossTenant-Network-Message-Id: 3f18a6d7-27e0-4242-4f67-08da54e443b0
X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB5689.jpnprd01.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 23 Jun 2022 06:47:39.7356 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: e02030e7-4d45-463e-a968-0290e738c18e
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: jYGWIf73gZj7pqxUGV35sh+Fz6+Evd6Z+5Yj0+R0eRVqBxYkAGrKFURgOy1V+514Cm9z445Mt0vG//Xog++1ow==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: OS0PR01MB5300
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/C_lv2frL7WO65oltxdqfoLEQPks>
Subject: Re: [art] Artart last call review of draft-ietf-core-problem-details-05
X-BeenThere: art@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Applications and Real-Time Area Discussion <art.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/art>, <mailto:art-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/art/>
List-Post: <mailto:art@ietf.org>
List-Help: <mailto:art-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/art>, <mailto:art-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Jun 2022 06:47:51 -0000

Dear Core and I18N experts,

Some comments on the I18N aspects of Tag 38 below.

[Sorry this answer took so long, and got so long. The two 'long's 
influenced each other :-).]

On 2022-06-16 01:23, Carsten Bormann wrote:
> 
> Hi Harald,
> 
> thank you for this thoughtful review.

>> The “Tag 38 internationalized string”
>> This document adds an appendix defining an “internationalized string” format
>> that adds a BCP 47 language tag and an Unicode-based direction indicator to an
>> UTF-8 string. This is laudable; RFC 2277 section 4 pointed out the need for
>> this ability 24 years ago.

I think that Language-Tagged Strings (CBOR Tag 38, 
https://datatracker.ietf.org/doc/html/draft-ietf-core-problem-details-06#appendix-A) 
are a very good step ahead. At least for CBOR, in many cases from now 
on, the answer might just be "use Tag 38" (assuming we get the details 
right).

>> Unfortunately neither definition is problem-free.
>>
>> First of all, this tag, if useful at all, is of far greater utility than the
>> error format. Burying it in an appendix of a document whose stated purpose is
>> something else makes it far more difficult to refer to than it needs to be.
> 
> That is usually not a problem.  The focal point for finding a CBOR tag for a specific application is the CBOR tag registry; this then points to the places where the specifications for the tags can be found (which in this case is easily expressed as “Appendix A of RFC XXXX”).

Separate Draft or Not
=====================

I agree with Harald that it should be a separate draft; it would 
definitely help with visibility of I18N in general and the issue of 
strings with language and directionality information inside and outside 
the IETF (not only the visibility within the CBOR community, which may 
be covered by the tag registry). Being able to say "look at RFC XXXX for 
a good example" is way better than being able to say "look at appendix X 
of RFC YYYY for a good example".

I understand Francesca's arguments, too, but I think the investment in a 
separate draft would be well worth the effort. I'm willing to contribute 
although I guess that Carsten would do the necessary work in less time 
than it takes him to get anybody else up to speed.

>> Second, the “detailed semantics” has chosen to include the quite complex BNF of
>> RFC 5646 translated into CDDL; this may have some use, but BCP 47 is a moving
>> target;
> 
> We intend tag38 to be useful for the current form of BCP 47, so it is hard to plan for the future.  If BCP 47 needs to be considered unstable, we could of course define a “bcp47-extension” alternative with a CDDL feature control operator.

(NOT!) Copying BCP 47 Grammar
=============================

I also agree with Harald that the definition of 'Language-Tagged 
Strings' has room for improvement. First, as Harald said, it repeats the 
BCP 47 grammar when we very well know that repeating grammars is usually 
a bad idea. I'm really not sure why CBOR wants to check each and every 
detail of the current language tag syntax. My understanding was that 
CBOR was (among else if not primarily) for constrained devices. I just 
cannot see the motivation of embedding a list of legacy tags into a 
constrained device.

I also don't know about other technology on a similar level as CBOR that 
would do so. As an example, XML had productions 33-38 (see 
https://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag), but they were 
removed as early as 2000 (see 
https://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag), for very good 
reasons. I really have difficulties to imagine why CBOR would want to 
make the same mistake that XML fixed more than 20 years ago.

Similarly, XML Schema Datatypes only gives a very simple regular 
expression ([a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*) and notes
(see https://www.w3.org/TR/xmlschema11-2/#language):

[[[[
Note: The regular expression above provides the only normative 
constraint on the lexical and value spaces of this type. The additional 
constraints imposed on language identifiers by [BCP 47] and its 
successor(s), and in particular their requirement that language codes be 
registered with IANA or ISO if not given in ISO 639, are not part of 
this datatype as defined here.
]]]]
Again, XML Schema would have done something more precise if anybody had 
been convinced that such precision made sense.

Another way to see this is that in general, when giving restricting 
syntactic rules, there's the question of "bang for the buck". The 
complexity of the language tag syntax rules, down to the legacy 
(grandfathered) stuff, mean that the cost ("buck") is quite high. This 
not only includes implementation and memory footprint, but also testing 
and everything else.

On the other hand, the "bang" is quite low, because of two reasons:
First, without a check against the registry, a lot of garbage still can 
go through. Think e.g. "en-UK", which looks reasonable and fits the 
grammar, but is not allowed (UK is not a country code, "en-GB" is 
correct). Second, most actual language tags, in particular for 
constrained devices, are more on the level of "fr" or "en-US", which 
means that on most actual data, the full syntax isn't really exercised. 
Which further means that software with implementation bugs in the syntax 
testing part doesn't get weeded out.

The main mechanisms (if any) that will help to make sure these language 
tags are correct are the following:
1) On the 'sender' side, texts will be translated, by "hand" or using 
some localization tools, and the correct language tags will be set there 
(because somebody translating to Ukrainian, or their tool, knows the 
correct tag is "uk", and not something else).
2) On the 'receiver' side, user preferences will be expressed as 
language tags (or prefixes,...), which should assure that correctly 
tagged data gets shown and incorrectly tagged data gets ignored.

To summarize, copying the grammar from BCP 47 brings extremely little 
bang for rather high costs. Get rid of it in the same way other 
standards which have thought this through have gone rid of a detailled 
grammar. If you want something that gives you a minimal plausibility 
test (catch cases where e.g. the text and the language tag got swapped 
by some accident,...), do what XML Schema did.

This will also be future proof. There are many changes to BCP 47 that 
have been discussed in the past (although none of these got traction, or 
are expected to get traction in the near future), but changing the basic 
syntax constraint expressed by XML Schema was never considered an 
option. On the other hand, it was always clear to the people involved 
that users of language tags shouldn't create artificial barriers to 
future changes. It would be really a pity if CBOR created such a barrier 
just because they could. Things such as "CDDL feature control operators" 
are great where they actually serve a purpose, here I don't think they 
would.

Directionality Information
==========================

Regarding language tags, in addition, there is the following note:
[[[[
NOTE: The Unicode Standard [Unicode-14.0.0] includes a set of
    characters designed for tagging text (including language tagging), in
    the range U+E0000 to U+E007F.  Although many applications, including
    RDF, do not disallow these characters in text strings, the Unicode
    Consortium has deprecated these characters and recommends annotating
    language via a higher-level protocol instead.  See the section
    "Deprecated Tag Characters" in Section 23.9 of [Unicode-14.0.0].
]]]]
It's weird for the IETF to refer (only) to the Unicode standard here 
even though the IETF has deprecated this kind of language tagging in RFC 
6082. (see https://www.rfc-editor.org/rfc/rfc6082.html) So please cite 
that RFC.

>> having CDDL parsers try to validate tags according to this grammar is
>> not going to be useful. If included at all, this needs to be clearly marked
>> with text saying that BCP 47 is normative for this grammar, and that language
>> tag parsers should NOT try to reject tags based on this grammar; instead, they
>> should be treated as strings, and looked up against relevant language handling
>> APIs. (“zh-ZZ” is perfectly valid according to the grammar, but is semantically
>> invalid according to BCP 47).
> 
> Here again, it is hard to capture semantics in a structural definition.
> Our document is going to reference RFC 5646 (including its ABNF), as that is the current definition; if BCP 47 is updated, the effect of that update on this document will need new consideration.

No, please. I understand that in some areas, you don't want to allow 
gratuitous changes to your network and software based on changes to 
technology that you use. But for language tags, such a mindset is really 
counterproductive. Some of the changes to BCP 47 that have been 
discussed are to include some subtags for dialects. Now if such a change 
happened, there are two questions relevant for CBOR:
1) How many cases would there be in the CBOR landscape where people 
would want to use such subtags? The answer would probably be: Very few, 
so a change (using a "CDDL feature control operator" or whatever) would 
have very low priority. But why should people be prohibited from using 
such subtags if they want to use them?
2) What's the problem in letting such subtags though the current 
infrastructure? My guess is that there's no problem at all. When there 
are parallel texts, one tagged with "en-US" and the other with one of 
these dialect subtags, the chance is very high that a recipient will be 
displaying the former. Would that be a problem?

>> Note also that the sentence “Data items with tag
>> 38 that do not meet the criteria above are invalid (see Section 5.3.2 of
>> [STD94]).” is really hard to parse semantically, given that section 5.3.2 of
>> RFC 8949 doesn’t use the word “invalid”, it uses “inadmissible value”. I do not
>> recommend rejecting unknown language tags.
> 
> They may not be rejected, they are just not “valid” in RFC 8949 sense (they are still well-formed).  I would expect language tags to evolve within the grammar defined by RFC 5646 (which does have an extension point); it that is a mistaken assumption, please let us know.

In the short term (my average guess at "short term" would be 10 years or 
so), evolution *within* RFC 5646 is definitely the main focus. In the 
really long term, I guess anything that fits the XML Schema production 
is fair game. That restriction has been there since the original RFC 
1766, and provides some actual "bang for the buck". It is also baked in 
into technologies such as XML Schema which would provide a very strong 
argument to not give up on it. In all the work on revising RFC 1766 
(which I co-chaired, and which was quite long-winded), changing the rule 
that each subtag had to be 8 characters or less was never strongly 
disputed at all.

>> Thirdly, the definition of the tri-state direction attribute can be made
>> clearer; in particular, the Unicode Bidirectional Algorithm (UAX#9) should be
>> referenced, with particular reference to
>> https://www.unicode.org/reports/tr9/tr9-44.html#Markup_And_Formatting - the
>> important property here is that the desired semantic is isolation - the markup
>> is intended to have zero influence on strings outside the embedded string - the
>> semantics of embedding in RLI…PDI is the desired effect.
> 
> Tag38 does not provide a way to handle embedding, so we are not trying to boil that ocean yet.

Again, I agree with Harald here. But first, please be careful. 
"embedding" has a very narrow technical meaning in the Bidi Algorithm 
(UAX #9). Tag 38 doesn't need a way to handle embeddings in this sense. 
When Harald used the term "embedded string", he didn't use "embedded" in 
this very narrow technical sense, but in a more general sense, namely 
that the string from Tag 38 is expected to be put into some 
(surrounding) context. That might mean that it shows up by itself 
somewhere, or that it gets included in a larger text of some sorts.

In the draft, you have the following text:
[[[[
    The optional third element, if present, is a Boolean value that
    indicates a direction: false for "ltr" direction, true for "rtl"
    direction.  If the third element is absent, no indication is made
    about the direction; it can be explicitly given as null to express
    the same while overriding any context that might be considered
    applying to this element.  Note that the proper processing of
    Language and Direction Metadata is an active area of investigation;
    the reader is advised to consult ongoing standardization activities
    such as [STRING-META] when processing the information represented in
    this tag.
]]]]

[override is also a technical term in the Bidi Algorithm]

I think this text is very important, so I'll got into some details. 
First (minor nit), it says "If the third element is absent ...". Because 
this is in a paragraph that starts with "The optional third element 
...", I think it would better say "If this element is absent ...".

Next, let me make sure that I get this right: This is a Boolean value, 
but it can in effect have four different states, yes? That would be:
- True (rtl)
- False (ltr)
- null (no indication about direction, but overriding any context)
- absent (no indication about direction, but context may apply)
If that's true, then it might be good to put that into a more structured 
from (something like the above list).

[very major point] The main problem is with the last sentence. There's 
not much of a point in defining a field for directionality if it's not 
clear what that is supposed to be used for. I'm also not sure where the 
claim "the proper processing of Language and Direction Metadata is an 
active area of investigation" came from, and why it is here.

It is true that some areas of bidi processing (e.g. the best consistent 
way to display IRIs that contain pieces of text from both 
directionalities) that are not solved yet, or even (as the example a 
line ago) are not even actively being investigated because the general 
agreement is that the problem is too difficult to have a solution.
It is also true that "Strings on the Web: Language and Direction 
Metadata" (https://www.w3.org/TR/string-meta/) is still in Draft status.

But neither of these facts should have to influence the specification of 
Tag 38. [StringMeta] (3.4 What consumers need to do to support 
direction, https://www.w3.org/TR/string-meta/#what_consumers_do), Harald 
and I all agree about what the right thing to do is: Use Bidi isolation 
(in the technical sense of 
https://www.unicode.org/reports/tr9/#Explicit_Directional_Isolates).

So given all the above considerations, what about rewriting the 
paragraph under consideration along the following lines:

[[[[
    The optional third element, if present, is a Boolean value that
    indicates a direction, as follows:
    - false: LTR direction. The text is expected to be displayed
      with LTR base direction if standalone, and isolated with LTR
      direction (enclosed in RLI ... PDI or equivalent, see [1]) in
      the context of a longer string or text.
    - true: RTL direction. The text is expected to be displayed
      with LTR base direction if standalone, and isolated with RTL
      direction (enclosed in LRI ... PDI or equivalent, see [1]) in
      the context of a longer string or text.
    - absent: no indication is made about the direction
    - (explicit) null: no indication is made about the direction,
      but any directionality context applying to this element (e.g.,
      base directionality information for an entire CBOR message or
      part thereof) is ignored.
]]]]
[1] Unicode® Standard Annex #9, Unicode Bidirectional Algorithm, Section 
2.7  Markup and Formatting Characters, 
https://www.unicode.org/reports/tr9/#Markup_And_Formatting

I'm not really sure yet about the 'absent' and 'null' entries, neither 
if they are really distinct nor whether the specification is good enough 
(we might want to specify FIRST STRONG ISOLATE semantics).

Hope this helps. Let's make sure together that we get this right.

Regards,    Martin.

[art] Artart last call review of draft-ietf-core-… Harald Alvestrand via Datatracker
Re: [art] Artart last call review of draft-ietf-c… Carsten Bormann
Re: [art] Artart last call review of draft-ietf-c… Francesca Palombini
Re: [art] Artart last call review of draft-ietf-c… Martin J. Dürst
Re: [art] Artart last call review of draft-ietf-c… Martin J. Dürst
Re: [art] Artart last call review of draft-ietf-c… Ira McDonald
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] [Last-Call] Artart last call review of … Ira McDonald
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] Artart last call review of draft-ietf-c… Carsten Bormann
Re: [art] [Last-Call] Artart last call review of … Ira McDonald
Re: [art] [Last-Call] Artart last call review of … John C Klensin
Re: [art] [Last-Call] Artart last call review of … Martin J. Dürst
Re: [art] [Last-Call] Artart last call review of … tom petch
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] [Last-Call] Artart last call review of … tom petch
Re: [art] [Last-Call] Artart last call review of … Ira McDonald
Re: [art] [Last-Call] Artart last call review of … tom petch
Re: [art] [Last-Call] Artart last call review of … Martin J. Dürst
Re: [art] [Last-Call] Artart last call review of … Martin J. Dürst
Re: [art] [Last-Call] Artart last call review of … Harald Alvestrand
Re: [art] [Last-Call] Artart last call review of … John C Klensin
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] [Last-Call] Artart last call review of … John C Klensin
Re: [art] [Last-Call] Artart last call review of … John C Klensin
Re: [art] [Last-Call] Artart last call review of … tom petch
Re: [art] [Last-Call] Artart last call review of … Martin J. Dürst
Re: [art] Artart last call review of draft-ietf-c… Martin J. Dürst
Re: [art] Artart last call review of draft-ietf-c… Carsten Bormann
Re: [art] [Last-Call] Artart last call review of … John C Klensin
Re: [art] Artart last call review of draft-ietf-c… Martin J. Dürst
Re: [art] [Last-Call] Artart last call review of … Martin J. Dürst
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] Artart last call review of draft-ietf-c… Carsten Bormann
[art] Thank you! -- Re: [core] Artart last call r… Carsten Bormann
Re: [art] Thank you! -- Re: [core] Artart last ca… Francesca Palombini
Re: [art] Artart last call review of draft-ietf-c… Martin J. Dürst
[art] Language tags and YANG Francesca Palombini
Re: [art] [Last-Call] Artart last call review of … Carsten Bormann
Re: [art] [core] Artart last call review of draft… Thomas Fossati
[art] Call for comments on draft-ietf-core-proble… Francesca Palombini
Re: [art] Call for comments on draft-ietf-core-pr… Marco Tiloca
Re: [art] [Last-Call] [core] Artart last call rev… John C Klensin
Re: [art] [Last-Call] [core] Artart last call rev… Carsten Bormann
Re: [art] [Last-Call] [core] Artart last call rev… John C Klensin
Re: [art] [Last-Call] Artart last call review of … Martin J. Dürst
Re: [art] [core] Call for comments on draft-ietf-… Ari Keränen
Re: [art] [Last-Call] [core] Artart last call rev… Thomas Fossati
Re: [art] [Last-Call] [core] Artart last call rev… Carsten Bormann
Re: [art] Language tags and YANG tom petch
Re: [art] [Last-Call] Language tags and YANG Carsten Bormann
Re: [art] [Last-Call] Call for comments on draft-… tom petch
Re: [art] [Last-Call] Call for comments on draft-… Carsten Bormann
Re: [art] [Last-Call] [core] Artart last call rev… Francesca Palombini
Re: [art] [Last-Call] Call for comments on draft-… Francesca Palombini
Re: [art] [Last-Call] [core] Artart last call rev… Carsten Bormann
Re: [art] [core] [Last-Call] Artart last call rev… Carsten Bormann
Re: [art] [Last-Call] Call for comments on draft-… Randy Presuhn
Re: [art] [core] [Last-Call] Call for comments on… Carsten Bormann
Re: [art] [Last-Call] Call for comments on draft-… Martin J. Dürst
Re: [art] [Last-Call] Language tags and YANG Martin J. Dürst
Re: [art] [Last-Call] Call for comments on draft-… tom petch
Re: [art] [Last-Call] Language tags and YANG tom petch
Re: [art] [Last-Call] Call for comments on draft-… Carsten Bormann
Re: [art] [core] [Last-Call] Artart last call rev… Thomas Fossati
Re: [art] [core] Call for comments on draft-ietf-… Hubert Przybysz
Re: [art] [core] [Last-Call] Artart last call rev… Francesca Palombini
Re: [art] [core] [Last-Call] Artart last call rev… Carsten Bormann
Re: [art] [core] [Last-Call] Artart last call rev… Martin J. Dürst
Re: [art] [core] [Last-Call] Artart last call rev… Carsten Bormann
Re: [art] [core] [Last-Call] Artart last call rev… Carsten Bormann
Re: [art] [core] [Last-Call] Artart last call rev… Martin J. Dürst
[art] Obsoletes Re: [core] [Last-Call] Artart las… tom petch
Re: [art] Call for comments on draft-ietf-core-pr… Francesca Palombini
Re: [art] Obsoletes Re: [core] [Last-Call] Artart… John C Klensin
Re: [art] Obsoletes Re: [core] [Last-Call] Artart… Scott O. Bradner
Re: [art] Obsoletes Re: [core] [Last-Call] Artart… tom petch
Re: [art] Obsoletes Re: [core] [Last-Call] Artart… Carsten Bormann