Re: [core] [art] Artart last call review of draft-ietf-core-problem-details-05

Hi Martin,

Thanks for your excellent and detailed comments and rationales.  I agree
that
a detailed CDDL for language tags is a *bad* idea.

I agree that a separate draft for CBOR language tags would be preferable,
to
raise the visibility of internationalization in other RFCs and in other
SDOs.

I suggest that CORE or other IETF WGs specifically focused on constrained
devices are the wrong places to write and publish such a language tag RFC,
because CBOR is certainly *not* primarily for constrained devices anymore.
I would prefer to see the separate RFC developed in the CBOR WG.

In many IETF WGs (RATS, SACM, TEEP, etc.) and many other SDOs, CBOR
is being heavily used for the entire spectrum of computing devices
(especially
routers and switches for telecom networks and servers for enterprise
networks).
CBOR is simply a better mousetrap, IMO.

Cheers,
- Ira

On Thu, Jun 23, 2022 at 2:48 AM Martin J. Dürst <duerst@it.aoyama.ac.jp>
wrote:

> Dear Core and I18N experts,
>
> Some comments on the I18N aspects of Tag 38 below.
>
> [Sorry this answer took so long, and got so long. The two 'long's
> influenced each other :-).]
>
> On 2022-06-16 01:23, Carsten Bormann wrote:
> >
> > Hi Harald,
> >
> > thank you for this thoughtful review.
>
> >> The “Tag 38 internationalized string”
> >> This document adds an appendix defining an “internationalized string”
> format
> >> that adds a BCP 47 language tag and an Unicode-based direction
> indicator to an
> >> UTF-8 string. This is laudable; RFC 2277 section 4 pointed out the need
> for
> >> this ability 24 years ago.
>
> I think that Language-Tagged Strings (CBOR Tag 38,
>
> https://datatracker.ietf.org/doc/html/draft-ietf-core-problem-details-06#appendix-A)
>
> are a very good step ahead. At least for CBOR, in many cases from now
> on, the answer might just be "use Tag 38" (assuming we get the details
> right).
>
>
> >> Unfortunately neither definition is problem-free.
> >>
> >> First of all, this tag, if useful at all, is of far greater utility
> than the
> >> error format. Burying it in an appendix of a document whose stated
> purpose is
> >> something else makes it far more difficult to refer to than it needs to
> be.
> >
> > That is usually not a problem.  The focal point for finding a CBOR tag
> for a specific application is the CBOR tag registry; this then points to
> the places where the specifications for the tags can be found (which in
> this case is easily expressed as “Appendix A of RFC XXXX”).
>
> Separate Draft or Not
> =====================
>
> I agree with Harald that it should be a separate draft; it would
> definitely help with visibility of I18N in general and the issue of
> strings with language and directionality information inside and outside
> the IETF (not only the visibility within the CBOR community, which may
> be covered by the tag registry). Being able to say "look at RFC XXXX for
> a good example" is way better than being able to say "look at appendix X
> of RFC YYYY for a good example".
>
> I understand Francesca's arguments, too, but I think the investment in a
> separate draft would be well worth the effort. I'm willing to contribute
> although I guess that Carsten would do the necessary work in less time
> than it takes him to get anybody else up to speed.
>
>
> >> Second, the “detailed semantics” has chosen to include the quite
> complex BNF of
> >> RFC 5646 translated into CDDL; this may have some use, but BCP 47 is a
> moving
> >> target;
> >
> > We intend tag38 to be useful for the current form of BCP 47, so it is
> hard to plan for the future.  If BCP 47 needs to be considered unstable, we
> could of course define a “bcp47-extension” alternative with a CDDL feature
> control operator.
>
> (NOT!) Copying BCP 47 Grammar
> =============================
>
> I also agree with Harald that the definition of 'Language-Tagged
> Strings' has room for improvement. First, as Harald said, it repeats the
> BCP 47 grammar when we very well know that repeating grammars is usually
> a bad idea. I'm really not sure why CBOR wants to check each and every
> detail of the current language tag syntax. My understanding was that
> CBOR was (among else if not primarily) for constrained devices. I just
> cannot see the motivation of embedding a list of legacy tags into a
> constrained device.
>
> I also don't know about other technology on a similar level as CBOR that
> would do so. As an example, XML had productions 33-38 (see
> https://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag), but they were
> removed as early as 2000 (see
> https://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag), for very good
> reasons. I really have difficulties to imagine why CBOR would want to
> make the same mistake that XML fixed more than 20 years ago.
>
> Similarly, XML Schema Datatypes only gives a very simple regular
> expression ([a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*) and notes
> (see https://www.w3.org/TR/xmlschema11-2/#language):
>
> [[[[
> Note: The regular expression above provides the only normative
> constraint on the lexical and value spaces of this type. The additional
> constraints imposed on language identifiers by [BCP 47] and its
> successor(s), and in particular their requirement that language codes be
> registered with IANA or ISO if not given in ISO 639, are not part of
> this datatype as defined here.
> ]]]]
> Again, XML Schema would have done something more precise if anybody had
> been convinced that such precision made sense.
>
>
> Another way to see this is that in general, when giving restricting
> syntactic rules, there's the question of "bang for the buck". The
> complexity of the language tag syntax rules, down to the legacy
> (grandfathered) stuff, mean that the cost ("buck") is quite high. This
> not only includes implementation and memory footprint, but also testing
> and everything else.
>
> On the other hand, the "bang" is quite low, because of two reasons:
> First, without a check against the registry, a lot of garbage still can
> go through. Think e.g. "en-UK", which looks reasonable and fits the
> grammar, but is not allowed (UK is not a country code, "en-GB" is
> correct). Second, most actual language tags, in particular for
> constrained devices, are more on the level of "fr" or "en-US", which
> means that on most actual data, the full syntax isn't really exercised.
> Which further means that software with implementation bugs in the syntax
> testing part doesn't get weeded out.
>
> The main mechanisms (if any) that will help to make sure these language
> tags are correct are the following:
> 1) On the 'sender' side, texts will be translated, by "hand" or using
> some localization tools, and the correct language tags will be set there
> (because somebody translating to Ukrainian, or their tool, knows the
> correct tag is "uk", and not something else).
> 2) On the 'receiver' side, user preferences will be expressed as
> language tags (or prefixes,...), which should assure that correctly
> tagged data gets shown and incorrectly tagged data gets ignored.
>
> To summarize, copying the grammar from BCP 47 brings extremely little
> bang for rather high costs. Get rid of it in the same way other
> standards which have thought this through have gone rid of a detailled
> grammar. If you want something that gives you a minimal plausibility
> test (catch cases where e.g. the text and the language tag got swapped
> by some accident,...), do what XML Schema did.
>
> This will also be future proof. There are many changes to BCP 47 that
> have been discussed in the past (although none of these got traction, or
> are expected to get traction in the near future), but changing the basic
> syntax constraint expressed by XML Schema was never considered an
> option. On the other hand, it was always clear to the people involved
> that users of language tags shouldn't create artificial barriers to
> future changes. It would be really a pity if CBOR created such a barrier
> just because they could. Things such as "CDDL feature control operators"
> are great where they actually serve a purpose, here I don't think they
> would.
>
>
> Directionality Information
> ==========================
>
> Regarding language tags, in addition, there is the following note:
> [[[[
> NOTE: The Unicode Standard [Unicode-14.0.0] includes a set of
>     characters designed for tagging text (including language tagging), in
>     the range U+E0000 to U+E007F.  Although many applications, including
>     RDF, do not disallow these characters in text strings, the Unicode
>     Consortium has deprecated these characters and recommends annotating
>     language via a higher-level protocol instead.  See the section
>     "Deprecated Tag Characters" in Section 23.9 of [Unicode-14.0.0].
> ]]]]
> It's weird for the IETF to refer (only) to the Unicode standard here
> even though the IETF has deprecated this kind of language tagging in RFC
> 6082. (see https://www.rfc-editor.org/rfc/rfc6082.html) So please cite
> that RFC.
>
>
> >> having CDDL parsers try to validate tags according to this grammar is
> >> not going to be useful. If included at all, this needs to be clearly
> marked
> >> with text saying that BCP 47 is normative for this grammar, and that
> language
> >> tag parsers should NOT try to reject tags based on this grammar;
> instead, they
> >> should be treated as strings, and looked up against relevant language
> handling
> >> APIs. (“zh-ZZ” is perfectly valid according to the grammar, but is
> semantically
> >> invalid according to BCP 47).
> >
> > Here again, it is hard to capture semantics in a structural definition.
> > Our document is going to reference RFC 5646 (including its ABNF), as
> that is the current definition; if BCP 47 is updated, the effect of that
> update on this document will need new consideration.
>
> No, please. I understand that in some areas, you don't want to allow
> gratuitous changes to your network and software based on changes to
> technology that you use. But for language tags, such a mindset is really
> counterproductive. Some of the changes to BCP 47 that have been
> discussed are to include some subtags for dialects. Now if such a change
> happened, there are two questions relevant for CBOR:
> 1) How many cases would there be in the CBOR landscape where people
> would want to use such subtags? The answer would probably be: Very few,
> so a change (using a "CDDL feature control operator" or whatever) would
> have very low priority. But why should people be prohibited from using
> such subtags if they want to use them?
> 2) What's the problem in letting such subtags though the current
> infrastructure? My guess is that there's no problem at all. When there
> are parallel texts, one tagged with "en-US" and the other with one of
> these dialect subtags, the chance is very high that a recipient will be
> displaying the former. Would that be a problem?
>
>
> >> Note also that the sentence “Data items with tag
> >> 38 that do not meet the criteria above are invalid (see Section 5.3.2 of
> >> [STD94]).” is really hard to parse semantically, given that section
> 5.3.2 of
> >> RFC 8949 doesn’t use the word “invalid”, it uses “inadmissible value”.
> I do not
> >> recommend rejecting unknown language tags.
> >
> > They may not be rejected, they are just not “valid” in RFC 8949 sense
> (they are still well-formed).  I would expect language tags to evolve
> within the grammar defined by RFC 5646 (which does have an extension
> point); it that is a mistaken assumption, please let us know.
>
> In the short term (my average guess at "short term" would be 10 years or
> so), evolution *within* RFC 5646 is definitely the main focus. In the
> really long term, I guess anything that fits the XML Schema production
> is fair game. That restriction has been there since the original RFC
> 1766, and provides some actual "bang for the buck". It is also baked in
> into technologies such as XML Schema which would provide a very strong
> argument to not give up on it. In all the work on revising RFC 1766
> (which I co-chaired, and which was quite long-winded), changing the rule
> that each subtag had to be 8 characters or less was never strongly
> disputed at all.
>
>
> >> Thirdly, the definition of the tri-state direction attribute can be made
> >> clearer; in particular, the Unicode Bidirectional Algorithm (UAX#9)
> should be
> >> referenced, with particular reference to
> >> https://www.unicode.org/reports/tr9/tr9-44.html#Markup_And_Formatting
> - the
> >> important property here is that the desired semantic is isolation - the
> markup
> >> is intended to have zero influence on strings outside the embedded
> string - the
> >> semantics of embedding in RLI…PDI is the desired effect.
> >
> > Tag38 does not provide a way to handle embedding, so we are not trying
> to boil that ocean yet.
>
> Again, I agree with Harald here. But first, please be careful.
> "embedding" has a very narrow technical meaning in the Bidi Algorithm
> (UAX #9). Tag 38 doesn't need a way to handle embeddings in this sense.
> When Harald used the term "embedded string", he didn't use "embedded" in
> this very narrow technical sense, but in a more general sense, namely
> that the string from Tag 38 is expected to be put into some
> (surrounding) context. That might mean that it shows up by itself
> somewhere, or that it gets included in a larger text of some sorts.
>
> In the draft, you have the following text:
> [[[[
>     The optional third element, if present, is a Boolean value that
>     indicates a direction: false for "ltr" direction, true for "rtl"
>     direction.  If the third element is absent, no indication is made
>     about the direction; it can be explicitly given as null to express
>     the same while overriding any context that might be considered
>     applying to this element.  Note that the proper processing of
>     Language and Direction Metadata is an active area of investigation;
>     the reader is advised to consult ongoing standardization activities
>     such as [STRING-META] when processing the information represented in
>     this tag.
> ]]]]
>
> [override is also a technical term in the Bidi Algorithm]
>
> I think this text is very important, so I'll got into some details.
> First (minor nit), it says "If the third element is absent ...". Because
> this is in a paragraph that starts with "The optional third element
> ...", I think it would better say "If this element is absent ...".
>
> Next, let me make sure that I get this right: This is a Boolean value,
> but it can in effect have four different states, yes? That would be:
> - True (rtl)
> - False (ltr)
> - null (no indication about direction, but overriding any context)
> - absent (no indication about direction, but context may apply)
> If that's true, then it might be good to put that into a more structured
> from (something like the above list).
>
> [very major point] The main problem is with the last sentence. There's
> not much of a point in defining a field for directionality if it's not
> clear what that is supposed to be used for. I'm also not sure where the
> claim "the proper processing of Language and Direction Metadata is an
> active area of investigation" came from, and why it is here.
>
> It is true that some areas of bidi processing (e.g. the best consistent
> way to display IRIs that contain pieces of text from both
> directionalities) that are not solved yet, or even (as the example a
> line ago) are not even actively being investigated because the general
> agreement is that the problem is too difficult to have a solution.
> It is also true that "Strings on the Web: Language and Direction
> Metadata" (https://www.w3.org/TR/string-meta/) is still in Draft status.
>
> But neither of these facts should have to influence the specification of
> Tag 38. [StringMeta] (3.4 What consumers need to do to support
> direction, https://www.w3.org/TR/string-meta/#what_consumers_do), Harald
> and I all agree about what the right thing to do is: Use Bidi isolation
> (in the technical sense of
> https://www.unicode.org/reports/tr9/#Explicit_Directional_Isolates).
>
> So given all the above considerations, what about rewriting the
> paragraph under consideration along the following lines:
>
> [[[[
>     The optional third element, if present, is a Boolean value that
>     indicates a direction, as follows:
>     - false: LTR direction. The text is expected to be displayed
>       with LTR base direction if standalone, and isolated with LTR
>       direction (enclosed in RLI ... PDI or equivalent, see [1]) in
>       the context of a longer string or text.
>     - true: RTL direction. The text is expected to be displayed
>       with LTR base direction if standalone, and isolated with RTL
>       direction (enclosed in LRI ... PDI or equivalent, see [1]) in
>       the context of a longer string or text.
>     - absent: no indication is made about the direction
>     - (explicit) null: no indication is made about the direction,
>       but any directionality context applying to this element (e.g.,
>       base directionality information for an entire CBOR message or
>       part thereof) is ignored.
> ]]]]
> [1] Unicode® Standard Annex #9, Unicode Bidirectional Algorithm, Section
> 2.7  Markup and Formatting Characters,
> https://www.unicode.org/reports/tr9/#Markup_And_Formatting
>
> I'm not really sure yet about the 'absent' and 'null' entries, neither
> if they are really distinct nor whether the specification is good enough
> (we might want to specify FIRST STRONG ISOLATE semantics).
>
>
> Hope this helps. Let's make sure together that we get this right.
>
> Regards,    Martin.
>
> _______________________________________________
> art mailing list
> art@ietf.org
> https://www.ietf.org/mailman/listinfo/art
>