[core] CBOR Tag 38 (in draft-ietf-core-problem-details)

Carsten Bormann <cabo@tzi.org> Wed, 11 May 2022 12:17 UTC

Return-Path: <cabo@tzi.org>
X-Original-To: core@ietfa.amsl.com
Delivered-To: core@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DBC1BC18353A; Wed, 11 May 2022 05:17:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.899
X-Spam-Level:
X-Spam-Status: No, score=-6.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Jyz96qmIzJt3; Wed, 11 May 2022 05:17:23 -0700 (PDT)
Received: from gabriel-smtp.zfn.uni-bremen.de (gabriel-smtp.zfn.uni-bremen.de [IPv6:2001:638:708:32::15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D78C7C1850C2; Wed, 11 May 2022 05:17:20 -0700 (PDT)
Received: from smtpclient.apple (p5089ad4f.dip0.t-ipconnect.de [80.137.173.79]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by gabriel-smtp.zfn.uni-bremen.de (Postfix) with ESMTPSA id 4Kyv6h2qwwzDCbm; Wed, 11 May 2022 14:17:16 +0200 (CEST)
From: Carsten Bormann <cabo@tzi.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.80.82.1.1\))
Date: Wed, 11 May 2022 14:17:15 +0200
Cc: core@ietf.org
To: cbor@ietf.org
Message-Id: <E7B9C8A6-81CC-452E-98CE-421CC37B98DD@tzi.org>
X-Mailer: Apple Mail (2.3696.80.82.1.1)
Archived-At: <https://mailarchive.ietf.org/arch/msg/core/gMSw1Ja9K5ZasWAlImhFP2NZM1E>
Subject: [core] CBOR Tag 38 (in draft-ietf-core-problem-details)
X-BeenThere: core@ietf.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: "Constrained RESTful Environments \(CoRE\) Working Group list" <core.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/core>, <mailto:core-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/core/>
List-Post: <mailto:core@ietf.org>
List-Help: <mailto:core-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/core>, <mailto:core-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 11 May 2022 12:17:26 -0000

draft-ietf-core-problem-details is a specification that enters somewhat new territory for CoRE WG specifications:
It is concerned with carrying around data that include human-readable text, where the human is an end-user and not a programmer.

We could address this with something that is specific to CoRE, but it is better to use a mechanism that is common to the CBOR ecosystem.

Internationalization experts insist on having language-tags with human-readable text, as the rendering of the text may differ between different languages.
BCP47 is the way we do language-tags in the IETF.

We actually have a tag in the CBOR tags registry for that: CBOR tag 38, http://peteroupc.github.io/CBOR/langtags.html
This was registered by Peter Occil to be able to ship language-tags with text, and it solves the problem nicely.

Except that language-tags themselves have a problem:
Other than maybe originally thought, they are mostly useless for establishing base direction.
Some writing systems are left-to-right (ltr), others are right-to-left (rtl) (and some are more complex, but not much in use commercially).

For most of us this problem is invisible, as we are firmly rooted in the left-to-right world.
W3C has written up some of this:

https://www.w3.org/International/questions/qa-direction-from-language.html
https://www.w3.org/TR/international-specs/#text_direction

TL;DR: As a language tag does not really indicate a writing system, you need to be able to separately establish a base direction with a language-tag.

So, it seems that CBOR tag 38 has a bug, in that it doesn’t address this essential component of using language-tags with text.

How do we fix this bug?
(1) we could define another CBOR tag (pair of tags) that just establish the base direction for their tag content.
(2) as the most straightforward way of addressing the bug, we could add an (optional) base direction to the array inside tag 38.

The latter could be done in two ways:
(2a) we define a new tag, let’s call it N38 here, that allows the optional element.
(2b) we could fix up 38 to allow the optional element.

So far, we have not done the equivalent of 2b to CBOR tags.  
E.g., for tag 4 and 5 there are analog tags 264 and 265 that simply increase the range of exponents (by allowing big integers).
For tag 4 and 5, this is probably actually the right way to handle this belated extension: Most applications of tag4/5 do not need the extended exponent range.

But tag 38 really is broken the way it is.
Of course, it works fine where you don’t happen to need the indication of a base direction.
But it is not like this is a different area of application (like with 264 and 4): *Any* language tagging really needs this.
Any application that limits itself to what tag 38 can do today as is, is essentially broken from an internationalization point of view.

To me, this looks like a strong incentive to do 2b.
But this (expanding the domain of an existing tag) is a new thing in CBOR land, so I thought I might want to bring this up here.

Specification-technically, the reference document for tag 38 would change from Peter’s document to the appendix in the problem-details specification.

2a would probably need another 1+1 tag for N38, or applications may not migrate to the new tag (cognitive dissonance).
The use of N38 is not forward compatible ever:  
An application that switches to the new tag can’t interoperate with one that expects 38 even if it does not need to establish a base direction.
Experience with such a situation indicates that this often creates a deployment barrier to migrating to the new form.

2b clearly is backward compatible: old data (and old producers) work with new consumers.
It is not forward compatible (new data/new producers do not work with non-updated old consumers) only when the base direction feature is actually used.
This is very similar to a lot of other extensions we do routinely, such as adding new Unicode characters.

Of course, I’m assuming here that applications that employ language tags will want to employ them correctly and therefore will be updated to consume base direction indication.
With that, I believe that 2b is the best way to handle this bug.
We don’t have a feature matrix showing CBOR implementations and tags supported right in the generic decoder, but I would surmise that the number of implementations that actually have to change to accommodate base directions is still quite manageable.

We will discuss problem-details in a couple of hours in the CoRE interim meeting:
See <https://mailarchive.ietf.org/arch/msg/core/t0vAGs8hCGqFgsSKgp9TEXEdzp0> for details.
If we want to delve more deeply into the impacts that updating this tag would have on the CBOR ecosystem, we can do this a week later in the CBOR interim.

As an SDO (3GPP) waits for us to complete problem-details, we’ll need a decision soon.
So please reply to this message (which reaches both WGs), or come to one of the meetings.
(I didn’t include an i18n mailing list, because the i18n discussion is complex, but has essentially converged.)

Oh, and while you read the W3C documents, please note that neither the existing tag 38 nor the various proposals to fix/complement it solve delimiting embedded runs; this would be getting a bit closer to creating a little markup language though.

Grüße, Carsten