Re: [Ltru] [apps-discuss] Fwd: Defining a CBOR tag for RFC 5646 Language Tags

"Doug Ewell" <> Thu, 15 May 2014 15:40 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 137161A00DF for <>; Thu, 15 May 2014 08:40:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id Zi1uW0oGD3Up for <>; Thu, 15 May 2014 08:40:03 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 923361A008D for <>; Thu, 15 May 2014 08:40:03 -0700 (PDT)
Received: from localhost ([]) by with bizsmtp id 2Ffw1o0035JG3DC01FfwTq; Thu, 15 May 2014 08:39:56 -0700
X-SID: 2Ffw1o0035JG3DC01
Received: (qmail 29233 invoked by uid 99); 15 May 2014 15:39:56 -0000
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
User-Agent: Workspace Webmail 5.6.47
Message-Id: <>
From: "Doug Ewell" <>
To: "Dave Cridland" <>
Date: Thu, 15 May 2014 08:39:55 -0700
Mime-Version: 1.0
Cc: LTRU Working Group <>
Subject: Re: [Ltru] [apps-discuss] Fwd: Defining a CBOR tag for RFC 5646 Language Tags
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 15 May 2014 15:40:05 -0000

Dave Cridland <dave at cridland dot net> wrote:

> Of course, an invalid-UTF-8 based proposal simply means that it's no
> longer UTF-8 per-se, and so needs itself to be tagged differently.
> Other than that, I don't see it's a bad idea from a technical
> standpoint. The use of the word "invalid" probably scares people, but
> I note that's really a shorthand for "not backwards compatible by
> existing UTF-8 processors".

The proposal from 1997 ("MLSF") did call it an extra layer on top of
UTF-8, and included lots of health warnings that it was not really
UTF-8. That didn't remove the danger, though, because it looked so much
like UTF-8. John's response about decoders was spot-on.

> Exactly the same caveats apply to Plane 14 tagging, mind, and
> moreover, we could invent our own - indeed, that's what we're doing by
> having these arrays of (tag, string) tuples.

Since CBOR processors are likely to be bespoke, as you said, but the
underlying UTF-8 processor is not -- as you also said -- it would be
much easier to strip out Plane 14 characters for display if necessary
than to implement a whole new UTF-8–like decoding-stream layer that
understands MLSF.

As Mark knows, I never bought into the deprecation argument about how
evil Plane 14 tag characters are. Handling them correctly just isn't
that difficult. For CBOR, you may be better off with the tag/string
tuples; the tags in that case are much easier to see and don't need to
be stripped from the string for display or comparison. But if this
"tagged text" model is too far out of step with the CBOR/JSON way of
thinking, Plane 14 is out there.

Doug Ewell | Thornton, CO, USA | @DougEwell