Re: [Ltru] [apps-discuss] Fwd: Defining a CBOR tag for RFC 5646 Language Tags

"Martin J. Dürst" <> Fri, 16 May 2014 09:18 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 0D8DE1A01C6 for <>; Fri, 16 May 2014 02:18:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -0.442
X-Spam-Status: No, score=-0.442 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, MIME_8BIT_HEADER=0.3, RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-0.651] autolearn=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id eZBXNr7E7lfD for <>; Fri, 16 May 2014 02:18:20 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id CABF91A01C5 for <>; Fri, 16 May 2014 02:18:19 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 4970A32E49E; Fri, 16 May 2014 18:18:11 +0900 (JST)
Received: from (unknown []) by with smtp id 6177_f8e4_22b13d96_3816_427d_b833_dfa2ee16d931; Fri, 16 May 2014 18:18:10 +0900
Received: from [IPv6:::1] (unknown []) by (Postfix) with ESMTP id 73653C0312; Fri, 16 May 2014 18:18:10 +0900 (JST)
Message-ID: <>
Date: Fri, 16 May 2014 18:17:57 +0900
From: =?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?= <>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0
MIME-Version: 1.0
To: Doug Ewell <>, Dave Cridland <>
References: <>
In-Reply-To: <>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Cc: LTRU Working Group <>
Subject: Re: [Ltru] [apps-discuss] Fwd: Defining a CBOR tag for RFC 5646 Language Tags
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 16 May 2014 09:18:22 -0000

On 2014/05/16 00:39, Doug Ewell wrote:
> Dave Cridland <dave at cridland dot net> wrote:
>> Of course, an invalid-UTF-8 based proposal simply means that it's no
>> longer UTF-8 per-se, and so needs itself to be tagged differently.

But this is highly counter-productive. US computer and Internet 
technology was so successful among else because everything was ASCII. We 
are finally getting close to a place where (almost) everything is UTF-8. 
Some of us already in 1997 (or even earlier) knew that that was the 
direction to go. UTF-8 "variants" would have killed a lot of the 
advantages of moving towards UTF-8.

>> Other than that, I don't see it's a bad idea from a technical
>> standpoint. The use of the word "invalid" probably scares people, but
>> I note that's really a shorthand for "not backwards compatible by
>> existing UTF-8 processors".

It was not such a bad idea on the level of "let's use a screw to hold 
these two pieces of metal together". I was a very bad idea because much 
of technological (prospect of) success is or has to be measured not by 
how many good pieces of technology you have (how many different sizes of 
screws), but how *few* of them you have.

> The proposal from 1997 ("MLSF") did call it an extra layer on top of
> UTF-8, and included lots of health warnings that it was not really
> UTF-8.

Only after quite a bit of pressure on the authors (including from me). 

> That didn't remove the danger, though, because it looked so much
> like UTF-8. John's response about decoders was spot-on.

It was an extremely ugly chameleon mixing character encoding with 
higher-level information, messing around with the structural cleanness 
and heuristic detectability of UTF-8, and inviting all kinds of other 
crazy cludges for other "UTF-8-but-not-quite" chimeras.

>> Exactly the same caveats apply to Plane 14 tagging, mind, and
>> moreover, we could invent our own - indeed, that's what we're doing by
>> having these arrays of (tag, string) tuples.

Plane 14 language tags are strictly within UTF-8 and also of course work 
with UTF-16. They are therefore quite a bit less bad than MLSF, but 
still bad enough.

> As Mark knows, I never bought into the deprecation argument about how
> evil Plane 14 tag characters are. Handling them correctly just isn't
> that difficult.

There are hundreds of ideas that look "not that difficult" to implement. 
But usually, everything turns out to be more difficult than estimated, 
and what's more important, the combinations of the different ideas turn 
out to be the killer.

In some ways, plane 14 language tags were born dead. Putting them in 
plane 14 was an explicit decision that sent a clear message that there 
was no expectation that they would or should be used frequently.

> For CBOR, you may be better off with the tag/string
> tuples; the tags in that case are much easier to see and don't need to
> be stripped from the string for display or comparison. But if this
> "tagged text" model is too far out of step with the CBOR/JSON way of
> thinking, Plane 14 is out there.

As far as I understand, the tagged text model should work well, about as 
well as lang/xml:lang attributes for HTML and XML.

Regards,   Martin.