Re: [Ltru] [apps-discuss] Fwd: Defining a CBOR tag for RFC 5646 Language Tags

Dave Cridland <> Thu, 15 May 2014 07:44 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id A507E1A0407 for <>; Thu, 15 May 2014 00:44:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.378
X-Spam-Status: No, score=-1.378 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id cS_Lv_hX6ojv for <>; Thu, 15 May 2014 00:44:08 -0700 (PDT)
Received: from ( [IPv6:2607:f8b0:4003:c02::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 172301A026B for <>; Thu, 15 May 2014 00:44:08 -0700 (PDT)
Received: by with SMTP id n16so813055oag.10 for <>; Thu, 15 May 2014 00:44:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=cJYEvRIZSzW6QY/dB0IF7RoDH+U7C1r9iXMQg9/wxrQ=; b=igTQVRws8ayQOL3Wihwk0t6lKvCmTjKa/lDyx/P1TUh2QVVpasuliVXF8YDjy56kCr lnuJfcj8GFvl3IcTYHFbZEzC3YrgfVf/KvkBMdWRpTUiUFU36zL7FExfsnTIHr2Jp20g ipcspupp+x8vfh+MoOCJ3vF4BzrANgPPoaL3Y=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=cJYEvRIZSzW6QY/dB0IF7RoDH+U7C1r9iXMQg9/wxrQ=; b=gjr2/hH5g00aeWuyDI53zkfLXvsj8Ba848/0VvQO8opFpnT+9/f7OaHwukEfjHWgXX UUC0wPkqulDbJrI8oJonEiBwmyEztgfvnkREO+3bH0HRrngiC2uc3QpfmxS8xu9Oyz26 G/XO4NV4ZZUnu2fVLy3TiAKk0BZ2TvtFcfhdbDF42vxu6R5EzasViN/WStAL1D/yvnji ovJc6aDqOhVIIL6OZ5pNlIIrJTZlDr/iqGF1pZL2ewYgMQHfl74vJS1cGcHyQXT6ch5q lnPzdOrMVzlAeSYrYxnuI/juzKCcL1HkcaQqetXaUEsLUBKuELDLIxVhv0lA1/ER6rXA phmA==
X-Gm-Message-State: ALoCoQkLDzI5OmNTkKqh88ndGsvNZF2p9OuTL4a0cyoC8UMcjufjN8km5sPI6sh0yNf5DLQzsCb+
MIME-Version: 1.0
X-Received: by with SMTP id ow15mr8390903oeb.59.1400139840978; Thu, 15 May 2014 00:44:00 -0700 (PDT)
Received: by with HTTP; Thu, 15 May 2014 00:44:00 -0700 (PDT)
In-Reply-To: <>
References: <>
Date: Thu, 15 May 2014 08:44:00 +0100
Message-ID: <>
From: Dave Cridland <>
To: Doug Ewell <>
Content-Type: multipart/alternative; boundary=047d7b472832db8e8804f96b75f2
Cc: LTRU Working Group <>
Subject: Re: [Ltru] [apps-discuss] Fwd: Defining a CBOR tag for RFC 5646 Language Tags
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 15 May 2014 07:44:09 -0000

On 14 May 2014 22:47, Doug Ewell <> wrote:

> Mark Davis ☕️ <mark at macchiato dot com> wrote:
> > I'm sure you're not implying that I think invalid UTF-8 would have
> > been a good idea, but your statement might not be clear to others.
> To clarify, I got that impression from Dave's remark, which is why I
> originally quoted it:
> ] Many years ago, Mark Crispin and Chris Newman had a proposal for
> ] embedding language tags in invalid UTF-8; I seem to recall they
> ] publicly renounced their proposal rather dramatically in favour of a
> ] Unicode Consortium proposal for embedding the language tags somewhere
> ] in Plane 14 - published as RFC 2482.
> ]
> ] The fact it was all initiated in order to support the pressing needs
> ] of ACAP might give you some hints as to why it never really took off,
> ] but as a counter-proposal to language tags in metadata, it might be
> ] worth re-examining.
> I'm not sure, upon re-reading this, whether Dave meant to say that Plane
> 14 tags or invalid UTF-8 was worth re-examining.


Of course, an invalid-UTF-8 based proposal simply means that it's no longer
UTF-8 per-se, and so needs itself to be tagged differently. Other than
that, I don't see it's a bad idea from a technical standpoint. The use of
the word "invalid" probably scares people, but I note that's really a
shorthand for "not backwards compatible by existing UTF-8 processors".

Exactly the same caveats apply to Plane 14 tagging, mind, and moreover, we
could invent our own - indeed, that's what we're doing by having these
arrays of (tag, string) tuples.

Inline tagging is out of fashion, I agree - and as I said, it never really
caught on in part because the only protocol that cared - ACAP - never
caught on itself. What I was suggesting was that this technique exists, and
a new type for "language tagged string" is not impossible to arrange, and
actually aligns with the goal here.

Whether this is the right solution here is an entirely different question,
and the one I was hoping wouldn't be rejected out of hand - in no small
part because all we're doing here is inventing a new bespoke format for
CBOR to do the same thing.

The main consideration, I think, is what happens when a CBOR processor
encounters a language-tagged string when it doesn't understand the concept.
Extending UTF-8 seems generally undesirable since the UTF-8 processor is
likely to be an off-the-shelf black box, Plane 14 shows up as unknown
characters, and the array-of-tuple shows up as arrays of tuples. I'm not
sure any of those are desirable.

I'd further note that CBOR processors are, at this stage, likely to be
bespoke, so changes there seem acceptable. Either of the latter two can be
handled that way.

Finally, I don't have a horse in this race, I was just wondering if anyone
else had noticed there were other horses.