Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)

John C Klensin <john-ietf@jck.com> Thu, 30 April 2020 04:09 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E57443A10F7 for <i18ndir@ietfa.amsl.com>; Wed, 29 Apr 2020 21:09:41 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2gkbWRNgubkk for <i18ndir@ietfa.amsl.com>; Wed, 29 Apr 2020 21:09:40 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A122C3A10F0 for <i18ndir@ietf.org>; Wed, 29 Apr 2020 21:09:40 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jU0W2-000CGb-VJ; Thu, 30 Apr 2020 00:09:38 -0400
Date: Thu, 30 Apr 2020 00:09:32 -0400
From: John C Klensin <john-ietf@jck.com>
To: Asmus Freytag <asmusf@ix.netcom.com>, i18ndir@ietf.org
Message-ID: <7584276CBB9741AD64AC3863@PSB>
In-Reply-To: <0c3f5982-108d-81e0-29a4-ce67e7685f2e@ix.netcom.com>
References: <E552C138-7938-42BD-B2B2-26AD8AA43516@nostrum.com> <A93B38FC-7D55-4D06-80AE-F165F242F259@episteme.net> <31CF68D680D76D7F45FAB3E2@PSB> <A9854982-3696-46FF-AD5C-8088CFCDD8FC@frobbit.se> <E67F0F68A403F5E4E5D8F476@PSB> <0c3f5982-108d-81e0-29a4-ce67e7685f2e@ix.netcom.com>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/bj9D1ceI_lkIhJ6WS8sDN76Wl3w>
Subject: Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Apr 2020 04:09:42 -0000

Asmus, good to hear from you.

Inline below.

--On Wednesday, April 29, 2020 16:54 -0700 Asmus Freytag
<asmusf@ix.netcom.com> wrote:

>...
> +4 Aren't UTF-16 BOM versions are not legal UTF-8 byte
> sequences, or am I misremebering? If so, a UTF-16 BOM should
> override or flag as invalid any UTF-8 declaration.

Your memory is correct.  The question is whether, if
charset="UTF-8" is declared and then the first two octets are a
UTF-16 BOM, whether to let it override and treat the data as
UTF-16  -or- to figure out a way to say "this is nonsense".

Patrik interprets the document as requiring the override
behavior.  He and I both prefer the latter, I think for at least
three reasons:

(i) I think, but am not sure, that some valid UTF-16 sequences
are valid UTF-8, even if the BOMs aren't.  If so, a file that is
labeled as UTF-8 but contains a UTF-16 BOM because some idiot
(not necessarily a human one) put it there at the beginning,
could really get into a world of trouble.  For example, any
valid UTF-16 sequence, especially one that contains no zero
octets, could be a valid 8859 sequence or valid sequences in
most code pages.  So starting from "it was labeled as UTF-8 but
isn't" doesn't make it UTF-16, BOM or no BOM.

(ii) As you know, there are a number of sequences that look like
UTF-8 but are not valid,  ones that Unicode now prohibits (my
vague memory is that it took a while for that prohibition to be
clearly established).  If something is labeled as UTF-8 (and/or
starts with a UTF-8 BOM) but contains one of those sequences,
the choice is whether to say "this is nonsense" in some way or
to try to guess at what was intended.  My assumption is that
those two nonsense cases are more or less equivalent and that,
if systems are sent ill-formed files, they are better off not
trying to guess... especially if the spec does not require
Unicode (which goes to your 21st Century comment, I think).

(iii) Mislabeled content is just bad news for all sorts of
reasons as I have been discovering when one of the mail systems
of a very large vendor are identifying message bodies as
'text/plain; charset="UTF-8"' and then sending UTF-16, BOM and
all.  Anything that can be done to discourage that sort of
nonsense --or, if not, to describe carefully want should be done
when it is encountered would be, IMO, A Good Thing.

The I-D, in its present form, does none of that.

stay safe and well,
   john