Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)

John C Klensin <john-ietf@jck.com> Wed, 29 April 2020 22:30 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A16A53A08FD for <i18ndir@ietfa.amsl.com>; Wed, 29 Apr 2020 15:30:49 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FgwaJUrhXoOi for <i18ndir@ietfa.amsl.com>; Wed, 29 Apr 2020 15:30:48 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 753543A08B2 for <i18ndir@ietf.org>; Wed, 29 Apr 2020 15:30:46 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jTvE1-000BKm-To; Wed, 29 Apr 2020 18:30:41 -0400
Date: Wed, 29 Apr 2020 18:30:35 -0400
From: John C Klensin <john-ietf@jck.com>
To: =?UTF-8?Q?Patrik_F=C3=A4ltstr=C3=B6m?= <patrik@frobbit.se>, Pete Resnick <resnick@episteme.net>
cc: Internationalization Directorate <i18ndir@ietf.org>
Message-ID: <E67F0F68A403F5E4E5D8F476@PSB>
In-Reply-To: <A9854982-3696-46FF-AD5C-8088CFCDD8FC@frobbit.se>
References: <E552C138-7938-42BD-B2B2-26AD8AA43516@nostrum.com> <A93B38FC-7D55-4D06-80AE-F165F242F259@episteme.net> <31CF68D680D76D7F45FAB3E2@PSB> <A9854982-3696-46FF-AD5C-8088CFCDD8FC@frobbit.se>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/R84k1CQ-c5y8hR8MQIW5fqvDWbo>
Subject: Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2020 22:30:50 -0000


--On Wednesday, April 29, 2020 20:58 +0200 Patrik Fältström
<patrik@frobbit.se> wrote:

> On 29 Apr 2020, at 3:41, John C Klensin wrote:
> 
>> Since some hours have gone by without a response to your
>> message and I was in need of an excuse to delay getting to an
>> unpleasant task...
> 
> Now I have read the draft as well.
> 
>> Moreover, if I correctly understand what seems like
>> unnecessary convoluted text (in both versions) a BOM is
>> ignored in further processing if the character encoding
>> scheme is determined to be UTF-8 in 4.2(2) or 4.2(3) but not
>> ignored if charset="UTF-8" is present and the BOM occurs
>> anyway (something clearly allowed by RFC 3629).  That doesn't
>> appear to make sense.
> 
> I think I see the same thing as you, which is that even if the
> charset parameter states the encoding is UTF-8, if the data
> itself starts with a BOM, then the text is to be treated as
> UTF-16.

Actually, not what I noticed, and that reinforces my view that
even if one ignores the specific i18n issues, the text is just
too convoluted.  I had read the text as suggesting that, if the
charset was labeled (part 1), that the checks in part 2 and 3
just don't get made.  Whether that is smart or not -- whether
the protocol or application should make a sanity check on
whether something labeled as charset="UTF-8" is actually
conforming UTF-8 and/or whether there is a BOF present if it is
consistent with UTF-8 -- I don't really know except that it
should probably be clear.  But, you are right: if the text is
identified with charset-"UTF-8", then it really, really, better
be UTF-8 and, if there is a BOF that suggests it is something
else, then the spec should say something quite definite about
that.

> That is just so wrong.

I was more concerned about something else (with the
understanding that it isn't my only issue).  As I read the spec,
the plan is, approximately:
  (1) Apply step one, if there is no charset parameter present,
go to step 2
  (2) Apply step 2, i.e., go looking for a BOM fingerprint.  If
there isn't one present, go to stem 3.
  (3) Step 3: decide it is UTF-8.

Now the (or at least one) problem with that is that, absent a
rule that says "MUST use Unicode in some known encoding form",
there is no practical and reliable way to distinguish UTF-8 from
any part of ISO/IEC 8859 or, for that matter, any proprietary
code page.  UTF-16, with or without the BOM heuristics of your
choice is better, but not much better.
   
> I went to the ECMA spec and see they use UTF-16 all over the
> place, and have to bend over backwards to get things right. It
> feels like reading a BER encoding spec (again). :-)

BER is at least precise about what it is talking about.  This
spec isn't.
 
> This "problem" do already exist in RFC 4329...
> 
> But, if they update RFC 4329 I think they should clean this
> up, and my suggestion would be:
> 
> The encoding must be what is actually labeled. If the encoding
> is UTF-16 (which it seems it often is), then it should be
> tagged as UTF-16, not UTF-8 with BOM.

Absolutely.  The easiest way out of both the problem you saw and
the one I saw is to get rid of steps 2 and 3 and insist on
labeling and conformance to what is labeled.  If that is
impractical for some reason, much more specificity is needed,
starting with a firm Unicode requirement.  And, unless they
intend to confine themselves to the BMP, they probably need to
talk about surrogates and their implications (or include such an
explanation by reference).   As to whether they "should" fix a
problem left over from 4329, they have changed the text in that
area and this is a Known Technical Omission or Defect.

best,
   john