Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)

John C Klensin <john-ietf@jck.com> Thu, 30 April 2020 19:47 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 314463A11BF for <i18ndir@ietfa.amsl.com>; Thu, 30 Apr 2020 12:47:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ss_YoVbuTLKf for <i18ndir@ietfa.amsl.com>; Thu, 30 Apr 2020 12:47:47 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 24E8F3A11DF for <i18ndir@ietf.org>; Thu, 30 Apr 2020 12:47:47 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jUF9t-00016w-66; Thu, 30 Apr 2020 15:47:45 -0400
Date: Thu, 30 Apr 2020 15:47:40 -0400
From: John C Klensin <john-ietf@jck.com>
To: Pete Resnick <resnick@episteme.net>
cc: i18ndir@ietf.org, John R Levine <johnl@taugh.com>
Message-ID: <477C5A18357719590D6336D9@PSB>
In-Reply-To: <8CE808C7-DF4F-45A9-9C17-2D82A8B78A9E@episteme.net>
References: <20200430014516.01551188B50A@ary.qy> <33a39102-0385-e235-1cdc-57cf6dad4f4b@ix.netcom.com> <7AD06F46449F354499AC2E24@PSB> <ACB0D0AB-2271-409D-A9A1-DFFD5A1AEE93@episteme.net> <alpine.OSX.2.22.407.2004301241440.26342@ary.qy> <8CE808C7-DF4F-45A9-9C17-2D82A8B78A9E@episteme.net>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/A83qz-DCjnzPMFqyG5REDghpe5c>
Subject: Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Apr 2020 19:47:51 -0000

Pete,

I do not have time (especially today or tomorrow) to engage with
this further or try to explain again, but parts of John's
summary are, I believe, inconsistent with how Patrik, Asmus, and
myself interpreted the document.  Specifically, 

(1) We all interpreted it as "check the 'charset' parameter
first".  The question was what was to be done next.

(2) At least two of us expressed concern about the use of a file
name suffix as a classifier.  Even if that is a carry-forward
from 4329, it is a major step back from the reason why both
email and the web adopted media type labeling and, if we are
counting deployment and running code, I think it is safe to
suggest that those two applications (or, if you prefer, sets of
applications) are somewhat more broadly deployed than these
so-called "scripting media types".

(3) Other than the statement "Source text is expected to be in
Unicode Normalization Form C", there is apparently no
requirement that the underlying CCS be Unicode.  The statement
"Implementations are required to support the UTF-8 character
encoding scheme" does not impose that requirement either, it
just makes UTF-8 support mandatory to implement... if it does
that, because normative requirements of that type are normally
not buried in Security Considerations sections.  This I-D claims
authority for insisting on NFC from the Security Considerations
Section of RFC 3629, but that Section does not discuss
normalization forms at all: instead, it discusses "the same
thing" issue and then says that the problems are "amenable to
solutions" based on normalization (not just NFC) and UAX15.
Even if that was good advice based on our best understanding in
2003, it certainly is neither necessary nor sufficient now.
More generally, the current conventional wisdom (or, if you
will, "best practice") is that, except where special
circumstances apply (as they do with IDNA) normalization should
occur if needed at processing (especially comparison) time, not
at storage or transmission time.  Yet, this document specifies
NFC and does so without explanation other than, as discussed
about, blaming RFC 3629 which is not only not guilty but is not
applicable to any charset other than UTF-8.

And, more broadly and probably more important:

(5) It is not an i18n issue specifically, but as a co-author of
RFC 6838 / BCP 13, I find it deeply troubling that this document
is put forward as a media type registration when what it appears
to do is to (i) allow the charset parameter but make it optional
and specifically provide that it is to be ignored if present for
Module Goals (Section 4.1) and then provide that, even for
Script Goals, it can be ignored if heuristics are applied the
suggest the charset (and encoding form) in use is something
else.  In particular, one can conform to the SHOULD in that
Section by specifying, for example, 'charset="IEO-8859-6"' and,
since the document doesn't specify what to do with Script goal
charsets one does not recognize, support for that drags in all
of the bidi and troublesome sequence issues associated with
Arabic without any of the support Unicode the documents
surrounding it provide.  

(6) The text and section organization that I described as
"convoluted" is very troublesome.  It was bad in 4329; the
changes in the current I-D certainly does not fix the problem
and may make it worse.  When four of us (Patrik, Asmus, John
Levine, and myself) each with considerable experience in reading
technical specifications, can read the same spec and come to
four different conclusions as to what it says to do, that is a
deep, fundamental, problem independent of any details. 

What is even more troublesome is that they could rather easily
dig themselves out of most (sadly, not all) of this mess by, as
Asmus more or less put it, joining the 21st century.    For
example, well-placed requirements that state clearly:

(i) For Module goal sources, the information MUST be in Unicode,
using encoding form UTF-8.  A charset parameter SHOULD NOT be
specified, if it is, its value MUST be "UTF-8".

(ii) For Script goal sources, a charset parameter MUST be
specified and MUST be one of "UTF-8", "UTF-16BE", or "UTF-16LE".
If it is omitted, the receiving system MAY dig itself into as
deep a whole as it prefers, possibly using BOM heuristics if
there is an explicit "MUST use Unicode" requirements for Script
goals.

and getting all requirements on the spec itself moved  out of
the Security Considerations section and stated as requirements,
without relying on requirements or recommendations of documents
like RFC 3629 that are somewhat outdated and/or don't say what
the I-D claims or implies that they say and/or are not
applicable to encoding forms (and non-Unicode CCSs) that the I-D
allows.

_Recommendations_

(a) We have said in multiple places, most recently in what is
now RFC 8753, that this i18n stuff requires a collaborative
effort by people whose expertise comes from a variety of
different perspectives.  The comments from Patrik, Asmus, John
Levine, and myself illustrate the reasons for that.  So either
no review should ever go out unless it either reflects multiple
sets of eyes and consensus (at least among those who were
willing to look) or it should bear a much stronger disclaimer
than is typical for "area review team" review assignments.  The
latter might say something like "while this review was assigned
to me by the i18n directorate, it represents my opinion only and
not consensus among the experts who make up that directorate,
even consensus that my summary of their discussion is accurate".


Consider what might happen in this case without one or the
other.  A review goes off that talks about the concerns of the
directorate and John's summary of those concerns ("We understand
it to say...").  The WG addresses those issues and the document
goes to IETF LC.  Some of Patrik, Asmus, and me (and maybe
others) respond to the IETF LC pointing out the issues raised in
our earlier notes and above, strongly suggesting that the WG
should have known about most of this, that they are depending on
documents that don't say what the WG claims they say and that
violate the letter and spirit of assorted RCPs.  We point out to
IANA that this document is not a proper Media Type registration
and that 4329 wasn't either.   The WG responds with dismay
because all of this is new to them.  And the ART ADs (whom I
believe are on this list) end up with egg on their faces as does
the whole directorate and its leadership.

(b) Let's respond to the WG with the issue I think those of us
who have looked at the document are all agreed about: it is
_really_ hard to figure out just what the document specifies and
hence to comment on it in an authoritative way.   If they are
assuming Unicode, they need to make that a requirement, not hope
the reader figures it out.   The notorious Section 4 may need to
be split up into separate subsections for Module goals and
Script goals or otherwise structured to be sure it is clear what
one is to do in each case and with and without charset
parameters.  And probably (less important for this iteration
since no one else mentioned it, but I predict an extra iteration
if it is not done), normative requirements on the spec must not
appear only in the Security Considerations section and they
better check the applicability of their references.   Only when
they fix enough of those things that we can all agree about what
the documents says are they going to get a review of substantive
i18n issues.

Disgustedly,
   john


--On Thursday, April 30, 2020 12:30 -0500 Pete Resnick
<resnick@episteme.net> wrote:

> On 30 Apr 2020, at 12:22, John R Levine wrote:
> 
>>> the WG to take some action? If I don't hear from anyone,
>>> I'll start  accosting people privately.
>> 
>> Nooo, not the Private Accosting.
> 
> Obviously you have never experienced my full-out private
> accosting. :-)
> 
>> Summary:
>> 
>> The i18n directorate has some concerns about character set
>> handling in draft-ietf-dispatch-javascript-mjs-07.
>> 
>> We understand it to say that if a javascript MIME element
>> does not  have a name that ends with .mjs, a consumer ignores
>> the declared  charset and looks at the first few bytes of the
>> content for a byte  order mark (BOM.) If it finds one, it
>> uses the charset implied by the  BOM, which can be UTF-16BE,
>> UTF-16LE, or UTF-8.  If there's no BOM, it  uses the declared
>> charset unless there isn't one, in which case it  defaults to
>> UTF-8.
>> 
>> We are unaware of any other MIME type that uses this sort of
>> trick to  work around mislabelled content, and are concerned
>> that it leads to  failures in general MIME code that doesn't
>> handle this special case.   We also don't know how important
>> the workaround is in practice, e.g.,  how many MIME producers
>> still mislabel UTF-16 as UTF-8 or vice versa.
>> 
>> For better interoperation it could say something like
>> producers MUST  put the correct charset on any media (same as
>> any other media type)  and that consumers SHOULD use the
>> declared charset but MAY do the BOM  trick for backward
>> compatibility in certain cases.
>> 
>> It also says the BOM must be removed from the decoded text.
>> That's  confusing since ECMAscript treats a BOM as a space
>> which would be  harmless at the start of a block of code.
> 
> Thanks for taking up the pen John. If folks think something
> needs to be elaborated or added, or if you have some
> wordsmithing, do speak up.
> 
> I'll check with Barry whether he wants this on the official
> review form. If so, I'll assign the review to you in the
> datatracker. Otherwise, you can just email the dispatch list
> and sign it "John, stuckee for the directorate" or some such.
> 
> pr
> -- 
> Pete Resnick https://www.episteme.net/
> All connections to the world are tenuous at best