Re: [I18ndir] I18ndir early review of draft-ietf-dispatch-javascript-mjs-07

John C Klensin <john-ietf@jck.com> Sat, 09 May 2020 16:49 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7207A3A0BD2 for <i18ndir@ietfa.amsl.com>; Sat, 9 May 2020 09:49:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ianuq7yZ3Zik for <i18ndir@ietfa.amsl.com>; Sat, 9 May 2020 09:49:11 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0ACE73A0ADC for <i18ndir@ietf.org>; Sat, 9 May 2020 09:49:10 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jXSev-00003z-AR; Sat, 09 May 2020 12:49:05 -0400
Date: Sat, 09 May 2020 12:48:59 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Barry Leiba <barryleiba@computer.org>, John Levine <johnl@taugh.com>
cc: i18ndir@ietf.org
Message-ID: <6A345F66CB22AE9DFE835DF8@PSB>
In-Reply-To: <015842ec-9a76-f381-1f46-c932224ebaca@it.aoyama.ac.jp>
References: <158896904545.17044.5288882047334991439@ietfa.amsl.com> <CALaySJ+CRJumYtDCxvGsSwzanz4y=7icuqd+toc0wMivf-mJGg@mail.gmail.c om> <CAD7Fb3diej1-3fAgqZsS_E9wOs1KC=OwVWbvxV5mVjOdQEQm5g@mail.gmail.com> <791ca602-758e-cb0f-a1a4-8fb6b74a8b61@outer-planes.net> <6F916805FF734CB450A3C724@PSB> <015842ec-9a76-f381-1f46-c932224ebaca@it.aoyama.ac.jp>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/ivG0eb8dcdnXd_xqwejsRu7H0no>
Subject: Re: [I18ndir] I18ndir early review of draft-ietf-dispatch-javascript-mjs-07
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 May 2020 16:49:15 -0000

Martin,

I stand corrected, but the point I was trying to get at remains
unchanged.

Inline.

--On Saturday, May 9, 2020 16:36 +0900 "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:

 
> On 09/05/2020 13:01, John C Klensin wrote:
> 
>>> Section 4.2 still includes step #3 to deal with the (in
>>> practice quite common) case of a missing BOM and the media
>>> type missing a charset parameter.  There are also too many
>>> servers that set this to "ISO-8859-1" without otherwise
>>> examining the sources being served. We'll make it clearer
>>> this is a default/fallback case.
>> 
>> Noting (again) that, absent a BOM (or even with one) UTF-8
>> cannot be reliably distinguished from ISO-8859-X (for any
>> registered or unregistered value of X),
> 
> Sorry, not true. Please see
> https://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf, page
> 20.

Very interesting and useful analysis, especially if one starts a
page or two earlier.  And I understand it, what you are saying
is that there are some 8859 parts (and some other code pages and
CCSs) in which some of the bytes are not populated and hence
cannot be confused with some strings of UTF-8. 

The second sentence in the paragraph in the middle of page 19 is
particularly useful, IMO, in understanding what you are
suggesting:

	"To be able to assess the actual potential of confusion,
	it is important that all possible octet combinations are
	tested, within the limits of computational power and the
	repetitions that result in longer strings."

Let me come back to that but, in addition, some other
possibilities can be eliminated by careful script or
language-based analysis.  Especially if one had a reasonable
hypothesis about the language, one could even go a bit further,
assuming the text were in UTF-8 and some other candidate
encoding and CCS and then applying natural language analysis to
both.  The odds that both would make sense in that language
(and, for this case, conform to Javascript syntax rules) are
vanishingly small; one or the other would approximate gibberish
or worse.  However, even tests like that can get us only to
"could not plausibly be anything but UTF-8" as distinct from an
absolute conclusion because, I suppose, text cold be a digital
signature or encrypted, in which one would not would expect
those strings to be anything but gibberish from a normal/natural
language standpoint.   Since I have not yet read the entire
paper, perhaps you make those points elsewhere.

So, what I would (or at least should) have said if I had your
level of understanding, or even had read this paper earlier,
would have been closer to:

	"Noting (again) that, absent a BOM (or even with one)
	UTF-8 cannot be reliably distinguished from most
	ISO-8859-X (for most registered or unregistered values
	of X) and many other 8bit CCSs without the use of
	careful testing on a per-alternative basis or an
	analysis of the text that was presumably encoded.
	Exceptions, which are fairly easily detected as not
	being UTF-8 if sufficient text is available, include any
	8bit CCS that does not assign values to any of the
	octets with hex values between x'80' and x'9F' 
  inclusive."
	
There are two problems with that revised explanation.  One is
that, in practice, most people reading the I-D would not read
it.  Their eyes would glaze over.  Possibly more important, the
odds that any browser (or similar application) would try to
perform any significant fraction of the needed tests (noting
your comment about testing "all possible octet combinations" as
a starting point) in real time is, IMO, just about zero.  So,
unless there is sufficient language tag or other identifying
information present to vastly narrow the alternative CCSs to be
considered as alternate possibilities (and maybe not then), the
substance of my original text does not change significantly.

Given the history of the use of non-Unicode CCSs on the web, a
far less sweeping statement than the one I made would be more
likely and make a similar point almost equally well.  That
statement would suggest that there are, in all likelihood, only
four possibilities (ignoring the wildness, or lack thereof, of
UTF-32):
    ISO/IEC 8859-1, UTF-8, UTF-16BE, and UTF-16LE

However, even with that restriction, UTF-8 cannot reliably be
distinguished from 8859-1 unless it is possible to apply an
additional constraint that C1 code points cannot occur in
Javascript, or at least any Javascript of interest, even in
embedded strings.  And, again, making the scans and tests needed
to make that distinction might involve a higher resource cost
than browsers (etc.) would find acceptable.

Another difficulty with that restriction is that they allow a
charset parameter without specifying that it "MUST" denote a
Unicode encoding form.  I think I suggested that in my earlier
note but don't think it made it into the review.

So, perhaps I should have said "cannot be easily and reliably
distinguished" (and then perhaps define what I mean by
"reliably") as well as substituting "most" for "any".

And perhaps we should suggest including the substance of the
above discussion, referencing Martin's paper of course, as an
appendix to the I-D or as a separate I-D that could then be
referenced.  Perhaps writing such a thing would be a good
project for the directorate.
 
>> I don't know what the
>> default/fallback case actually is or, more generally, what
>> the above paragraph means.  Unless I have misunderstood
>> something important, the reality is that, if there is
>> anywhere on the Internet that a web browser or server
>> (including decades-old embedded servers) treat ISO/IEC 8859-1
>> as either the default or legitimate, then there are two
>> possibilities: accurate labeling of the charset in use or use
>> of heuristics that, by their nature and the nature of
>> possible CCSs (not just encoding schemes), may fail.
 
> For people not familiar with the details, the following should
> be pointed out clearly: The heuristic of detecting UTF-8 as
> UTF-8 based on its very specific bit patterns is an extremely
> strong heuristic. There are cases of specific character
> combinations in specific encodings that can look like UTF-8
> bit combinations, but that is extremely rare and gets rarer
> and rarer the longer the data gets. This is already true if
> one looks just at the available characters/bit patterns, but
> even more so if one looks at the languages that use the
> respective encodings and their letter patterns.

While I think that is, like my original statement, somewhat
stronger than is justified (e.g., we have been repeatedly told
that implementers are not going to do analyses that require
examining long strings and language information is not
guaranteed to be available, and noting that the current I-D
suggests looking only at  initial characters that might be a
BOM, not doing text analysis), I think a paragraph along those
lines would be helpful.

Coming back to ISO/IEC 8859-1, is the sequence "ï" "»" "¿"
(for identification, equivalent to Unicode code points U+00EF,
U+00BB, U+00BF) at the beginning of a file of unencrypted
natural language text likely?  Nope.  Is it "extremely rare",
making the heuristic "extremely strong" when the language is not
known, that is all the text one is going to look at, and when,
depending on Javascript restrictions and restrictions I haven't
found yet in the I-D, there may be no guarantee of normal
language and cleartext?  Seems to me that depends on a judgment
call about how rare or strong something needs to be to rate
"extremely" and I'd guess different of us might have different
opinions about that.   But, again, my concern in making those
comments was that the I-D not say anything that can be read as
"this heuristic will work in all cases" and there I think Martin
and I agree even if the cases in which it will fail are
"extremely rare".

It is also, IMO, consistent with my suggestion: 

>> That calls for at least a health warning in the
>> document, not proceeding as if the heuristics are foolproof.

best,
   john