Re: [I18ndir] I18ndir early review of draft-ietf-dispatch-javascript-mjs-07
John C Klensin <john-ietf@jck.com> Sat, 09 May 2020 16:49 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7207A3A0BD2 for <i18ndir@ietfa.amsl.com>; Sat, 9 May 2020 09:49:14 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ianuq7yZ3Zik for <i18ndir@ietfa.amsl.com>; Sat, 9 May 2020 09:49:11 -0700 (PDT)
Received: from bsa2.jck.com (bsa2.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0ACE73A0ADC for <i18ndir@ietf.org>; Sat, 9 May 2020 09:49:10 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jXSev-00003z-AR; Sat, 09 May 2020 12:49:05 -0400
Date: Sat, 09 May 2020 12:48:59 -0400
From: John C Klensin <john-ietf@jck.com>
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, Barry Leiba <barryleiba@computer.org>, John Levine <johnl@taugh.com>
cc: i18ndir@ietf.org
Message-ID: <6A345F66CB22AE9DFE835DF8@PSB>
In-Reply-To: <015842ec-9a76-f381-1f46-c932224ebaca@it.aoyama.ac.jp>
References: <158896904545.17044.5288882047334991439@ietfa.amsl.com> <CALaySJ+CRJumYtDCxvGsSwzanz4y=7icuqd+toc0wMivf-mJGg@mail.gmail.c om> <CAD7Fb3diej1-3fAgqZsS_E9wOs1KC=OwVWbvxV5mVjOdQEQm5g@mail.gmail.com> <791ca602-758e-cb0f-a1a4-8fb6b74a8b61@outer-planes.net> <6F916805FF734CB450A3C724@PSB> <015842ec-9a76-f381-1f46-c932224ebaca@it.aoyama.ac.jp>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/ivG0eb8dcdnXd_xqwejsRu7H0no>
Subject: Re: [I18ndir] I18ndir early review of draft-ietf-dispatch-javascript-mjs-07
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 09 May 2020 16:49:15 -0000
Martin, I stand corrected, but the point I was trying to get at remains unchanged. Inline. --On Saturday, May 9, 2020 16:36 +0900 "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote: > On 09/05/2020 13:01, John C Klensin wrote: > >>> Section 4.2 still includes step #3 to deal with the (in >>> practice quite common) case of a missing BOM and the media >>> type missing a charset parameter. There are also too many >>> servers that set this to "ISO-8859-1" without otherwise >>> examining the sources being served. We'll make it clearer >>> this is a default/fallback case. >> >> Noting (again) that, absent a BOM (or even with one) UTF-8 >> cannot be reliably distinguished from ISO-8859-X (for any >> registered or unregistered value of X), > > Sorry, not true. Please see > https://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf, page > 20. Very interesting and useful analysis, especially if one starts a page or two earlier. And I understand it, what you are saying is that there are some 8859 parts (and some other code pages and CCSs) in which some of the bytes are not populated and hence cannot be confused with some strings of UTF-8. The second sentence in the paragraph in the middle of page 19 is particularly useful, IMO, in understanding what you are suggesting: "To be able to assess the actual potential of confusion, it is important that all possible octet combinations are tested, within the limits of computational power and the repetitions that result in longer strings." Let me come back to that but, in addition, some other possibilities can be eliminated by careful script or language-based analysis. Especially if one had a reasonable hypothesis about the language, one could even go a bit further, assuming the text were in UTF-8 and some other candidate encoding and CCS and then applying natural language analysis to both. The odds that both would make sense in that language (and, for this case, conform to Javascript syntax rules) are vanishingly small; one or the other would approximate gibberish or worse. However, even tests like that can get us only to "could not plausibly be anything but UTF-8" as distinct from an absolute conclusion because, I suppose, text cold be a digital signature or encrypted, in which one would not would expect those strings to be anything but gibberish from a normal/natural language standpoint. Since I have not yet read the entire paper, perhaps you make those points elsewhere. So, what I would (or at least should) have said if I had your level of understanding, or even had read this paper earlier, would have been closer to: "Noting (again) that, absent a BOM (or even with one) UTF-8 cannot be reliably distinguished from most ISO-8859-X (for most registered or unregistered values of X) and many other 8bit CCSs without the use of careful testing on a per-alternative basis or an analysis of the text that was presumably encoded. Exceptions, which are fairly easily detected as not being UTF-8 if sufficient text is available, include any 8bit CCS that does not assign values to any of the octets with hex values between x'80' and x'9F' inclusive." There are two problems with that revised explanation. One is that, in practice, most people reading the I-D would not read it. Their eyes would glaze over. Possibly more important, the odds that any browser (or similar application) would try to perform any significant fraction of the needed tests (noting your comment about testing "all possible octet combinations" as a starting point) in real time is, IMO, just about zero. So, unless there is sufficient language tag or other identifying information present to vastly narrow the alternative CCSs to be considered as alternate possibilities (and maybe not then), the substance of my original text does not change significantly. Given the history of the use of non-Unicode CCSs on the web, a far less sweeping statement than the one I made would be more likely and make a similar point almost equally well. That statement would suggest that there are, in all likelihood, only four possibilities (ignoring the wildness, or lack thereof, of UTF-32): ISO/IEC 8859-1, UTF-8, UTF-16BE, and UTF-16LE However, even with that restriction, UTF-8 cannot reliably be distinguished from 8859-1 unless it is possible to apply an additional constraint that C1 code points cannot occur in Javascript, or at least any Javascript of interest, even in embedded strings. And, again, making the scans and tests needed to make that distinction might involve a higher resource cost than browsers (etc.) would find acceptable. Another difficulty with that restriction is that they allow a charset parameter without specifying that it "MUST" denote a Unicode encoding form. I think I suggested that in my earlier note but don't think it made it into the review. So, perhaps I should have said "cannot be easily and reliably distinguished" (and then perhaps define what I mean by "reliably") as well as substituting "most" for "any". And perhaps we should suggest including the substance of the above discussion, referencing Martin's paper of course, as an appendix to the I-D or as a separate I-D that could then be referenced. Perhaps writing such a thing would be a good project for the directorate. >> I don't know what the >> default/fallback case actually is or, more generally, what >> the above paragraph means. Unless I have misunderstood >> something important, the reality is that, if there is >> anywhere on the Internet that a web browser or server >> (including decades-old embedded servers) treat ISO/IEC 8859-1 >> as either the default or legitimate, then there are two >> possibilities: accurate labeling of the charset in use or use >> of heuristics that, by their nature and the nature of >> possible CCSs (not just encoding schemes), may fail. > For people not familiar with the details, the following should > be pointed out clearly: The heuristic of detecting UTF-8 as > UTF-8 based on its very specific bit patterns is an extremely > strong heuristic. There are cases of specific character > combinations in specific encodings that can look like UTF-8 > bit combinations, but that is extremely rare and gets rarer > and rarer the longer the data gets. This is already true if > one looks just at the available characters/bit patterns, but > even more so if one looks at the languages that use the > respective encodings and their letter patterns. While I think that is, like my original statement, somewhat stronger than is justified (e.g., we have been repeatedly told that implementers are not going to do analyses that require examining long strings and language information is not guaranteed to be available, and noting that the current I-D suggests looking only at initial characters that might be a BOM, not doing text analysis), I think a paragraph along those lines would be helpful. Coming back to ISO/IEC 8859-1, is the sequence "ï" "»" "¿" (for identification, equivalent to Unicode code points U+00EF, U+00BB, U+00BF) at the beginning of a file of unencrypted natural language text likely? Nope. Is it "extremely rare", making the heuristic "extremely strong" when the language is not known, that is all the text one is going to look at, and when, depending on Javascript restrictions and restrictions I haven't found yet in the I-D, there may be no guarantee of normal language and cleartext? Seems to me that depends on a judgment call about how rare or strong something needs to be to rate "extremely" and I'd guess different of us might have different opinions about that. But, again, my concern in making those comments was that the I-D not say anything that can be read as "this heuristic will work in all cases" and there I think Martin and I agree even if the cases in which it will fail are "extremely rare". It is also, IMO, consistent with my suggestion: >> That calls for at least a health warning in the >> document, not proceeding as if the heuristics are foolproof. best, john
- [I18ndir] I18ndir early review of draft-ietf-disp… John Levine via Datatracker
- Re: [I18ndir] I18ndir early review of draft-ietf-… Barry Leiba
- Re: [I18ndir] I18ndir early review of draft-ietf-… Myles Borins
- Re: [I18ndir] I18ndir early review of draft-ietf-… John R Levine
- Re: [I18ndir] I18ndir early review of draft-ietf-… Matthew A. Miller
- Re: [I18ndir] I18ndir early review of draft-ietf-… Asmus Freytag
- Re: [I18ndir] I18ndir early review of draft-ietf-… John C Klensin
- Re: [I18ndir] I18ndir early review of draft-ietf-… Patrik Fältström
- Re: [I18ndir] I18ndir early review of draft-ietf-… Martin J. Dürst
- Re: [I18ndir] I18ndir early review of draft-ietf-… John R Levine
- Re: [I18ndir] I18ndir early review of draft-ietf-… John R Levine
- Re: [I18ndir] I18ndir early review of draft-ietf-… John R Levine
- Re: [I18ndir] I18ndir early review of draft-ietf-… John C Klensin
- Re: [I18ndir] I18ndir early review of draft-ietf-… John C Klensin
- Re: [I18ndir] I18ndir early review of draft-ietf-… John C Klensin
- Re: [I18ndir] I18ndir early review of draft-ietf-… Barry Leiba
- Re: [I18ndir] I18ndir early review of draft-ietf-… John R Levine
- Re: [I18ndir] I18ndir early review of draft-ietf-… John C Klensin
- Re: [I18ndir] I18ndir early review of draft-ietf-… Asmus Freytag
- Re: [I18ndir] I18ndir early review of draft-ietf-… John Levine
- Re: [I18ndir] I18ndir early review of draft-ietf-… Patrik Fältström
- Re: [I18ndir] I18ndir early review of draft-ietf-… Asmus Freytag
- Re: [I18ndir] I18ndir early review of draft-ietf-… Mathias Bynens
- Re: [I18ndir] I18ndir early review of draft-ietf-… John R Levine
- Re: [I18ndir] I18ndir early review of draft-ietf-… John C Klensin
- Re: [I18ndir] I18ndir early review of draft-ietf-… Bradley Farias
- Re: [I18ndir] I18ndir early review of draft-ietf-… Barry Leiba