Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)

Asmus Freytag <asmusf@ix.netcom.com> Wed, 29 April 2020 23:54 UTC

Return-Path: <asmusf@ix.netcom.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0BC333A0A9C for <i18ndir@ietfa.amsl.com>; Wed, 29 Apr 2020 16:54:25 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.996
X-Spam-Level:
X-Spam-Status: No, score=-1.996 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=ix.netcom.com; domainkeys=pass (2048-bit key) header.from=asmusf@ix.netcom.com header.d=ix.netcom.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RNUR0YyuSnBA for <i18ndir@ietfa.amsl.com>; Wed, 29 Apr 2020 16:54:23 -0700 (PDT)
Received: from elasmtp-masked.atl.sa.earthlink.net (elasmtp-masked.atl.sa.earthlink.net [209.86.89.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EB9923A0A90 for <i18ndir@ietf.org>; Wed, 29 Apr 2020 16:54:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ix.netcom.com; s=dk12062016; t=1588204462; bh=9LAEFbahtVCVpaBgcBpTU0Pz3ajMsrcZLs0/ wun/GSs=; h=Received:Subject:To:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language: X-ELNK-Trace:X-Originating-IP; b=Tk7agG2zHq1mQAuyDE+I7rRFgT6hBrnoN BNs8Bd4Q+iXUIxL8Z6gpqSXFgbz/qtRyLKxEa1tHVqbmneb+AQWb1hVCws/OgySDbU2 NejT8dvsZyljrsI3e+XyAgPUTjiqHRW8gIHOOO3R/NqqH2ccl7rL05C+xx9/wCPXDBv DooctagJXoIerM1PvAsNRe/31WD/gLpFbRslrBelMA4q3TJPLX/4LmOOYSWSFlxWB2c ak6FKGY3KzgQEY7k2qbd1NGXVV0Bpg1oiwM+T7g14jLNoortYdzFDJYd6jLUW/TKZPz wy0qzNQxOFn1ZqeWZIotggy8SSth9M543kREibDAQ==
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=dk12062016; d=ix.netcom.com; b=jWNjaqTthoBcbFwQk4l1oo6XNrw9bwBTc045du/pzoU8xIKWEEYQ3i6tMiyRLzlLmO+nr0CXOU8d1sGFBkthzc9zOekASxhhrYofDQ3OLMUT1n20cRHJxwDJX98XvzJvz2tEfijYoZBqhZZem7PCyAr8huY2TF+9GZkGonW9dUyAlvIwDdrvrVg6WcnXEg7LR+5WNXrvJasvhdLWSbj496sXdhe0wzaA70ifiQK9QYPrIKNbKiFSaWYpjzM6wRtZNuVLEpuvqC1C/Nh65gbeBR8TnZzC1TI4oVbbxfm4QGvr5FJdBHTjEQgUws/LqbCk6oNCmotwLhRfqGhIbFvSUQ==; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent:MIME-Version:In-Reply-To:Content-Type:Content-Language:X-ELNK-Trace:X-Originating-IP;
Received: from [75.172.116.31] (helo=[192.168.0.5]) by elasmtp-masked.atl.sa.earthlink.net with esmtpa (Exim 4) (envelope-from <asmusf@ix.netcom.com>) id 1jTwWz-000EHR-0f for i18ndir@ietf.org; Wed, 29 Apr 2020 19:54:21 -0400
To: i18ndir@ietf.org
References: <E552C138-7938-42BD-B2B2-26AD8AA43516@nostrum.com> <A93B38FC-7D55-4D06-80AE-F165F242F259@episteme.net> <31CF68D680D76D7F45FAB3E2@PSB> <A9854982-3696-46FF-AD5C-8088CFCDD8FC@frobbit.se> <E67F0F68A403F5E4E5D8F476@PSB>
From: Asmus Freytag <asmusf@ix.netcom.com>
Message-ID: <0c3f5982-108d-81e0-29a4-ce67e7685f2e@ix.netcom.com>
Date: Wed, 29 Apr 2020 16:54:21 -0700
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0
MIME-Version: 1.0
In-Reply-To: <E67F0F68A403F5E4E5D8F476@PSB>
Content-Type: multipart/alternative; boundary="------------C111DEF6FF82E717511FBC87"
Content-Language: en-US
X-ELNK-Trace: 464f085de979d7246f36dc87813833b26976a2cdabd2db7a5a6e9fbbb6276b6de3e1ea2bffcb0734350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
X-Originating-IP: 75.172.116.31
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/fOJxBaOYKt_XOyxPhFLKeIvGQpY>
Subject: Re: [I18ndir] Review volunteer needed (Fwd: [dispatch] WGLC of draft-ietf-dispatch-javascript-mjs-07)
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2020 23:54:25 -0000

Don't have time to delve into this deeply, but am reading along the 
discussion here.

+1 on the text being too convoluted - I know my first reading was wrong.

+2 on seeing whether we can't use the revised spec to move to the 21st 
century in terms of character encoding.

(Their step 3 really attempts that, by making it the default / fallback 
when unlabeled / or without BOM signature, but it could be more explicit 
that you can only get legacy behavior if such is labeled).

+3 on stronger language about detecting mislabeled / inconsistent cases. 
Some conditions are clear cut enough to warrant rejection. Ideally, 
you'd always reject and not convert anything to U+FFFD. But, again, 
existing legacy may make that not possible.

+4 Aren't UTF-16 BOM versions are not legal UTF-8 byte sequences, or am 
I misremebering? If so, a UTF-16 BOM should override or flag as invalid 
any UTF-8 declaration.

A./

On 4/29/2020 3:30 PM, John C Klensin wrote:
>
> --On Wednesday, April 29, 2020 20:58 +0200 Patrik Fältström
> <patrik@frobbit.se> wrote:
>
>> On 29 Apr 2020, at 3:41, John C Klensin wrote:
>>
>>> Since some hours have gone by without a response to your
>>> message and I was in need of an excuse to delay getting to an
>>> unpleasant task...
>> Now I have read the draft as well.
>>
>>> Moreover, if I correctly understand what seems like
>>> unnecessary convoluted text (in both versions) a BOM is
>>> ignored in further processing if the character encoding
>>> scheme is determined to be UTF-8 in 4.2(2) or 4.2(3) but not
>>> ignored if charset="UTF-8" is present and the BOM occurs
>>> anyway (something clearly allowed by RFC 3629).  That doesn't
>>> appear to make sense.
>> I think I see the same thing as you, which is that even if the
>> charset parameter states the encoding is UTF-8, if the data
>> itself starts with a BOM, then the text is to be treated as
>> UTF-16.
> Actually, not what I noticed, and that reinforces my view that
> even if one ignores the specific i18n issues, the text is just
> too convoluted.  I had read the text as suggesting that, if the
> charset was labeled (part 1), that the checks in part 2 and 3
> just don't get made.  Whether that is smart or not -- whether
> the protocol or application should make a sanity check on
> whether something labeled as charset="UTF-8" is actually
> conforming UTF-8 and/or whether there is a BOF present if it is
> consistent with UTF-8 -- I don't really know except that it
> should probably be clear.  But, you are right: if the text is
> identified with charset-"UTF-8", then it really, really, better
> be UTF-8 and, if there is a BOF that suggests it is something
> else, then the spec should say something quite definite about
> that.
>
>> That is just so wrong.
> I was more concerned about something else (with the
> understanding that it isn't my only issue).  As I read the spec,
> the plan is, approximately:
>    (1) Apply step one, if there is no charset parameter present,
> go to step 2
>    (2) Apply step 2, i.e., go looking for a BOM fingerprint.  If
> there isn't one present, go to stem 3.
>    (3) Step 3: decide it is UTF-8.
>
> Now the (or at least one) problem with that is that, absent a
> rule that says "MUST use Unicode in some known encoding form",
> there is no practical and reliable way to distinguish UTF-8 from
> any part of ISO/IEC 8859 or, for that matter, any proprietary
> code page.  UTF-16, with or without the BOM heuristics of your
> choice is better, but not much better.
>     
>> I went to the ECMA spec and see they use UTF-16 all over the
>> place, and have to bend over backwards to get things right. It
>> feels like reading a BER encoding spec (again). :-)
> BER is at least precise about what it is talking about.  This
> spec isn't.
>   
>> This "problem" do already exist in RFC 4329...
>>
>> But, if they update RFC 4329 I think they should clean this
>> up, and my suggestion would be:
>>
>> The encoding must be what is actually labeled. If the encoding
>> is UTF-16 (which it seems it often is), then it should be
>> tagged as UTF-16, not UTF-8 with BOM.
> Absolutely.  The easiest way out of both the problem you saw and
> the one I saw is to get rid of steps 2 and 3 and insist on
> labeling and conformance to what is labeled.  If that is
> impractical for some reason, much more specificity is needed,
> starting with a firm Unicode requirement.  And, unless they
> intend to confine themselves to the BMP, they probably need to
> talk about surrogates and their implications (or include such an
> explanation by reference).   As to whether they "should" fix a
> problem left over from 4329, they have changed the text in that
> area and this is a Known Technical Omission or Defect.
>
> best,
>     john
>