Re: [Tools-discuss] BOMs and rfcmarkup

Henrik Levkowetz <henrik@levkowetz.com> Wed, 20 September 2017 22:08 UTC

Return-Path: <henrik@levkowetz.com>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5F1BF1320D8 for <tools-discuss@ietfa.amsl.com>; Wed, 20 Sep 2017 15:08:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 5mNxnl4nL1OY for <tools-discuss@ietfa.amsl.com>; Wed, 20 Sep 2017 15:08:46 -0700 (PDT)
Received: from zinfandel.tools.ietf.org (zinfandel.tools.ietf.org [IPv6:2001:1890:126c::1:2a]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id D1CD7132026 for <tools-discuss@ietf.org>; Wed, 20 Sep 2017 15:08:46 -0700 (PDT)
Received: from h-99-61.a357.priv.bahnhof.se ([82.196.99.61]:49741 helo=[192.168.1.120]) by zinfandel.tools.ietf.org with esmtpsa (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <henrik@levkowetz.com>) id 1dunAj-00036p-OB; Wed, 20 Sep 2017 15:08:46 -0700
To: Adam Roach <adam@nostrum.com>, Brian E Carpenter <brian.e.carpenter@gmail.com>, tools-discuss@ietf.org
References: <935d04f3-7605-dcd2-7366-d7e3f522e0e0@gmail.com> <175CFE36-A267-42CF-98FE-50855E355DD6@tzi.org> <1403EA2F-C81D-44FF-B644-AD2C49FA6832@tzi.org> <b0f562eb-bf27-5c48-03a0-54cf63125135@gmail.com> <1c939c42-6574-8f3d-c871-80b7a0930f05@levkowetz.com> <13b43544-ec18-ec61-4f1b-acfe5f431551@nostrum.com>
From: Henrik Levkowetz <henrik@levkowetz.com>
Message-ID: <caea1415-66d8-e98a-3a59-d0a199a39372@levkowetz.com>
Date: Thu, 21 Sep 2017 00:08:37 +0200
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <13b43544-ec18-ec61-4f1b-acfe5f431551@nostrum.com>
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="0gk7t8s3C2KQ7tg0hpJWLKCElJfqAo7RC"
X-SA-Exim-Connect-IP: 82.196.99.61
X-SA-Exim-Rcpt-To: tools-discuss@ietf.org, brian.e.carpenter@gmail.com, adam@nostrum.com
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000)
X-SA-Exim-Scanned: Yes (on zinfandel.tools.ietf.org)
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/IoO0SMua_S8eJ0CD-91zesiKvSQ>
Subject: Re: [Tools-discuss] BOMs and rfcmarkup
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 20 Sep 2017 22:08:48 -0000

Hi Adam,

On 2017-09-20 23:44, Adam Roach wrote:
> On 9/20/17 1:34 PM, Henrik Levkowetz wrote:
>> Hi Brian,
>>
>> On 2017-09-20 01:47, Brian E Carpenter wrote:
>>> Just a note that https://tools.ietf.org/html/rfc8187 seems
>>> to be fine, iff your browser is set to assume UTF-8. But it
>>> does include the BOM.
>>>
>>> If your browser is set to assume what Firefox calls "Western",
>>> the document starts with  (which is the IS8859 interpretation
>>> of a UTF-8 BOM, expressed in UTF-8, if that doesn't make your head
>>> hurt and if it survives the email system). Naturally, the £, € and
>>> ü do not display correctly.
>>>
>>> The BOM is in the generated HTML immediately after the <pre>.
>>> However, with Firefox it performs no useful function; all that matters
>>> is the View/Text Encoding setting. Exactly the same applies to Internet
>>> Explorer.
>>>
>>> Thus, at least for these two browsers, including the BOM in the HTML
>>> file seems to be pointless. I'd vote for removing it.
>> I've updated rfcmarkup to strip out BOMs; the link above should now
>> give you a document without a BOM after the <pre>.
> 
> Wait. WAIT! No. This is the wrong answer. If you are in a tool that 
> makes the BOM visible at all, the problem isn't the presence of the BOM 
> (which was well-defined rendering semantics in UTF8 that basically say 
> "this shouldn't be visible in any way at all"); the problem is that you 
> are loading a UTF-8 document and treating it as _some_ _other_ _encoding_.

I'm sorry, but I disagree (at least in one sense).  Earlier, rfcmarkup
left a BOM in the middle of the html, which I believe is wrong.  You
could say that this is the result of not reading a UTF-8 file with BOM
correctly, and I would agree.  So now I'm reading the content as UTF-8,
but not keeping the BOM, and not placing it in the middle of a html file
(which is served with Content-Type: text/html; charset=UTF-8).  I think
that's right.

> It does no good to paper over the presence of a BOM, since that just 
> defers the problem to elsewhere (namely, any other non-ASCII codepoints 
> in the document).

Nope.  I've been handling UTF-8 correctly in rfcmarkup for several years;
it has behaved correctly for draft-ietf-httpbis-rfc5987bis for every
version which has had non-ascii characters.

> The proper behavior here, and especially for rfcdiff, is to make sure 
> everything is using UTF-8. Leaving the BOM in place serves as a useful 
> canary in this coal mine, and I would suggest *not* stripping it, at 
> least until we're sure the tools *otherwise* handle UTF-8 correctly. 
> That is, the BOM provides a good litmus test: if you can see it, you're 
> doing it wrong.

I think my experience is that rfcmarkup has bean dealing just fine with
utf-8 in general, but the BOM (which I think is nonsense in UTF-8, but
that's a different discussion) has been visible, which I've now fixed.


Best regards,

	Henrik