Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html

Henrik Levkowetz <henrik@levkowetz.com> Tue, 20 November 2018 17:40 UTC

Return-Path: <henrik@levkowetz.com>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AA4E7130E99 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 09:40:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id G1xtj5Ra6UZE for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 09:40:27 -0800 (PST)
Received: from zinfandel.tools.ietf.org (zinfandel.tools.ietf.org [IPv6:2001:1890:126c::1:2a]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E7359130E83 for <tools-discuss@ietf.org>; Tue, 20 Nov 2018 09:40:27 -0800 (PST)
Received: from h-37-140.a357.priv.bahnhof.se ([94.254.37.140]:50084 helo=tannat.localdomain) by zinfandel.tools.ietf.org with esmtpsa (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <henrik@levkowetz.com>) id 1gPA0h-00078M-5Z; Tue, 20 Nov 2018 09:40:27 -0800
To: Daniel Kahn Gillmor <dkg@fifthhorseman.net>, "tools-discuss@ietf.org" <tools-discuss@ietf.org>
References: <87zhu4oc9d.fsf@fifthhorseman.net> <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com> <87pnuzohb9.fsf@fifthhorseman.net>
From: Henrik Levkowetz <henrik@levkowetz.com>
Message-ID: <cd480b02-f907-1bc5-1265-570d4277b28d@levkowetz.com>
Date: Tue, 20 Nov 2018 18:40:19 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <87pnuzohb9.fsf@fifthhorseman.net>
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="r0ONwNh9MCqXPiF0FBMk0uB5oBnknI4ms"
X-SA-Exim-Connect-IP: 94.254.37.140
X-SA-Exim-Rcpt-To: tools-discuss@ietf.org, dkg@fifthhorseman.net
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000)
X-SA-Exim-Scanned: Yes (on zinfandel.tools.ietf.org)
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/TIRjTYCS0tjgSk9BkgQHhbeJU3U>
Subject: Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 17:40:36 -0000

On 2018-11-20 17:16, Daniel Kahn Gillmor wrote:
> On Tue 2018-11-20 14:45:28 +0100, Henrik Levkowetz wrote:
>> On 2018-11-20 00:53, Daniel Kahn Gillmor wrote:
>>
>>> i found that there were several files in that repo that contained
>>> non-ASCII and non-UTF-8 octet sequences.
>>
>> Yes.  Many of these (possibly all, but I haven't inspected each file
>> individually) are due to control characters in the text documents.
>> SUB (^Z) seems to be the most common one, but there are BS, DC2, SI,
>> ACK, FS, and more.
> 
> Thanks!  Filtering those out for the HTML would be good.  But that's
> clearly not the only thing remaining, since some files contain high
> bytes.
> 
> I did a review of all of the documents that contain octets above 0x7f:

I've looked through the equivalent .txt files, and I believe that common
for all except draft-ietf-2000-issue is that they use unknown code pages;
not latin-1 and certainly not utf-8.  An attempt to check some of them
against other less code pages failed to find a match.

Any fixup of these would have to be built into the htmlizer as custom
tweaks for each specific document (except for draft-ietf-2000-issue,
where some other solution would be needed).

All of these are old, I think the latest from 2005; two from 2004; and
the rest earlier.  I doubt it's meaningful to add custom fixes...


Best regards,

	Henrik

>  * draft-ietf-2000-issue-05.html is the most troubling one -- it appears
>    to contain fragments of an embedded pdf document, if i'm reading it
>    correctly.
> 
>  * draft-felton-universal-language-00.html contains what appears to be
>    attempted embedded unicode Chinese, but it has been garbled.
> 
>  * draft-ietf-dnsop-interim-signed-root-01.html and
>    draft-ihren-dnsop-interim-signed-root-01.html contain more mangled
>    unicode in the author names.
> 
>  * draft-ietf-pkix-cmmf-02.html contains some kind of mangled em dashes
>    and smartquotes
> 
>  * draft-ietf-rmonmib-pi-ipv6-01.html has a lot of weirdness in the very
>    end of the document
> 
>  * draft-ietf-smime-domsec-00.html appears to have mangled smartquotes
> 
>  * draft-klensin-dns-role-01.html has some breakage around "Montréal"
>    and 'German glyph "ö"'
> 
>  * draft-loughney-sctp-sig-prot-00.html contains some mangled characters
>    in the organization addresses
> 
>  * draft-rabbat-ccamp-carrier-survey-01.html contains a weird character
>    in the page headers between "Expires" and "April 2005"
> 
>  * draft-vasseur-mpls-backup-computation-02.html mangles "España"
> 
>  * draft-worster-mpls-in-ip-02.html has a damaged Fax number for Rick
>    Wilder
> 
> Regards,
> 
>         --dkg
>