Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
Henrik Levkowetz <henrik@levkowetz.com> Tue, 20 November 2018 17:40 UTC
Return-Path: <henrik@levkowetz.com>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AA4E7130E99 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 09:40:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id G1xtj5Ra6UZE for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 09:40:27 -0800 (PST)
Received: from zinfandel.tools.ietf.org (zinfandel.tools.ietf.org [IPv6:2001:1890:126c::1:2a]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E7359130E83 for <tools-discuss@ietf.org>; Tue, 20 Nov 2018 09:40:27 -0800 (PST)
Received: from h-37-140.a357.priv.bahnhof.se ([94.254.37.140]:50084 helo=tannat.localdomain) by zinfandel.tools.ietf.org with esmtpsa (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <henrik@levkowetz.com>) id 1gPA0h-00078M-5Z; Tue, 20 Nov 2018 09:40:27 -0800
To: Daniel Kahn Gillmor <dkg@fifthhorseman.net>, "tools-discuss@ietf.org" <tools-discuss@ietf.org>
References: <87zhu4oc9d.fsf@fifthhorseman.net> <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com> <87pnuzohb9.fsf@fifthhorseman.net>
From: Henrik Levkowetz <henrik@levkowetz.com>
Message-ID: <cd480b02-f907-1bc5-1265-570d4277b28d@levkowetz.com>
Date: Tue, 20 Nov 2018 18:40:19 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <87pnuzohb9.fsf@fifthhorseman.net>
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="r0ONwNh9MCqXPiF0FBMk0uB5oBnknI4ms"
X-SA-Exim-Connect-IP: 94.254.37.140
X-SA-Exim-Rcpt-To: tools-discuss@ietf.org, dkg@fifthhorseman.net
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000)
X-SA-Exim-Scanned: Yes (on zinfandel.tools.ietf.org)
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/TIRjTYCS0tjgSk9BkgQHhbeJU3U>
Subject: Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 17:40:36 -0000
On 2018-11-20 17:16, Daniel Kahn Gillmor wrote: > On Tue 2018-11-20 14:45:28 +0100, Henrik Levkowetz wrote: >> On 2018-11-20 00:53, Daniel Kahn Gillmor wrote: >> >>> i found that there were several files in that repo that contained >>> non-ASCII and non-UTF-8 octet sequences. >> >> Yes. Many of these (possibly all, but I haven't inspected each file >> individually) are due to control characters in the text documents. >> SUB (^Z) seems to be the most common one, but there are BS, DC2, SI, >> ACK, FS, and more. > > Thanks! Filtering those out for the HTML would be good. But that's > clearly not the only thing remaining, since some files contain high > bytes. > > I did a review of all of the documents that contain octets above 0x7f: I've looked through the equivalent .txt files, and I believe that common for all except draft-ietf-2000-issue is that they use unknown code pages; not latin-1 and certainly not utf-8. An attempt to check some of them against other less code pages failed to find a match. Any fixup of these would have to be built into the htmlizer as custom tweaks for each specific document (except for draft-ietf-2000-issue, where some other solution would be needed). All of these are old, I think the latest from 2005; two from 2004; and the rest earlier. I doubt it's meaningful to add custom fixes... Best regards, Henrik > * draft-ietf-2000-issue-05.html is the most troubling one -- it appears > to contain fragments of an embedded pdf document, if i'm reading it > correctly. > > * draft-felton-universal-language-00.html contains what appears to be > attempted embedded unicode Chinese, but it has been garbled. > > * draft-ietf-dnsop-interim-signed-root-01.html and > draft-ihren-dnsop-interim-signed-root-01.html contain more mangled > unicode in the author names. > > * draft-ietf-pkix-cmmf-02.html contains some kind of mangled em dashes > and smartquotes > > * draft-ietf-rmonmib-pi-ipv6-01.html has a lot of weirdness in the very > end of the document > > * draft-ietf-smime-domsec-00.html appears to have mangled smartquotes > > * draft-klensin-dns-role-01.html has some breakage around "Montréal" > and 'German glyph "ö"' > > * draft-loughney-sctp-sig-prot-00.html contains some mangled characters > in the organization addresses > > * draft-rabbat-ccamp-carrier-survey-01.html contains a weird character > in the page headers between "Expires" and "April 2005" > > * draft-vasseur-mpls-backup-computation-02.html mangles "España" > > * draft-worster-mpls-in-ip-02.html has a damaged Fax number for Rick > Wilder > > Regards, > > --dkg >
- [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed … Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Carsten Bormann
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Paul Hoffman
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Carsten Bormann