Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
Daniel Kahn Gillmor <dkg@fifthhorseman.net> Tue, 20 November 2018 16:18 UTC
Return-Path: <dkg@fifthhorseman.net>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2DD6812D4F2 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 08:18:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.89
X-Spam-Level:
X-Spam-Status: No, score=-1.89 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, T_SPF_PERMERROR=0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id T3j2bJXW5uA7 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 08:18:11 -0800 (PST)
Received: from che.mayfirst.org (che.mayfirst.org [IPv6:2001:470:1:116::7]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5177112F295 for <tools-discuss@ietf.org>; Tue, 20 Nov 2018 08:18:11 -0800 (PST)
Received: from fifthhorseman.net (unknown [38.109.115.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by che.mayfirst.org (Postfix) with ESMTPSA id 0088DF99A; Tue, 20 Nov 2018 11:18:08 -0500 (EST)
Received: by fifthhorseman.net (Postfix, from userid 1000) id 8601F202A8; Tue, 20 Nov 2018 11:16:29 -0500 (EST)
From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: Henrik Levkowetz <henrik@levkowetz.com>, "tools-discuss@ietf.org" <tools-discuss@ietf.org>
In-Reply-To: <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com>
References: <87zhu4oc9d.fsf@fifthhorseman.net> <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com>
Date: Tue, 20 Nov 2018 11:16:26 -0500
Message-ID: <87pnuzohb9.fsf@fifthhorseman.net>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/Lam4F71WhBguGMAZOP-UsUO7FBg>
Subject: Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 16:18:13 -0000
On Tue 2018-11-20 14:45:28 +0100, Henrik Levkowetz wrote: > On 2018-11-20 00:53, Daniel Kahn Gillmor wrote: > >> i found that there were several files in that repo that contained >> non-ASCII and non-UTF-8 octet sequences. > > Yes. Many of these (possibly all, but I haven't inspected each file > individually) are due to control characters in the text documents. > SUB (^Z) seems to be the most common one, but there are BS, DC2, SI, > ACK, FS, and more. Thanks! Filtering those out for the HTML would be good. But that's clearly not the only thing remaining, since some files contain high bytes. I did a review of all of the documents that contain octets above 0x7f: * draft-ietf-2000-issue-05.html is the most troubling one -- it appears to contain fragments of an embedded pdf document, if i'm reading it correctly. * draft-felton-universal-language-00.html contains what appears to be attempted embedded unicode Chinese, but it has been garbled. * draft-ietf-dnsop-interim-signed-root-01.html and draft-ihren-dnsop-interim-signed-root-01.html contain more mangled unicode in the author names. * draft-ietf-pkix-cmmf-02.html contains some kind of mangled em dashes and smartquotes * draft-ietf-rmonmib-pi-ipv6-01.html has a lot of weirdness in the very end of the document * draft-ietf-smime-domsec-00.html appears to have mangled smartquotes * draft-klensin-dns-role-01.html has some breakage around "Montréal" and 'German glyph "ö"' * draft-loughney-sctp-sig-prot-00.html contains some mangled characters in the organization addresses * draft-rabbat-ccamp-carrier-survey-01.html contains a weird character in the page headers between "Expires" and "April 2005" * draft-vasseur-mpls-backup-computation-02.html mangles "España" * draft-worster-mpls-in-ip-02.html has a damaged Fax number for Rick Wilder Regards, --dkg
- [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed … Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Carsten Bormann
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Paul Hoffman
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Carsten Bormann