Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html

Daniel Kahn Gillmor <dkg@fifthhorseman.net> Tue, 20 November 2018 16:18 UTC

Return-Path: <dkg@fifthhorseman.net>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2DD6812D4F2 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 08:18:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.89
X-Spam-Level:
X-Spam-Status: No, score=-1.89 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, T_SPF_PERMERROR=0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id T3j2bJXW5uA7 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 08:18:11 -0800 (PST)
Received: from che.mayfirst.org (che.mayfirst.org [IPv6:2001:470:1:116::7]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5177112F295 for <tools-discuss@ietf.org>; Tue, 20 Nov 2018 08:18:11 -0800 (PST)
Received: from fifthhorseman.net (unknown [38.109.115.130]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by che.mayfirst.org (Postfix) with ESMTPSA id 0088DF99A; Tue, 20 Nov 2018 11:18:08 -0500 (EST)
Received: by fifthhorseman.net (Postfix, from userid 1000) id 8601F202A8; Tue, 20 Nov 2018 11:16:29 -0500 (EST)
From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: Henrik Levkowetz <henrik@levkowetz.com>, "tools-discuss@ietf.org" <tools-discuss@ietf.org>
In-Reply-To: <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com>
References: <87zhu4oc9d.fsf@fifthhorseman.net> <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com>
Date: Tue, 20 Nov 2018 11:16:26 -0500
Message-ID: <87pnuzohb9.fsf@fifthhorseman.net>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/Lam4F71WhBguGMAZOP-UsUO7FBg>
Subject: Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 16:18:13 -0000

On Tue 2018-11-20 14:45:28 +0100, Henrik Levkowetz wrote:
> On 2018-11-20 00:53, Daniel Kahn Gillmor wrote:
>
>> i found that there were several files in that repo that contained
>> non-ASCII and non-UTF-8 octet sequences.
>
> Yes.  Many of these (possibly all, but I haven't inspected each file
> individually) are due to control characters in the text documents.
> SUB (^Z) seems to be the most common one, but there are BS, DC2, SI,
> ACK, FS, and more.

Thanks!  Filtering those out for the HTML would be good.  But that's
clearly not the only thing remaining, since some files contain high
bytes.

I did a review of all of the documents that contain octets above 0x7f:

 * draft-ietf-2000-issue-05.html is the most troubling one -- it appears
   to contain fragments of an embedded pdf document, if i'm reading it
   correctly.

 * draft-felton-universal-language-00.html contains what appears to be
   attempted embedded unicode Chinese, but it has been garbled.

 * draft-ietf-dnsop-interim-signed-root-01.html and
   draft-ihren-dnsop-interim-signed-root-01.html contain more mangled
   unicode in the author names.

 * draft-ietf-pkix-cmmf-02.html contains some kind of mangled em dashes
   and smartquotes

 * draft-ietf-rmonmib-pi-ipv6-01.html has a lot of weirdness in the very
   end of the document

 * draft-ietf-smime-domsec-00.html appears to have mangled smartquotes

 * draft-klensin-dns-role-01.html has some breakage around "Montréal"
   and 'German glyph "ö"'

 * draft-loughney-sctp-sig-prot-00.html contains some mangled characters
   in the organization addresses

 * draft-rabbat-ccamp-carrier-survey-01.html contains a weird character
   in the page headers between "Expires" and "April 2005"

 * draft-vasseur-mpls-backup-computation-02.html mangles "España"

 * draft-worster-mpls-in-ip-02.html has a damaged Fax number for Rick
   Wilder

Regards,

        --dkg