Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
Henrik Levkowetz <henrik@levkowetz.com> Tue, 20 November 2018 13:45 UTC
Return-Path: <henrik@levkowetz.com>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 667F512D7F8 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 05:45:44 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iLGz-UDhuxkJ for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 05:45:42 -0800 (PST)
Received: from zinfandel.tools.ietf.org (zinfandel.tools.ietf.org [IPv6:2001:1890:126c::1:2a]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3CFD3127148 for <tools-discuss@ietf.org>; Tue, 20 Nov 2018 05:45:42 -0800 (PST)
Received: from h-37-140.a357.priv.bahnhof.se ([94.254.37.140]:49221 helo=tannat.localdomain) by zinfandel.tools.ietf.org with esmtpsa (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <henrik@levkowetz.com>) id 1gP6LT-0001iX-PF; Tue, 20 Nov 2018 05:45:41 -0800
To: Daniel Kahn Gillmor <dkg@fifthhorseman.net>, "tools-discuss@ietf.org" <tools-discuss@ietf.org>
References: <87zhu4oc9d.fsf@fifthhorseman.net>
From: Henrik Levkowetz <henrik@levkowetz.com>
Message-ID: <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com>
Date: Tue, 20 Nov 2018 14:45:28 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <87zhu4oc9d.fsf@fifthhorseman.net>
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="KR60lnfTbUSjS7NA8Txg74TLK7P4VrLU5"
X-SA-Exim-Connect-IP: 94.254.37.140
X-SA-Exim-Rcpt-To: tools-discuss@ietf.org, dkg@fifthhorseman.net
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000)
X-SA-Exim-Scanned: Yes (on zinfandel.tools.ietf.org)
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/YhMY_TJlJR0ruc_McW3Ql382IF8>
Subject: Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 13:45:45 -0000
Hi Daniel, On 2018-11-20 00:53, Daniel Kahn Gillmor wrote: > I ran the following to get all html files: > > rsync -az rsync.tools.ietf.org::tools.html/ html/ > > But i found that there were several files in that repo that contained > non-ASCII and non-UTF-8 octet sequences. Yes. Many of these (possibly all, but I haven't inspected each file individually) are due to control characters in the text documents. SUB (^Z) seems to be the most common one, but there are BS, DC2, SI, ACK, FS, and more. Yes, I could have a go at filtering out those at the same time as I generate the htmlized versions. I'll do so for control characters, then we can revisit and see if additional action is needed. Best regards, Henrik > In particular, i ran: > > find html/ -iname '*.html' -print0 | xargs -0 file | grep -v 'ASCII text' | grep -v 'UTF-8 Unicode text' > > and it produced this output: > > html/draft-calabrese-requir-logprot-04.html: data > html/draft-dewinter-queue-start-00.html: data > html/draft-dewinter-queue-start-01.html: data > html/draft-duerst-iri-05.html: data > html/draft-dusse-smime-msg-03.html: data > html/draft-felton-universal-language-00.html: data > html/draft-fujiwara-dnsop-resolver-update-00.html: data > html/draft-iana-special-ipv4-03.html: data > html/draft-iana-special-ipv4-04.html: data > html/draft-ietf-2000-issue-00.html: data > html/draft-ietf-2000-issue-03.html: data > html/draft-ietf-2000-issue-04.html: data > html/draft-ietf-2000-issue-05.html: data > html/draft-ietf-2000-issue-06.html: data > html/draft-ietf-ccamp-mpls-tp-rsvpte-ext-associated-lsp-08.html: data > html/draft-ietf-dhc-mdhcp-00.html: data > html/draft-ietf-dhc-multopt-00.html: data > html/draft-ietf-disman-remops-mib-00.html: data > html/draft-ietf-disman-remops-mib-01.html: data > html/draft-ietf-dnsop-interim-signed-root-01.html: data > html/draft-ietf-hubmib-mau-mib-03.html: data > html/draft-ietf-hubmib-mau-mib-04.html: data > html/draft-ietf-hubmib-repeater-dev-02.html: data > html/draft-ietf-hubmib-repeater-dev-03.html: data > html/draft-ietf-ipcdn-pktc-signaling-07.html: data > html/draft-ietf-ippm-ipdv-01.html: data > html/draft-ietf-isis-opexp-01.html: data > html/draft-ietf-ldapext-matchedval-06.html: data > html/draft-ietf-mpls-p2mp-loose-path-reopt-01.html: data > html/draft-ietf-ospf-hmac-sha-01.html: data > html/draft-ietf-pkix-cmmf-02.html: data > html/draft-ietf-pkix-scvp-15.html: data > html/draft-ietf-rmonmib-pi-ipv6-01.html: data > html/draft-ietf-rtfm-new-traffic-flow-00.html: data > html/draft-ietf-smime-domsec-00.html: data > html/draft-ietf-spki-cert-req-00.html: data > html/draft-ietf-spki-cert-structure-00.html: data > html/draft-ihren-dnsop-interim-signed-root-01.html: data > html/draft-klensin-dns-role-01.html: data > html/draft-lear-ietf-rfc2026bis-00.html: data > html/draft-leung-sigtran-stream-sctp-00.html: data > html/draft-loughney-sctp-sig-prot-00.html: data > html/draft-murray-auth-ftp-ssl-03.html: data > html/draft-newnan-isomib-internet-00.html: data > html/draft-rabbat-ccamp-carrier-survey-01.html: data > html/draft-rfced-info-dudley-01.html: data > html/draft-vasseur-mpls-backup-computation-02.html: data > html/draft-vrancken-oauth-redelegation-01.html: data > html/draft-wenzel-cctld-bcp-00.html: data > html/draft-white-slapm-mib-00.html: data > html/draft-worster-mpls-in-ip-02.html: data > html/draft-zhu-rmcat-nada-02.html: data > html/draft-zhu-rmcat-nada-03.html: data > html/rfc1142.html: data > html/rfc542.html: data > html/rfc652.html: data > html/rfc674.html: data > html/rfc684.html: data > html/rfc731.html: data > html/rfc734.html: data > html/rfc736.html: data > html/rfc752.html: data > html/rfc774.html: data > html/rfc776.html: data > html/rfc783.html: data > > Are these things that can be cleaned up by a future html re-rendering? > they're likely to break html parsers or other transformations that work > from the html source. > > Regards, > > --dkg > > > > ___________________________________________________________ > Tools-discuss mailing list > Tools-discuss@ietf.org > https://www.ietf.org/mailman/listinfo/tools-discuss > > Please report datatracker.ietf.org and mailarchive.ietf.org > bugs at http://tools.ietf.org/tools/ietfdb > or send email to datatracker-project@ietf.org > > Please report tools.ietf.org bugs at > http://tools.ietf.org/tools/issues > or send email to webmaster@tools.ietf.org >
- [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed … Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Carsten Bormann
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Paul Hoffman
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Daniel Kahn Gillmor
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Henrik Levkowetz
- Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync… Carsten Bormann