Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html

Henrik Levkowetz <henrik@levkowetz.com> Tue, 20 November 2018 13:45 UTC

Return-Path: <henrik@levkowetz.com>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 667F512D7F8 for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 05:45:44 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iLGz-UDhuxkJ for <tools-discuss@ietfa.amsl.com>; Tue, 20 Nov 2018 05:45:42 -0800 (PST)
Received: from zinfandel.tools.ietf.org (zinfandel.tools.ietf.org [IPv6:2001:1890:126c::1:2a]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3CFD3127148 for <tools-discuss@ietf.org>; Tue, 20 Nov 2018 05:45:42 -0800 (PST)
Received: from h-37-140.a357.priv.bahnhof.se ([94.254.37.140]:49221 helo=tannat.localdomain) by zinfandel.tools.ietf.org with esmtpsa (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <henrik@levkowetz.com>) id 1gP6LT-0001iX-PF; Tue, 20 Nov 2018 05:45:41 -0800
To: Daniel Kahn Gillmor <dkg@fifthhorseman.net>, "tools-discuss@ietf.org" <tools-discuss@ietf.org>
References: <87zhu4oc9d.fsf@fifthhorseman.net>
From: Henrik Levkowetz <henrik@levkowetz.com>
Message-ID: <45bd7024-1866-88ff-0289-157aef2cfe99@levkowetz.com>
Date: Tue, 20 Nov 2018 14:45:28 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <87zhu4oc9d.fsf@fifthhorseman.net>
Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="KR60lnfTbUSjS7NA8Txg74TLK7P4VrLU5"
X-SA-Exim-Connect-IP: 94.254.37.140
X-SA-Exim-Rcpt-To: tools-discuss@ietf.org, dkg@fifthhorseman.net
X-SA-Exim-Mail-From: henrik@levkowetz.com
X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000)
X-SA-Exim-Scanned: Yes (on zinfandel.tools.ietf.org)
X-Clacks-Overhead: GNU Terry Pratchett
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/YhMY_TJlJR0ruc_McW3Ql382IF8>
Subject: Re: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 13:45:45 -0000

Hi Daniel,

On 2018-11-20 00:53, Daniel Kahn Gillmor wrote:
> I ran the following to get all html files:
> 
>      rsync -az rsync.tools.ietf.org::tools.html/ html/
> 
> But i found that there were several files in that repo that contained
> non-ASCII and non-UTF-8 octet sequences.

Yes.  Many of these (possibly all, but I haven't inspected each file
individually) are due to control characters in the text documents.
SUB (^Z) seems to be the most common one, but there are BS, DC2, SI,
ACK, FS, and more.

Yes, I could have a go at filtering out those at the same time as I
generate the htmlized versions.  I'll do so for control characters,
then we can revisit and see if additional action is needed.


Best regards,

	Henrik

> In particular, i ran:
> 
>     find html/ -iname '*.html' -print0 | xargs -0 file | grep -v 'ASCII text' | grep -v 'UTF-8 Unicode text'
> 
> and it produced this output:
> 
> html/draft-calabrese-requir-logprot-04.html:                     data
> html/draft-dewinter-queue-start-00.html:                         data
> html/draft-dewinter-queue-start-01.html:                         data
> html/draft-duerst-iri-05.html:                                   data
> html/draft-dusse-smime-msg-03.html:                              data
> html/draft-felton-universal-language-00.html:                    data
> html/draft-fujiwara-dnsop-resolver-update-00.html:               data
> html/draft-iana-special-ipv4-03.html:                              data
> html/draft-iana-special-ipv4-04.html:                              data
> html/draft-ietf-2000-issue-00.html:                                data
> html/draft-ietf-2000-issue-03.html:                                data
> html/draft-ietf-2000-issue-04.html:                                data
> html/draft-ietf-2000-issue-05.html:                                data
> html/draft-ietf-2000-issue-06.html:                                data
> html/draft-ietf-ccamp-mpls-tp-rsvpte-ext-associated-lsp-08.html: data
> html/draft-ietf-dhc-mdhcp-00.html:                               data
> html/draft-ietf-dhc-multopt-00.html:                             data
> html/draft-ietf-disman-remops-mib-00.html:                       data
> html/draft-ietf-disman-remops-mib-01.html:                       data
> html/draft-ietf-dnsop-interim-signed-root-01.html:               data
> html/draft-ietf-hubmib-mau-mib-03.html:                          data
> html/draft-ietf-hubmib-mau-mib-04.html:                          data
> html/draft-ietf-hubmib-repeater-dev-02.html:                     data
> html/draft-ietf-hubmib-repeater-dev-03.html:                     data
> html/draft-ietf-ipcdn-pktc-signaling-07.html:                   data
> html/draft-ietf-ippm-ipdv-01.html:                              data
> html/draft-ietf-isis-opexp-01.html:                             data
> html/draft-ietf-ldapext-matchedval-06.html:                       data
> html/draft-ietf-mpls-p2mp-loose-path-reopt-01.html:              data
> html/draft-ietf-ospf-hmac-sha-01.html:                           data
> html/draft-ietf-pkix-cmmf-02.html:                               data
> html/draft-ietf-pkix-scvp-15.html:                               data
> html/draft-ietf-rmonmib-pi-ipv6-01.html:                         data
> html/draft-ietf-rtfm-new-traffic-flow-00.html:                   data
> html/draft-ietf-smime-domsec-00.html:                                data
> html/draft-ietf-spki-cert-req-00.html:                               data
> html/draft-ietf-spki-cert-structure-00.html:                         data
> html/draft-ihren-dnsop-interim-signed-root-01.html:                   data
> html/draft-klensin-dns-role-01.html:                             data
> html/draft-lear-ietf-rfc2026bis-00.html:                         data
> html/draft-leung-sigtran-stream-sctp-00.html:                    data
> html/draft-loughney-sctp-sig-prot-00.html:                        data
> html/draft-murray-auth-ftp-ssl-03.html:                              data
> html/draft-newnan-isomib-internet-00.html:                           data
> html/draft-rabbat-ccamp-carrier-survey-01.html:                      data
> html/draft-rfced-info-dudley-01.html:                                               data
> html/draft-vasseur-mpls-backup-computation-02.html:               data
> html/draft-vrancken-oauth-redelegation-01.html:                   data
> html/draft-wenzel-cctld-bcp-00.html:                              data
> html/draft-white-slapm-mib-00.html:                               data
> html/draft-worster-mpls-in-ip-02.html:                            data
> html/draft-zhu-rmcat-nada-02.html:                              data
> html/draft-zhu-rmcat-nada-03.html:                              data
> html/rfc1142.html:                                              data
> html/rfc542.html:                                            data
> html/rfc652.html:                                            data
> html/rfc674.html:                                            data
> html/rfc684.html:                                            data
> html/rfc731.html:                                            data
> html/rfc734.html:                                            data
> html/rfc736.html:                                            data
> html/rfc752.html:                                            data
> html/rfc774.html:                                            data
> html/rfc776.html:                                            data
> html/rfc783.html:                                            data
> 
> Are these things that can be cleaned up by a future html re-rendering?
> they're likely to break html parsers or other transformations that work
> from the html source.
> 
> Regards,
> 
>     --dkg
> 
> 
> 
> ___________________________________________________________
> Tools-discuss mailing list
> Tools-discuss@ietf.org
> https://www.ietf.org/mailman/listinfo/tools-discuss
> 
> Please report datatracker.ietf.org and mailarchive.ietf.org
> bugs at http://tools.ietf.org/tools/ietfdb
> or send email to datatracker-project@ietf.org
> 
> Please report tools.ietf.org bugs at
> http://tools.ietf.org/tools/issues
> or send email to webmaster@tools.ietf.org
>