[Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html

Daniel Kahn Gillmor <dkg@fifthhorseman.net> Tue, 20 November 2018 00:50 UTC

Return-Path: <dkg@fifthhorseman.net>
X-Original-To: tools-discuss@ietfa.amsl.com
Delivered-To: tools-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2544A128766 for <tools-discuss@ietfa.amsl.com>; Mon, 19 Nov 2018 16:50:47 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.19
X-Spam-Level:
X-Spam-Status: No, score=-4.19 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, T_SPF_PERMERROR=0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fL42k4Hiawlt for <tools-discuss@ietfa.amsl.com>; Mon, 19 Nov 2018 16:50:44 -0800 (PST)
Received: from che.mayfirst.org (che.mayfirst.org [162.247.75.118]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F221B124C04 for <tools-discuss@ietf.org>; Mon, 19 Nov 2018 16:50:43 -0800 (PST)
Received: from fifthhorseman.net (ool-6c3a0662.static.optonline.net [108.58.6.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by che.mayfirst.org (Postfix) with ESMTPSA id 0DC22F99A for <tools-discuss@ietf.org>; Mon, 19 Nov 2018 19:50:41 -0500 (EST)
Received: by fifthhorseman.net (Postfix, from userid 1000) id 5FDAF1FFEF; Mon, 19 Nov 2018 18:53:23 -0500 (EST)
From: Daniel Kahn Gillmor <dkg@fifthhorseman.net>
To: "tools-discuss@ietf.org" <tools-discuss@ietf.org>
Date: Mon, 19 Nov 2018 18:53:18 -0500
Message-ID: <87zhu4oc9d.fsf@fifthhorseman.net>
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Archived-At: <https://mailarchive.ietf.org/arch/msg/tools-discuss/qrKoyK9Q5_rFixW9zEWKlAb8q6k>
Subject: [Tools-discuss] non-ASCII, non-UTF-8 in rsync'ed html
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tools-discuss/>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2018 00:50:47 -0000

I ran the following to get all html files:

     rsync -az rsync.tools.ietf.org::tools.html/ html/

But i found that there were several files in that repo that contained
non-ASCII and non-UTF-8 octet sequences.

In particular, i ran:

    find html/ -iname '*.html' -print0 | xargs -0 file | grep -v 'ASCII text' | grep -v 'UTF-8 Unicode text'

and it produced this output:

html/draft-calabrese-requir-logprot-04.html:                     data
html/draft-dewinter-queue-start-00.html:                         data
html/draft-dewinter-queue-start-01.html:                         data
html/draft-duerst-iri-05.html:                                   data
html/draft-dusse-smime-msg-03.html:                              data
html/draft-felton-universal-language-00.html:                    data
html/draft-fujiwara-dnsop-resolver-update-00.html:               data
html/draft-iana-special-ipv4-03.html:                              data
html/draft-iana-special-ipv4-04.html:                              data
html/draft-ietf-2000-issue-00.html:                                data
html/draft-ietf-2000-issue-03.html:                                data
html/draft-ietf-2000-issue-04.html:                                data
html/draft-ietf-2000-issue-05.html:                                data
html/draft-ietf-2000-issue-06.html:                                data
html/draft-ietf-ccamp-mpls-tp-rsvpte-ext-associated-lsp-08.html: data
html/draft-ietf-dhc-mdhcp-00.html:                               data
html/draft-ietf-dhc-multopt-00.html:                             data
html/draft-ietf-disman-remops-mib-00.html:                       data
html/draft-ietf-disman-remops-mib-01.html:                       data
html/draft-ietf-dnsop-interim-signed-root-01.html:               data
html/draft-ietf-hubmib-mau-mib-03.html:                          data
html/draft-ietf-hubmib-mau-mib-04.html:                          data
html/draft-ietf-hubmib-repeater-dev-02.html:                     data
html/draft-ietf-hubmib-repeater-dev-03.html:                     data
html/draft-ietf-ipcdn-pktc-signaling-07.html:                   data
html/draft-ietf-ippm-ipdv-01.html:                              data
html/draft-ietf-isis-opexp-01.html:                             data
html/draft-ietf-ldapext-matchedval-06.html:                       data
html/draft-ietf-mpls-p2mp-loose-path-reopt-01.html:              data
html/draft-ietf-ospf-hmac-sha-01.html:                           data
html/draft-ietf-pkix-cmmf-02.html:                               data
html/draft-ietf-pkix-scvp-15.html:                               data
html/draft-ietf-rmonmib-pi-ipv6-01.html:                         data
html/draft-ietf-rtfm-new-traffic-flow-00.html:                   data
html/draft-ietf-smime-domsec-00.html:                                data
html/draft-ietf-spki-cert-req-00.html:                               data
html/draft-ietf-spki-cert-structure-00.html:                         data
html/draft-ihren-dnsop-interim-signed-root-01.html:                   data
html/draft-klensin-dns-role-01.html:                             data
html/draft-lear-ietf-rfc2026bis-00.html:                         data
html/draft-leung-sigtran-stream-sctp-00.html:                    data
html/draft-loughney-sctp-sig-prot-00.html:                        data
html/draft-murray-auth-ftp-ssl-03.html:                              data
html/draft-newnan-isomib-internet-00.html:                           data
html/draft-rabbat-ccamp-carrier-survey-01.html:                      data
html/draft-rfced-info-dudley-01.html:                                               data
html/draft-vasseur-mpls-backup-computation-02.html:               data
html/draft-vrancken-oauth-redelegation-01.html:                   data
html/draft-wenzel-cctld-bcp-00.html:                              data
html/draft-white-slapm-mib-00.html:                               data
html/draft-worster-mpls-in-ip-02.html:                            data
html/draft-zhu-rmcat-nada-02.html:                              data
html/draft-zhu-rmcat-nada-03.html:                              data
html/rfc1142.html:                                              data
html/rfc542.html:                                            data
html/rfc652.html:                                            data
html/rfc674.html:                                            data
html/rfc684.html:                                            data
html/rfc731.html:                                            data
html/rfc734.html:                                            data
html/rfc736.html:                                            data
html/rfc752.html:                                            data
html/rfc774.html:                                            data
html/rfc776.html:                                            data
html/rfc783.html:                                            data

Are these things that can be cleaned up by a future html re-rendering?
they're likely to break html parsers or other transformations that work
from the html source.

Regards,

    --dkg