Re: [Rswg] Broken links - some data

Brian E Carpenter <brian.e.carpenter@gmail.com> Sat, 27 May 2023 03:38 UTC

Return-Path: <brian.e.carpenter@gmail.com>
X-Original-To: rswg@ietfa.amsl.com
Delivered-To: rswg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2486DC151B1C; Fri, 26 May 2023 20:38:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.097
X-Spam-Level:
X-Spam-Status: No, score=-2.097 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bxjbhpsLSaFi; Fri, 26 May 2023 20:38:40 -0700 (PDT)
Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5CB6BC151060; Fri, 26 May 2023 20:38:40 -0700 (PDT)
Received: by mail-pf1-x432.google.com with SMTP id d2e1a72fcca58-64d3fbb8c1cso1858556b3a.3; Fri, 26 May 2023 20:38:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1685158719; x=1687750719; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=pRsrF6LSBRK46opGcD4OxFhQQFv8AfAN4seu+oo2pgI=; b=mo7QjMtKeIC4ISbuuwsOqBPQdqsDPulYVUl973a2m8l75nzJx7TNdw3wIyLVLIOkgZ b+J9ws5IZq3Ed5YVn08SyV6h/9w4RVdI6nkxXLiLTOnxKANcPF5lOD9qL7O38csr5gBb 4qNSyx0Aba3OUl1sEpS6Q8uGpd56VjroN2IOHtgu6fQ8mlObFhJFS7LmEt9A/yLAnSmr rgQXpNlGfoCGMO2ToWSjFI5bQgAVDHo04Jv1PFqvjMRfxrxr3ObsYYfC4sc4O9OKG4vs /eisZo/1izyutfTYNUWZ7MTqQnEKEEnPIocHHAIK+YNDaxtYgeqFeYgsypc+SYOJwDIK Hwog==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685158719; x=1687750719; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pRsrF6LSBRK46opGcD4OxFhQQFv8AfAN4seu+oo2pgI=; b=JWY+KVNszGoJJ6iaLidF6Mx28npRD4rJtRojx5NiPtM2FV7icOFz+op9ttpCfFPHvD mElGFwti5yFARFghieu7FqKocxw9xh9eU6e8ANURyMjohKAUSEdL3ZxsWkjqrQQw3ivK WuSDhBgx6UNRXBlDxnN9ws5C0g04iGc4siW8ANjnz7q+9JJh4+YjENVj9zQbcqnSSIZ0 RQLYibu7QUTBkJEXWe+ZkmIs8j+zUMXrGwWS8T0HL6Nl8J33RPxuEOgEq2OEh9L/3/k9 4yLheWd+EPYmfl+lldBH/sl/HOUzWISbOAKoy0ZRwry4a98ZmGAnsptZ0DLY2HcNelII jlCQ==
X-Gm-Message-State: AC+VfDxtrIn4ukShbvw2oFqiUHFikRjGkoKSG2UPt54PtaIuePZUwzgg XovQerbD0qXxKXzivE13nZuSZfe+01w7eA==
X-Google-Smtp-Source: ACHHUZ52Qbn4OaSXje/Erd42dZwmy/hVloeXGSQf01yLDezecLsj6u36eckcQMY/H0E0TcldBXCGvA==
X-Received: by 2002:a05:6a20:394c:b0:10c:9ba8:5953 with SMTP id r12-20020a056a20394c00b0010c9ba85953mr2067676pzg.59.1685158719354; Fri, 26 May 2023 20:38:39 -0700 (PDT)
Received: from ?IPV6:2406:e003:1184:f001:9991:d1ad:8c20:42bd? ([2406:e003:1184:f001:9991:d1ad:8c20:42bd]) by smtp.gmail.com with ESMTPSA id z11-20020a17090acb0b00b0024c1f1cdf98sm5195999pjt.13.2023.05.26.20.38.36 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 26 May 2023 20:38:38 -0700 (PDT)
Message-ID: <26f1bda8-2c0e-3f5c-55ab-7fe1cf60e561@gmail.com>
Date: Sat, 27 May 2023 15:38:33 +1200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0
Content-Language: en-US
To: Alexis Rossi <rsce@rfc-editor.org>, John C Klensin <john-ietf@jck.com>
Cc: rswg@rfc-editor.org
References: <A6EB6C3DC97A62AA1D45C29D@PSB> <245BCE33-D121-4F85-9E47-AA009E9EFA50@rfc-editor.org>
From: Brian E Carpenter <brian.e.carpenter@gmail.com>
In-Reply-To: <245BCE33-D121-4F85-9E47-AA009E9EFA50@rfc-editor.org>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/rswg/-aAQNN3gAEh8iL3BzfevKhXuslE>
Subject: Re: [Rswg] Broken links - some data
X-BeenThere: rswg@rfc-editor.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "RFC Series Working Group \(RSWG\)" <rswg.rfc-editor.org>
List-Unsubscribe: <https://mailman.rfc-editor.org/mailman/options/rswg>, <mailto:rswg-request@rfc-editor.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/rswg/>
List-Post: <mailto:rswg@rfc-editor.org>
List-Help: <mailto:rswg-request@rfc-editor.org?subject=help>
List-Subscribe: <https://mailman.rfc-editor.org/mailman/listinfo/rswg>, <mailto:rswg-request@rfc-editor.org?subject=subscribe>
X-List-Received-Date: Sat, 27 May 2023 03:38:44 -0000

On 27-May-23 13:08, Alexis Rossi wrote:
> 
>>
>> (5) Pre-web and URLs there was, IIR, no standardized form for
>> FTP references and, despite the Berkeley Unix convention "ftp:"
>> might not have picked up all of them.  Do you have data on that
>> or do we believe that the number of FTP references in the sample
>> is small enough to be irrelevant?
> 
> The WM only started collecting data in 1996, and it did not do a great job of collecting anything other than http URLs. IA was also just learning how to crawl, so the early stuff is often sparse. My guess is that we are unlikely to be able to find direct fixes for references for non-http resources.

Well, some of the ftp ones I saw are not too hard:

ftp://ds.internic.net/rfc/rfc1087

A number of RFCs used that format, before the RFC Editor was "on the web".

Since people seem interested, I've posted the raw log file at https://github.com/becarpenter/misc/blob/main/URLcheck.log

(Note there are a few obvious bogons in there, due to the difficulties of parsing the plain text format and detecting URLs that are examples.)

Regards
    Brian