Re: [art] Draft submission: Robots Exclusion Protocol

"Martin Thomson" <mt@lowentropy.net> Mon, 08 July 2019 02:58 UTC

Return-Path: <mt@lowentropy.net>
X-Original-To: art@ietfa.amsl.com
Delivered-To: art@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 788431200F3 for <art@ietfa.amsl.com>; Sun, 7 Jul 2019 19:58:19 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.7
X-Spam-Level:
X-Spam-Status: No, score=-2.7 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=lowentropy.net header.b=YPCHPBtq; dkim=pass (2048-bit key) header.d=messagingengine.com header.b=C0Z+2e64
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mdtcjChbO32a for <art@ietfa.amsl.com>; Sun, 7 Jul 2019 19:58:17 -0700 (PDT)
Received: from out5-smtp.messagingengine.com (out5-smtp.messagingengine.com [66.111.4.29]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 30ED21200D8 for <art@ietf.org>; Sun, 7 Jul 2019 19:58:17 -0700 (PDT)
Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id 6D7AA21B24 for <art@ietf.org>; Sun, 7 Jul 2019 22:58:16 -0400 (EDT)
Received: from imap2 ([10.202.2.52]) by compute1.internal (MEProxy); Sun, 07 Jul 2019 22:58:16 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lowentropy.net; h=mime-version:message-id:in-reply-to:references:date:from:to :subject:content-type; s=fm2; bh=EJV0BlDLGxkt+zaGaFzxzVaIShWaiAg nrgRg4ImwCZI=; b=YPCHPBtq83dXxFQkj57EaRnmrAfApF+ER2FED8Z1IatTTgh DbFMWwTaKTeRH+rtcSRfuUkerqPtLoTwma6ZA9ZWHe0NhnCj+QycrmdYK6JymYZb OWyMkUIiVZ/Ik5+N6dyLvWoUtCZib989xxSAS9oPXFGRb2fEJyLatHh6yBqsDHIC Ty7o/lgZYXVA6uNElvXweoXYe/RD9BpXXT85l/vW/6nZ+9ZFdGkn4ap8RobY3zti 6ZyZQRDnNIPcD9l7WGBjt1KnK0n14ZR2+C4fHzgJgrOcnNvGTmm2GQ3BkGWnsphS TIq7jx0jLMvO+TA4+B7k8rzMU//hy8UHZhchGww==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; bh=EJV0Bl DLGxkt+zaGaFzxzVaIShWaiAgnrgRg4ImwCZI=; b=C0Z+2e64NontGrWGN7Q1pR xWEatZPBfnFDOxqtGFsQ8PLq28hQn3oX6lBzG5r3qngvjIfkl4pxh7BDg9mp77RB kWY919hRx5xFt8tBCq5KTxerUwY9gVTW5o5uVOsOcJsS9iLGImf50P1Bsn3sWf7m DBOO4QpgBLMYGexonJrvw5Q6gzjI+7Bb+kHUNTJRhQLH6opjNJBNijdYl7x1qvuv Zrt1KPct0xlnY/EpVUhDdNxsSG05WwbSEOEE4QaiILsSAmym4+QaA7x71fXuG2e6 IaUNiKGT/QTgiDG1+iEU87zact0bR6zdQuPY6J3hGT6Jwo/FgnkKOotGF3oZp+LQ ==
X-ME-Sender: <xms:R7EiXTvGNYvBhn21LZfmM-o_q6wNdzM5vXelawcd6ZF1VXi32K2ixw>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduvddrfeelgdeivdcutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecunecujfgurhepofgfggfkjghffffhvffutgesthdtre dtreertdenucfhrhhomhepfdforghrthhinhcuvfhhohhmshhonhdfuceomhhtsehlohif vghnthhrohhphidrnhgvtheqnecuffhomhgrihhnpehivghtfhdrohhrghenucfrrghrrg hmpehmrghilhhfrhhomhepmhhtsehlohifvghnthhrohhphidrnhgvthenucevlhhushht vghrufhiiigvpedt
X-ME-Proxy: <xmx:R7EiXTsKX90h2gctkbzOiOkjAEDFhxeGaYc4uliIXMbCOR9lBXaKtg> <xmx:R7EiXS_ybHszlE2nroTmzLj6qo8TEPQ7hgb8Wk7WTrqFN148KuPcDA> <xmx:R7EiXc-QPhCBP36AA--OPQTkf3ImcSDiZ2oMRmg0xnzXzwzbP1B9yA> <xmx:SLEiXWNQ_Q5yjnF0-S9g_I2SVbvilKZbuhwcsG8jCJa2Hrl8XGUF7A>
Received: by mailuser.nyi.internal (Postfix, from userid 501) id A8A29E0592; Sun, 7 Jul 2019 22:58:15 -0400 (EDT)
X-Mailer: MessagingEngine.com Webmail Interface
User-Agent: Cyrus-JMAP/3.1.6-731-g19d3b16-fmstable-20190627v1
Mime-Version: 1.0
Message-Id: <765f5ca4-6c69-4fbf-95b8-ca2161f2ae39@www.fastmail.com>
In-Reply-To: <CADTQi=fpo7OMET7N_EZPq+n+xZq_sysf=mDoZzTNvaHxWh-Qbw@mail.gmail.com>
References: <CADTQi=fpo7OMET7N_EZPq+n+xZq_sysf=mDoZzTNvaHxWh-Qbw@mail.gmail.com>
Date: Mon, 08 Jul 2019 12:58:10 +1000
From: Martin Thomson <mt@lowentropy.net>
To: art@ietf.org
Content-Type: text/plain
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/A2qjtTa0n434Na4EaYt0jWAHx0o>
Subject: Re: [art] Draft submission: Robots Exclusion Protocol
X-BeenThere: art@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Applications and Real-Time Area Discussion <art.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/art>, <mailto:art-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/art/>
List-Post: <mailto:art@ietf.org>
List-Help: <mailto:art-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/art>, <mailto:art-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 08 Jul 2019 02:58:20 -0000

Hi Gary,

Thanks for taking the time to do this.  It's not glamorous work by any means, but it's worth writing down this stuff.

Some minor comments here on the draft, which overall looks like it is worth doing.

Section 2 could benefit from a brief description of what the protocol is and how it is used.  That is, there is a text file that describes what a robot is discouraged from accessing.  That file comprises multiple groups, with each group applying to one or more robots (as identified by the User-Agent value it advertises), or all robots (*).

Please be doubly, extra-specific about where case-sensitivity applies.  These textual formats are really bad for case sensitivity issues, particularly where some parts ("allow") are insensitive and other parts (URLs) are sensitive.

There's probably security considerations here for servers that serve case insensitive resources, because this format might fail to properly exclude paths in that case.

   ; parser implementors: add additional lines you need (for
   ; example Sitemaps), and be lenient when reading lines that don't
   ; conform. Apply Postel's law.

Sorry, but this is a red flag to me.  When someone says that, I interpret this as "sorry, but we were too lazy to write a proper spec here".  I know that's not the intent, and the draft already basically says what you need it to say.  Section 2.3.1.5. is most of the way there, though a SHOULD leaves too much wiggle room to be an effective specification.

I would suggest saying explicitly: Lines that a parser does not understand MUST be ignored.  And maybe "For example, the "sitemap" rule applies.  That probably also comes with some better definition of lines (in prose, maybe with reference to the EOL rule; btw, check that rule, I think it's a run-on) and how groups are defined. The latter is, as far as I can see, a problem with this format.

An important question you then need to consider is whether this is one group or two:

  User-Agent : foo
  garbage: garbage
  User-Agent: bar
  disallow: /off-limits

That is, when garbage is ignored, does that mean that "foo" would be required to respect the disallow rule?

Section 5 of https://tools.ietf.org/html/draft-bormann-dispatch-modern-network-unicode includes advice on construction of ABNF you might want to read.  That doesn't need to be an RFC for you to follow its advice.

When talking about the /robots.txt resource, it might pay to cite RFC 7320/BCP 190 and explain why this document doesn't follow the advice there.

On Mon, Jul 8, 2019, at 06:45, Gary Illyes wrote:
> Hi ART,
> 
> As you may have seen, a group has gotten together with Martijn Koster 
> to create an internet-draft which restates the Robots Exclusion 
> Protocol using updated ABNF and prescriptions. The current draft is 
> (draft-koster-rep). If you get a chance, we'd appreciate your 
> confirming that it matches your understanding of current practice and 
> that the specification is clear even for edge cases. You can send mail 
> to me to forward on to the author team or address it to the emails 
> listed in the draft. 
> 
> Thanks for your help,
> Gary Illyes
> Google Switzerland
> 
> https://datatracker.ietf.org/doc/draft-koster-rep/
> _______________________________________________
> art mailing list
> art@ietf.org
> https://www.ietf.org/mailman/listinfo/art
>