Re: [weirds] [Regops] Search Engines Indexing RDAP Server Content

"John R Levine" <johnl@taugh.com> Fri, 29 January 2016 16:47 UTC

Return-Path: <johnl@taugh.com>
X-Original-To: weirds@ietfa.amsl.com
Delivered-To: weirds@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C3F801A8747 for <weirds@ietfa.amsl.com>; Fri, 29 Jan 2016 08:47:53 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.136
X-Spam-Level:
X-Spam-Status: No, score=-1.136 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HELO_MISMATCH_COM=0.553, HOST_MISMATCH_NET=0.311, KHOP_DYNAMIC=0.001, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id syVBA3kQ3DUB for <weirds@ietfa.amsl.com>; Fri, 29 Jan 2016 08:47:52 -0800 (PST)
Received: from miucha.iecc.com (abusenet-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:1126::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 4F0E31A8741 for <weirds@ietf.org>; Fri, 29 Jan 2016 08:47:52 -0800 (PST)
Received: (qmail 76420 invoked from network); 29 Jan 2016 16:47:51 -0000
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=iecc.com; h=date:message-id:from:to:subject:mime-version:content-type:user-agent; s=12a83.56ab97b7.k1601; bh=3aPVW89ccftl28CoMZylXmPPXihvtD8rOzCv2VmbhK0=; b=u+cPASgc3xUpSd36QGj5rcV4O4Zfz+XEfn/hilx6dTBVV+nkzMgy1N3a/+99L1CAqha9cAyMgmsl1Y8RBIjGDpXPKVQDvaAPfLO0aJlySBz+E7KuKA1JlPAQ2Ltaua1IcsPWu0iC4+z+Lcm7FjSbFroo5VU/VXLQFXHPzi8KyF7oUSUIiTK3IDSnXywzuKSZfHV8emOdwJM16SZgv/emcciZ/1xPocO+s0h5XlK9Yf1zK5HzbJG0hmJE9CPSe/Yv
DKIM-Signature: v=1; a=rsa-sha256; c=simple; d=taugh.com; h=date:message-id:from:to:subject:mime-version:content-type:user-agent; s=12a83.56ab97b7.k1601; bh=3aPVW89ccftl28CoMZylXmPPXihvtD8rOzCv2VmbhK0=; b=Gm7qgvAk7C2xTgmtEYTV+P4ukdAp/uwpZ5ZWaWEMDJu5njkTx2pJ0/iJnfyHkdpychAmtfFd6Gy8z4bKyl2Z4t9P3gFuoH6pVqORA5MTRfVyGb0w4tocQAxRYueSTr4CDSfUxzxmld0h/W0d8TcLrKcZPVONWFrO8cGFTtZ+gJjl+P6iNbliRURG69fDWSXSHg/3tutT01NWF0rYOYqOY0+hrl5Ct1grfvUgcAsKaKiBg/gc1ngGiNI1EBeszkMM
Received: from localhost ([IPv6:2001:470:1f07:1126::78:696d:6170]) by imap.iecc.com ([IPv6:2001:470:1f07:1126::78:696d:6170]) with ESMTPS (TLS1.0/X.509/SHA1) via TCP6; 29 Jan 2016 16:47:51 -0000
Date: 29 Jan 2016 11:47:50 -0500
Message-ID: <alpine.OSX.2.11.1601291146570.26863@ary.lan>
From: "John R Levine" <johnl@taugh.com>
To: weirds@ietf.org
User-Agent: Alpine 2.11 (OSX 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Archived-At: <http://mailarchive.ietf.org/arch/msg/weirds/o1rXx5Cr15AtnGvr4SRiXQFryqk>
Subject: Re: [weirds] [Regops] Search Engines Indexing RDAP Server Content
X-BeenThere: weirds@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "WHOIS-based Extensible Internet Registration Data Service \(WEIRDS\)" <weirds.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/weirds>, <mailto:weirds-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/weirds/>
List-Post: <mailto:weirds@ietf.org>
List-Help: <mailto:weirds-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/weirds>, <mailto:weirds-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 29 Jan 2016 16:47:53 -0000

> So I saw a tweet from Gavin Brown (@GavinBrown) that describes how one 
> particular search engine has indexed the RDAP server of a gTLD registry 
> operator:
> 
> https://twitter.com/GavinBrown/status/692718904058191872
> 
> This is all the more reason to work on a client authentication specification 
> that includes support for varying responses based on client identity and 
> authorization. I've been working on such a specification and welcome feedback 
> on the approach:

I don't see what the problem is.  If you set up an http server that 
returns interlinked data, search engines will find it and index it.  All 
the information RDAP returns to an unauthenticated query is presumably 
public, so what's the harm in making it easier to find?

But anyway, if you don't want them to do that, there's plenty of ways to 
keep them out.

The easiest is to publish a /robots.txt file.  All legit search engines 
will stay out if that's what it says.

Another is to look at the agent string the client sends.  Google's is 
googlebot, Bing's is bingbot, Yahoo's is scooter.  It's easy enough to 
find a list of common spider names.  If the agent is a spider, tell it to 
go away or redirect it to a help page.

Another is to look at the Accept: header.  An RDAP client should ask for a 
JSON media type.  For a client that asks for html or anything else, return 
an html version with meta fields in the header saying NOINDEX and 
NOFOLLOW.

The big search engines spider at low speed from hosts all over the world 
to avoid overloading the sites they index.  You're not going to keep them 
out via authentication without also keeping out everyone else who doesn't 
have a password.  I don't think that's a good idea.

R's,
John