Re: Google Scholar, was How to pay $47 for a copy of RFC 793

Harald Alvestrand <harald@alvestrand.no> Tue, 10 May 2011 21:01 UTC

Return-Path: <harald@alvestrand.no>
X-Original-To: ietf@ietfa.amsl.com
Delivered-To: ietf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2973DE0782 for <ietf@ietfa.amsl.com>; Tue, 10 May 2011 14:01:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.599
X-Spam-Level:
X-Spam-Status: No, score=-102.599 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 11xrV-50E+gx for <ietf@ietfa.amsl.com>; Tue, 10 May 2011 14:01:23 -0700 (PDT)
Received: from eikenes.alvestrand.no (eikenes.alvestrand.no [158.38.152.233]) by ietfa.amsl.com (Postfix) with ESMTP id 2142DE0717 for <ietf@ietf.org>; Tue, 10 May 2011 14:01:22 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1]) by eikenes.alvestrand.no (Postfix) with ESMTP id 08D0539E182; Tue, 10 May 2011 22:52:33 +0200 (CEST)
X-Virus-Scanned: Debian amavisd-new at eikenes.alvestrand.no
Received: from eikenes.alvestrand.no ([127.0.0.1]) by localhost (eikenes.alvestrand.no [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EpkE7WimMZXe; Tue, 10 May 2011 22:52:32 +0200 (CEST)
Received: from [192.168.0.14] (c213-89-141-213.bredband.comhem.se [213.89.141.213]) by eikenes.alvestrand.no (Postfix) with ESMTPS id B039239E173; Tue, 10 May 2011 22:52:31 +0200 (CEST)
Message-ID: <4DC9A5BD.1070509@alvestrand.no>
Date: Tue, 10 May 2011 22:53:17 +0200
From: Harald Alvestrand <harald@alvestrand.no>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110424 Thunderbird/3.1.10
MIME-Version: 1.0
To: John C Klensin <john-ietf@jck.com>
Subject: Re: Google Scholar, was How to pay $47 for a copy of RFC 793
References: <20110510152851.40727.qmail@joyce.lan> <4DC95CBE.60304@alvestrand.no> <1C26E7D5-1810-4B13-B51B-A1220121531F@vpnc.org> <4DC9824C.2070109@alvestrand.no> <457217AD26982E40EEF2C057@PST.JCK.COM>
In-Reply-To: <457217AD26982E40EEF2C057@PST.JCK.COM>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: John Levine <johnl@iecc.com>, Paul Hoffman <paul.hoffman@vpnc.org>, ietf@ietf.org
X-BeenThere: ietf@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: IETF-Discussion <ietf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf>, <mailto:ietf-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf>
List-Post: <mailto:ietf@ietf.org>
List-Help: <mailto:ietf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf>, <mailto:ietf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 10 May 2011 21:01:24 -0000

On 05/10/2011 10:08 PM, John C Klensin wrote:
>
> --On Tuesday, May 10, 2011 20:22 +0200 Harald Alvestrand
> <harald@alvestrand.no>  wrote:
>
>>> If only there was someone who worked at Google on this list
>>> who could send an internal message to get this rectified....
>>> :-)
>>    From what I could tell from the instructions, Scholar is
>> using some heuristics to figure out that "this is a paper" and
>> "this is not a paper". The highest one on the list was a
>> 3-slide presentation that really didn't say very much - I
>> think this is one where heuristics had failed.
>> I think someone at the site could help them a lot more.
> Harald,
>
> I'm not sure what you mean by "someone at the site".  Certainly,
> various of us could explain to them why the series should be
> more comprehensibly indexed.  But with Maps as a notable
> exception, I've found that suggesting that a particular
> heuristic is failing, or that something should have been indexed
> that isn't, is most likely to get a response whose essence is
> the Google folks and their algorithms are ever so much smarter
> then us lusers, so what could we possibly know?
The instructions at Scholar were pretty comprehensive and specific:

- Make either your abstracts or your documents into HTML
- Put a very specific selection of tags into your documents
- Report your collection to the Scholar robot

We can either ignore this particular set of instructions, and get the 
result that the heuristics generate, or follow this set of instructions, 
and hope for a better result.

My point (if I have any) is that those instructions should be easy to 
follow for the people who control these sites, but are not so easy for 
anyone else (unless they want to act as if they are an "official mirror").

That puts the ball in the RFC publisher's court.
> Of course, my personal heuristic, and that of many folks I know
> who use Scholar much more intensely than I do, is that if a
> Scholar search fails or produces nonsense, I go to the
> general-purpose search engine.   For RFCs, it tends to do very
> well, both at finding the right stuff and at ranking the RFC
> text itself near the top.
>
> So, other than being lazy about not doing the second search,
> pedantic about what Scholar should be indexing and how, or
> demanding and expecting a more perfect universe, I'm not sure I
> see a real problem in this.
>
>      john
>
>