Re: Atom Link Extensions Use Case

Tim Bray <twbray@google.com> Fri, 08 June 2012 14:48 UTC

Return-Path: <owner-atom-syntax@mail.imc.org>
X-Original-To: ietfarch-atompub-archive@ietfa.amsl.com
Delivered-To: ietfarch-atompub-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8234221F888C for <ietfarch-atompub-archive@ietfa.amsl.com>; Fri, 8 Jun 2012 07:48:53 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.661
X-Spam-Level:
X-Spam-Status: No, score=-102.661 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-1, SARE_MILLIONSOF=0.315, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id kcIytz-XFHFS for <ietfarch-atompub-archive@ietfa.amsl.com>; Fri, 8 Jun 2012 07:48:52 -0700 (PDT)
Received: from hoffman.proper.com (IPv6.Hoffman.Proper.COM [IPv6:2605:8e00:100:41::81]) by ietfa.amsl.com (Postfix) with ESMTP id 8EC0821F8870 for <atompub-archive@ietf.org>; Fri, 8 Jun 2012 07:48:32 -0700 (PDT)
Received: from hoffman.proper.com (localhost [127.0.0.1]) by hoffman.proper.com (8.14.5/8.14.5) with ESMTP id q58EfXrp003269 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 07:41:33 -0700 (MST) (envelope-from owner-atom-syntax@mail.imc.org)
Received: (from majordom@localhost) by hoffman.proper.com (8.14.5/8.13.5/Submit) id q58EfXp5003268; Fri, 8 Jun 2012 07:41:33 -0700 (MST) (envelope-from owner-atom-syntax@mail.imc.org)
X-Authentication-Warning: hoffman.proper.com: majordom set sender to owner-atom-syntax@mail.imc.org using -f
Received: from mail-wg0-f53.google.com (mail-wg0-f53.google.com [74.125.82.53]) by hoffman.proper.com (8.14.5/8.14.5) with ESMTP id q58EfVEA003254 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for <atom-syntax@imc.org>; Fri, 8 Jun 2012 07:41:33 -0700 (MST) (envelope-from twbray@google.com)
Received: by wgbfm10 with SMTP id fm10so756584wgb.22 for <atom-syntax@imc.org>; Fri, 08 Jun 2012 07:41:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-system-of-record; bh=vcHb1P7Fkrvd/++Lomhfxc5JHwioDETlS9H+eatlbQk=; b=m5gYfgpl7RwvkvpjV1cOGh+cmjw4z1G2f3gl6H/PdfOZ3IMUSWmzXxCSs+cO7eCGSv av+nUNEoq6h33DpGZSu0bamEBouvYr7p1PsgM7+xAb8p+8gOdmhH6IhSIL90JXSJ2Zfa TyUBWDjJeberJ3Apan4usCItHIQo18XJ2hTgzKpMW2jKQmQSNox/JOAutBD9bUt73/53 oDLjfHN0RayXbPU7+7njPPHUUb7I8nK36FuJ55jbYj1Z6VK6rqebPZv7GRg1pAAcYrTg 5qWQYePX8KSdHDuAapz5x0Abs4Vw8kW/A1cLHktlwMUlmOrJvxCtaIXq/q7HbIz4bXkh OAhQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-system-of-record:x-gm-message-state; bh=vcHb1P7Fkrvd/++Lomhfxc5JHwioDETlS9H+eatlbQk=; b=Jv/r202mLI7nPra1Ii/aYfXFSLesy0oKWf5hLWw35/ogJK04BW6t1UWFfIxWUadXTS FX4FhFw0qtGSGyA32SkHw9IWCSEdRFbV4CpzOKGBPE+SUJPQE+GJeNk05XX9NfrGodUc HNlnc0dOCOQLlgjeviZWj4RdNysGCE6bRurDJTkKB42uGqGhLnPqRJ0J6SVLjm7G/+0f 082WedDmcDldXJ8cNwhAD+FfUuUNklfT7/QbOcLZfdCGCHo/VbEHZUuu8DiRhaOnU4kG QZzlL+L+BkiE6OITi5s67pNkEYp5oWC6yUyYDPMYVr2Lh8F+tjPcE81v+Q2IGJPsNWXt vWuQ==
Received: by 10.180.80.37 with SMTP id o5mr959573wix.12.1339166491051; Fri, 08 Jun 2012 07:41:31 -0700 (PDT)
Received: by 10.180.80.37 with SMTP id o5mr959479wix.12.1339166490428; Fri, 08 Jun 2012 07:41:30 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.223.81.65 with HTTP; Fri, 8 Jun 2012 07:41:00 -0700 (PDT)
In-Reply-To: <CABzDd=4pwK3Ao=fGOL4K+vN3po9iwd2QBkmL8OwEw3ZmYvW=Xw@mail.gmail.com>
References: <CABzDd=4pwK3Ao=fGOL4K+vN3po9iwd2QBkmL8OwEw3ZmYvW=Xw@mail.gmail.com>
From: Tim Bray <twbray@google.com>
Date: Fri, 08 Jun 2012 07:41:00 -0700
Message-ID: <CA+ZpN27_XfQnedj1v0BgS7G1BLR2Yq5ETkwROLXnCZbSEJLZ5A@mail.gmail.com>
Subject: Re: Atom Link Extensions Use Case
To: Ed Summers <ehs@pobox.com>
Cc: atom-syntax <atom-syntax@imc.org>, James Snell <jasnell@gmail.com>
Content-Type: multipart/alternative; boundary="f46d04428644f5058c04c1f6fe2f"
X-System-Of-Record: true
X-Gm-Message-State: ALoCoQmemXihoL1Aa1w37xBwosrmPCm7s9Ga9e1U0F2UT8lxd+nxGOTvvXio8r8NGezb9emi0J4nvB0Z2Pdidb9UqTDjmFev/6etuoM8s+afgWVKE1oFECPeqOVYAbvGH7wnP8sP7UtVmh7rLULaP3bMRm/2N1GMkQ==
Sender: owner-atom-syntax@mail.imc.org
Precedence: bulk
List-Archive: <http://www.imc.org/atom-syntax/mail-archive/>
List-Unsubscribe: <mailto:atom-syntax-request@imc.org?body=unsubscribe>
List-ID: <atom-syntax.imc.org>

Why not just drop an element into the <entry> in your own namespace?  This
doesn’t feel like any kind of a link to me.

<feed xmlns:loc="http://whatever.loc.gov">
  ...
  <entry>
    ...
    <loc:checksum>3c89ea593c01483fd091</loc:checksum
    ...

On Fri, Jun 8, 2012 at 6:04 AM, Ed Summers <ehs@pobox.com> wrote:

>
> Hi all,
>
> I am using Atom to syndicate access to data dumps at the Library of
> Congress. We have a web application that provides access to historic
> newspapers [1], and we have received requests for access to the
> underlying OCR data for research and commercial purposes. Despite the
> fact that this is historic data, we are routinely adding new content
> as it is digitized. Rather than require clients to issue millions of
> requests to get at the OCR data (which is actually web addressable)
> the plan is to periodically create a tarred and compressed dump file
> of new OCR content, and publish the availability of the file in an
> Atom feed, which interested parties can subscribe to. It's a similar
> model to what Wikimedia does for various Wikipedia projects [2].
>
> Here's a minimal example, to give you an idea of what I mean (warning
> URLs don't currently resolve):
>
> <?xml version="1.0" encoding="utf-8"?>
> <feed xmlns="http://www.w3.org/2005/Atom">
>    <title>Chronicling America OCR Dumps</title>
>    <link rel="self" type="application/atom+xml"
> href="http://chroniclingamerica.loc.gov/dumps/ocr/feed/" />
>    <id>info:lc/ndnp/dumps/ocr</id>
>    <author>
>        <name>Library of Congress</name>
>        <uri>http://loc.gov</uri>
>    </author>
>    <updated>2012-06-08T08:35:27-04:00</updated>
>    <entry>
>        <title>part-00001.tar.bz2</title>
>        <link rel="alternate" type="application/x-bzip2"
> href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
> />
>        <id>info:lc/ndnp/dump/ocr/part-00001.tar.bz2</id>
>        <updated>2012-06-07T13:57:23-04:00</updated>
>        <summary type="xhtml"><div
> xmlns="http://www.w3.org/1999/xhtml">OCR dump file <a
> href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2
> ">part-00001.tar.bz2</a>
> with size 162.7 MB generated June 7, 2012, 1:57 p.m.</div></summary>
>    </entry>
> </feed>
>
> So the reason why I am writing here is that I would like to add
> checksum information to the feed to let clients verify that they have
> downloaded the data dump file correctly. An argument could be made
> that it's not necessary since a corrupted bz2 file would likely not
> decompress. An argument could also be made that the Content-MD5 header
> could be used. But I like the idea of making an explicit assertion
> about the checksum in the Atom document.
>
> After a bit of googling I ran across James Snell's Atom Link
> Extensions draft, which provides a pattern for including an md5
> checksum in the <link> element like so:
>
>    <link rel="alternate" type="application/x-bzip2"
> hash="md5:579758192095fde80896058af4ce0aee"
> href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
> />
>
> Unfortunately it looks like the draft has expired. I was wondering:
>
> a) are there other established patterns for adding checksum
> information for resources in Atom
> b) if it's worth it for James to update the draft and try to push it
> forwards to an Informational status
>
> As more and more data providers make dumps of their data available to
> reduce crawling (like Wikipedia) it seems like a good use case for
> Atom to support.
>
> //Ed
>
> [1] http://chroniclingamerica.loc.gov
> [2]
> http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml-rss.xml
> [3] http://tools.ietf.org/html/draft-snell-atompub-link-extensions-08
>
>