Re: Atom Link Extensions Use Case

James M Snell <jasnell@gmail.com> Fri, 08 June 2012 14:48 UTC

Return-Path: <owner-atom-syntax@mail.imc.org>
X-Original-To: ietfarch-atompub-archive@ietfa.amsl.com
Delivered-To: ietfarch-atompub-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AA5BB21F8895 for <ietfarch-atompub-archive@ietfa.amsl.com>; Fri, 8 Jun 2012 07:48:33 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.284
X-Spam-Level:
X-Spam-Status: No, score=-3.284 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, RCVD_IN_DNSWL_LOW=-1, SARE_MILLIONSOF=0.315]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id H8B8jSoHPm0P for <ietfarch-atompub-archive@ietfa.amsl.com>; Fri, 8 Jun 2012 07:48:32 -0700 (PDT)
Received: from hoffman.proper.com (IPv6.Hoffman.Proper.COM [IPv6:2605:8e00:100:41::81]) by ietfa.amsl.com (Postfix) with ESMTP id 1A1F221F8880 for <atompub-archive@ietf.org>; Fri, 8 Jun 2012 07:48:29 -0700 (PDT)
Received: from hoffman.proper.com (localhost [127.0.0.1]) by hoffman.proper.com (8.14.5/8.14.5) with ESMTP id q58EdgUZ003141 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 07:39:42 -0700 (MST) (envelope-from owner-atom-syntax@mail.imc.org)
Received: (from majordom@localhost) by hoffman.proper.com (8.14.5/8.13.5/Submit) id q58Edg8S003140; Fri, 8 Jun 2012 07:39:42 -0700 (MST) (envelope-from owner-atom-syntax@mail.imc.org)
X-Authentication-Warning: hoffman.proper.com: majordom set sender to owner-atom-syntax@mail.imc.org using -f
Received: from mail-wi0-f181.google.com (mail-wi0-f181.google.com [209.85.212.181]) by hoffman.proper.com (8.14.5/8.14.5) with ESMTP id q58EdeWA003135 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for <atom-syntax@imc.org>; Fri, 8 Jun 2012 07:39:42 -0700 (MST) (envelope-from jasnell@gmail.com)
Received: by wibhn14 with SMTP id hn14so553234wib.4 for <atom-syntax@imc.org>; Fri, 08 Jun 2012 07:39:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=7us51ZQly9JYhBKgsiwA8Wo4+hf5d1WsnQIUwLOjxK0=; b=QDIUTyNnkbLEOlLWiuikkmEjOm1RwrEGwipKN2t7+hFYQIXaGi0dHXYTAkQBfHq334 0T9d2JqZHRY6fOHpSjku6NCyBB9klvA3z1M6PZZZJh4e5lwJXxGuvhYKZQQWRXD6Wlsr Sbs/aO7k3jwayZw89QB/rmJZ4mnaa43y9HfCbeGXB6OgneH/TlSQ8ri0B+D8itH1p4l1 RA2hSGX1fEsPbm6R93lii1GYWPmSO0sLYx7Hp5O9K5ANMuqI4qy+avx8ZqGivplcz9wi 7WdqfObMWDL8f4RUlqAC2KgvJfiMV7W4R+QvxemGfdHPtMLzumVHrm5je5lzq6wZRheW oWIA==
Received: by 10.180.91.109 with SMTP id cd13mr951358wib.22.1339166380401; Fri, 08 Jun 2012 07:39:40 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.223.104.12 with HTTP; Fri, 8 Jun 2012 07:39:20 -0700 (PDT)
In-Reply-To: <CABzDd=4pwK3Ao=fGOL4K+vN3po9iwd2QBkmL8OwEw3ZmYvW=Xw@mail.gmail.com>
References: <CABzDd=4pwK3Ao=fGOL4K+vN3po9iwd2QBkmL8OwEw3ZmYvW=Xw@mail.gmail.com>
From: James M Snell <jasnell@gmail.com>
Date: Fri, 08 Jun 2012 07:39:20 -0700
Message-ID: <CABP7RbduNRpCZ2aTEqKd+TtUmVKmYFVHihzfZDBzZaV=kjAbhQ@mail.gmail.com>
Subject: Re: Atom Link Extensions Use Case
To: Ed Summers <ehs@pobox.com>
Cc: atom-syntax <atom-syntax@imc.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by hoffman.proper.com id q58EdgW9003136
Sender: owner-atom-syntax@mail.imc.org
Precedence: bulk
List-Archive: <http://www.imc.org/atom-syntax/mail-archive/>
List-Unsubscribe: <mailto:atom-syntax-request@imc.org?body=unsubscribe>
List-ID: <atom-syntax.imc.org>

Good stuff. I can definitely resurrect the draft and move it forward
if there's enough interest.

- James

On Fri, Jun 8, 2012 at 6:04 AM, Ed Summers <ehs@pobox.com> wrote:
> Hi all,
>
> I am using Atom to syndicate access to data dumps at the Library of
> Congress. We have a web application that provides access to historic
> newspapers [1], and we have received requests for access to the
> underlying OCR data for research and commercial purposes. Despite the
> fact that this is historic data, we are routinely adding new content
> as it is digitized. Rather than require clients to issue millions of
> requests to get at the OCR data (which is actually web addressable)
> the plan is to periodically create a tarred and compressed dump file
> of new OCR content, and publish the availability of the file in an
> Atom feed, which interested parties can subscribe to. It's a similar
> model to what Wikimedia does for various Wikipedia projects [2].
>
> Here's a minimal example, to give you an idea of what I mean (warning
> URLs don't currently resolve):
>
> <?xml version="1.0" encoding="utf-8"?>
> <feed xmlns="http://www.w3.org/2005/Atom">
>    <title>Chronicling America OCR Dumps</title>
>    <link rel="self" type="application/atom+xml"
> href="http://chroniclingamerica.loc.gov/dumps/ocr/feed/" />
>    <id>info:lc/ndnp/dumps/ocr</id>
>    <author>
>        <name>Library of Congress</name>
>        <uri>http://loc.gov</uri>
>    </author>
>    <updated>2012-06-08T08:35:27-04:00</updated>
>    <entry>
>        <title>part-00001.tar.bz2</title>
>        <link rel="alternate" type="application/x-bzip2"
> href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
> />
>        <id>info:lc/ndnp/dump/ocr/part-00001.tar.bz2</id>
>        <updated>2012-06-07T13:57:23-04:00</updated>
>        <summary type="xhtml"><div
> xmlns="http://www.w3.org/1999/xhtml">OCR dump file <a
> href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2">part-00001.tar.bz2</a>
> with size 162.7 MB generated June 7, 2012, 1:57 p.m.</div></summary>
>    </entry>
> </feed>
>
> So the reason why I am writing here is that I would like to add
> checksum information to the feed to let clients verify that they have
> downloaded the data dump file correctly. An argument could be made
> that it's not necessary since a corrupted bz2 file would likely not
> decompress. An argument could also be made that the Content-MD5 header
> could be used. But I like the idea of making an explicit assertion
> about the checksum in the Atom document.
>
> After a bit of googling I ran across James Snell's Atom Link
> Extensions draft, which provides a pattern for including an md5
> checksum in the <link> element like so:
>
>    <link rel="alternate" type="application/x-bzip2"
> hash="md5:579758192095fde80896058af4ce0aee"
> href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
> />
>
> Unfortunately it looks like the draft has expired. I was wondering:
>
> a) are there other established patterns for adding checksum
> information for resources in Atom
> b) if it's worth it for James to update the draft and try to push it
> forwards to an Informational status
>
> As more and more data providers make dumps of their data available to
> reduce crawling (like Wikipedia) it seems like a good use case for
> Atom to support.
>
> //Ed
>
> [1] http://chroniclingamerica.loc.gov
> [2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml-rss.xml
> [3] http://tools.ietf.org/html/draft-snell-atompub-link-extensions-08