Atom Link Extensions Use Case

Ed Summers <ehs@pobox.com> Fri, 08 June 2012 13:11 UTC

Return-Path: <owner-atom-syntax@mail.imc.org>
X-Original-To: ietfarch-atompub-archive@ietfa.amsl.com
Delivered-To: ietfarch-atompub-archive@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0D92F21F880B for <ietfarch-atompub-archive@ietfa.amsl.com>; Fri, 8 Jun 2012 06:11:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.662
X-Spam-Level:
X-Spam-Status: No, score=-2.662 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, RCVD_IN_DNSWL_LOW=-1, SARE_MILLIONSOF=0.315]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id r2poTbq5Z4hp for <ietfarch-atompub-archive@ietfa.amsl.com>; Fri, 8 Jun 2012 06:11:43 -0700 (PDT)
Received: from hoffman.proper.com (IPv6.Hoffman.Proper.COM [IPv6:2605:8e00:100:41::81]) by ietfa.amsl.com (Postfix) with ESMTP id D36B021F8875 for <atompub-archive@ietf.org>; Fri, 8 Jun 2012 06:11:42 -0700 (PDT)
Received: from hoffman.proper.com (localhost [127.0.0.1]) by hoffman.proper.com (8.14.5/8.14.5) with ESMTP id q58D44hE094868 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 8 Jun 2012 06:04:04 -0700 (MST) (envelope-from owner-atom-syntax@mail.imc.org)
Received: (from majordom@localhost) by hoffman.proper.com (8.14.5/8.13.5/Submit) id q58D44MK094867; Fri, 8 Jun 2012 06:04:04 -0700 (MST) (envelope-from owner-atom-syntax@mail.imc.org)
X-Authentication-Warning: hoffman.proper.com: majordom set sender to owner-atom-syntax@mail.imc.org using -f
Received: from mail-ob0-f171.google.com (mail-ob0-f171.google.com [209.85.214.171]) by hoffman.proper.com (8.14.5/8.14.5) with ESMTP id q58D43PX094861 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL) for <atom-syntax@imc.org>; Fri, 8 Jun 2012 06:04:04 -0700 (MST) (envelope-from ed.summers@gmail.com)
Received: by obfk16 with SMTP id k16so5681692obf.16 for <atom-syntax@imc.org>; Fri, 08 Jun 2012 06:04:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type; bh=PWxl3gN0s5Recxk1vP/vslkBLMNk0mgxqE+WAMJwQBA=; b=OHqr/+bkFutPuXUUGeDHJdWxzv2AQh7LEAeOjazvZWQNimqM7pVsmSNc3kdBHFeLRL wViPSXNqon1VVPpPY+k/AjGrlKVPIDAbrPJaOJog6xo05XaCAO86zCBbj1Ez2b/VcLvz VhV3ff1ZiuNfvzlqvjVCpvPqNFxAygGqYWeyANDK0vNK/Y27/zge4qeeXv5/R8q7gh4z r8rXRFWXHtEs8ZWquxL5jr+rf9g4oiJIF6L8LsX39AdccDqjWJObA8hG3mfD3VJh/jq1 1hENt5TQj5RsgsuWqapBWnnFyqfU24LVEW1th6/Pnco6HNU18QOq3L/vQRTBfKHpe1Ib 14FA==
MIME-Version: 1.0
Received: by 10.182.40.71 with SMTP id v7mr6419840obk.5.1339160643191; Fri, 08 Jun 2012 06:04:03 -0700 (PDT)
Received: by 10.60.147.138 with HTTP; Fri, 8 Jun 2012 06:04:03 -0700 (PDT)
Date: Fri, 08 Jun 2012 09:04:03 -0400
X-Google-Sender-Auth: ZPhCmAynBhOQ5Mlj_qS50nU7Zpc
Message-ID: <CABzDd=4pwK3Ao=fGOL4K+vN3po9iwd2QBkmL8OwEw3ZmYvW=Xw@mail.gmail.com>
Subject: Atom Link Extensions Use Case
From: Ed Summers <ehs@pobox.com>
To: atom-syntax <atom-syntax@imc.org>
Cc: James Snell <jasnell@gmail.com>
Content-Type: text/plain; charset="ISO-8859-1"
Sender: owner-atom-syntax@mail.imc.org
Precedence: bulk
List-Archive: <http://www.imc.org/atom-syntax/mail-archive/>
List-Unsubscribe: <mailto:atom-syntax-request@imc.org?body=unsubscribe>
List-ID: <atom-syntax.imc.org>

Hi all,

I am using Atom to syndicate access to data dumps at the Library of
Congress. We have a web application that provides access to historic
newspapers [1], and we have received requests for access to the
underlying OCR data for research and commercial purposes. Despite the
fact that this is historic data, we are routinely adding new content
as it is digitized. Rather than require clients to issue millions of
requests to get at the OCR data (which is actually web addressable)
the plan is to periodically create a tarred and compressed dump file
of new OCR content, and publish the availability of the file in an
Atom feed, which interested parties can subscribe to. It's a similar
model to what Wikimedia does for various Wikipedia projects [2].

Here's a minimal example, to give you an idea of what I mean (warning
URLs don't currently resolve):

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Chronicling America OCR Dumps</title>
    <link rel="self" type="application/atom+xml"
href="http://chroniclingamerica.loc.gov/dumps/ocr/feed/" />
    <id>info:lc/ndnp/dumps/ocr</id>
    <author>
        <name>Library of Congress</name>
        <uri>http://loc.gov</uri>
    </author>
    <updated>2012-06-08T08:35:27-04:00</updated>
    <entry>
        <title>part-00001.tar.bz2</title>
        <link rel="alternate" type="application/x-bzip2"
href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
/>
        <id>info:lc/ndnp/dump/ocr/part-00001.tar.bz2</id>
        <updated>2012-06-07T13:57:23-04:00</updated>
        <summary type="xhtml"><div
xmlns="http://www.w3.org/1999/xhtml">OCR dump file <a
href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2">part-00001.tar.bz2</a>
with size 162.7 MB generated June 7, 2012, 1:57 p.m.</div></summary>
    </entry>
</feed>

So the reason why I am writing here is that I would like to add
checksum information to the feed to let clients verify that they have
downloaded the data dump file correctly. An argument could be made
that it's not necessary since a corrupted bz2 file would likely not
decompress. An argument could also be made that the Content-MD5 header
could be used. But I like the idea of making an explicit assertion
about the checksum in the Atom document.

After a bit of googling I ran across James Snell's Atom Link
Extensions draft, which provides a pattern for including an md5
checksum in the <link> element like so:

    <link rel="alternate" type="application/x-bzip2"
hash="md5:579758192095fde80896058af4ce0aee"
href="http://chroniclingamerica.loc.gov/data/dumps/ocr/part-00001.tar.bz2"
/>

Unfortunately it looks like the draft has expired. I was wondering:

a) are there other established patterns for adding checksum
information for resources in Atom
b) if it's worth it for James to update the draft and try to push it
forwards to an Informational status

As more and more data providers make dumps of their data available to
reduce crawling (like Wikipedia) it seems like a good use case for
Atom to support.

//Ed

[1] http://chroniclingamerica.loc.gov
[2] http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-abstract.xml-rss.xml
[3] http://tools.ietf.org/html/draft-snell-atompub-link-extensions-08