[Tools-discuss] Data sources / alternatives to screen scraping

<Pasi.Eronen@nokia.com> Mon, 15 February 2010 08:01 UTC

Return-Path: <Pasi.Eronen@nokia.com>
X-Original-To: tools-discuss@core3.amsl.com
Delivered-To: tools-discuss@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C44223A77B5 for <tools-discuss@core3.amsl.com>; Mon, 15 Feb 2010 00:01:46 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.372
X-Spam-Level:
X-Spam-Status: No, score=-6.372 tagged_above=-999 required=5 tests=[AWL=0.227, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id aP-X7CETXYuo for <tools-discuss@core3.amsl.com>; Mon, 15 Feb 2010 00:01:45 -0800 (PST)
Received: from mgw-mx03.nokia.com (smtp.nokia.com [192.100.122.230]) by core3.amsl.com (Postfix) with ESMTP id 2C6253A7720 for <tools-discuss@ietf.org>; Mon, 15 Feb 2010 00:01:44 -0800 (PST)
Received: from esebh105.NOE.Nokia.com (esebh105.ntc.nokia.com [172.21.138.211]) by mgw-mx03.nokia.com (Switch-3.3.3/Switch-3.3.3) with ESMTP id o1F82vq7031316 for <tools-discuss@ietf.org>; Mon, 15 Feb 2010 10:03:12 +0200
Received: from vaebh102.NOE.Nokia.com ([10.160.244.23]) by esebh105.NOE.Nokia.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 15 Feb 2010 10:03:10 +0200
Received: from vaebh101.NOE.Nokia.com ([10.160.244.22]) by vaebh102.NOE.Nokia.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 15 Feb 2010 10:03:04 +0200
Received: from smtp.mgd.nokia.com ([65.54.30.7]) by vaebh101.NOE.Nokia.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.3959); Mon, 15 Feb 2010 10:03:01 +0200
Received: from NOK-EUMSG-01.mgdnok.nokia.com ([65.54.30.86]) by nok-am1mhub-03.mgdnok.nokia.com ([65.54.30.7]) with mapi; Mon, 15 Feb 2010 09:03:00 +0100
From: Pasi.Eronen@nokia.com
To: tools-discuss@ietf.org
Date: Mon, 15 Feb 2010 09:02:57 +0100
Thread-Topic: Data sources / alternatives to screen scraping
Thread-Index: AcquFUjhZRIRdljIT/+nJJWVIPTJIQ==
Message-ID: <808FD6E27AD4884E94820BC333B2DB775841610C5C@NOK-EUMSG-01.mgdnok.nokia.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginalArrivalTime: 15 Feb 2010 08:03:01.0269 (UTC) FILETIME=[4B337450:01CAAE15]
X-Nokia-AV: Clean
Subject: [Tools-discuss] Data sources / alternatives to screen scraping
X-BeenThere: tools-discuss@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: IETF Tools Discussion <tools-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tools-discuss>
List-Post: <mailto:tools-discuss@ietf.org>
List-Help: <mailto:tools-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tools-discuss>, <mailto:tools-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Feb 2010 08:01:46 -0000

Hi all,

As Russ announced on the ietf-announce list a month ago, the upcoming
datatracker user interface changes will break scripts that do
"screen-scraping", or try to parse the HTML pages to extract some
information.

Much of the information is also available in forms that are intended
to parseable by scripts; however, those data sources aren't exactly
well documented (and it's not always easy to tell which of them are
authorative primary data, and which derived).

I've now updated the data source documentation on the Tools wiki
(http://trac.tools.ietf.org/group/tools/trac/wiki/DataSources).
Here's a quick summary about the most important primary data sources:

---------

Status of all internet-drafts (tab-separated text; generated from the
database by a cron job once a day):
http://www.ietf.org/id/all_id.txt

Title/authors/abstract/date for active internet-drafts (text;
generated from the database by cron jobs once a day. Although these
text files were probably originally intended for mainly human
consumption, their format has been very stable over the years, and as
they're currently used by number of tools, I would expect this to
continue):
http://www.ietf.org/id/1id-index.txt
http://www.ietf.org/id/1id-abstracts.txt

All RFCs (XML/text; note that the XML has much more information than
the text version):
http://www.rfc-editor.org/rfc/rfc-index.xml
http://www.rfc-editor.org/rfc/rfc-index.txt

Documents that are currently in IETF last call (Atom feed; generated
from the database on-the-fly):
http://datatracker.ietf.org/feed/last-call/

Document that are on the agenda of upcoming IESG telechats
(tab-separated text; generated from the database on-the-fly):
http://datatracker.ietf.org/iesg/agenda/documents.txt

Documents in the RFC editor queue (XML):
http://www.rfc-editor.org/queue2.xml

Detailed datatracker history for a particular draft (Atom feed;
generated from the database on-the-fly):
http://datatracker.ietf.org/feed/comments/draft-ietf-msec-newtype-keyid/

Information about active WGs, chairs, mailing lists, charters, etc
(text; generated from the database on-the-fly):
http://datatracker.ietf.org/wg/summary.txt
http://datatracker.ietf.org/wg/1wg-charters.txt

IPR disclosures by draft (tab-separated text; generated from the
database on-the-fly):
http://datatracker.ietf.org/ipr/by-draft/ 

---------

The upcoming datatracker UI changes will affect only HTML pages,
not any of the text/Atom/etc. URLs listed above.
 
Best regards,
Pasi