[Sipping] Alternate CLF syntax proposal

Adam Roach <adam@nostrum.com> Thu, 26 March 2009 02:01 UTC

Return-Path: <adam@nostrum.com>
X-Original-To: sipping@core3.amsl.com
Delivered-To: sipping@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id C797A3A680F for <sipping@core3.amsl.com>; Wed, 25 Mar 2009 19:01:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.538
X-Spam-Level:
X-Spam-Status: No, score=-2.538 tagged_above=-999 required=5 tests=[AWL=0.062, BAYES_00=-2.599, SPF_PASS=-0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fKGcI6NEjFJr for <sipping@core3.amsl.com>; Wed, 25 Mar 2009 19:01:15 -0700 (PDT)
Received: from nostrum.com (nostrum-pt.tunnel.tserv2.fmt.ipv6.he.net [IPv6:2001:470:1f03:267::2]) by core3.amsl.com (Postfix) with ESMTP id 9F4403A6AAF for <sipping@ietf.org>; Wed, 25 Mar 2009 19:01:14 -0700 (PDT)
Received: from dhcp-17f4.meeting.ietf.org (dhcp-17f4.meeting.ietf.org [130.129.23.244]) (authenticated bits=0) by nostrum.com (8.14.3/8.14.3) with ESMTP id n2Q224tl023749 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 25 Mar 2009 21:02:05 -0500 (CDT) (envelope-from adam@nostrum.com)
Message-ID: <49CAE21C.5060309@nostrum.com>
Date: Wed, 25 Mar 2009 19:02:04 -0700
From: Adam Roach <adam@nostrum.com>
User-Agent: Thunderbird 2.0.0.21 (Macintosh/20090302)
MIME-Version: 1.0
To: sipping WG <sipping@ietf.org>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Received-SPF: pass (nostrum.com: 130.129.23.244 is authenticated by a trusted mechanism)
X-Virus-Scanned: ClamAV 0.94.2/9168/Wed Mar 25 16:01:16 2009 on shaman.nostrum.com
X-Virus-Status: Clean
Cc: draft-gurbani-sipping-clf@tools.ietf.org
Subject: [Sipping] Alternate CLF syntax proposal
X-BeenThere: sipping@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "SIPPING Working Group \(applications of SIP\)" <sipping.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/sipping>, <mailto:sipping-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/sipping>
List-Post: <mailto:sipping@ietf.org>
List-Help: <mailto:sipping-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sipping>, <mailto:sipping-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 26 Mar 2009 02:01:15 -0000

In the spirit of "send text," I've put together a straw-man proposal for 
an easy-to-generate and fast-to-process extensible format for saving SIP 
log messages:

http://www.ietf.org/internet-drafts/draft-roach-sipping-clf-syntax-00.txt

As an example of the processing that can be performed on this format: 
consider that I have a large file (on the order of 1 GB of data), with 
1,232,896 records in it (to choose a nice, round number). I'd like to 
extract all the information about messages with a particular "From" value.

With a text-based format, I'll be reading and parsing 1,262,485,504 
bytes (every byte in the file) in order to find delimiters.

With the format proposed in this document, I can open the file and then 
do the following about 1,232,896 times:

  - Read 4 bytes (total record length)
  - Fseek 32 bytes to reach the "To Value" pointer and length
  - Read 4 bytes
  - Fseek according to those 4 bytes to the literal value of the to 
header field
  - Read the to header field (let's imagine it's 20 bytes)
  - Fseek to the next record (according to the total record length)

In total, I'm reading 28 bytes per record 1,232,896 times, for a grand 
total of 34,521,088 bytes -- or about 2.7% as much data as I do with a 
text file.

When you're dealing with terabytes of log data, this can make the 
difference between taking one minute to sift data and taking 37 minutes 
to do the same operation. And, of course, it has the advantage that you 
can add more (tagged) data to each record without causing any additional 
processing load.

/a