Re: [sip-clf] New CLF Syntax draft (text with index)

Hadriel Kaplan <> Fri, 12 June 2009 18:50 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id AFF623A6AB9 for <>; Fri, 12 Jun 2009 11:50:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.299
X-Spam-Status: No, score=-2.299 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_00=-2.599, J_CHICKENPOX_43=0.6]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id YvLwn+h8Mlle for <>; Fri, 12 Jun 2009 11:50:25 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id DFBE13A6C07 for <>; Fri, 12 Jun 2009 11:50:24 -0700 (PDT)
Received: from ( by ( with Microsoft SMTP Server (TLS) id 8.1.340.0; Fri, 12 Jun 2009 14:50:30 -0400
Received: from ([]) by mail ([]) with mapi; Fri, 12 Jun 2009 14:50:30 -0400
From: Hadriel Kaplan <>
To: Adam Roach <>, "" <>
Date: Fri, 12 Jun 2009 14:50:28 -0400
Thread-Topic: [sip-clf] New CLF Syntax draft (text with index)
Thread-Index: AcnPRY2QTmstE+HIQiewPQ2dwFq5SAcQPVAQ
Message-ID: <E6C2E8958BA59A4FB960963D475F7AC31940E573DB@mail>
References: <>
In-Reply-To: <>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [sip-clf] New CLF Syntax draft (text with index)
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: SIP Common Log File format discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Fri, 12 Jun 2009 18:50:26 -0000

Sorry for being late in draft comments.  Ingenious concept to encode it in ascii, and all, but I do wonder about the overhead in all this.  Personally I don't care that much about how much work it will be for the decoder/tool to read a CLF file, as much as I care how much overhead it is for a SIP device to generate the CLF.  A few extra instructions overall to help the reader-tool out is ok, but extra instructions/work per field is pushing it.  So my comments will probably seem obvious:

1) Encoding the field contents with terminating TAB characters is not only an extra instruction per field content, but also involves an entire scan of each field content to replace real TAB's with spaces/whatever.  At least if the goal of the encoding is to make it usable by text-parsing tools, without taking the length values into account. (is that the goal?)  This was exactly the type of thing we were trying to avoid by using binary encoding to begin with, I thought: not having to search the received on-the-wire header content for the "special" character we use for separation, to escape/replace it.

2) Encoding the pointers/lengths in hex-ascii is an extra itoa(val,16) type of function per length/ptr field.  I have no idea how much real work that is, but it's work that I don't have to do.  It also makes the file larger, since a 2-byte value is taking 4-bytes to encode.

3) This is sorta a nit, but the entire record length can be ~280 Terabytes, if I'm doing the math right.  Obviously it's up to the logger if it ever wants to record that much, but I have a hard time really imagining the need for anything beyond 4GB (ie, a 32-bit value).  So if we can encode binary as binary, the record length field could reasonably be 32-bits instead of this 48-bit thing.

IANAE, but I think most systems these days are memory-constrained for performance, rather than instruction-contrained (i.e., accessing memory is the bottleneck, not instruction cycles).  So I don't know how much of the work above is really noticeable work for systems, except for the scanning of content issue.


> -----Original Message-----
> From: [] On Behalf
> Of Adam Roach
> Sent: Thursday, May 07, 2009 2:56 PM
> To:
> Subject: [sip-clf] New CLF Syntax draft (text with index)
> This version defines a text format in which each record is composed of
> two lines in a log file. The first line is primarily pointers into the
> second line. The second line contains the actual logged fields,
> separated by tab characters.
> This approach retains the fast-search capabilities that I've been
> advocating, while allowing the use of simple, text-based tools, as Vijay
> has been promoting. This hybrid approach does come with a slight
> increase in log file size; for example, writing out the same 100,000 log
> entries in each of the three formats proposed so far:
>    - Text:   25 Mb
>    - Binary: 29 Mb
>    - Hybrid: 37 Mb
> This is about a 20% premium over the binary format, and a 48% premium
> over the text format. Speed of generation should be on the same order of
> speed as the other two versions, and speed of processing should be
> approximately the same as processing binary.
> The new version of the document can be found here:
> /a
> _______________________________________________
> sip-clf mailing list