[sip-ops] SIP-CLF: Results on ASCII vs. binary representation

"Vijay K. Gurbani" <vkg@alcatel-lucent.com> Wed, 29 April 2009 14:31 UTC

Return-Path: <vkg@alcatel-lucent.com>
X-Original-To: sip-ops@core3.amsl.com
Delivered-To: sip-ops@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 07C4D3A67F3; Wed, 29 Apr 2009 07:31:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.598
X-Spam-Level:
X-Spam-Status: No, score=-1.598 tagged_above=-999 required=5 tests=[AWL=-0.858, BAYES_20=-0.74]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id r9nbOZzqpNkk; Wed, 29 Apr 2009 07:31:37 -0700 (PDT)
Received: from ihemail2.lucent.com (ihemail2.lucent.com [135.245.0.35]) by core3.amsl.com (Postfix) with ESMTP id 5538D3A6809; Wed, 29 Apr 2009 07:30:47 -0700 (PDT)
Received: from umail.lucent.com (h135-3-40-61.lucent.com [135.3.40.61]) by ihemail2.lucent.com (8.13.8/IER-o) with ESMTP id n3TEW8PP003014 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 29 Apr 2009 09:32:08 -0500 (CDT)
Received: from [135.185.236.17] (il0015vkg1.ih.lucent.com [135.185.236.17]) by umail.lucent.com (8.13.8/TPES) with ESMTP id n3TEW8vi017596; Wed, 29 Apr 2009 09:32:08 -0500 (CDT)
Message-ID: <49F864E8.20005@alcatel-lucent.com>
Date: Wed, 29 Apr 2009 09:32:08 -0500
From: "Vijay K. Gurbani" <vkg@alcatel-lucent.com>
Organization: Bell Labs Security Technology Research Group
User-Agent: Thunderbird 2.0.0.6 (Windows/20070728)
MIME-Version: 1.0
To: sip-ops@ietf.org, dispatch@ietf.org
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.57 on 135.245.2.35
Subject: [sip-ops] SIP-CLF: Results on ASCII vs. binary representation
X-BeenThere: sip-ops@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: SIP Operations <sip-ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/sip-ops>, <mailto:sip-ops-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/sip-ops>
List-Post: <mailto:sip-ops@ietf.org>
List-Help: <mailto:sip-ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sip-ops>, <mailto:sip-ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Apr 2009 14:31:39 -0000

[Bcc'd to sipping]

Hello:

During the SF IETF, the SIP CLF work [2] garnered support
and attention; the minutes of the ad-hoc are archived in [1].

While there was near universal support for having a common
log format, there was a lot of discussion about whether
the format should be text or binary, the argument for binary
being that it should be much faster to search.  An option
for text generation is in [2] and an option for
binary generation is in [3].

We realized the question is not "binary vs. text?" but
"should we optimize for log generation vs. optimize for
log processing?"  To that extent, this email is to socialize
the performance results we have obtained for generating
both binary and ASCII formats, including a simulation of a
worst-case analysis by retrieving the last record from large
binary and ASCII files.

To get these results, we generated 1 Million SIP CLF entries
into an ASCII file and the same 1 Million into a binary file.
The ASCII file followed the convention of [2] and the binary
file that of [3].  The last entry in these files was a SIP
request with a special Call-ID.  We measured the time it took
to search for the special Call-ID in both the binary and ASCII
files.  Here are the results, followed by some discussion; the
source code to the programs that generated these results is
also available (see [4].)

Total records in binary and ASCII CLF file: 1 Million
File size:
   Binary: 300,999,984 bytes
   ASCII:  258,999,984 bytes

Time taken to generate the CLF file with 1 Million records:
   Binary CLF: 138.60s
   ASCII CLF:    7.26s

   This is a difference of almost 20x in favor of the ASCII CLF.

Time taken to seek to the last record of the CLF file:
   Binary CLF:   3.08s
   ASCII CLF:   16.55s (using perl v5.6.1)
                42.92s (using perl v.5.8 and v5.10)

   The ASCII CLF seek is five times slower using perl v5.6.1,
   and 13x slower using perl v5.8 and v5.10.  It looks like later
   versions of perl may have inadvertently made the regex compiler
   less optimized.  We don't know why.

   The above data is from experiments ran on an Intel dual-core
   (T2500 @ 2.00 GHz) IBM T60 laptop running Linux 2.6.27 with 1
   GByte of memory.
   We also ran the programs on a more powerful machine: Intel
   dual-core (X6800 @ 2.93GHz) machines with 8GB RAM and a
   Linux 2.6.24 kernel.  The results scaled accordingly.

Clearly, the biggest difference in the above data is the time
taken to produce the CLF file.  ASCII is a lightweight approach
since the SIP entity producing the ASCII CLF file already has
the SIP message in text form.  It is then just a matter of
writing the fields out on disk.  With the binary form, the
SIP entity producing the binary CLF file has to calculate
offsets, which takes a non-negligible amount of time.  Since
the entity producing the SIP CLF log file should not be over-
burdened with the act of producing it, we feel that ASCII CLF
generation is the only choice here (i.e., we should optimize for
log generation.)  Otherwise, the SIP entity producing the
binary CLF file will spend an inordinate time in calculating
offsets, creating a table of contents, etc. to the detriment
of providing the service it is supposed to.

That said, it is also clear that the the worst-case search
for a record is at five to 13x slower when using ASCII.  But,
because searching is done offline, we feel that this sub-optimality
can well be tolerated.  We also feel that there is value in
specifying a binary format because it allows for SIP operators
who want to do such searches to convert their ASCII files to
binary for optimized traversal and other such uses.  A binary
format  must be defined so that offline processes can convert
the captured ASCII data to binary format for optimized
traversal.

Comments and discussions on these results are welcome.  If
you find any errors in the programs used to generate
these results, please do let us know.

[1] Thread "[Sipping] Meeting Minutes: Ad-hoc Common Log Format
  meeting," IETF SIPPING WG, March 27, 2009.  Archived at:
  http://www.ietf.org/mail-archive/web/sipping/current/msg17199.html

[2] V. Gurbani, E. Burger, T. Anjali, H. Abdelnur and O. Festor,
  "The Common Log File (CLF) format for the Session Initiation
  Protocol (SIP)," IETF Internet-Draft, work in progress, March 9,
  2009.  Archived at:
  http://tools.ietf.org/html/draft-gurbani-sipping-clf-01

[3] A. Roach, "Binary Syntax for SIP Common Log Format," IETF
  Internet-Draft, work in progress, March 25, 2009.  Archived at:
  http://tools.ietf.org/html/draft-roach-sipping-clf-syntax-00.

[4] Source code available at the following URLs; please see
  comment block in clf-write.c on how to generate ASCII and
  binary CLF files.
  http://ect.bell-labs.com/who/vkg/IETF/sip-clf/write-clf.c
  http://ect.bell-labs.com/who/vkg/IETF/sip-clf/clf.h
  http://ect.bell-labs.com/who/vkg/IETF/sip-clf/read-clf-record.c
  http://ect.bell-labs.com/who/vkg/IETF/sip-clf/read-clf-record.pl
  http://ect.bell-labs.com/who/vkg/IETF/sip-clf/Makefile

Thanks,

- vijay
-- 
Vijay K. Gurbani, Bell Laboratories, Alcatel-Lucent
1960 Lucent Lane, Rm. 9C-533, Naperville, Illinois 60566 (USA)
Email: vkg@{alcatel-lucent.com,bell-labs.com,acm.org}
Web:   http://ect.bell-labs.com/who/vkg/