Re: [sip-ops] [dispatch] SIP-CLF: Extensibility considerations (was Results on ASCII vs. binary representation)

Adam Roach <adam@nostrum.com> Thu, 30 April 2009 22:17 UTC

Return-Path: <adam@nostrum.com>
X-Original-To: sip-ops@core3.amsl.com
Delivered-To: sip-ops@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 2D46B3A6B29; Thu, 30 Apr 2009 15:17:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.567
X-Spam-Level:
X-Spam-Status: No, score=-2.567 tagged_above=-999 required=5 tests=[AWL=0.033, BAYES_00=-2.599, SPF_PASS=-0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NCAH0DlewUY5; Thu, 30 Apr 2009 15:17:37 -0700 (PDT)
Received: from nostrum.com (nostrum-pt.tunnel.tserv2.fmt.ipv6.he.net [IPv6:2001:470:1f03:267::2]) by core3.amsl.com (Postfix) with ESMTP id BFB103A6853; Thu, 30 Apr 2009 15:17:36 -0700 (PDT)
Received: from [172.16.3.231] (vicuna-alt.estacado.net [75.53.54.121]) (authenticated bits=0) by nostrum.com (8.14.3/8.14.3) with ESMTP id n3UMIrXS047815 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 30 Apr 2009 17:18:55 -0500 (CDT) (envelope-from adam@nostrum.com)
Message-ID: <49FA23CD.9060000@nostrum.com>
Date: Thu, 30 Apr 2009 17:18:53 -0500
From: Adam Roach <adam@nostrum.com>
User-Agent: Postbox 1.0b11 (Macintosh/2009041623)
MIME-Version: 1.0
To: "Vijay K. Gurbani" <vkg@alcatel-lucent.com>
References: <49FA0526.4010000@nostrum.com> <49FA142E.7060607@alcatel-lucent.com>
In-Reply-To: <49FA142E.7060607@alcatel-lucent.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Received-SPF: pass (nostrum.com: 75.53.54.121 is authenticated by a trusted mechanism)
Cc: "sip-ops@ietf.org" <sip-ops@ietf.org>, "dispatch@ietf.org" <dispatch@ietf.org>
Subject: Re: [sip-ops] [dispatch] SIP-CLF: Extensibility considerations (was Results on ASCII vs. binary representation)
X-BeenThere: sip-ops@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: SIP Operations <sip-ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/sip-ops>, <mailto:sip-ops-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/sip-ops>
List-Post: <mailto:sip-ops@ietf.org>
List-Help: <mailto:sip-ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sip-ops>, <mailto:sip-ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Apr 2009 22:17:38 -0000

Vijay K. Gurbani wrote:
> Adam Roach wrote:
>> Even knowing the list of purposes, if we go with a text format
>> similar to what is proposed, we are going to be forced to nail down
>> the complete set of log record fields now, with little hope for
>> backwards-compatible extensibility in the future. Admittedly, we
>> *could* go to a tagged text format (e.g. where the fields are
>> explicitly labeled instead of being inferred by position) to address
>> this shortcoming, but that's not what's being proposed at the moment.
>
> The tagged format will add further latency to an ASCII format,
> so I did not include it. In the best case, I am looking for an
> ASCII format that is amenable to taking a line and using a
> regexp to break it down to its constituent fields.

That's going to be pretty difficult with character escaping. You'll need 
powerful regex ninja skills to deal with the text format once you figure 
out how to handle things like spaces embedded within fields.


> Remember that we are not the consumers of the log file, rather
> it is the people who will be feeding SIP servers. And given
> that constituency, I think they'd rather prefer to write tools
> that operate on ASCII.

I don't think the constituent consumers will be particularly upset by 
the difference between:

   grep 'iaMb87@evil.example.com' sip-clf.log

and

   clfdump sip-clf.log | grep 'iaMb87@evil.example.com'


> Given the updated results and all, I will stop arguing on the
> grounds of efficiency. Reading binary CLF was always more
> efficient, and so is producing the binary CLF.
...
> But, there are and will be
> SIP servers that do not carry an internal binary representation
> of the SIP message. We will, in essence, force these servers
> to do so just to produce binary CLF. And that is a big tradeoff.


That's pure nonsense. You first acknowledge that the proposed binary 
format is faster to read and faster to write... and then make an 
argument that seems to be predicated on the opposite of that fact.

The servers will need to separate the components of interest out for 
both the format described in your document and the format described in 
mine. The binary format doesn't take any ASCII fields and turn them into 
a binary representation (with the exception of CSeq, but a single 
string-to-integer transformation -- one that almost every implementation 
needs to perform anyway -- certainly can't be what you're objecting to). 
The only real processing difference is how field and record delimitation 
is handled -- and we already know that the speed differences between 
them are negligible.


> A binary CLF can always be produced from an ASCII one using
> offline transformations.

And vice-versa -- at least, modulo the extensible tagging mechanisms 
inherent in the binary format.

> It is just that producing an ASCII
> CLF is low-impact since the messages that enter and exit
> a SIP server are ASCII to begin with.

Unless you're dumping literal SIP messages (which is not what you're 
proposing), you're still having to parse out the fields and rearrange 
them. This is true for BOTH formats. The impact is not significantly 
higher for one versus the other, and you've acknowledged that empirical 
data confirms this fact.

> Before going down the path of mandating a binary-only option,
> I would at the very least like us to understand the tradeoffs
> of the decision and keep in mind who the ultimate consumers
> of the log file are.

While I'm sure some of the ops guys will appreciate the fact that they 
get to take a 20-minute coffee and/or smoking break every time they 
launch a query against a 60-million-record log, there's probably a 
larger percentage of them who would like to be able to do their jobs 
efficiently.

At least, that's what I suspect.

/a