[IPFIX] string vs octetArray (for non UTF-8 character sets)

Andrew Feren <andrewf@plixer.com> Wed, 24 October 2012 16:24 UTC

Return-Path: <andrewf@plixer.com>
X-Original-To: ipfix@ietfa.amsl.com
Delivered-To: ipfix@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0423A21F86D9 for <ipfix@ietfa.amsl.com>; Wed, 24 Oct 2012 09:24:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.322
X-Spam-Level:
X-Spam-Status: No, score=-2.322 tagged_above=-999 required=5 tests=[AWL=0.124, BAYES_00=-2.599, HTML_MESSAGE=0.001, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fXM46harSHVF for <ipfix@ietfa.amsl.com>; Wed, 24 Oct 2012 09:24:29 -0700 (PDT)
Received: from smtp.plixer.com (smtp.plixer.com [66.186.184.193]) by ietfa.amsl.com (Postfix) with ESMTP id 78A2321F8673 for <ipfix@ietf.org>; Wed, 24 Oct 2012 09:24:29 -0700 (PDT)
Received: from [10.100.1.132] ([10.100.1.132]) by smtp.plixer.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Wed, 24 Oct 2012 12:24:28 -0400
Message-ID: <5088163B.8070004@plixer.com>
Date: Wed, 24 Oct 2012 12:24:27 -0400
From: Andrew Feren <andrewf@plixer.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/19.0 Thunderbird/19.0a1
MIME-Version: 1.0
To: IETF IPFIX Working Group <ipfix@ietf.org>
Content-Type: multipart/alternative; boundary="------------080708020900020203040405"
X-OriginalArrivalTime: 24 Oct 2012 16:24:28.0087 (UTC) FILETIME=[09FE0C70:01CDB204]
Subject: [IPFIX] string vs octetArray (for non UTF-8 character sets)
X-BeenThere: ipfix@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: IPFIX WG discussion list <ipfix.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipfix>, <mailto:ipfix-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ipfix>
List-Post: <mailto:ipfix@ietf.org>
List-Help: <mailto:ipfix-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipfix>, <mailto:ipfix-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Oct 2012 16:24:31 -0000

Hi all,

I ran into something the other day that doesn't appear to be an issue 
with any current standard information elements, but given the number of 
vendors exporting URL information seems like the issue will come up 
eventually.

I originally gave URLs a data type of string as this seemed like the 
most appropriate type for a human readable string, but then my parser 
squawked about the following URL in an export

http://www.plixer.com/blog/scrutinizer/the-null-scan-you're-being-watched/  <http://www.plixer.com/blog/scrutinizer/the-null-scan-you%E2%80%99re-being-watched/>

In the above URL the "you're" was seen on the wire as "you<92>re" (<92> 
representing one hex byte).  I don't know what character set that is, 
but Windows thinks it is an apostrophe (or RIGHT SINGLE QUOTATION MARK 
if we were speaking UTF-8)

This got me wondering.  What is the right thing to do when monitoring 
text that is not necessarily UTF-8.

a) treat it as an octetArray?  This works, but doesn't feel quite 
right.  It seems useful to have a distinction between raw bytes and 
readable strings.  For presentation purposes for example.

b) expect the exporter to convert to something that is UTF-8 and still 
accurately reports what was observed.  (For example in this case 
converting byte in question to an ascii ' (<27>) looks right and is 
valid UTF-8, but results in a 404, but converting to a real UTF-8 RIGHT 
SINGLE QUOTATION MARK works)

c) define a new data type with character set information.

d) whistle and walk away

e) something else

Thoughts?
-Andrew