[IPFIX] string vs octetArray (for non UTF-8 character sets)

Andrew Feren <andrewf@plixer.com> Wed, 24 October 2012 16:24 UTC

Message-ID: <5088163B.8070004@plixer.com>
Date: Wed, 24 Oct 2012 12:24:27 -0400
From: Andrew Feren <andrewf@plixer.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/19.0 Thunderbird/19.0a1
MIME-Version: 1.0
To: IETF IPFIX Working Group <ipfix@ietf.org>
Content-Type: multipart/alternative; boundary="------------080708020900020203040405"
Subject: [IPFIX] string vs octetArray (for non UTF-8 character sets)
Precedence: list

Hi all,

I ran into something the other day that doesn't appear to be an issue
with any current standard information elements, but given the number of
vendors exporting URL information seems like the issue will come up
eventually.

I originally gave URLs a data type of string as this seemed like the
most appropriate type for a human readable string, but then my parser
squawked about the following URL in an export

http://www.plixer.com/blog/scrutinizer/the-null-scan-you're-being-watched/ <http://www.plixer.com/blog/scrutinizer/the-null-scan-you%E2%80%99re-being-watched/>

In the above URL the "you're" was seen on the wire as "you<92>re" (<92>
representing one hex byte). I don't know what character set that is,
but Windows thinks it is an apostrophe (or RIGHT SINGLE QUOTATION MARK
if we were speaking UTF-8)

This got me wondering. What is the right thing to do when monitoring
text that is not necessarily UTF-8.

a) treat it as an octetArray? This works, but doesn't feel quite
right. It seems useful to have a distinction between raw bytes and
readable strings. For presentation purposes for example.

b) expect the exporter to convert to something that is UTF-8 and still
accurately reports what was observed. (For example in this case
converting byte in question to an ascii ' (<27>) looks right and is
valid UTF-8, but results in a 404, but converting to a real UTF-8 RIGHT
SINGLE QUOTATION MARK works)

c) define a new data type with character set information.

d) whistle and walk away

e) something else

Thoughts?
-Andrew

[IPFIX] string vs octetArray (for non UTF-8 chara… Andrew Feren