[IPFIX] string vs octetArray (for non UTF-8 character sets)
Andrew Feren <andrewf@plixer.com> Wed, 24 October 2012 16:24 UTC
Return-Path: <andrewf@plixer.com>
X-Original-To: ipfix@ietfa.amsl.com
Delivered-To: ipfix@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0423A21F86D9 for <ipfix@ietfa.amsl.com>; Wed, 24 Oct 2012 09:24:31 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.322
X-Spam-Level:
X-Spam-Status: No, score=-2.322 tagged_above=-999 required=5 tests=[AWL=0.124, BAYES_00=-2.599, HTML_MESSAGE=0.001, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fXM46harSHVF for <ipfix@ietfa.amsl.com>; Wed, 24 Oct 2012 09:24:29 -0700 (PDT)
Received: from smtp.plixer.com (smtp.plixer.com [66.186.184.193]) by ietfa.amsl.com (Postfix) with ESMTP id 78A2321F8673 for <ipfix@ietf.org>; Wed, 24 Oct 2012 09:24:29 -0700 (PDT)
Received: from [10.100.1.132] ([10.100.1.132]) by smtp.plixer.com over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Wed, 24 Oct 2012 12:24:28 -0400
Message-ID: <5088163B.8070004@plixer.com>
Date: Wed, 24 Oct 2012 12:24:27 -0400
From: Andrew Feren <andrewf@plixer.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/19.0 Thunderbird/19.0a1
MIME-Version: 1.0
To: IETF IPFIX Working Group <ipfix@ietf.org>
Content-Type: multipart/alternative; boundary="------------080708020900020203040405"
X-OriginalArrivalTime: 24 Oct 2012 16:24:28.0087 (UTC) FILETIME=[09FE0C70:01CDB204]
Subject: [IPFIX] string vs octetArray (for non UTF-8 character sets)
X-BeenThere: ipfix@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: IPFIX WG discussion list <ipfix.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipfix>, <mailto:ipfix-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ipfix>
List-Post: <mailto:ipfix@ietf.org>
List-Help: <mailto:ipfix-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipfix>, <mailto:ipfix-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Oct 2012 16:24:31 -0000
Hi all, I ran into something the other day that doesn't appear to be an issue with any current standard information elements, but given the number of vendors exporting URL information seems like the issue will come up eventually. I originally gave URLs a data type of string as this seemed like the most appropriate type for a human readable string, but then my parser squawked about the following URL in an export http://www.plixer.com/blog/scrutinizer/the-null-scan-you're-being-watched/ <http://www.plixer.com/blog/scrutinizer/the-null-scan-you%E2%80%99re-being-watched/> In the above URL the "you're" was seen on the wire as "you<92>re" (<92> representing one hex byte). I don't know what character set that is, but Windows thinks it is an apostrophe (or RIGHT SINGLE QUOTATION MARK if we were speaking UTF-8) This got me wondering. What is the right thing to do when monitoring text that is not necessarily UTF-8. a) treat it as an octetArray? This works, but doesn't feel quite right. It seems useful to have a distinction between raw bytes and readable strings. For presentation purposes for example. b) expect the exporter to convert to something that is UTF-8 and still accurately reports what was observed. (For example in this case converting byte in question to an ascii ' (<27>) looks right and is valid UTF-8, but results in a 404, but converting to a real UTF-8 RIGHT SINGLE QUOTATION MARK works) c) define a new data type with character set information. d) whistle and walk away e) something else Thoughts? -Andrew
- [IPFIX] string vs octetArray (for non UTF-8 chara… Andrew Feren