Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt

Adam Roach <adam@nostrum.com> Tue, 23 October 2012 22:27 UTC

Return-Path: <adam@nostrum.com>
X-Original-To: sip-clf@ietfa.amsl.com
Delivered-To: sip-clf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E65651F0CA6 for <sip-clf@ietfa.amsl.com>; Tue, 23 Oct 2012 15:27:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.329
X-Spam-Level:
X-Spam-Status: No, score=-102.329 tagged_above=-999 required=5 tests=[AWL=0.271, BAYES_00=-2.599, SPF_PASS=-0.001, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6DBUuTmlN7bA for <sip-clf@ietfa.amsl.com>; Tue, 23 Oct 2012 15:26:59 -0700 (PDT)
Received: from shaman.nostrum.com (nostrum-pt.tunnel.tserv2.fmt.ipv6.he.net [IPv6:2001:470:1f03:267::2]) by ietfa.amsl.com (Postfix) with ESMTP id A3A4D1F0C8E for <sip-clf@ietf.org>; Tue, 23 Oct 2012 15:26:58 -0700 (PDT)
Received: from hydra-en0.roach.at (99-152-144-32.lightspeed.dllstx.sbcglobal.net [99.152.144.32]) (authenticated bits=0) by shaman.nostrum.com (8.14.3/8.14.3) with ESMTP id q9NMQv3a053272 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 23 Oct 2012 17:26:57 -0500 (CDT) (envelope-from adam@nostrum.com)
Message-ID: <508719B1.4090108@nostrum.com>
Date: Tue, 23 Oct 2012 17:26:57 -0500
From: Adam Roach <adam@nostrum.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:14.0) Gecko/20120713 Thunderbird/14.0
MIME-Version: 1.0
To: Gonzalo Salgueiro <gsalguei@cisco.com>
References: <20121005015620.22856.1399.idtracker@ietfa.amsl.com> <869FCF91-1032-4411-A7D5-85CEE6F120E5@cisco.com> <50870CB8.40908@nostrum.com> <5A63A1D1-5D2A-4EA8-9E7A-CDA3C9668DE5@cisco.com>
In-Reply-To: <5A63A1D1-5D2A-4EA8-9E7A-CDA3C9668DE5@cisco.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Received-SPF: pass (nostrum.com: 99.152.144.32 is authenticated by a trusted mechanism)
Cc: "sip-clf@ietf.org Mailing" <sip-clf@ietf.org>
Subject: Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt
X-BeenThere: sip-clf@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: SIP Common Log File format discussion list <sip-clf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sip-clf>, <mailto:sip-clf-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/sip-clf>
List-Post: <mailto:sip-clf@ietf.org>
List-Help: <mailto:sip-clf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sip-clf>, <mailto:sip-clf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Oct 2012 22:27:02 -0000

On 10/23/12 16:56, Oct 23, Gonzalo Salgueiro wrote:
> "For the purposes of this document, we define 'unprintable' to mean a string of octets that: (a) contains an octet with a value in the range of 0 to 31, inclusive; (b) contains an octet with a value of 127, (c) contains any octet greater than or equal to 128 which is a formatting or control character (such as 128 to 159) within the UTF-8 character set; or (d) falls outside the UTF-8 character range, as specified by [UNICODE]."
>
> Does that sound ok?

I think we're still talking past each other here.

"Outside the UTF-8 character range" simply isn't a sensible thing to 
say. What we're talking about putting into a log record is a series of 
*octets*, not a series of *characters*. UTF-8 is an encoding that 
defines how octets are put together to make characters.

Once you start talking about the octets as if they *are* characters, 
you're conflating two very different things. So, for example, you can't 
talk about "a string of octets that... falls outside the UTF-8 character 
range."

You can talk about a string of bytes that does not form a valid UTF-8 
sequence, and that's almost certainly what you want to say here.

I'm also getting a bit lost in what you mean when you say "which is a 
formatting or control character (such as 128 to 159)." Keep in mind that 
we're still talking about *octets* here, not characters. In UTF-8, 
there's nothing special about an octet with a value of 128. There's 
nothing special about an octet with a value of 159. Both can appear as 
the second octet in a two-octet character. Or the second or third octet 
in a three-octet character. And so on. The same goes for everything 
between 128 and 191.

Now, octet values of 192, 193, and 245-255 won't appear in valid UTF-8. 
If we wanted to be abundantly careful, we could call those out as being 
invalid. But I think we catch those just fine if we talk about octets 
that form valid UTF-8 sequences.

Or are you meaning to call out UTF-8 code points like U+0080 (the 
Latin-1 padding character)? Because that has nothing to do with an 
*octet* with a value of 128. It would be encoded as a two-octet sequence 
starting with 194. However, if we're intending to go down the rabbit 
hole of making decisions about whether to Base-64 encode based on which 
UTF-8 codepoints we want to consider "printable," then we've got years 
of draft refinement ahead of us (I can already imagine the right-to-left 
mark arguments). That way lies madness.

All of which is a very long winded way to say: octets are not characters 
and characters are not octets; and you need to write the text in a way 
that does not mix them with each other.

/a