Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt

Adam Roach <> Tue, 23 October 2012 22:27 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id E65651F0CA6 for <>; Tue, 23 Oct 2012 15:27:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -102.329
X-Spam-Status: No, score=-102.329 tagged_above=-999 required=5 tests=[AWL=0.271, BAYES_00=-2.599, SPF_PASS=-0.001, USER_IN_WHITELIST=-100]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 6DBUuTmlN7bA for <>; Tue, 23 Oct 2012 15:26:59 -0700 (PDT)
Received: from ( [IPv6:2001:470:1f03:267::2]) by (Postfix) with ESMTP id A3A4D1F0C8E for <>; Tue, 23 Oct 2012 15:26:58 -0700 (PDT)
Received: from ( []) (authenticated bits=0) by (8.14.3/8.14.3) with ESMTP id q9NMQv3a053272 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 23 Oct 2012 17:26:57 -0500 (CDT) (envelope-from
Message-ID: <>
Date: Tue, 23 Oct 2012 17:26:57 -0500
From: Adam Roach <>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:14.0) Gecko/20120713 Thunderbird/14.0
MIME-Version: 1.0
To: Gonzalo Salgueiro <>
References: <> <> <> <>
In-Reply-To: <>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Received-SPF: pass ( is authenticated by a trusted mechanism)
Cc: " Mailing" <>
Subject: Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: SIP Common Log File format discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 23 Oct 2012 22:27:02 -0000

On 10/23/12 16:56, Oct 23, Gonzalo Salgueiro wrote:
> "For the purposes of this document, we define 'unprintable' to mean a string of octets that: (a) contains an octet with a value in the range of 0 to 31, inclusive; (b) contains an octet with a value of 127, (c) contains any octet greater than or equal to 128 which is a formatting or control character (such as 128 to 159) within the UTF-8 character set; or (d) falls outside the UTF-8 character range, as specified by [UNICODE]."
> Does that sound ok?

I think we're still talking past each other here.

"Outside the UTF-8 character range" simply isn't a sensible thing to 
say. What we're talking about putting into a log record is a series of 
*octets*, not a series of *characters*. UTF-8 is an encoding that 
defines how octets are put together to make characters.

Once you start talking about the octets as if they *are* characters, 
you're conflating two very different things. So, for example, you can't 
talk about "a string of octets that... falls outside the UTF-8 character 

You can talk about a string of bytes that does not form a valid UTF-8 
sequence, and that's almost certainly what you want to say here.

I'm also getting a bit lost in what you mean when you say "which is a 
formatting or control character (such as 128 to 159)." Keep in mind that 
we're still talking about *octets* here, not characters. In UTF-8, 
there's nothing special about an octet with a value of 128. There's 
nothing special about an octet with a value of 159. Both can appear as 
the second octet in a two-octet character. Or the second or third octet 
in a three-octet character. And so on. The same goes for everything 
between 128 and 191.

Now, octet values of 192, 193, and 245-255 won't appear in valid UTF-8. 
If we wanted to be abundantly careful, we could call those out as being 
invalid. But I think we catch those just fine if we talk about octets 
that form valid UTF-8 sequences.

Or are you meaning to call out UTF-8 code points like U+0080 (the 
Latin-1 padding character)? Because that has nothing to do with an 
*octet* with a value of 128. It would be encoded as a two-octet sequence 
starting with 194. However, if we're intending to go down the rabbit 
hole of making decisions about whether to Base-64 encode based on which 
UTF-8 codepoints we want to consider "printable," then we've got years 
of draft refinement ahead of us (I can already imagine the right-to-left 
mark arguments). That way lies madness.

All of which is a very long winded way to say: octets are not characters 
and characters are not octets; and you need to write the text in a way 
that does not mix them with each other.