Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt

Peter Musgrave <> Wed, 24 October 2012 17:01 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 8762721F85DA for <>; Wed, 24 Oct 2012 10:01:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -3.598
X-Spam-Status: No, score=-3.598 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-1]
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id TzDWISMg23I5 for <>; Wed, 24 Oct 2012 10:01:11 -0700 (PDT)
Received: from ( []) by (Postfix) with ESMTP id 4F3DD21F8498 for <>; Wed, 24 Oct 2012 10:01:11 -0700 (PDT)
Received: by with SMTP id fc26so858551vbb.31 for <>; Wed, 24 Oct 2012 10:01:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=0zr+OIEkOW1qbuLXrjzDjW1oHKu5IUM2d6JbAmOdlJU=; b=FD5Smr8esuHvX1Xw/HIXAgqll/PPi59e3Mk58i9s7HbCoQMCsS6lKHIMdeIOj00hZP O61gYSJA84temO34DwjJ2/T067C+NLCTByiN+W2Dx9D3YSu1wwS4MhnK96QG6XuEpegk 6hsD+gFw8/K+mRLaCS5nP8j4zRTXyzM0pFR84tkoOahgDsWgbOnfirlGpAixwF8+hZFp Tnu8wgcA4xoI0hiyfeklhYcmx1oUQf8xbL/TD/HNVqwzWTm/51vgTH8XvL2q/j8CTy0G nNPj2oLW2K7KffzI47fAxZ8iwUk9AAwasn1/j6N/PxSE5OVIhc9CbsjDEKYI6FGgd/AT 3N0Q==
MIME-Version: 1.0
Received: by with SMTP id fr8mr8996511vcb.34.1351098070698; Wed, 24 Oct 2012 10:01:10 -0700 (PDT)
Received: by with HTTP; Wed, 24 Oct 2012 10:01:10 -0700 (PDT)
In-Reply-To: <>
References: <> <> <> <> <> <>
Date: Wed, 24 Oct 2012 13:01:10 -0400
Message-ID: <>
From: Peter Musgrave <>
To: Gonzalo Salgueiro <>
Content-Type: multipart/alternative; boundary=14dae9ccd4b68f8b8904ccd108d7
Cc: " Mailing" <>
Subject: Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: SIP Common Log File format discussion list <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 24 Oct 2012 17:01:12 -0000

Sorry to be so absent.

Great that we're gonna get this puppy done.

I will read through front to back this weekend.

Can I ask others to look at the diffs and comment by Monday Oct 29th?

Then it seems we'll need a minor fix up to -08 and we're good to write this
up again.

If anyone wants more time, please let me know.


Peter Musgrave (as WG chair)

On Wed, Oct 24, 2012 at 2:30 AM, Gonzalo Salgueiro <>wrote;wrote:

> Adam -
> Let me first state that I'm certainly no expert in UTF-8 and will happily
> defer to you on this. I certainly get your point about octets versus
> characters, especially since UTF-8 allows for multi-byte characters.  When
> I made the statement "outside the UTF-8 character range" I was referring to
> the discrete set of characters from U+0000 to U+7FFFFFFF  (or U+10FFFF if
> using the restricted RFC 3629 definition). I assumed this was a clearly
> bounded set.
> To be clear, my only intent was to expand your original definition of
> 'unprintable' beyond the C0 control codes (U+0000 to U+001F, U+007F) to
> also include the well known C1 control codes (U+0080 to U+009F). To avoid
> confusion, I should have used the Unicode code point notation.  I certainly
> don't want to try and go down the road of specifying what characters are
> 'printable', but I thought C1 control codes were worth stating explicitly
> in the same way you did C0 control codes. Your original definition seemed
> unnecessarily restrictive since the C1 control codes, for example, are not
> printable but are certainly a valid UTF-8 sequence of octets greater than
> or equal to 128. Does that make sense?
> If you think your original definition of 'unprintable' is broad enough,
> then I will use that.
> Cheers,
> Gonzalo
> On Oct 23, 2012, at 6:26 PM, Adam Roach wrote:
> > On 10/23/12 16:56, Oct 23, Gonzalo Salgueiro wrote:
> >> "For the purposes of this document, we define 'unprintable' to mean a
> string of octets that: (a) contains an octet with a value in the range of 0
> to 31, inclusive; (b) contains an octet with a value of 127, (c) contains
> any octet greater than or equal to 128 which is a formatting or control
> character (such as 128 to 159) within the UTF-8 character set; or (d) falls
> outside the UTF-8 character range, as specified by [UNICODE]."
> >>
> >> Does that sound ok?
> >
> > I think we're still talking past each other here.
> >
> > "Outside the UTF-8 character range" simply isn't a sensible thing to
> say. What we're talking about putting into a log record is a series of
> *octets*, not a series of *characters*. UTF-8 is an encoding that defines
> how octets are put together to make characters.
> >
> > Once you start talking about the octets as if they *are* characters,
> you're conflating two very different things. So, for example, you can't
> talk about "a string of octets that... falls outside the UTF-8 character
> range."
> >
> > You can talk about a string of bytes that does not form a valid UTF-8
> sequence, and that's almost certainly what you want to say here.
> >
> > I'm also getting a bit lost in what you mean when you say "which is a
> formatting or control character (such as 128 to 159)." Keep in mind that
> we're still talking about *octets* here, not characters. In UTF-8, there's
> nothing special about an octet with a value of 128. There's nothing special
> about an octet with a value of 159. Both can appear as the second octet in
> a two-octet character. Or the second or third octet in a three-octet
> character. And so on. The same goes for everything between 128 and 191.
> >
> > Now, octet values of 192, 193, and 245-255 won't appear in valid UTF-8.
> If we wanted to be abundantly careful, we could call those out as being
> invalid. But I think we catch those just fine if we talk about octets that
> form valid UTF-8 sequences.
> >
> > Or are you meaning to call out UTF-8 code points like U+0080 (the
> Latin-1 padding character)? Because that has nothing to do with an *octet*
> with a value of 128. It would be encoded as a two-octet sequence starting
> with 194. However, if we're intending to go down the rabbit hole of making
> decisions about whether to Base-64 encode based on which UTF-8 codepoints
> we want to consider "printable," then we've got years of draft refinement
> ahead of us (I can already imagine the right-to-left mark arguments). That
> way lies madness.
> >
> > All of which is a very long winded way to say: octets are not characters
> and characters are not octets; and you need to write the text in a way that
> does not mix them with each other.
> >
> > /a
> >
> _______________________________________________
> sip-clf mailing list