Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt

Sorry to be so absent.

Great that we're gonna get this puppy done.

I will read through front to back this weekend.

Can I ask others to look at the diffs and comment by Monday Oct 29th?

Then it seems we'll need a minor fix up to -08 and we're good to write this
up again.

If anyone wants more time, please let me know.

Regards,

Peter Musgrave (as WG chair)

On Wed, Oct 24, 2012 at 2:30 AM, Gonzalo Salgueiro <gsalguei@cisco.com>wrote:

> Adam -
>
> Let me first state that I'm certainly no expert in UTF-8 and will happily
> defer to you on this. I certainly get your point about octets versus
> characters, especially since UTF-8 allows for multi-byte characters.  When
> I made the statement "outside the UTF-8 character range" I was referring to
> the discrete set of characters from U+0000 to U+7FFFFFFF  (or U+10FFFF if
> using the restricted RFC 3629 definition). I assumed this was a clearly
> bounded set.
>
> To be clear, my only intent was to expand your original definition of
> 'unprintable' beyond the C0 control codes (U+0000 to U+001F, U+007F) to
> also include the well known C1 control codes (U+0080 to U+009F). To avoid
> confusion, I should have used the Unicode code point notation.  I certainly
> don't want to try and go down the road of specifying what characters are
> 'printable', but I thought C1 control codes were worth stating explicitly
> in the same way you did C0 control codes. Your original definition seemed
> unnecessarily restrictive since the C1 control codes, for example, are not
> printable but are certainly a valid UTF-8 sequence of octets greater than
> or equal to 128. Does that make sense?
>
> If you think your original definition of 'unprintable' is broad enough,
> then I will use that.
>
> Cheers,
>
> Gonzalo
>
> On Oct 23, 2012, at 6:26 PM, Adam Roach wrote:
>
> > On 10/23/12 16:56, Oct 23, Gonzalo Salgueiro wrote:
> >> "For the purposes of this document, we define 'unprintable' to mean a
> string of octets that: (a) contains an octet with a value in the range of 0
> to 31, inclusive; (b) contains an octet with a value of 127, (c) contains
> any octet greater than or equal to 128 which is a formatting or control
> character (such as 128 to 159) within the UTF-8 character set; or (d) falls
> outside the UTF-8 character range, as specified by [UNICODE]."
> >>
> >> Does that sound ok?
> >
> > I think we're still talking past each other here.
> >
> > "Outside the UTF-8 character range" simply isn't a sensible thing to
> say. What we're talking about putting into a log record is a series of
> *octets*, not a series of *characters*. UTF-8 is an encoding that defines
> how octets are put together to make characters.
> >
> > Once you start talking about the octets as if they *are* characters,
> you're conflating two very different things. So, for example, you can't
> talk about "a string of octets that... falls outside the UTF-8 character
> range."
> >
> > You can talk about a string of bytes that does not form a valid UTF-8
> sequence, and that's almost certainly what you want to say here.
> >
> > I'm also getting a bit lost in what you mean when you say "which is a
> formatting or control character (such as 128 to 159)." Keep in mind that
> we're still talking about *octets* here, not characters. In UTF-8, there's
> nothing special about an octet with a value of 128. There's nothing special
> about an octet with a value of 159. Both can appear as the second octet in
> a two-octet character. Or the second or third octet in a three-octet
> character. And so on. The same goes for everything between 128 and 191.
> >
> > Now, octet values of 192, 193, and 245-255 won't appear in valid UTF-8.
> If we wanted to be abundantly careful, we could call those out as being
> invalid. But I think we catch those just fine if we talk about octets that
> form valid UTF-8 sequences.
> >
> > Or are you meaning to call out UTF-8 code points like U+0080 (the
> Latin-1 padding character)? Because that has nothing to do with an *octet*
> with a value of 128. It would be encoded as a two-octet sequence starting
> with 194. However, if we're intending to go down the rabbit hole of making
> decisions about whether to Base-64 encode based on which UTF-8 codepoints
> we want to consider "printable," then we've got years of draft refinement
> ahead of us (I can already imagine the right-to-left mark arguments). That
> way lies madness.
> >
> > All of which is a very long winded way to say: octets are not characters
> and characters are not octets; and you need to write the text in a way that
> does not mix them with each other.
> >
> > /a
> >
>
> _______________________________________________
> sip-clf mailing list
> sip-clf@ietf.org
> https://www.ietf.org/mailman/listinfo/sip-clf
>