Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt

<individual>
Ok - I know less about UTF-8 than either Adam or Gonzalo, but the rationale of being able to use text base tools is one I support. 

I read through the diffs and did not find anything else I wanted to bring up. 
</individual>

<as chair>
Once we agree on "printable" is there anything else people want to see altered?
</as chair>

Thanks, 

Peter

One nit: 
4.3 s/macine/machine/

On 2012-10-24, at 7:41 PM, Gonzalo Salgueiro <gsalguei@cisco.com> wrote:

> Thanks, Adam.  I agree with both your assumptions. I think where we diverged is whether C1 was on an equal footing with C0 as far as their 'unprintability'.  If C1 is handled nicely by text processing tools then we can omit them. We all have a vested interest in minimizing the amount we Base64 encode these optionally logged bodies and whole messages within SIP CLF since it drastically reduces its usefulness and can introduce information loss. Thus, I absolutely agree that we don't want an accented Latin character or other valid UTF-8 non-control characters (CJK/Cyrillic/Arabic/etc) to trigger a Base64 encoding.  At this point I think I will go with your original text for defining 'unprintable' within the context of this document.
> 
> Cheers,
> 
> Gonzalo
> 
> 
> On Oct 24, 2012, at 3:04 PM, Adam Roach wrote:
> 
>> So, I think I understand your point. (And, to be clear, I don't intend to hold myself out as a Unicode expert either). The rationale behind what I'm proposing has two main assumptions:
>> 	• The whole point of choosing a non-binary record format is allowing the use of commandline tools to work with records. Legacy commandline tools will frequently make assumptions about what octetes that correspond to ASCII control characters mean (e.g. 10 and 13), and take certain actions based on them. By and large, I wouldn't expect the same kind of reaction to octetes that represent extended control characters, like those in the C1 set. I would expect normal commandline tools to simply treat octets in the range of 128-255 as (relatively) non-special, and process them normally. Cursory experimentation with the commandline tools I have easily available leads me to believe that such is the case.
>> 
>> 	• I presume that we do not want to force Base-64 encoding due to the simple presence of an accented Latin character. Similarly, we probably don't want a small number of characters from the CJK/Cyrillic/Arabic/etc codepages to trigger Base-64 either. 
>> If I'm wrong about either of these assumptions, then we need to re-visit what I'm proposing.
>> 
>> /a
>> 
>> 
>> On 10/24/12 1:30 AM, Gonzalo Salgueiro wrote:
>>> Adam -
>>> 
>>> Let me first state that I'm certainly no expert in UTF-8 and will happily defer to you on this. I certainly get your point about octets versus characters, especially since UTF-8 allows for multi-byte characters.  When I made the statement "outside the UTF-8 character range" I was referring to the discrete set of characters from U+0000 to U+7FFFFFFF  (or U+10FFFF if using the restricted RFC 3629 definition). I assumed this was a clearly bounded set.
>>> 
>>> To be clear, my only intent was to expand your original definition of 'unprintable' beyond the C0 control codes (U+0000 to U+001F, U+007F) to also include the well known C1 control codes (U+0080 to U+009F). To avoid confusion, I should have used the Unicode code point notation.  I certainly don't want to try and go down the road of specifying what characters are 'printable', but I thought C1 control codes were worth stating explicitly in the same way you did C0 control codes. Your original definition seemed unnecessarily restrictive since the C1 control codes, for example, are not printable but are certainly a valid UTF-8 sequence of octets greater than or equal to 128. Does that make sense?
>>> 
>>> If you think your original definition of 'unprintable' is broad enough, then I will use that.
>>> 
>>> Cheers,
>>> 
>>> Gonzalo
>>> 
>>> On Oct 23, 2012, at 6:26 PM, Adam Roach wrote:
>>> 
>>> 
>>>> On 10/23/12 16:56, Oct 23, Gonzalo Salgueiro wrote:
>>>> 
>>>>> "For the purposes of this document, we define 'unprintable' to mean a string of octets that: (a) contains an octet with a value in the range of 0 to 31, inclusive; (b) contains an octet with a value of 127, (c) contains any octet greater than or equal to 128 which is a formatting or control character (such as 128 to 159) within the UTF-8 character set; or (d) falls outside the UTF-8 character range, as specified by [UNICODE]."
>>>>> 
>>>>> Does that sound ok?
>>>>> 
>>>> I think we're still talking past each other here.
>>>> 
>>>> "Outside the UTF-8 character range" simply isn't a sensible thing to say. What we're talking about putting into a log record is a series of *octets*, not a series of *characters*. UTF-8 is an encoding that defines how octets are put together to make characters.
>>>> 
>>>> Once you start talking about the octets as if they *are* characters, you're conflating two very different things. So, for example, you can't talk about "a string of octets that... falls outside the UTF-8 character range."
>>>> 
>>>> You can talk about a string of bytes that does not form a valid UTF-8 sequence, and that's almost certainly what you want to say here.
>>>> 
>>>> I'm also getting a bit lost in what you mean when you say "which is a formatting or control character (such as 128 to 159)." Keep in mind that we're still talking about *octets* here, not characters. In UTF-8, there's nothing special about an octet with a value of 128. There's nothing special about an octet with a value of 159. Both can appear as the second octet in a two-octet character. Or the second or third octet in a three-octet character. And so on. The same goes for everything between 128 and 191.
>>>> 
>>>> Now, octet values of 192, 193, and 245-255 won't appear in valid UTF-8. If we wanted to be abundantly careful, we could call those out as being invalid. But I think we catch those just fine if we talk about octets that form valid UTF-8 sequences.
>>>> 
>>>> Or are you meaning to call out UTF-8 code points like U+0080 (the Latin-1 padding character)? Because that has nothing to do with an *octet* with a value of 128. It would be encoded as a two-octet sequence starting with 194. However, if we're intending to go down the rabbit hole of making decisions about whether to Base-64 encode based on which UTF-8 codepoints we want to consider "printable," then we've got years of draft refinement ahead of us (I can already imagine the right-to-left mark arguments). That way lies madness.
>>>> 
>>>> All of which is a very long winded way to say: octets are not characters and characters are not octets; and you need to write the text in a way that does not mix them with each other.
>>>> 
>>>> /a
>>>> 
>>>> 
>> 
>> 
> 
> _______________________________________________
> sip-clf mailing list
> sip-clf@ietf.org
> https://www.ietf.org/mailman/listinfo/sip-clf