Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt

Peter Musgrave <musgravepj@gmail.com> Wed, 24 October 2012 17:01 UTC

Return-Path: <musgravepj@gmail.com>
X-Original-To: sip-clf@ietfa.amsl.com
Delivered-To: sip-clf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8762721F85DA for <sip-clf@ietfa.amsl.com>; Wed, 24 Oct 2012 10:01:12 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.598
X-Spam-Level:
X-Spam-Status: No, score=-3.598 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TzDWISMg23I5 for <sip-clf@ietfa.amsl.com>; Wed, 24 Oct 2012 10:01:11 -0700 (PDT)
Received: from mail-vb0-f44.google.com (mail-vb0-f44.google.com [209.85.212.44]) by ietfa.amsl.com (Postfix) with ESMTP id 4F3DD21F8498 for <sip-clf@ietf.org>; Wed, 24 Oct 2012 10:01:11 -0700 (PDT)
Received: by mail-vb0-f44.google.com with SMTP id fc26so858551vbb.31 for <sip-clf@ietf.org>; Wed, 24 Oct 2012 10:01:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=0zr+OIEkOW1qbuLXrjzDjW1oHKu5IUM2d6JbAmOdlJU=; b=FD5Smr8esuHvX1Xw/HIXAgqll/PPi59e3Mk58i9s7HbCoQMCsS6lKHIMdeIOj00hZP O61gYSJA84temO34DwjJ2/T067C+NLCTByiN+W2Dx9D3YSu1wwS4MhnK96QG6XuEpegk 6hsD+gFw8/K+mRLaCS5nP8j4zRTXyzM0pFR84tkoOahgDsWgbOnfirlGpAixwF8+hZFp Tnu8wgcA4xoI0hiyfeklhYcmx1oUQf8xbL/TD/HNVqwzWTm/51vgTH8XvL2q/j8CTy0G nNPj2oLW2K7KffzI47fAxZ8iwUk9AAwasn1/j6N/PxSE5OVIhc9CbsjDEKYI6FGgd/AT 3N0Q==
MIME-Version: 1.0
Received: by 10.220.205.200 with SMTP id fr8mr8996511vcb.34.1351098070698; Wed, 24 Oct 2012 10:01:10 -0700 (PDT)
Received: by 10.58.216.72 with HTTP; Wed, 24 Oct 2012 10:01:10 -0700 (PDT)
In-Reply-To: <2771A52D-9AE5-460F-A896-888AB334153C@cisco.com>
References: <20121005015620.22856.1399.idtracker@ietfa.amsl.com> <869FCF91-1032-4411-A7D5-85CEE6F120E5@cisco.com> <50870CB8.40908@nostrum.com> <5A63A1D1-5D2A-4EA8-9E7A-CDA3C9668DE5@cisco.com> <508719B1.4090108@nostrum.com> <2771A52D-9AE5-460F-A896-888AB334153C@cisco.com>
Date: Wed, 24 Oct 2012 13:01:10 -0400
Message-ID: <CAJH01taV7oGn8DX6sRR26nt8JfX7J7WBZkfxG9ujTm-6JG=7TQ@mail.gmail.com>
From: Peter Musgrave <musgravepj@gmail.com>
To: Gonzalo Salgueiro <gsalguei@cisco.com>
Content-Type: multipart/alternative; boundary="14dae9ccd4b68f8b8904ccd108d7"
Cc: "sip-clf@ietf.org Mailing" <sip-clf@ietf.org>
Subject: Re: [sip-clf] I-D Action: draft-ietf-sipclf-format-07.txt
X-BeenThere: sip-clf@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: SIP Common Log File format discussion list <sip-clf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/sip-clf>, <mailto:sip-clf-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/sip-clf>
List-Post: <mailto:sip-clf@ietf.org>
List-Help: <mailto:sip-clf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/sip-clf>, <mailto:sip-clf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Oct 2012 17:01:12 -0000

Sorry to be so absent.

Great that we're gonna get this puppy done.

I will read through front to back this weekend.

Can I ask others to look at the diffs and comment by Monday Oct 29th?

Then it seems we'll need a minor fix up to -08 and we're good to write this
up again.

If anyone wants more time, please let me know.

Regards,

Peter Musgrave (as WG chair)


On Wed, Oct 24, 2012 at 2:30 AM, Gonzalo Salgueiro <gsalguei@cisco.com>wrote:

> Adam -
>
> Let me first state that I'm certainly no expert in UTF-8 and will happily
> defer to you on this. I certainly get your point about octets versus
> characters, especially since UTF-8 allows for multi-byte characters.  When
> I made the statement "outside the UTF-8 character range" I was referring to
> the discrete set of characters from U+0000 to U+7FFFFFFF  (or U+10FFFF if
> using the restricted RFC 3629 definition). I assumed this was a clearly
> bounded set.
>
> To be clear, my only intent was to expand your original definition of
> 'unprintable' beyond the C0 control codes (U+0000 to U+001F, U+007F) to
> also include the well known C1 control codes (U+0080 to U+009F). To avoid
> confusion, I should have used the Unicode code point notation.  I certainly
> don't want to try and go down the road of specifying what characters are
> 'printable', but I thought C1 control codes were worth stating explicitly
> in the same way you did C0 control codes. Your original definition seemed
> unnecessarily restrictive since the C1 control codes, for example, are not
> printable but are certainly a valid UTF-8 sequence of octets greater than
> or equal to 128. Does that make sense?
>
> If you think your original definition of 'unprintable' is broad enough,
> then I will use that.
>
> Cheers,
>
> Gonzalo
>
> On Oct 23, 2012, at 6:26 PM, Adam Roach wrote:
>
> > On 10/23/12 16:56, Oct 23, Gonzalo Salgueiro wrote:
> >> "For the purposes of this document, we define 'unprintable' to mean a
> string of octets that: (a) contains an octet with a value in the range of 0
> to 31, inclusive; (b) contains an octet with a value of 127, (c) contains
> any octet greater than or equal to 128 which is a formatting or control
> character (such as 128 to 159) within the UTF-8 character set; or (d) falls
> outside the UTF-8 character range, as specified by [UNICODE]."
> >>
> >> Does that sound ok?
> >
> > I think we're still talking past each other here.
> >
> > "Outside the UTF-8 character range" simply isn't a sensible thing to
> say. What we're talking about putting into a log record is a series of
> *octets*, not a series of *characters*. UTF-8 is an encoding that defines
> how octets are put together to make characters.
> >
> > Once you start talking about the octets as if they *are* characters,
> you're conflating two very different things. So, for example, you can't
> talk about "a string of octets that... falls outside the UTF-8 character
> range."
> >
> > You can talk about a string of bytes that does not form a valid UTF-8
> sequence, and that's almost certainly what you want to say here.
> >
> > I'm also getting a bit lost in what you mean when you say "which is a
> formatting or control character (such as 128 to 159)." Keep in mind that
> we're still talking about *octets* here, not characters. In UTF-8, there's
> nothing special about an octet with a value of 128. There's nothing special
> about an octet with a value of 159. Both can appear as the second octet in
> a two-octet character. Or the second or third octet in a three-octet
> character. And so on. The same goes for everything between 128 and 191.
> >
> > Now, octet values of 192, 193, and 245-255 won't appear in valid UTF-8.
> If we wanted to be abundantly careful, we could call those out as being
> invalid. But I think we catch those just fine if we talk about octets that
> form valid UTF-8 sequences.
> >
> > Or are you meaning to call out UTF-8 code points like U+0080 (the
> Latin-1 padding character)? Because that has nothing to do with an *octet*
> with a value of 128. It would be encoded as a two-octet sequence starting
> with 194. However, if we're intending to go down the rabbit hole of making
> decisions about whether to Base-64 encode based on which UTF-8 codepoints
> we want to consider "printable," then we've got years of draft refinement
> ahead of us (I can already imagine the right-to-left mark arguments). That
> way lies madness.
> >
> > All of which is a very long winded way to say: octets are not characters
> and characters are not octets; and you need to write the text in a way that
> does not mix them with each other.
> >
> > /a
> >
>
> _______________________________________________
> sip-clf mailing list
> sip-clf@ietf.org
> https://www.ietf.org/mailman/listinfo/sip-clf
>