RE: Protocol Action: 'UTF-8, a transformation format of ISO 10646' to Standard (fwd)

Anton Okmianski <aokmians@cisco.com> Thu, 14 August 2003 22:19 UTC

Date: Thu, 14 Aug 2003 18:19:47 -0400
From: Anton Okmianski <aokmians@cisco.com>
Subject: RE: Protocol Action: 'UTF-8, a transformation format of ISO 10646' to Standard (fwd)
X-Message-ID:
Message-ID: <20140418112159.2560.26771.ARCHIVE@ietfa.amsl.com>

Rainer et al:

I don't claim to have definitive answers here -- just some thoughts.

Can we define a new syslog standard that is UTF-8 based?  It will be
backwards compatible for US-ASCII.

If somebody fires US-ASCII only message encoded in UTF-8, it will be
the same 7-bit stuff, right?  So, this will be compliant with older
syslog implementations.

If somebody wants to fire a message with non US-ASCII characters in
the syslog message, then they should only be fired to a syslog daemon
implementation that supports the new standard and UTF-8.   Is not this
ok?

Or do we want to state it is our goal that the legacy syslog
implementation should be able to receive and store internationalized
messages?  Is it strictly necessary?  Yes, it would reduce the need to
upgrade infrastructure, but it would tie us to a less compact UTF-7.

One other bad thing about UTF-7 is that it represents all US-ASCII
unchanged *except* for "+" character because it is used as an escape
character. I also think it has some restrictions on "\", "~" and a
trailing "-".  So, UTF-7 is not actually fully US-ASCII compatible,
while UTF-8 is. Right?  Other nastiness/annoyance is that I think
UTF-7 allows multiple ways to encode the same thing.

Because of the history of syslog where implementations appeared before
the standard, I think it may be acceptable to eventually try to
standardize things instead of supporting legacy application which are
not even known to follow the standard anyway.

The UTF-7 IETF standard itself suggests that UTF-8 should be followed
anywhere where possible: "UTF-7 should normally be used only in the
context of 7 bit transports, such as mail. In other contexts, straight
Unicode or UTF-8 is preferred."  I think we can afford to say that
legacy syslog implementations do not have to deal with
internationalized messages since they were not intended to. Then, it
follows that we can afford to support UTF-8. Right?

Supporting multiple encoding could be an answer, but not an elegant
one.  It would actually mean that new implementations need to support
both UTF-7 and UTF-8.  If we supported just UTF-8, then we may not
even need any new headers in the message (except for maybe a language
which is debatable).

Just as anecdotal evidence... There is definitely momentum behind
supporting just UTF-8 in new protocols.  I was giving a related
presentation today and people were somewhat skeptical when I mentioned
potentially using UTF-7.  Adopting UTF-7 does seems to people as a
forward-looking move.

I understand your concerns though.  Just my 2.5 cents.

Thanks for investigating all this!

Anton.





> -----Original Message-----
> From: owner-syslog-sec@employees.org
> [mailto:owner-syslog-sec@employees.org]On Behalf Of Rainer Gerhards
> Sent: Thursday, August 14, 2003 4:41 PM
> To: Chris Lonvick; syslog-sec@employees.org
> Subject: RE: Protocol Action: 'UTF-8, a transformation format of ISO
> 10646' to Standard (fwd)
>
>
> Chris and all,
>
> I am still strugling with UTF-8 & ALL syslog RFCs.
>
> http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis
> -05.txt, in
> 4. says:
>
> "   For the convenience of implementors using ABNF, a definition of
> UTF-8
>    in ABNF syntax is given here.
>
>    A UTF-8 string is a sequence of octets representing a
> sequence of UCS
>    characters. An octet sequence is valid UTF-8 only if it
> matches the
>    following syntax, which is derived from the rules for
> encoding UTF-8
>    and is expressed in the ABNF of [RFC2234].
>
>    UTF8-octets = *( UTF8-char )
>    UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
>    UTF8-1      = %x00-7F
>    UTF8-2      = %xC2-DF UTF8-tail
>    UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
>                  %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
>    UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3(
> UTF8-tail ) /
>                  %xF4 %x80-8F 2( UTF8-tail )
>    UTF8-tail   = %x80-BF
> "
>
> If you look at this definition, 8 bit characters are
> required. All of
> the current RFCs/Ids describe 7 bit US-ASCII only. So I
> don't see any
> way to use UTF-8 in the current framework.
>
> Am I missing something?
>
> Rainer
>
>
> > -----Original Message-----
> > From: Chris Lonvick [mailto:clonvick@cisco.com]
> > Sent: Thursday, August 14, 2003 3:48 PM
> > To: syslog-sec@employees.org
> > Subject: Protocol Action: 'UTF-8, a transformation format of
> > ISO 10646' to Standard (fwd)
> >
> >
> > Since we're on the subject.
> >
> > Thanks,
> > Chris
> >
> > ---------- Forwarded message ----------
> > Date: Mon, 11 Aug 2003 16:17:04 -0400
> > From: The IESG <iesg-secretary@ietf.org>
> > To: IETF-Announce:  ;
> > Cc: Internet Architecture Board <iab@iab.org>,
> >      RFC Editor <rfc-editor@rfc-editor.org>
> > Subject: Protocol Action: 'UTF-8,
> >      a transformation format of ISO          10646' to Standard
> >
> > The IESG has approved the Internet-Draft 'UTF-8, a
> > transformation format of ISO 10646'
> > <draft-yergeau-rfc2279bis-05.txt> as a Standard. This
> > document has been reviewed in the IETF but is not the product
> > of an IETF Working Group. The IESG contact person is Ted Hardie.
> >
> > Technical Summary
> >
> > This document updates the specification of UTF-8,
> > an encoding of the UCS which is designed to be
> > compatible with many current applications and protocols.
> > UTF-8 has the characteristic of preserving the full US-ASCII
> > range, providing compatibility with file systems, parsers and
> > other software that rely on US-ASCII values but are
> > transparent to other values. This memo obsoletes and replaces
> > RFC 2279.
> >
> >
> > Working Group Summary
> >
> > This draft and the interoperability reports associated with
> > it were discussed on the IETF-charsets@iana.org mailing list.
> > Archives may be found at
> > http://lists.w3.org/Archives/Public/ietf-> charsets/ among other
> places.
> >
> >
> > Protocol Quality
> >
> > This specification was reviewed for the IESG by Patrik Falstrom.
> >
> >
> >
> >
> >
>
>

------------------------------

Protocol Action: 'UTF-8, a transformation format … Chris Lonvick
RE: Protocol Action: 'UTF-8, a transformation for… Rainer Gerhards
RE: Protocol Action: 'UTF-8, a transformation for… Anton Okmianski
RE: Protocol Action: 'UTF-8, a transformation for… Glen Zorn
RE: Protocol Action: 'UTF-8, a transformation for… Chris Lonvick
RE: Protocol Action: 'UTF-8, a transformation for… Rainer Gerhards
RE: Protocol Action: 'UTF-8, a transformation for… Rainer Gerhards
RE: Protocol Action: 'UTF-8, a transformation for… Anton Okmianski