Re: RFC 2047 and gatewaying

Bruce Lilly <blilly@erols.com> Thu, 09 January 2003 17:29 UTC

Received: (from majordomo@localhost) by above.proper.com (8.11.6/8.11.3) id h09HTPN27969 for ietf-822-bks; Thu, 9 Jan 2003 09:29:25 -0800 (PST)
Received: from smtp02.mrf.mail.rcn.net (smtp02.mrf.mail.rcn.net [207.172.4.61]) by above.proper.com (8.11.6/8.11.3) with ESMTP id h09HTJo27964 for <ietf-822@imc.org>; Thu, 9 Jan 2003 09:29:19 -0800 (PST)
Received: from 209-122-237-200.s1262.apx1.nyw.ny.dialup.rcn.com ([209.122.237.200] helo=mail.blilly.com) by smtp02.mrf.mail.rcn.net with esmtp (Exim 3.35 #4) id 18WgUd-0001QY-00; Thu, 09 Jan 2003 12:29:23 -0500
Received: from mail.blilly.com (localhost [127.0.0.1]) by mail.blilly.com with ESMTP id h09HTGwJ022941(8.12.3/8.12.3/SuSE Linux 0.6/2002-07-27 16:10:46); Thu, 9 Jan 2003 12:29:16 -0500
Received: from alex.blilly.com (alex.blilly.com [192.168.99.6]) by mail.blilly.com with ESMTP id h09HTE8h022940(8.12.3/8.12.3/Submit/2002-06-01 20:08:15); Thu, 9 Jan 2003 12:29:15 -0500
Message-ID: <3E1DB169.6080203@alex.blilly.com>
Date: Thu, 09 Jan 2003 12:29:13 -0500
From: Bruce Lilly <blilly@erols.com>
Reply-To: ietf-822@imc.org
Organization: Bruce Lilly
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021130
X-Accept-Language: en-us
MIME-Version: 1.0
To: Charles Lindsey <chl@clw.cs.man.ac.uk>
CC: ietf-822@imc.org
Subject: Re: RFC 2047 and gatewaying
References: <ylk7hnwi2d.fsf@windlord.stanford.edu> <8d9he9cmw-B@khms.westfalen.de> <20030104033518.GA16177@ussenterprise.ufp.org> <yln0mho6dl.fsf@windlord.stanford.edu> <3E1731A6.5030604@alex.blilly.com> <H8As9G.5C2@clw.cs.man.ac.uk>
In-Reply-To: <H8As9G.5C2@clw.cs.man.ac.uk>
Content-Type: text/plain; charset="us-ascii"; format="flowed"
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavis-milter (http://amavis.org/)
Sender: owner-ietf-822@mail.imc.org
Precedence: bulk
List-Archive: <http://www.imc.org/ietf-822/mail-archive/>
List-ID: <ietf-822.imc.org>
List-Unsubscribe: <mailto:ietf-822-request@imc.org?body=unsubscribe>

Charles Lindsey wrote:
> In <3E1731A6.5030604@alex.blilly.com> Bruce Lilly <blilly@erols.com> writes:
> 
> 
>>It's worse than that; there are at least 3 different versions of UTF-8.
>>They differ in the longer multi-byte sequences.
> 
> 
> No, there is precisely one, as defined by the relevant Unicode documents.
> See RFC 2044 and draft-yergeau-rfc2279bis-02.txt, with all of which Usefor
> is fully compatible.

Charles, in spite of having been shown the differences in the past, you persist
in claiming that there are none.  RFC 2044 is not a Unicode document, and has
long been obsoleted.  One clue that you are wrong is the following quotation
from Unicode Technical report 28:

"Most notable among the corrigenda to the Standard is a further tightening of the definition of UTF-8, to eliminate irregular UTF-8 and to bring the Unicode specification of UTF-8 more completely into line with other specifications of UTF-8. "

Obviously if the Unicode consortium states unequivocally that there are
multiple utf-8 specifications which differ, there cannot be "precicely
one" utf-8 specification.

Here, once again:

Unicode 2.0, table A-3 (applies through Unicode 3.0):

  Unicode Value     1st Byte  2nd Byte  3rd Byte  4th Byte
000000000xxxxxxx   0xxxxxxx
00000yyyyyxxxxxx   110yyyyy  10xxxxxx
zzzzyyyyyyxxxxxx   1110zzzz  10yyyyyy  10xxxxxx
110110wwwwzzzzyy +
110111yyyyxxxxxx   11110uuu  10uuzzzz  10yyyyyy  10xxxxxx

RFC 2044 is obsolete; here's the table from RFC 2279:

    UCS-4 range (hex.)           UTF-8 octet sequence (binary)
    0000 0000-0000 007F   0xxxxxxx
    0000 0080-0000 07FF   110xxxxx 10xxxxxx
    0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

    0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    0400 0000-7FFF FFFF   1111110x 10xxxxxx ... 10xxxxxx

Unicode 3.2, Unicode Technical Report #28:

  Code Points   1st Byte   2nd Byte   3rd Byte   4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF ill-formed
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF  80..BF

Clearly RFC 2279 provides for 5- and 6-byte utf-8 sequences, which are
not provided for by Unicode through 3.2.  And some 4-byte sequences
differ in different Unicode versions (particulary those corresponding
to surrogate pairs).  Whether or not Unicode 4.0 and/or the draft
mentioned above will introduce additional variants is another matter.