[idn] Re: permission <draft-ietf-idn-ace37-00.txt (attach)
Marc Blanchet <Marc.Blanchet@viagenie.qc.ca> Thu, 05 July 2001 18:00 UTC
Received: from psg.com (exim@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with SMTP id OAA21750 for <idn-archive@lists.ietf.org>; Thu, 5 Jul 2001 14:00:23 -0400 (EDT)
Received: from lserv by psg.com with local (Exim 3.31 #1) id 15IDBt-000IyZ-00 for idn-data@psg.com; Thu, 05 Jul 2001 10:45:25 -0700
Received: from jazz.viagenie.qc.ca ([206.123.31.2]) by psg.com with esmtp (Exim 3.31 #1) id 15IDBs-000IyT-00 for idn@ops.ietf.org; Thu, 05 Jul 2001 10:45:24 -0700
Received: from CLASSIC.viagenie.qc.ca (classic.viagenie.qc.ca [206.123.31.136]) by jazz.viagenie.qc.ca (Viagenie/8.11.0) with ESMTP id f65I4k152757; Thu, 5 Jul 2001 14:04:46 -0400 (EDT)
X-Accept-Language: fr,en,es
Message-Id: <5.1.0.14.1.20010705120420.01efc498@mail.viagenie.qc.ca>
X-Sender: blanchet@mail.viagenie.qc.ca
X-Mailer: QUALCOMM Windows Eudora Version 5.1
Date: Thu, 05 Jul 2001 12:08:02 -0400
To: Edmon <edmon@neteka.com>, idn@ops.ietf.org
From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
Subject: [idn] Re: permission <draft-ietf-idn-ace37-00.txt (attach)
In-Reply-To: <006601c10569$ffa20e00$1001a8c0@neteka.com>
References: <5.1.0.14.1.20010705084656.02b0e138@mail.viagenie.qc.ca>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"; format="flowed"
Content-Transfer-Encoding: 8bit
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 8bit
At/À 11:48 2001-07-05 -0400, Edmon you wrote/vous écriviez: >Hi all, > >I was unaware that the workgroup no longer accepts new drafts. see: Message-Id: <5.1.0.14.1.20010626000156.03d85e50@mail.viagenie.qc.ca> Date: Tue, 26 Jun 2001 00:05:28 -0400 To: idn@ops.ietf.org From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca> Subject: [idn] wg next steps and: Message-Id: <5.1.0.14.1.20010629080012.02042a10@mail.viagenie.qc.ca> Date: Fri, 29 Jun 2001 08:06:14 -0400 To: idn@ops.ietf.org From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca> Subject: [idn] document pools active And, as I wrote in the email, you are _encouraged_ to submit as individual submission. The only difference is filename and no listing in the ietf idn wg charter web page. Marc. > Anyway, I >have drafted a new ACE based on the simplicity of DUDE which has hugely >improved compression. Worst case scenario CJK could have 21 han characters! >Attached below is a copy of the draft (for my original submission), you can >also find it at http://www.dnsii.org/idn-ace37-00.txt (easier to read) and >hopefully in the i-d-n.net website soon. > >ACE37 is based on the one-pass one-mode scheme of DUDE (diiferential XOR), >then utilizes a simple code block shifting (similar to the reference points >in the AMC series) to hugely increase the capacity for CJK (worst case >scenario 21 han characters!) and then utilizes base-32 for compression (as >in LACE) (DUDE and AMC-w/v uses base-32 only for flagging). In addition to >base-32, a base-4 scheme is introduced by using the remaining characters >{wxyz}. These contain 2 bits of character information and doubles as an >indicator for codepoint brackets. All the while, the algorithm is kept to >be as simple as DUDE. > >Hopefully you might find that it is interesting and appropriate to be >considered as an ACE within the IETF. Afterall, it was intended to be an >integrated version of the three primary ACEs: DUDE, LACE and the AMC series, >identified by the ACE design team report. > >Looking forward to all your inputs. > >Edmon > >PS. I have created an Excel worksheet to illustrate the Encoding and >Decoding procedures as well you can find them at >http://www.dnsii.org/ace37/ace37-encode.xls and >http://www.dnsii.org/ace37/ace37-decode.xls respectively. > > > >----- Original Message ----- >From: "Marc Blanchet" <Marc.Blanchet@viagenie.qc.ca> >To: "Natalia Syracuse" <nsyracus@ietf.org>; <edmon@neteka.com>; ><david@neteka.com> >Cc: <jseng@pobox.org.sg> >Sent: Thursday, July 05, 2001 8:50 AM >Subject: Re: permission <draft-ietf-idn-ace37-00.txt (attach) > > > > I'm sorry but the new wg policy is to not accept draft unless there is a > > demonstrated support. But drafts are _highly_ encouraged to be published >as > > individual submissions. I would recommend to put idn in the filename and > > use this filenaming convention: draft-<yourname>-idn-ace37-00.txt. After > > publication in the internet-draft, the author should announce it in the wg > > mailing list and I'll put a reference to it in the wg web page. > > > > So please publish it as individual submission. > > > > Marc. > > > > At/À 08:34 2001-07-05 -0400, Natalia Syracuse you wrote/vous écriviez: > > > > > > > > > > > >Internet Draft Edmon Chung, Neteka Inc. > > ><draft-ietf-idn-ace37-00.txt> David Leung, Neteka Inc. > > > June 2001 > > > > > > > > > > > > ACE Utilizing All 37 Alphanumeric Characters (ACE37) > > > > > > > > >STATUS OF THIS MEMO > > > > > > This document is an Internet-Draft and is in full conformance with > > > all provisions of Section 10 of RFC2026. > > > > > > Internet-Drafts are working documents of the Internet Engineering > > > Task Force (IETF), its areas, and its working groups. Note that > > > other groups may also distribute working documents as Internet- > > > Drafts. Internet-Drafts are draft documents valid for a maximum of > > > six months and may be updated, replaced, or obsoleted by other > > > documents at any time. It is inappropriate to use Internet-Drafts > > > as reference material or to cite them other than as "work in > > > progress." > > > > > > The reader is cautioned not to depend on the values that appear in > > > examples to be current or complete, since their purpose is primarily > > > educational. Distribution of this memo is unlimited. > > > > > > The list of current Internet-Drafts can be accessed at > > > http://www.ietf.org/ietf/1id-abstracts.txt > > > The list of Internet-Draft Shadow Directories can be accessed at > > > http://www.ietf.org/shadow.html. > > > > > >Abstract > > > > > > ACE37 is a combination of DUDE-02, AMC-W/V and LACE. ACE37 utilizes > > > the simple one pass algorithm of DUDE, the character block > > > considerations of AMC-W/V and the Base-32 compression of LACE. It > > > also fully utilizes entire LDH set currently allowed in the DNS (A- > > > z, 0-9 and "-") within its character repertoire to optimize > > > performance and compression. Even for the worst-case scenario in > > > ACE37, any name can have 21 characters including Chinese, Japanese > > > and Korean names. Two Excel spreadsheets for ACE37 encoding and > > > decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls > > > and http://www.dnsii.org/ace37/ace37-decode.xls respectively. > > > > > > While DUDE-02 provides a very efficient differential mechanism, its > > > compression is inefficient as it fails to take advantage of the > > > base-32 scheme in using all 5-bits for character information. The > > > AMC series is highly efficient in compression but requires > > > complicated mode changes and therefore inefficient in process. LACE > > > is rather moderate and requires a two-pass mechanism but utilizes > > > base-32 for good compression. > > > > > > > > >Chung & Leung [Page 1] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > ACE37 uses simple character block shifting to achieve the > > > compression efficiency of the AMC series, retains the one-pass and > > > one mode XOR differential mechanism used by DUDE while embracing the > > > base-32 compression used by LACE for efficient character bit > > > information. > > > > > >Terminology > > > > > > The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", > > > and "MAY" in this document are to be interpreted as described in RFC > > > 2119 [RFC2119]. > > > > > > LDH: Letters, Digits and Hyphens: a string of characters that > > > consists only hyphens ("-"), English letters (A-z) and digits (0-9), > > > which might not be a result of an algorithm for transcoding > > > multilingual characters. For example: whatever-you-want.example > > > > > > ACE - ASCII Compatible Encoding: a string of characters resulting > > > from a particular algorithm for transforming multilingual character > > > information into an alphanumeric form acceptable by the existing > > > DNS. For example: bq--3bhc2zmh.tld. In essence, ACE is a subset of > > > LDH. > > > > > > Hexadecimal values are shown preceeded by "0x". For example, 0x60 > > > is decimal 96. Binary values are shown preceeded by "0b" for > > > example "0b1000" is decimal 8. As in the Unicode Standard > > > [UNICODE], Unicode code points are denoted by "U+" followed by four > > > to six hexadecimal digits, while a range of code points (or > > > hexadecimal numbers) is denoted by two hexadecimal numbers separated > > > by "..", with no prefixes. > > > > > > Octets: sequences of 8 bits; Quintets: sequences of 5 bits; > > > Quartets: sequences of 4 bits; Duplets: sequences of 2 bits. > > > > > > XOR: bitwise exclusive or. Given 2 nonnegative integers A and B, A > > > XOR B is the nonnegative integer value whose binary representation > > > is 1 wherever A and B disagrees, and 0 wherever they agree. > > > > > >Table Of Contents > > > > > > 1. Introduction....................................................3 > > > 2. Code Block Shifting.............................................4 > > > 3. Base-32 Characters..............................................5 > > > 4. Base-4 Characters...............................................6 > > > > > > 5. LDH Considerations..............................................9 > > > 6. Encoding Procedure..............................................9 > > > 7. Decoding Procedure.............................................11 > > > 8. Examples.......................................................13 > > > 9. Summary & Comparisons..........................................15 > > > 10. Security Considerations.......................................16 > > > 11. References....................................................16 > > > > > >Chung & Leung [Page 2] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > >1. Introduction > > > > > > ACE37 takes into account the recommendations and findings of the ACE > > > design team to create a "super-ACE" that incorporates the key > > > advantages of the various considered ACEs without complicated mode > > > changes. The encoding (Section 6) and decoding (Section 7) process > > > is largely similar to and as simple as DUDE-02. The encoding > > > processes for ACE37 in comparison with DUDE-02 could be summarized: > > > > > > ACE37 Encoding Procedure | DUDE Encoding Procedure > > > ---------------------------------+--------------------------------- > > > (1) let initial prev = 0x00 | (1) let initial prev = 0x60 > > > (2) if n = LDH output "-n" | (2) if n = hyphen output "-" > > > (3) code block shift to obtain | (3) diff = prev XOR n > > > ACE37 shifted n (Section 2)| (4) prepend "0" to the last > > > (4) diff = prev XOR n | quartet and "1" to others > > > (5) output in appropriate base-4 | (5) output a base-32 character > > > and base-32 form | for each corresponding > > > (Sections 3&4) | quintet > > > (6) let prev = n | (6) let prev = n > > > > > > Similarly, the decoding process can be described and compared: > > > > > > ACE37 Decoding Procedure | DUDE Decoding Procedure > > > ---------------------------------+--------------------------------- > > > (1) let initial prev = 0x00 | (1) let initial prev = 0x60 > > > (2) if char = hyphen discard "-" | (2) if char = hyphen consume > > > and output next char | and output 0x002D > > > (3) consume and convert char into| (3) consume and convert to > > > duplets and quintets | quintets until encoun- > > > (according to Sections 3&4)| erring a quintet with "0" > > > (4) concatenate to form diff | as first bit > > > (based on Sections 4.1&4.2)| (4) strip all first bits off > > > (5) let prev = prev XOR diff | (5) concatente to form diff > > > (6) reverse code block shifting | (6) let prev = prev XOR diff > > > (7) output Unicode code point | (7) output Unicode code point > > > > > > The features of ACE37 include: > > > > > > Unique & Reversible - the ACE37 encoding scheme yields a unique and > > > consistent result string for a given set of Unicode code points. > > > The encoded string could be decoded back to the original Unicode > > > code points without loss of character data. > > > > > > Simple - ACE37 utilizes a one-pass system and the XOR differential > > > function to encode and decode. Code block shifting is done by a > > > simple calculation instead of mapping or creation of arbitrary > > > reference points. Complex mode changes are not required. > > > > > > Spacious - With the code block shifting coupled with a base-32 > > > scheme, ACE37 can accommodate up to 21 unique Han characters > > > (including CJK) within the 63 octets allowed by the DNS. Other > > > Latin based scripts can reach up to 31 characters. > > >Chung & Leung [Page 3] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > > > > Completeness - any sequence of Unicode code points > > > (U+0000..U+10FFFF) could be encoded. Restrictions of allowed code > > > points is not discussed, but is expected that Nameprep [Nameprep] > > > will be used prior to ACE37 encoding. > > > > > > In essence, it captures the focus criterions discussed by the > > > workgroup ACE design team - reversibility, simplicity and > > > compression capability. Moreover, ACE37 utilizes a very simple code > > > block shifting (Section 2) mechanism to allow up to any 21 CJK > > > ideographs to be encoded within the 63-octet constraint. > > > > > >2. Code Block Shifting > > > > > > While the DNS was not originally designed for multilingual > > > characters, Unicode was not designed with the DNS in mind and > > > therefore code points were apparently not allocated in an ACE- > > > friendly way. > > > > > > The AMC series [AMC-W & AMC-V] utilizes a number of reference points > > > to achieve better compression efficiency by anticipating and > > > minimizing delta between characters. For ACE37, a much simpler > > > rendering is used. More specifically, the entire character block > > > U+3000..U+9FFF for CJK ideographs is shifted down by 0x3000. That > > > is U+3000 will become 0x0000, U+4000 becomes 0x1000, and so on. To > > > compensate for the downwards shift, the general script and symbol > > > characters in U+0000..U+2FFF will be shifted upwards by 0x7000. > > > Therefore, U+0100 will become 0x7100, U+2000 becomes 0x9000, and so > > > on. All other code points (U+A000..U+10FFFF) are unchanged. > > > > > > Original Unicode Allocation | ACE37 Code Block Shifted > > > --------------------------------|------------------------------- > > > General Scripts U+0000 -+ | +- 0x0000 CJK Misc > > > U+1000 | | | 0x1000 CJK Ideographs > > > +- | -> | 0x2000 > > > Symbols U+2000 -+ \ | / | 0x3000 > > > \ |/ | 0x4000 > > > CJK Misc U+3000 -+ \/ | 0x5000 > > > CJK Ideographs U+4000 | /\ +- 0x6000 > > > U+5000 | / |\ > > > U+6000 +-- | \ +- 0x7000 General Scripts > > > U+7000 | | -> | 0x8000 > > > U+8000 | | | > > > U+9000 -+ | +- 0x9000 Symbols > > > | > > > Hangul U+A000 -+ | +- 0xA000 Hangul > > > U+B000 | | | 0xB000 > > > U+C000 +----|---> | 0xC000 > > > U+D000 | | | 0xD000 > > > : : -+ | +- : : > > > | > > > > > > > > >Chung & Leung [Page 4] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > This shifting effectively moves the entire Han library to within > > > 0x6FFF and therefore could be represented in 15-bits or exactly 3 > > > base-32 characters. (details on base-32 characters in Section 3) > > > > > > For example, the Chinese character for <change> with the original > > > Unicode code point at U+8F49, will be shifted to 0x5F49 and can be > > > represented in 3 quintets, and in turn with 3 base-32 characters: > > > > > > Character: <change> > > > Unicode Code Point: U+8F49 > > > ACE37 Shifted: 0x5F49 > > > Corresponding Quartets: 0101 1111 0100 1001 > > > Resulting Quintets: 10111 11010 01001 > > > Base-32: nq9 (further discussed in Section 3) > > > > > > This in turn means that any Chinese character could be represented > > > with 3 base-32 characters making the total possible characters > > > within a label, even without further compression introduced by the > > > XOR differential process (Section 6), to be at least 21. The ACE37 > > > code block shifting process could be described as follows: > > > > > > for each input code point = n > > > if n <= 9FFF > > > n = n - 0x3000 /*downwards shifting*/ > > > if n <= 0 > > > n = 0x9FFF + n /*compensation for U+0000..U+2FFF*/ > > > > > > The character block shifting introduced here is extremely simple and > > > utilizes simple calculation that requires no mapping function. At > > > the same time, it achieves the goal in adjusting the Unicode > > > allocation so that it becomes more ACE friendly. > > > > > >3. Base-32 Characters > > > > > > Base-32 characters are used in LACE for compression, while DUDE-02 > > > and the AMC series only utilizes it for quartet flagging to indicate > > > the last quartet of each encoded code point. ACE37 utilizes base-32 > > > characters for compression while base-4 characters, which will be > > > introduced in Section 4, determine the compressed code point > > > brackets. > > > > > > The following table shows the 32 base-32 characters and their > > > corresponding quintets: > > > > > > Base-32 Character =to= Corresponding Quintet > > > 0 = 00000 8 = 01000 g = 10000 o = 11000 > > > 1 = 00001 9 = 01001 h = 10001 p = 11001 > > > 2 = 00010 a = 01010 i = 10010 q = 11010 > > > 3 = 00011 b = 01011 j = 10011 r = 11011 > > > 4 = 00100 c = 01100 k = 10100 s = 11100 > > > 5 = 00101 d = 01101 l = 10101 t = 11101 > > > 6 = 00110 e = 01110 m = 10110 u = 11110 > > > 7 = 00111 f = 01111 n = 10111 v = 11111 > > >Chung & Leung [Page 5] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > > > > With this layout of base-32 characters, it is also possible to > > > implement a computation based base-32 conversion instead of having > > > to resort to mapping and lookup tables: > > > > > > For each quintet = q > > > if q <= 0x0F > > > then hex dump q to form base-32 character > > > if 0x10 <= q <= 0x1F > > > then q = q - 0x10 > > > and char(q + 0x67) to form base-32 character > > > > > > Note that 0x67 is the code value for the letter "g". Therefore, for > > > example if the quintet is 0b10001 its base-32 character can be > > > obtained by: > > > > > > 0x10 <= q=0b10001=0x11 <= 0x1F > > > therefore q = q - 0x10 = 0x11 - 0x10 = 0x01 > > > and base-32 character = char(0x01 + 0x67) > > > char(0x68) = "h" > > > > > >4. Base-4 Characters > > > > > > ACE37 goes beyond the 32 characters (base-32) to include the > > > remaining 4 characters {w,x,y,z} in the alphabet. These base-4 > > > characters enable ACE37 to better utilize the existing "resources" > > > (the allowed characters) to represent IDN character information, > > > therefore making it's encoding more efficient. > > > > > > The set of base-4 characters are {w,x,y,z} and will be used to > > > represent the following duplets (duplets are groups containing 2 > > > bits): > > > > > > Base-4 Character =to= Corresponding Duplet > > > w = 00 > > > x = 01 > > > y = 10 > > > z = 11 > > > > > >4.1 Base-4 Indicators > > > > > > Base-4 characters while carrying character information, also doubles > > > as an indicator for code point brackets. In DUDE-02, an extra bit > > > was pre-pended to each quartet. The last quartet of each encoded > > > code point will be pre-pended with "0", marking the end of the code > > > point. In ACE37, base-4 characters will determine the length > > > (number of ACE37 characters) of the encoded code point. Actually, > > > to be more precise, the encoded bits are in fact the "diff" and not > > > the code point itself (diff carries the same meaning as in DUDE-02 > > > and is further discussed in Sections 6 & 7) > > > > > > > > > > > >Chung & Leung [Page 6] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > The following table explains how base-4 characters are combined with > > > base-32 characters to form a representation of a diff (key: b4=base- > > > 4, b32=base-32): > > > > > > diff value |bits| ACE37 Form > > > -------------------------|----|---------------------------- > > > diff<=0x7F | 7 | <b4><b32> > > > 0x80<=diff<=0x7FFF | 15 | <b32><b32><b32> > > > 0x8000<=diff<=0x1FFFF | 17 | w<b4><b32><b32><b32> > > > 0x20000<=diff<=0xFFFFF | 20 | ww<b32><b32><b32><b32> > > > 0x100000<=diff<=0x10FFFF | 22 | <b4>w<b32><b32><b32><b32> > > > > > > Note that the "bits" column represents the maximum number of > > > significant bits for the given diff value. For example when > > > diff<=0x7F, the maximum value is 0b1111111, therefore the number of > > > significant bits is 7. > > > > > > Note also that to encode a 17-bit diff, the letter "w" is used as an > > > indicator to distinguish the sequence from the 7 bit diff where a > > > base-32 character is expected to follow a base-4 character. Since > > > "w" represents "00" that has no value, it will not be used in the > > > base-4 representation for a 17-bit diff (if a "00" is used, it means > > > that there are only 15 significant bits and therefore should use the > > > 15 bit diff form). This is the case for the 20-bit form as well. > > > The "w" is used as an arbitrary indicator in the 22-bit form and > > > MUST be discarded during decoding. > > > > > > By analyzing the ACE37 form, an encoded string could be successfully > > > returned to its original form. There is no overlap and the form can > > > be determined precisely. The following 5 rules dictate the 5 > > > different ACE37 forms: > > > > > > (1) Encode: if diff<=0x7F > > > Decode: if first character is <b4> AND next character NOT <b4> > > > Then it MUST be in 7-bit form: <b4><b32> > > > > > > (2) Encode: if 0x80<=diff<=0x7FFF > > > Decode: if first character is <b32> > > > Then it MUST be a 15-bit form: <b32><b32><b32> > > > > > > (3) Encode: if 0x8000<=diff<=0x1FFFF > > > Decode: if first character is "w" AND next character is <b4> > > > AND NOT "w" > > > Then it MUST be in 17-bit form: w<b4><b32><b32><b32> > > > > > > (4) Encode: if 0x20000<=diff<=0xFFFFF > > > Decode: if first character is "w" AND next character is "w" > > > Then it MUST be in 20-bit form: ww<b32><b32><b32><b32> > > > > > > (5) Encode: if 0x80<=diff<=0x7FFF > > > Decode: if first character is <b4> AND NOT "w" > > > AND next character is "w" > > > Then it MUST be 22-bit form: <b4>w<b32><b32><b32><b32> > > >Chung & Leung [Page 7] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > > > > Note that the ACE37 scheme can effectively encode a diff of up to 22 > > > significant bits or 0x3FFFFF. The Unicode code points are expected > > > to range only between 0x0000..0x10FFFF, therefore ACE37 will be able > > > to handle any Unicode code point. > > > > > > Additionally, base-4 characters (and sometimes base-32 characters) > > > could be used for mixed-case annotation. This optional mixed-case > > > annotation mechanism is discussed in Appendix B. > > > > > >4.2 First Code Point Considerations > > > > > > There are additional considerations for the first code point that is > > > encoded or decoded to ensure that if the first code point is within > > > the first Unicode plane (U+0000..U+FFFF), it will not occupy more > > > than 4 ACE37 characters. > > > > > > This special consideration affects only Rules (1), (3) and (4) > > > explained in Section 4.1. Rule (1) is discarded for the first code > > > point, therefore any diff under 0x7FFF will be in the form > > > <b32><b32><b32>. The form for Rule (3) becomes simply > > > <b4><b32><b32><b32> without the "w" indicator. Similarly, the form > > > for Rule (4) becomes w<b32><b32><b32><b32> with one less "w". > > > > > > The first code point considerations can be summarized in the > > > following 4 rules: > > > > > > (a) Encode: if diff<=0x7FFF > > > Decode: if first character is <b32> > > > Then it MUST be in 15-bit form: <b32><b32><b32> > > > > > > (b) Encode: if 0x8000<=diff<=0x1FFFF > > > Decode: if first character is <b4> AND NOT "w" > > > Then it MUST be in 17-bit form: <b4><b32><b32><b32> > > > > > > (c) Encode: if 0x20000<=diff<=0xFFFFF > > > Decode: if first character is "w" > > > Then it MUST be in 20-bit form: w<b32><b32><b32><b32> > > > > > > (d) Encode & Decode: same as Rule (5) in Section 4.1 > > > > > > Besides special considerations for base-4 character usage, prev > > > setting is also specially considered for the first code point. As > > > laid out in Section 6, in order to detect for the first code point, > > > the prev is evaluated. If prev = 0x00, it is assumed that it is the > > > first code point as 0x00 SHOULD not be a permitted character for > > > input. When an LDH is the first code point, there is a need to make > > > a special consideration. Regularly, if n = LDH is encountered > > > (Section 5), it will be output as "-n" and prev is not changed. > > > However, if the first code point is an LDH, after outputting "-n", > > > prev is updated to = lowercase(n). This is to ensure and maintain > > > that only the first code point coming in will have a prev = 0x00. > > > > > >Chung & Leung [Page 8] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > >5. LDH Considerations > > > > > > Finally, the 37th character of the entire LDH repertoire, the hyphen > > > will be used to indicate LDH exceptions. Extending the hyphen > > > consideration of DUDE-02, ACE37 gives special consideration for the > > > entire LDH repertoire. All LDH characters will be encoded "as is" > > > with the addition of a leading hyphen. For example, the character > > > "a" will be encoded within ACE37 as "-a". The hyphen character "-" > > > will be encoded as "--". > > > > > > This ensures that each LDH character will only take up 2 character > > > spaces within an ACE37 encoded string and also will allow > > > administrators to see the actual characters, similar to the AMC > > > series. Unlike the AMC series however, the hyphen is not used to > > > indicate an ongoing mode change, but only the following character. > > > Therefore retaining the simplicity of the DUDE-02 single-mode, > > > single-pass philosophy. > > > > > >6. Encoding Procedure > > > > > > Similar to DUDE, all ordering of bits and quartets is big-endian. > > > The following describes the encoding procedure: > > > > > > Set initial value for prev = 0x00 > > > for each input code point = n > > > if n is an LDH {A-z, 0-9, -} > > > output "-n" (Section 5: LDH Considerations) > > > if prev = 0x00 (Section 4.2: First Code Point) > > > let prev = lowercase(n) > > > else perform code block shifting (Section 2: Code Block Shifting) > > > let diff = prev XOR n (n after code block shifting) > > > if diff<=0x7F --------------------------------------+ > > > and if this is the first code point (Section 4.2)| > > > then output 15-bit form: <b32><b32><b32> | > > > else, output 7-bit form: <b4><b32> | > > > if 0x80<=diff<=0x7FFF +-(Section 4: > > > output 15-bit form: <b32><b32><b32> | Base-4 > > > if 0x8000<=diff<=0x1FFFF | Characters) > > > and if this is the first code point (Section 4.2)| > > > output 17-bit form: w<b4><b32><b32><b32> | > > > if 0x20000<=diff<=0xFFFFF | > > > output 20-bit form: ww<b32><b32><b32><b32> | > > > if 0x100000<=diff<=0x10FFFF | > > > output 22-bit form: <b4>w<b32><b32><b32><b32> ---+ > > > let prev = n > > > end and obtain next n and return to: "for each input code point = n" > > > > > > The following is a more comprehensive pseudo code: > > > > > > let prev = 0x00 > > > for each input integer n (in order) do begin > > > if n = "-" or "0..9" or "A..Z" or "a..z" > > > then output "hyphen"+"char(n)" > > >Chung & Leung [Page 9] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > if prev = 0x00 > > > let prev = lowercase(n) > > > > > > else begin > > > if n = 0x00 > > > then error and abort > > > if n <= 9FFF > > > n = n - 0x30 > > > if n < 0 > > > then n = 9FFF + n > > > > > > let diff = prev XOR n > > > > > > if diff <= 0x7F > > > if prev = 0x00 > > > then output with 3 base-32 characters > > > else, output first 2 bits with a base-4 character {wxyz} > > > and remaining 5 bits with 1 base-32 character > > > > > > if 0x80 <= diff <= 0x7FFF > > > then output all 15 bits with base-32 characters > > > > > > if 0x8000 <= diff <= 0xFFFF > > > if prev = 0x00 > > > then output first 2 bits with a base-4 {xyz} (except w) > > > and output remaining 15 bits with base-32 > > > else, output "w" > > > and output first 2 bits with a base-4 {xyz} (except w] > > > and output remaining 15 bits with base-32 > > > > > > if 0x10000 <= diff <= 0x1FFFF > > > then output "w" > > > and output first 2 bits with a base-4 {xyz} (except w) > > > and output remaining 15 bits with base-32 > > > > > > if 0x20000 <= diff <= 0xFFFFFF > > > then output "w" > > > and output all 20 bits with base-32 characters > > > > > > if 0x100000 <= diff <= 0x10FFFF > > > then output first 2 bits with a base-4 {xyz} (except w) > > > and output "w" > > > and output remaining 15 bits with base-32 > > > > > > let prev = n > > > end > > > end > > > > > > Nameprep [NAMEPREP] is not discussed in this document, but is > > > expected that it be implemented for IDN. Hence, regardless of the > > > code point presented, an encoder MUST not produce an incorrect > > > output. The encoder must fail if it encounters a negative input > > > value. > > >Chung & Leung [Page 10] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > > > > The initial value used is 0x00 so that all domains beginning with a > > > CJK ideograph or within row 0 (U+0000..U+0FFF) will be shorter. > > > Note that after the code block shifting (Section 2), the entire Han > > > library is within 0x0000..0x6FFF, while row 0 is fitted to > > > 0x7000..0x7FFF. Therefore by using an initial value of 0x00 the > > > diff for all Han and row 0 characters will be less than 0x7FFF. The > > > initial value is also used as a check point for the first code point > > > considerations (Section 4.2). > > > > > > Additionally, an optional mixed-case annotation mechanism is > > > discussed in Appendix B. > > > > > >7. Decoding Procedure > > > > > > A thorough description of the decoding rules, except for the final > > > reversal of the code block shifting has been presented in Sections > > > 4.1 and 4.2. The following description is a brief representation of > > > the decoding procedure: > > > > > > let prev = 0x00 > > > while the input string is not exhausted > > > if present character = hyphen (Section 5: LDH > > > discard and output next character Considerations) > > > else, depending on the presented form (Section 4) > > > convert into duplets and quintets (Section 4 & 3) > > > and concatenate to form diff > > > let prev = prev XOR diff > > > reverse code block shifting: (Section 2) > > > if prev<=0x9FFF > > > and if prev<=0x6FFF > > > output character = prev + 0x3000 > > > else, output character = prev - 0x7000 > > > else output character = prev > > > output character > > > End > > > > > > The following is a more comprehensive pseudo code for the decoding > > > precedure: > > > > > > let prev = 0x00 > > > while the input string is not exhausted do begin > > > if present character = hyphen /*Section 5:LDH Considerations*/ > > > then consume and discard hyphen > > > and obtain the next character > > > and output character > > > if prev = 0x00 /*Section 4.2:First Code Point*/ > > > let prev = code block shifted lowercase output character > > > > > > else, > > > if present character = Base-32 characters (0..v) > > > consume present character and next 2 characters > > > and convert them to quintets according to Base-32 > > >Chung & Leung [Page 11] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > concatenate the resulting quintets to form diff > > > /*15 bit form, 0x80<=diff<=0x7FFF*/ > > > > > > if present character = Base-4 characters {xyz} and NOT w > > > consume present character > > > and convert it to a duplet according to Base-4 > > > > > > if prev = 0x00 > > > obtain and consume next 3 characters > > > and convert them to quintets according to Base-32 > > > concatenate duplet with the 3 quintets to form diff > > > /*first code point: 17 bit form, 0x8000<=diff<=0x1FFFF*/ > > > > > > else, if next character = Base-32 character (0..v) > > > then consume and convert to quintet according to Base-32 > > > concatenate duplet with the quintet to form diff > > > /*7 bit form, diff<=0x7F*/ > > > > > > else, obtain next character > > > if next character = Base-4 characters {xyz} and NOT w > > > then fail and indicate error > > > > > > else, if next character = w > > > then consume and discard w and obtain next 4 characters > > > consume and convert characters to > > > quintets according to Base-32 > > > concatenate duplet with the 4 quintets to form diff > > > /*22 bit form, 0x100000<=diff<=0x10FFFF*/ > > > > > > if present character = w > > > discard "w" and obtain next character > > > > > > if next character = Base-4 characters {xyz} and NOT w > > > > > > and if prev = 0x00 > > > obtain and consume next 4 characters > > > and convert characters to quintets based on Base-32 > > > concatenate the 4 quintets to form diff > > > /*first code point: 20 bit form,*/ > > > /*0x20000<=diff<=0xFFFFFF */ > > > > > > else, consume and convert to duplet according to Base-4 > > > and obtain and consume next 3 characters > > > and convert to quintets according to Base-32 > > > concatenate duplet with the 3 quintets to form diff > > > /*17 bit form, 0x8000<=diff<=0x1FFFF*/ > > > > > > else, if next character = w > > > then consume and discard w > > > and obtain and consume next 4 characters > > > and convert to quintets according to Base-32 > > > concatenate duplet the 4 quintets to form diff > > > /*20 bit form, 0x20000<=diff<=0xFFFFFF*/ > > >Chung & Leung [Page 12] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > > > > else, if next character = Base-32 character (0..v) > > > then convert to quintet according to Base-32 > > > set quintet to diff > > > /*7 bit form, diff<=0x7F*/ > > > > > > fail upon encountering a non-ACE37 character > > > or end-of-input > > > > > > let prev = prev XOR diff > > > > > > if prev <= 0x9FFF /*reversal of the code */ > > > and if prev <= 6FFF /*block shifting described*/ > > > output = prev + 0x3000 /*in Section 2 */ > > > else, output = prev - 0x7000 > > > else, output prev > > > end > > > end > > > encode the output sequence and compare it to the input string > > > fail if they do not match (case insensitively) > > > > > >8. Examples > > > > > > ACE37 is likely to be implemented with an ACE prefix in the form > > > "xx--". The actual prefix to be used is not discussed in this > > > document. The following examples are taken from the mailing list as > > > well as from DUDE-02 and the AMC series. The resulting ACE37 string > > > is compared with that using DUDE: > > > > > > (A) JPNIC (the registry of .jp domain) > > > > > > Unicode: U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 > > > U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 > > > U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF > > > U+30FC > > > ACE37: i9urut6hm8jfaqv0m9dv1wewbx7wjyjwbynx6zsy8wtybygwky8y8ycy3 > > > (57 char) > > > DUDE-02: (error: result string exceeds 59 characters*) > > > Note: 59 characters is the maximum allowable when the ACE > > > prefix "xx--" is included > > > > > > > > > (B) A health-insurance organization in Tokyo > > > > > > Unicode: U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3 > > > U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44 > > > U+5408 > > > ACE37: drhaetvihk1o67ka44y9xfzahcqv2e6883micbaud7apuqac (48 char) > > > DUDE-02: (error: result string exceeds 59 characters) > > > > > > > > > > > > > > >Chung & Leung [Page 13] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > (C) 6 hangul syllables > > > > > > Unicode: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC > > > ACE37: xg9orfsqssvfg3i8t2c (19 char) > > > DUDE-02: 6txiy79ny53nz79a8wizwwn (23 char) > > > > > > > > > (D) maji<de>koi<suru>5<byou><mae> (Latin, hiragana, kanji) > > > > > > Unicode: U+006D U+0061 U+006A U+0069 U+3067 U+006B U+006F U+0069 > > > U+3059 U+308B U+0035 U+79D2 U+524D > > > ACE37: -m-a-j-is0a-k-o-xu06i-5iapqsv (30 char) > > > DUDE-02: pnmdvssqvssnegvsva7cvs5qz38hu53r (32 char) > > > > > > > > > (E) <pafii>de<runba> (Latin, katakana) > > > > > > Unicode: U+30D1 U+30D5 U+30A3 U+30FC U+0064 U+0065 U+30EB U+30F3 > > > U+30D0 > > > ACE37: 06hw4zmyv-d-ewnwox3 (19 char) > > > DUDE-02: vs5bezgxrvs3ibvs2qtiud (22 char) > > > > > > > > > (F) <sono><supiido><de> (hiragana, katakana) > > > > > > Unicode: U+305D U+306E U+30B9 U+30D4 U+30FC U+30C9 U+3067 > > > ACE37: 02txj06nzdx8xl05e (17 char) > > > DUDE-02: vsvpvd7hypuivf4q (16 char) > > > > > > > > > (G) 2 Arbitrary Plane Two Code Points > > > > > > Unicode: U+261AF U+261BF > > > ACE37: w4odfwg (7 char) > > > DUDE-02: uyt6rta (7 char) > > > > > > > > > (H) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky > > > > > > Unicode: U+0050 U+0072 U+006F U+010D U+0070 U+0072 U+006F U+0073 > > > U+0074 U+011B U+006E U+0065 U+006D U+006C U+0075 U+0076 > > > U+00ED U+010D U+0065 U+0073 U+006B U+0079 > > > ACE37: -p-r-o0bt-p r-o-s-twm-n-e-m-l-u-v0fm0f0-e-s-k-y (47 char) > > > DUDE-02: vauctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (44 char) > > > > > > > > > (I) Chinese > > > > > > Unicode: U+4ED5 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D > > > U+6587 > > > ACE37: 7mmfm7oh3n7is3ts5gh57h47ata (27 char) > > > DUDE-02: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (36 char) > > > > > >Chung & Leung [Page 14] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > >9. Summary & Comparisons > > > > > > In summary, ACE37 is based on the DUDE-02 process with an improved > > > compression scheme for code point sequences that are less likely to > > > cluster too closely together, such as CJK ideographs. > > > > > > Since it is the design team's indication that generally 30 > > > characters should be good enough and that there are a lot of concern > > > from the Asian community that 14-15 characters is definitely > > > limiting and that few indication from the Latin community that > > > length is really a concern, ACE37 have set its objective to increase > > > the possible number of characters in a worse case scenario closer to > > > 20 characters. > > > > > > ACE37 have succeeded in creating a very simple variation based on > > > the primary ACEs identified by the design team to create an ACE that > > > achieves dramatically better performance for CJK characters while > > > maintaining the simplicity of DUDE. > > > > > > Key Improvements of ACE37 over DUDE-02 > > > - much more spacious for Han characters. Improved worst-case > > > scenario to 21 Han ideographs by introducing code block shifting > > > and utilizing fully base-32 characters > > > - no need to arbitrarily pre-pend flagging bits to identify code > > > point brackets. Instead base-4 characters and diff forms are used > > > - base-32 and base-4 characters can be easily computed instead of > > > mapped using lookup tables > > > > > > Key Improvements of ACE37 over the AMC series > > > - a more simple process, utilizing the one-pass differential > > > mechanism from DUDE-02 > > > - a much more simple code block shifting process is used in ACE37 to > > > achieve a similar goal for the complex multiple reference point > > > system used by the AMC series > > > - base-32 and base-4 characters can be easily computed instead of > > > mapped using lookup tables > > > > > > Key Improvements of ACE37 over LACE > > > - a more simple process, utilizing the one-pass differential > > > mechanism from DUDE-02 > > > - much more spacious for Han characters. Improved worst-case > > > scenario to 21 Han ideographs by introducing code block shifting > > > and utilizing fully base-32 characters > > > - base-32 and base-4 characters can be easily computed instead of > > > mapped using lookup tables > > > > > > Two Excel spreadsheet for ACE37 encoding and decoding can be found > > > at http://www.dnsii.org/ace37/ace37-encode.xls and > > > http://www.dnsii.org/ace37/ace37-decode.xls respectively. This > > > illustrates the simplicity of ACE37 and provides a handy tool for > > > checking ACE37 encoding and decoding algorithms. The ACE37-encode > > > spreadsheet also includes a DUDE-encode worksheet. > > > > > >Chung & Leung [Page 15] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > >10. Security Considerations > > > > > > This document does not talk about DNS security issues, and it is > > > believed that the proposal does not introduce additional security > > > problems not already existent and/or anticipated by adding > > > multilingual characters to DNS and/or using ACE. > > > > > >11. References > > > > > > [AMC-W] Adam M. Costello, "AMC-ACE-W version 0.1.0", May 31, 2001. > > > > > > [AMC-V] Adam M. Costello, "AMC-ACE-V version 0.1.0", May 31, 2001. > > > > > > [DUDE-02] Mark Welter, Brian W. Spolarich & Adam M. > > > Costello, "Differential Unicode Domain Encoding (DUDE)", > > > June 7, 2001. > > > > > > [LACE] Mark Davis, IBM & Paul Hoffman, IMC & VPNC, "LACE: Length- > > > based ASCII Compatible Encoding for IDN", January 5, 2001. > > > > > > [Nameprep]Paul Hoffman, IMC & VPNC & Marc Blanchet, ViaGenie, > > > "Preparation of Internationalized Host Names", February > > > 24, 2001 > > > > > >Appendix A. Acknowledgements > > > > > > The ACE37 draft is a combination of DUDE-02, the AMC series and > > > LACE, and takes into consideration the report of the ACE design > > > team. The authors would therefore like to thank the authors of > > > DUDE-02 - Mark Welter, Brian W. Spolarich & Adam M. Costello; the > > > authors of the AMC series - Adam M.Costello; the authors of LACE - > > > Mark Davis & Paul Hoffman; and, the ACE design team and its advisors > > > - Adam M. Costello, Paul Hoffman, Makoto Ishisone, David Laurence, > > > Brian Spolarich, Rick Wesson, Marc Blanchet, Patrik Faltstrom and > > > Erik Nordmark for their inspirations. > > > > > >Appendix B. Mixed-case annotation > > > > > > This section is taken from DUDE and modified for ACE37 > > > > > > In order to use ACE37 to represent case-insensitive Unicode strings, > > > higher layers need to case-fold the Unicode strings prior to ACE37 > > > encoding. The encoded string can, however, use mixed-case base-4 > > > characters as an annotation telling how to convert the folded > > > Unicode string into a mixed-case Unicode string for display > > > purposes. > > > > > > Each Unicode code point (unless it is an LDH) is represented by a > > > sequence of base-4 and base-32 characters, the first of which is > > > mostly a base-4 character, which is always a letter {wxyz} (as > > > opposed to a digit). If that letter is uppercase, it is a > > > suggestion that the Unicode character be mapped to uppercase (if > > > > > >Chung & Leung [Page 16] > > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001 > > > > > > possible); if the letter is lowercase, it is a suggestion that the > > > Unicode character be mapped to lowercase (if possible). > > > > > > If the code point is an LDH, for example "a", it will be represented > > > as "-a". To mark the case for an LDH, simply set the LDH to the > > > desired case following the "-". Fir example if an uppercase "A" is > > > desired, the encoded form SHOULD be "-A". > > > > > > Note that there is a possibility that no base-4 character is present > > > for a code point representation. That is the case for a 15-bit diff > > > form. In this case, the base-32 characters will be used for case > > > suggestion (if possible), similar to that discussed for using a > > > base-4 character. However, also note that there is a very remote > > > possibility that all 3 base-32 characters are digits. If this > > > happens, case unfolding will be aborted. Since case annotation is > > > an optional feature and used for display purposes only, this is not > > > considered to be a major concern. Moreover, the possibility of this > > > happening is truly remote at only (32639/27)/1114109 or just 0.1% > > > chance of happening. > > > > > > ACE37 encoders and decoders are not required to support these > > > annotations, and higher layers need not use them. > > > > > > For example: In order to suggest that example (H) in Section 8: > > > "Examples" be displayed as: > > > Czech: Pro<ccaron(uppercase)>prost<ecaron(uppercase)> > > > nemLUV<iacute(lowercase)><ccaron(lowercase)>esky > > > > > > one could capitalize the ACE37 encoding as: > > > ACE37: -P-r-o0BT-p-r-o-s-tWM-n-e-m-L-U-V0fm0f0-e-s-k-y (47 char) > > > > > >Authors: > > > > > >Edmon Chung > > >Neteka Inc. > > >2462 Yonge St. Toronto, > > >Ontario, Canada M4P 2H5 > > >edmon@neteka.com > > > > > >David Leung > > >Neteka Inc. > > >2462 Yonge St. Toronto, > > >Ontario, Canada M4P 2H5 > > >david@neteka.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >Chung & Leung [Page 17] > >
- Re: [idn] Re: ave length, best compression etc James Seng/Personal
- [idn] Re: permission <draft-ietf-idn-ace37-00.txt… Marc Blanchet
- Re: [idn] Re: permission <draft-ietf-idn-ace37-00… Adam M. Costello
- [idn] ACE37 Edmon
- ave length, best compression etc - Was Re: [idn] … James Seng/Personal
- Re: [idn] THis WG derailed ? Soobok Lee
- Re: [idn] Re: ave length, best compression etc Martin Duerst
- Re: [idn] THis WG derailed ? Dave Crocker
- [idn] ACE37 Edmon
- Re: [idn] Re: permission <draft-ietf-idn-ace37-00… Edmon
- Re: [idn] Re: permission <draft-ietf-idn-ace37-00… Edmon
- Re: [idn] Re: permission <draft-ietf-idn-ace37-00… Soobok Lee
- Re: [idn] ACE37 Adam M. Costello
- Re: [idn] THis WG derailed ? Martin Duerst
- [idn] overall compression efficiciency matters Soobok Lee
- Re: [idn] THis WG derailed ? Keith Moore
- [idn] draft-idn-ace37-00.txt Edmon
- Re: [idn] THis WG derailed ? Keith Moore
- [idn] Re: permission <draft-ietf-idn-ace37-00.txt… Edmon
- Re: [idn] THis WG derailed ? Paul Hoffman / IMC
- [idn] efficiency of codings Maurizio Codogno
- Re: [idn] ACE37 Edmon
- Re: [idn] THis WG derailed ? Soobok Lee
- [idn] THis WG derailed ? Soobok Lee
- Re: [idn] ACE37 Paul Hoffman / IMC
- Re: [idn] ACE37 Adam M. Costello
- Re: [idn] Re: ave length, best compression etc Soobok Lee
- Re: [idn] THis WG derailed ? Keith Moore
- Re: [idn] average length Marc Blanchet
- Re: [idn] Re: permission <draft-ietf-idn-ace37-00… Adam M. Costello