Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)

liana Ye <liana.ydisg@juno.com> Tue, 04 December 2001 22:22 UTC

Received: from psg.com (exim@psg.com [147.28.0.62]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA15014 for <idn-archive@lists.ietf.org>; Tue, 4 Dec 2001 17:22:44 -0500 (EST)
Received: from lserv by psg.com with local (Exim 3.33 #1) id 16BNiY-0000Vw-00 for idn-data@psg.com; Tue, 04 Dec 2001 14:07:10 -0800
Received: from m2.boston.juno.com ([64.136.24.65]) by psg.com with esmtp (Exim 3.33 #1) id 16BNiW-0000Vq-00 for idn@ops.ietf.org; Tue, 04 Dec 2001 14:07:09 -0800
Received: from cookie.juno.com by cookie.juno.com for <"h3psgxK7wwe0kgb98qASH0EJcmVjp3VVDht/G7rgc35k++0YftArdw==">
Received: (from liana.ydisg@juno.com) by m2.boston.juno.com (jqueuemail) id GMUMB4TL; Tue, 04 Dec 2001 17:06:04 EST
To: jseng@pobox.org.sg
Cc: DougEwell2@cs.com, liana.ydisg@juno.com, idn@ops.ietf.org
Date: Tue, 04 Dec 2001 13:36:50 -0600
Subject: Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)
Message-ID: <20011204.140704.-319607.1.liana.ydisg@juno.com>
X-Mailer: Juno 4.0.5
MIME-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
X-Juno-Line-Breaks: 0-14,16-463
X-Juno-Att: 0
X-Juno-RefParts: 0
From: liana Ye <liana.ydisg@juno.com>
Sender: owner-idn@ops.ietf.org
Precedence: bulk
Content-Transfer-Encoding: 7bit

StepCode discussion has been dropped out
at June meeting, the reason is stated as "It is only 
concerning Chinese character encoding".  
It has never been brought back for discussion
at all.  This version, 01, has been summitted 
in Oct. which has included discussion of worldwide 
scripts.  This version has never been discussed
or remitted into the pool.  

Does this mean we can not discuss the idea 
on the list?  I am really puzzled.  

Liana


On Tue, 4 Dec 2001 21:50:32 +0800 "James Seng/Personal"
<jseng@pobox.org.sg> writes:
> In my email dated 23rd Oct 2001,
> http://www.imc.org/idn/mail-archive/msg04363.html
> 
> I have indicated that "StepCode- A Mnemonic Internationalized Domain
> Name Encoding
> (draft-ietf-idn-step)" have been drop from the WG Pool.
> 
> The following drafts remain in the wg pool:
> 
> Internationalizing Host Names In Applications (IDNA)
> (draft-ietf-idn-idna)
> Preparation of Internationalized Host Names 
> (draft-ietf-idn-nameprep)
> Proposal for a determining process of ACE identifier
> (draft-ietf-idn-aceid)
> Japanese characters in multilingual domain name label
> (draft-ietf-idn-jpchar)
> Traditional and Simplified Chinese Conversion 
> (draft-ietf-idn-tsconv)
> Hangeul NAMEPREP recommendation version 
> 1.0(draft-ietf-idn-hangeulchar)
> Improving ACE using code point reordering 
> v1.0(draft-ietf-idn-lsb-ace)
> AMC-ACE-Z (draft-ietf-idn-amc-ace-z)
> The Internationalized Domain Name System (draft-hall-dm-idns-00.txt)
> 
> I would like to remind the working group to remain focus with the
> discussion within the wg pool. Therefore, further discussion of 
> drafts
> outside the pool, such as StepCode, should be bought offline.
> 
> Thanks.
> 
> -James Seng
> 
> ----- Original Message -----
> From: "liana Ye" <liana.ydisg@juno.com>
> To: <DougEwell2@cs.com>
> Cc: <idn@ops.ietf.org>
> Sent: Tuesday, December 04, 2001 2:18 PM
> Subject: Re: Layer 2 and "idn identities" (was: Re: [idn] what are 
> the
> IDN identifiers?)
> 
> 
> >
> >
> > On Mon, 3 Dec 2001 16:29:57 EST DougEwell2@cs.com writes:
> > > In a message dated 2001-12-03 2:01:56 Pacific Standard Time,
> > > liana.ydisg@juno.com writes:
> > >
> > > > I see that you have not read the I-D yet, and Deng Xiang
> > > > has replied your Chinese vs. Japanese arguement, I
> > > > will wait for your comment on the language-tag issue,
> > > > or anything not up to your standard.
> > >
> > > Here are some specific concerns related to items in
> > > draft-Liana-idn-map-00.txt.
> > >
> > > | The proposed ACE is a mnemonic encoding scheme,
> > > | and is called StepCode [StepCode].
> > >
> > > Hasn't AMC-ACE-Z already been chosen as the standard ACE for 
> IDN?  I
> > > would be
> > > surprised if the decision were made to use two different ACEs
> > > depending on
> > > the language, or script, of the encoded text.
> >
> > The current AMC treats all UCS codepoints the same. It can not
> > solve look-alike cross different languages.  It does not make DNS
> > master records readable for administrators.  It does not make
> > zonefile sortable by different regions or sensible user groups.
> > It does not help a user does not read Chinese but communicate
> > with Chinese partners.  But It does compress the data and feed
> > that into DNS.
> >
> > Using StepCode can group users by language, so sorting
> > the names makes a lot more semantic sense for administrators.
> > StepCode also allows each character has its own ID to be treated
> > differently cross different languages.
> >
> > But StepCode is made only when user wants it, so there are may
> > be users don't want it.  Then the AMC should be used to capture
> > such cases.  StepCode and AMC are compliment to each other.
> > This is discussed in Section 4.5.
> >
> >
> > > | U-s   U-p A-p
> > > | U+0041  U+0061    a      (Latin Letter A case folding)
> > > | U+2fc2  U+2ee5    yv2    (Han character fish for Chinese case
> > > folding)
> > >
> > > Several Chinese speakers and other experts have already, 
> repeatedly,
> > > claimed
> > > that SC/TC mapping is NOT a 1-1 operation like Latin case 
> folding.
> > > If you
> > > think your users will be satisfied with the 1-1 solution only, 
> go
> > > right
> > > ahead, but if this turns out to be inadequate and you need to
> > > propose a fix,
> > > get ready to hear a lot of people say "I told you so."
> >
> > For this please see my post on data-centric programming
> > techniques applicable to SC/TC problem.
> >
> >
> > >
> > > | To facilitate end users for the speed of IDN access as well as
> > > | compatibility with existing applications, it is RECOMMENDED 
> that
> > > an IDN
> > > | code exchange table inculdes applicable local display 
> standards
> > > | corresponding with each applicable codepoints in UCS.
> > >
> > > Backward mapping tables to convert Unicode to legacy standards, 
> for
> > > the
> > > express purpose of allowing end-user software to delay the
> > > transition to
> > > Unicode?  Does this sound like a solution for the future?
> >
> > As these legacy standards have to
> > be on servers to switch large user base to the new IDN.
> > After you have switched the users, then you can replace
> > user softwares and hardwards at the suppliers pace.  That
> > means you can play the price/service game to lure the
> > users to switch.  Without this feature, most users never
> > want to change due to change is too expensive for
> > the users, since they are happy with what they got:
> > relieble connection.
> >
> > If we make these legal with the exchange map, then
> > there will be no need to implement another code form
> > like UTF-8.
> >
> >
> > > | It is REQURIED to register a language tag with IANA and its
> > > | associated script range whenever it is modified.
> > >
> > > There is already a perfectly good update process in place for 
> both
> > > ISO 639
> > > and RFC 3066.
> >
> > But IDN may not need to implement all of these tags.  Each tag
> > implemented need script specific procedures to be deployed.
> >
> > >
> > > | To use mixed scripts in one IDN label is NOT RECOMMEMDED for 
> an
> > > | early deployment of IDN.
> > >
> > > This immediately outcasts the Japanese, who have every reason to 
> mix
> > >
> > > hiragana, katakana, kanji, and romaji.
> >
> > Wrong.  Japanese, Korean are the primary languages to
> > be tested in practic.  That is in C,J,K tags, also are used
> > in the I-D to show the feasibility of the implementation.
> >
> > The recommendation is there for warning  though C,J,K
> > are shown can be done, since there is no system installed
> > yet, unrealistic jump in is not encouraged, especially when
> > there is possible use of USC blanket treatment by AMC.
> >
> > >
> > > |         Alphabet Sys.  Consonant Sys.  Character Sys.
> > > |
> > > | From: 0020            0530            2e80
> > > | to:   052f            1bff            d7af
> > > |
> > > | include:Latin           Armenian        CJK
> > > |         Greek           Hebrew          Kanji
> > > |         Cyrillic        Arabic          Kana
> > > |         IPA             Devanagari      Hangul
> > > |         Vietnamese      Malayalam       Yi
> > > |                         Thai
> > > |                         Lao
> > > |                         Tibetan
> > > |                         ...
> > >
> > > Sorry, it's just not that simple.  There are plenty of alphabets 
> and
> > >
> > > alphabetic characters encoded above U+0530.  That's probably why 
> the
> > > Unicode
> > > Consortium, while providing a list of blocks of code points like 
> the
> > >
> > > following:
> > >
> > >     # Start Code..End Code; Block Name
> > >     0000..007F; Basic Latin
> > >     0080..00FF; Latin-1 Supplement
> > >     0100..017F; Latin Extended-A
> > >
> > > is careful not to imply that ranges of code points are 
> permanently
> > > reserved
> > > for *classifications* of scripts like this.
> > >
> > > You can tell that the three ranges listed here are arbitrary and
> > > bogus even
> > > for the CJK scripts, by noting that Korean jamos (alphabetic) 
> are
> > > located in
> > > the "consonant system" block, while the Japanese syllabaries 
> (kana)
> > > and
> > > precomposed Korean syllables are in the "character system" 
> block.
> >
> > These are rough groups to study different cases
> > to cover broadest language variations.  And this grouping
> > is proposed by a well known linguist.  While we don't
> > need to copy their views, (just like I am against copy
> > UTC's recommendation),  it is necessary to learn
> > what the different views proposed by linguists before
> > I feel confidence to propose a reasonable solution.
> > No specifics are placed on these groups.  The real
> > term is in Language tag definition file.  As you may see
> > they are indefinit number of code blocks defined
> > in data specification format, Section 3.2, and
> > associated with language specific procedures.
> > That is the reason, I proposed IANA registration to
> > the language tags we support.
> >
> >
> > >
> > > | Some cultures often use more than two scripts within the same
> > > group,
> > > | such as Japanese, but rarely using another script especially 
> from
> > > a
> > > | different group.
> > >
> > > As noted above, the Japanese use four scripts from two different
> > > groups.
> > >
> > > | The main issue in IDN-Map
> > > | is to identify character equivalent sets, and reduce the 
> number of
> > > | applicable IDN identifiers by 1) limiting the applicable IDN 
> input
> > > code
> > > | points to Plane 0 of Unicode table,
> > >
> > > Has anyone else so far proposed that supplementary characters be
> > > flat-out
> > > prohibited from occurring in IDN identifiers?  Why should they 
> be
> > > singled out
> > > as a way to "reduce the number of applicable IDN identifiers"?
> >
> > This was a statement in an early ACE I-D of this
> > group. Since UTS released new case folding map,
> > [nameprep] took it without questioning, and everyone
> > dropped this issue.
> >
> > No one proposes to prohibit Plane 1 codepoints.
> > Here I am proposing to get the equivalent class
> > work first, before we allow Plane 1 and above code
> > points in.  In fact, the more you let in the more it is
> > support my case for letting TC/SC in.  And this is
> > the approve:
> >
> > in the current [nameprep] specification:
> >
> > 0048; 0068; Case map
> >
> > 210B; 0068; Additional folding
> > 210C; 0068; Additional folding
> > 210D; 0068; Additional folding
> >
> > 1D407; 0068; Additional folding
> > 1D43B; 0068; Additional folding
> > 1D46F; 0068; Additional folding
> > 1D4D7; 0068; Additional folding
> > 1D573; 0068; Additional folding
> >
> > You can see this is 9 to 1 case folding, and how
> > will you recover the 9 cases?
> >
> > >
> > > | It is RECOMMENDED that reasonable studies are given to each
> > > language to
> > > | classify script treatment model, and a cost vs. benifit 
> analysis
> > > in select
> > > | a long term script specific processing protocol to be embedded 
> in
> > > IDN
> > > | language specific modules.
> > >
> > > This won't disrupt the schedule of the working group, will it?
> >
> > I don't know what the WG schedule is based on.  If it waves
> > the CJK case away, it has met its schedule last year
> > already.  If you mean the current schedule, you have to
> > ask does the WG have a clear picture of the IDN or not.
> > If no one know how to deal with CJK, any schedule is
> > meaningless.  That is the reason, I do not comment
> > on WG mile stones.
> >
> > >
> > > | canonicalization
> > >
> > > This word has no clear definition and is carefully avoided by
> > > Unicode, as Ken
> > > Whistler already explained.
> >
> > I think we are getting somewhere.  We are getting
> > down on the codepoints now.  When I don't need use
> > all these vague terms, we are near the solution.
> >
> >
> > >
> > > | A string mixed with CJK and Kana is Japanese, CJK and Hangul 
> mix
> > > is
> > > | Korean. However, an all CJK character string MUST presumed to 
> be
> > > in the
> > > | primary language tag, that is Chinese, and registered as the 
> only
> > > IDN name,
> > > | unless the registrant requests a second and a third language 
> to
> > > access the
> > > | same IDN name.
> > >
> > > Nothing prevents an all-Han string of any arbitrary length from
> > > being
> > > Japanese text.  The priority given to Chinese here is not likely 
> to
> > > be well
> > > received by other groups.
> >
> > Priority gives to Chinese has many reasons:
> > 1) Majority of these characters originated in China with
> >  semantics and phonetic, and naturally be named and
> >  known to people who use them.  The number is
> >  100,000 - 20,003 = 80,000 on the way to be named.
> > 2) Kanji has more then two phonetics, and one of them
> >  is Chinese phonetics.  So it is not the worst case for Kanji.
> > 3) All Kanji label automaticly gets two registered names,
> >  one is in Chinese and the other in Japanese .
> >
> > Japanese gets the  Chinese registration for free,  Chinese
> >  gets the work  for nothing.  Who do think is the biggest
> > beneficiary?
> >
> >
> > > | Also, it
> > > | introduces more policy decisions, for example, an all CJK
> > > character
> > > | trademark registrant may have to registrate in three languages 
> to
> > > ensure
> > > | the legitimacy of the trademark.
> > >
> > > Wait just a minute.  Wasn't the whole idea of this 
> language-tagging
> > > and
> > > CJK-folding scheme to PREVENT registrants from having to 
> register an
> > > IDN
> > > identifier more than once?
> >
> > This registration is for the different user groups of the same
> > tradename, like AOL.com and AmericanOnLine.com in DNS,
> > but in IDN they are the same as <A><O><L>.com
> >
> > This is the IDN we have to work with, one match in IDN, one
> > match in DNS.  If there are more then one accesses in DNS to
> > one IDN label, IDN has to block them all in registration unless
> > they are registered.  That is the Chinese group has been
> > saying: if we don't implement TC/SC, then there will be
> > exponetial DNS names for the same IDN label.
> >
> > >
> > > | After all, a useful tool is to let its
> > > | user to make decisions.
> > >
> > > Some tools are interactive, others are not.
> >
> > This depends on which layer of user you have
> > in mind.  I have several of them.
> >
> > >
> > > Finally, it is not yet clear to me whether the "idn-zh-" tag 
> prefix
> >
> > Where is the idn- tag come in? The zh-- tag shall be on the
> > same footing with AMC tag bq-- and treated within the same
> > interface.  Please look through the idn-map I-D again.  If
> > I was not express that clearly, then tell me how to improve
> > it.
> >
> > > is
> > > supposed to be embedded within IDN identifiers or specified
> > > separately.  But
> > > between this additional label and the use of the less efficient
> > > StepCode
> > > instead of ACE-Z, it seems that several bytes out of the 
> precious
> > > 63-byte
> > > limit are required as overheard to support this tagging scheme.  
> If
> >
> > Without tag there is little chance you can process CJK and
> > the like problems.  For example, Latin and Armentian.
> > The tag takes the same bytes with bq-- used in AMC.
> >
> > StepCode is not compressed, is human readable, is
> > foreigner readable.  You can compare readability with
> >  code length efficience, but the judges are administrators
> > of zonefiles, internetional workers on a foreign land
> > and the IDN name owners.
> >
> >
> > > I
> > > remember correctly, it is CJK users (Soobok Lee is only the most
> > > vocal) who
> > > are most concerned about the space limitation and who want to 
> find
> > > (or
> > > invent) the most efficient encoding system possible.  Will these
> > > other CJK
> > > users agree to this proposal?
> > >
> > > -Doug Ewell
> > >  Fullerton, California
> >
> > Each member joins this list independently.  You have
> > to ask them.
> >
> > Liana
> >
>