Re: [Ltru] Re: Review of 4646bis-10, sections 1 to 3.4
Addison Phillips <addison@yahoo-inc.com> Fri, 07 December 2007 21:30 UTC
Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1J0kmI-00053I-Gk; Fri, 07 Dec 2007 16:30:34 -0500
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1J0kmH-000535-E2 for ltru-confirm+ok@megatron.ietf.org; Fri, 07 Dec 2007 16:30:33 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1J0kmH-00052x-4Y for ltru@lists.ietf.org; Fri, 07 Dec 2007 16:30:33 -0500
Received: from rsmtp1.corp.yahoo.com ([207.126.228.149]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1J0kmG-0004Jf-IF for ltru@lists.ietf.org; Fri, 07 Dec 2007 16:30:33 -0500
Received: from [172.21.37.80] (duringperson-lx.corp.yahoo.com [172.21.37.80]) by rsmtp1.corp.yahoo.com (8.13.8/8.13.8/y.rout) with ESMTP id lB7LUEDD037375; Fri, 7 Dec 2007 13:30:15 -0800 (PST)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:cc:subject: references:in-reply-to:content-type:content-transfer-encoding; b=n4V3E0mTnFDCm8ClMSHlR5FrJ+KLQ3GT1fQGV3lLTmjBliLkIeZOc6A/iY3EXcPW
Message-ID: <4759BB66.8010600@yahoo-inc.com>
Date: Fri, 07 Dec 2007 13:30:14 -0800
From: Addison Phillips <addison@yahoo-inc.com>
User-Agent: Thunderbird 2.0.0.9 (Windows/20071031)
MIME-Version: 1.0
To: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Subject: Re: [Ltru] Re: Review of 4646bis-10, sections 1 to 3.4
References: <20071206163755.GP10807@mercury.ccil.org> <fjc93d$blk$1@ger.gmane.org>
In-Reply-To: <fjc93d$blk$1@ger.gmane.org>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by rsmtp1.corp.yahoo.com id lB7LUEDD037375
X-Spam-Score: -15.0 (---------------)
X-Scan-Signature: c0bedb65cce30976f0bf60a0a39edea4
Cc: ltru@lists.ietf.org
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org
Frank Ellermann wrote: > >> 3.1.1: change the definition of folding to "Folding is always >> done on Unicode default grapheme boundaries". > > AFAIK the normal (2822upd) definition of folding is "at places > where WSP is allowed", and unfolding means "replace FWS by SP". > > No grapheme boundaries involved. If 4646bis tries to invent > its own folding rules this should be better explained. 4646bis > shouldn't expose such Unicode oddities, checking my own code: > > | #### unfold field body ############################################ > | /^[\t ]/ { BODY = BODY " " STRIP( $0 ) > | next > | } > > That replaces FWS by one SP, I'm almost certain that I don't want > an SP between the "graphemes" of a "folded word". Except... our record-jar format (and this document) doesn't follow that rule. In particular, if we permit the folding (which is done by people and not machines, in the case of the registry) to occur mid-grapheme, it will break the character. For example, if you were to put U+0065 U+0300 as a sequence and the fold occurred between those two characters, the result would be U+0065 U+0020 U+0300, or e ̀ (e followed by accent) rather than è. IOW> the registry is supposed to be viewable and readable as a plain-text (UTF-8) file. And note that my record-jar draft goes further and suppresses the addition of the space on unfolding (at the cost of complicating the syntax slightly with a line-continuation character). > >> also prohibits folding in the middle of a Hangul syllable >> written as separate jamo. > > Do they have "WSP separated words", or is there a serious > chance of more than 72-1 adjacent bytes belonging to one > or more graphemes ? They don't have WSP separated words in Korean, typically. This also applies to other languages, such as Thai, that also use combining sequences. There is no real (only theoretical) danger of 72-1 adjacent bytes belonging on ONLY one grapheme. > Talking about "bytes" in conjunction > with UTF-8 makes me nervous. Why so? UTF-8 is a multibyte character encoding. So it's code unit is the byte. When we speak of counting things, it is good to know what is being counted. In this case, we count bytes, but ensure that breaks occur between characters, indeed, with John's suggestion, between *visual* "characters", aka graphemes. Addison -- Addison Phillips Globalization Architect -- Yahoo! Inc. Chair -- W3C Internationalization Core WG Internationalization is an architecture. It is not a feature. _______________________________________________ Ltru mailing list Ltru@ietf.org https://www1.ietf.org/mailman/listinfo/ltru
- [Ltru] Re: Review of 4646bis-10, sections 1 to 3.4 John Cowan
- [Ltru] Review of 4646bis-10, sections 1 to 3.4 John Cowan
- [Ltru] Re: Review of 4646bis-10, sections 1 to 3.4 Stephane Bortzmeyer
- Re: [Ltru] Re: Review of 4646bis-10, sections 1 t… Mark Davis
- [Ltru] Re: Review of 4646bis-10, sections 1 to 3.4 Frank Ellermann
- Re: [Ltru] (editor response) Review of 4646bis-10… Addison Phillips
- Re: [Ltru] Re: Review of 4646bis-10, sections 1 t… Addison Phillips
- [Ltru] Broken folding (was: (editor response) Rev… Frank Ellermann
- [Ltru] Re: Broken folding (was: (editor response)… Stephane Bortzmeyer
- Re: [Ltru] Broken folding (was: (editor response)… John Cowan
- [Ltru] Corrections to 4646bis-11 (was: Review of … John Cowan
- Re: [Ltru] Corrections to 4646bis-11 Addison Phillips