[Ltru] Broken folding (was: (editor response) Review of 4646bis-10, sections 1 to 3.4)

"Frank Ellermann" <nobody@xyzzy.claranet.de> Sat, 08 December 2007 15:33 UTC

Return-path: <ltru-bounces@ietf.org>
Received: from [127.0.0.1] (helo=stiedprmman1.va.neustar.com) by megatron.ietf.org with esmtp (Exim 4.43) id 1J11fz-0006Cm-W1; Sat, 08 Dec 2007 10:33:11 -0500
Received: from ltru by megatron.ietf.org with local (Exim 4.43) id 1J11fz-00069t-3M for ltru-confirm+ok@megatron.ietf.org; Sat, 08 Dec 2007 10:33:11 -0500
Received: from [10.91.34.44] (helo=ietf-mx.ietf.org) by megatron.ietf.org with esmtp (Exim 4.43) id 1J11fy-000692-Ow for ltru@lists.ietf.org; Sat, 08 Dec 2007 10:33:10 -0500
Received: from main.gmane.org ([80.91.229.2] helo=ciao.gmane.org) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1J11fy-00074A-5a for ltru@lists.ietf.org; Sat, 08 Dec 2007 10:33:10 -0500
Received: from list by ciao.gmane.org with local (Exim 4.43) id 1J11fp-0006O7-Dr for ltru@lists.ietf.org; Sat, 08 Dec 2007 15:33:01 +0000
Received: from c-180-160-62.hh.dial.de.ignite.net ([62.180.160.62]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Sat, 08 Dec 2007 15:33:01 +0000
Received: from nobody by c-180-160-62.hh.dial.de.ignite.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for <ltru@lists.ietf.org>; Sat, 08 Dec 2007 15:33:01 +0000
X-Injected-Via-Gmane: http://gmane.org/
To: ltru@lists.ietf.org
From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sat, 08 Dec 2007 16:34:59 +0100
Organization: <http://purl.net/xyzzy>
Lines: 70
Message-ID: <fjedf4$n21$1@ger.gmane.org>
References: <20071206163755.GP10807@mercury.ccil.org> <4759B2E9.5000106@yahoo-inc.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: c-180-160-62.hh.dial.de.ignite.net
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1914
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1914
X-Spam-Score: -0.0 (/)
X-Scan-Signature: 92df29fa99cf13e554b84c8374345c17
Cc:
Subject: [Ltru] Broken folding (was: (editor response) Review of 4646bis-10, sections 1 to 3.4)
X-BeenThere: ltru@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
List-Id: Language Tag Registry Update working group discussion list <ltru.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/ltru>
List-Post: <mailto:ltru@ietf.org>
List-Help: <mailto:ltru-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/ltru>, <mailto:ltru-request@ietf.org?subject=subscribe>
Errors-To: ltru-bounces@ietf.org

Addison Phillips wrote:

>> 3.1.1: change the definition of folding to "Folding is always done on
>> Unicode default grapheme boundaries".  That says what the current text
>> says, and also prohibits folding in the middle of a Hangul syllable
>> written as separate jamo.
 
> :: shiver ::
 
> Yes, I know. (laughing) I hoped to avoid implementing it in record-jar 
> though. But you're right.
 
> The sentence was rewritten as follows, to include the example:
 
> Folding is always done on Unicode default grapheme boundaries (that
> is, never in the middle of a multibyte UTF-8 sequence nor in the
> middle of a combining character sequence).

Do they offer a list of combining characters somewhere, or is that
a case of "grep 996 KB list for non-zero combining class" ?

You need a note that "folding" is supposed to be replaced by NO
space in your definition.  In other words it MUST NOT occur where
WSP is allowed (replacing a real WSP by folding, which is later
unfolded into nothing, joins words):

1 - input was Example:<SP>fold<SP>me<CRLF>
2 - folded is Example:<SP>fold<CRLF><SP>me<CRLF>
3 - output is Example:<SP>foldme<CRLF>

Your "folding" lost the space here, I see no way to protect it.  
The folks who created STD 11 etc. knew what they were up to,
and the whole world knows how STD 11 folding works.

 [In another message =========================================]
> IOW the registry is supposed to be viewable and readable as
> a plain-text (UTF-8) file.

The folding business should be similar with NCRs for graphemes,
not folding "within" a NCR is obvious, like not folding within
an UTF-8 code point.  Record-jar is rather pointless for UTF-8,
I always wanted "UTF-8 => XML", how about simply deleting the
line length limit 72 ?  "72 bytes" is a pointless concept for
UTF-8, and 72 "graphemes" don't help with half- vs. full width.
(Actually I've no clue, maybe NFC gets rid of the width hurdle)

> They don't have WSP separated words in Korean, typically

Ugh.  Transform the 72 MUST in a SHOULD, and allow folding only
at 1*WSP, allowing folding within a word can't work as expected.
Violating the SHOULD for longer words is perfectly fine.  We're
not interested in 2047 / 2231 encoded "words" for the registry

[ MIME has smart rules about SP between encoded words, but not
 trivial, this killed several EAI downgrade drafts from my POV ]

>> Talking about "bytes" in conjunction with UTF-8 makes me nervous.

> Why so? UTF-8 is a multibyte character encoding. So it's code
> unit is the byte.

Sure, but what are "72 bytes" supposed to do, if that could be 18
to 72 code points, and hopefully more than zero graphemes per line.
The 72 only made sense for the NCR registry when viewed as ASCII.

 Frank



_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru