Re: [Ltru] Re: Review of 4646bis-10, sections 1 to 3.4

Frank Ellermann wrote:
> 
>> 3.1.1: change the definition of folding to "Folding is always
>> done on Unicode default grapheme boundaries".
> 
> AFAIK the normal (2822upd) definition of folding is "at places
> where WSP is allowed", and unfolding means "replace FWS by SP".
> 
> No grapheme boundaries involved.  If 4646bis tries to invent
> its own folding rules this should be better explained.  4646bis
> shouldn't expose such Unicode oddities, checking my own code:
> 
> | #### unfold field body ############################################
> | /^[\t ]/ { BODY = BODY " " STRIP( $0 )
> |            next
> |          }
> 
> That replaces FWS by one SP, I'm almost certain that I don't want 
> an SP between the "graphemes" of a "folded word".

Except... our record-jar format (and this document) doesn't follow that 
rule. In particular, if we permit the folding (which is done by people 
and not machines, in the case of the registry) to occur mid-grapheme, it 
will break the character.

For example, if you were to put U+0065 U+0300 as a sequence and the fold 
occurred between those two characters, the result would be U+0065 U+0020 
U+0300, or e ̀ (e followed by accent) rather than è.

IOW> the registry is supposed to be viewable and readable as a 
plain-text (UTF-8) file.

And note that my record-jar draft goes further and suppresses the 
addition of the space on unfolding (at the cost of complicating the 
syntax slightly with a line-continuation character).

> 
>> also prohibits folding in the middle of a Hangul syllable
>> written as separate jamo.
> 
> Do they have "WSP separated words", or is there a serious
> chance of more than 72-1 adjacent bytes belonging to one
> or more graphemes ? 

They don't have WSP separated words in Korean, typically. This also 
applies to other languages, such as Thai, that also use combining 
sequences. There is no real (only theoretical) danger of 72-1 adjacent 
bytes belonging on ONLY one grapheme.

> Talking about "bytes" in conjunction
> with UTF-8 makes me nervous.

Why so? UTF-8 is a multibyte character encoding. So it's code unit is 
the byte. When we speak of counting things, it is good to know what is 
being counted. In this case, we count bytes, but ensure that breaks 
occur between characters, indeed, with John's suggestion, between 
*visual* "characters", aka graphemes.

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG

Internationalization is an architecture.
It is not a feature.

_______________________________________________
Ltru mailing list
Ltru@ietf.org
https://www1.ietf.org/mailman/listinfo/ltru