[rfc-i] Feedback on Section 3.4 in draft-iab-rfc-nonascii-02, U+ syntax

paul.hoffman at vpnc.org (Paul Hoffman) Wed, 31 August 2016 19:25 UTC

From: paul.hoffman at vpnc.org (Paul Hoffman)
Date: Wed, 31 Aug 2016 12:25:45 -0700
Subject: [rfc-i] Feedback on Section 3.4 in draft-iab-rfc-nonascii-02, U+ syntax
In-Reply-To: <ecd3d504-764e-2b6d-72bd-3343ad22660d@seantek.com>
References: <ecd3d504-764e-2b6d-72bd-3343ad22660d@seantek.com>
Message-ID: <C5791071-864F-47A8-916B-95D8BE985178@vpnc.org>

On 31 Aug 2016, at 10:02, Sean Leonard wrote:

> /(Sent this to the authors, and the suggestion was that this is the 
> right mailing list for public discussion.)/
>
> **********
> Hello draft-iab-rfc-nonascii-02 people, here is feedback on 
> draft-iab-rfc-nonascii-02.
>
> Section 3.4 of draft-iab-rfc-nonascii-02 provides no less than six 
> preferred alternatives for how to represent a single Unicode character 
> or code point. They all pretty much say ?the ___ character (___)? 
> in various permutations. None of these are inherently wrong.
>
> However, The Unicode Standard itself (9.0.0 and prior versions) 
> provides a specific convention in Appendix A:
> ?U+[x][x]xxxx NAME OF CHARACTER?
>
> Notably, the convention does not use ?the ___ character? 
> formulation. Grammatically, the convention is a character, so an 
> article is omitted. A conforming example would be:
>
>  1.  Temperature changes in the Temperature Control Protocol are
>      indicated by U+2206 INCREMENT.
>
> I would like to propose that this be used as at least a priority 
> alternative.

Disagree. That formulation is harder to read in running text, and 
running text is exactly the formulation we are aiming for. The fact that 
TUC likes a particular format should not impinge on our choice for 
readability.

>
> In The Unicode Standard, two other conventions are noted:
>
> U+1F631 ??? FACE SCREAMING IN FEAR
>
> U+1F631 ???
>
> These conventions show all-caps, and small-caps (which for PDF 
> presentation purposes, are actually stored as lowercase). They also 
> show curly quotes. I asked the Unicode mailing list over the weekend 
> and the general sense is that the uppercase is normative in plain text 
> (as shown in the UCD) but case distinctions, along with space and 
> (nearly all) hyphens, are not relevant for unambiguous identification.

Neither of these are easier to read in running text than the ones in the 
draft.

>
> draft-iab-rfc-nonascii-02 is only concerned with characters, not 
> semantics or presentation formats (unlike xml2rfc format). Assuming 
> that plain text is the norm for purposes of draft-iab-rfc-nonascii-02, 
> I suppose that it is sufficient for the plain text to have an ALL-CAPS 
> name. I was going to suggest a novel xml2rfc element for Unicode code 
> points, such as <ucode name="yes">?</ucode> that would be 
> transformed into the output above in plain text mode. However, the 
> xml2rfc transformer can detect such text by looking for the presence 
> of ?U+1F631 FACE SCREAMING IN FEAR?, and apply CSS to it in the 
> html output instead, viz.:
> span.uniname { ? ? ? ? ? ? ? ? ? /* CHAR STYLES */
> text-transform: lowercase;
> font-variant: small-caps;
> font-size: 110%;
> }
>
> As discussed here: 
> <http://www.unicode.org/mail-arch/unicode-ml/y2016-m08/0055.html>
>
> Personally I do not see the need for quotations around the character. 
> U+____ SP ? SP NAME ought to be good enough: the single ? is 
> going to be non-ASCII anyway. However there are implications for 
> combining marks, with or without quotes?this needs to be thought 
> through. Consider:
> U+0308 ???? COMBINING DIAERESIS vs.
> U+0308 ?? COMBINING DIAERESIS vs.
> U+0308 ??? COMBINING DIAERESIS vs.
> U+0308 ? COMBINING DIAERESIS.
> See 
> <http://stackoverflow.com/questions/2224772/whats-the-unicode-glyph-used-to-indicate-combining-characters>
>
> The question is what happens when the ? is a specific protocol 
> element, which frequently (but not always) is quoted, such as "+" and 
> treated as verbatim text <spanx style="verb"> or the new <tt> in 
> xml2rfc v3.

This is another good reason for the current rules.

>
> Section 3.6 (and elsewhere) discusses ?U+ notation? without a 
> reference. Appendix A of [UnicodeCurrent] is appropriate.

That seems fine.