Re: [precis] Adam Roach's Discuss on draft-ietf-precis-7564bis-08: (with DISCUSS)

John C Klensin <john-ietf@jck.com> Thu, 06 July 2017 12:47 UTC

Date: Thu, 06 Jul 2017 08:47:13 -0400
From: John C Klensin <john-ietf@jck.com>
To: Peter Saint-Andre <stpeter@stpeter.im>, Adam Roach <adam@nostrum.com>, The IESG <iesg@ietf.org>
cc: precis-chairs@ietf.org, draft-ietf-precis-7564bis@ietf.org, precis@ietf.org
Message-ID: <351FA1CF6BB27A6E3363C58E@PSB>
In-Reply-To: <ce0503e8-7d34-8cb7-0b9b-2ea9c7a2c5b4@stpeter.im>
References: <149929088066.19029.17184582029308905319.idtracker@ietfa.amsl.com> <ce0503e8-7d34-8cb7-0b9b-2ea9c7a2c5b4@stpeter.im>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Archived-At: <https://mailarchive.ietf.org/arch/msg/precis/x4XGCkPPH93xa1252mK24O5SH2M>
Subject: Re: [precis] Adam Roach's Discuss on draft-ietf-precis-7564bis-08: (with DISCUSS)
Precedence: list

--On Wednesday, July 5, 2017 16:58 -0600 Peter Saint-Andre
<stpeter@stpeter.im> wrote:

> On 7/5/17 3:41 PM, Adam Roach wrote:
>> Adam Roach has entered the following ballot position for
>> draft-ietf-precis-7564bis-08: Discuss
>> 
>> When responding, please keep the subject line intact and
>> reply to all email addresses included in the To and CC lines.
>> (Feel free to cut this introductory paragraph, however.)
>> 
>> 
>> Please refer to
>> https://www.ietf.org/iesg/statement/discuss-criteria.html for
>> more information about IESG DISCUSS and COMMENT positions.
>> 
>> 
>> The document, along with other ballot positions, can be found
>> here:
>> https://datatracker.ietf.org/doc/draft-ietf-precis-7564bis/
>> 
>> 
>> 
>> -------------------------------------------------------------
>> --------- DISCUSS:
>> -------------------------------------------------------------
>> ---------
>> 
>> Section 12.5 contains the following normative statement:
>> 
>>>  Furthermore, because most languages are typically
>>>  represented by a single script or a small set of scripts,
>>>  and because most scripts are typically contained in one or
>>>  more blocks of code points, the software SHOULD warn the
>>>  user when presenting a string that mixes code points from
>>>  more than one script or block, or that uses code points
>>>  outside the normal range of the user's preferred
>>>  language(s).
>> 
>> This guidance seems broadly unimplementable for any users
>> whose native language uses a non-Latin script. Due in large
>> part to the Internet's ASCII heritage, and combined with the
>> somewhat ubiquitous use of Latin characters for other
>> worldwide purposes (e.g., a quick perusal of Russian- and
>> Chinese-language web sites shows numerous examples of Latin
>> representations for things like stock ticker symbols and
>> metric abbreviations), it seems that the normative
>> requirement to warn when "presenting a string that... uses
>> code points outside the normal range of the user's preferred
>> language(s)" will *either* warn non-Latin-character users
>> almost constantly (if Latin is considered outside the range),
>> or be broadly useless in preventing spoofing (if it is).
>> 
>> I'm not clever enough to come up with a generalized solution
>> for users of all alphabets, so don't have a generic proposal
>> here; but I think that the guidance does at least need to be
>> properly scoped so that it bears only on warning Latin
>> alphabet users of the presence of non-Latin characters, while
>> acknowledging that it is probably rather useless when used in
>> the opposite direction. I imagine that it still makes sense
>> to warn non-Latin users of non-Latin characters outside the
>> codepoints used by their language (e.g., warning Greek
>> speakers of the presence of Cyrillic characters).
> 
> Good catch. Yes, we could add a carve-out for characters from
> the ASCII repertoire when the context is Internet applications
> that use such characters.

Please be a bit careful about how you frame this.  AFAICT, there
are actually four relevant cases [1]:

(i) Latin-only, but including both ASCII and "decorated"
characters.  Note that, in the real world, a common typographic
and matching convention for some languages is simply to drop
diacritical markings when they are inconvenient.

(ii)  Predominantly Latin (or Latin-expected) mixed with
"something else".  It is even possible that Latin+Greek and
Latin+Cyrillic (and perhaps other Latin-derived scripts) should
be treated differently from Latin+SomeUnrelatedScript.

(iii) Predominantly a non-Latin script mixed with Latin
characters.  Again, it may be that the Greek, Cyrillic, and
perhaps other Latin-dervied script cases are different from
scripts not derived from Latin characters (again, see "digits"
below) [3].

(iv) Non-Latin script mixed with a different non-Latin script,
remembering that there are more "opportunities" for confusion
(and hence more need for caution) between Greek and Cyrillic
than between Latin and either one [2] [3].

In addition, there is a nearly-orthogonal issue with so-called
"European digits".  Those digits are used with a variety of
scripts, including Arabic in parts of North Africa and
elsewhere, modern Hebrew, and, of course, Greek and Cyrillic.
If they are treated as ASCII (or undecorated Latin), rather than
as script-free, they imply a problem with the above rules
(remember that too many warnings about cases that are almost
always safe results in warnings being ignored and/or treated as
an irritant); if they are treated as script-independent, it may
eliminate possible warnings about the characters in many scripts
that look more or less like "l" or "0".

> (It's not *necessarily* the case that all applications using
> PRECIS are Internet applications or involve ASCII characters -
> e.g., perhaps an application deployed on a closed intranet
> within, say, the Chinese government could still use the PRECIS
> rules to handle input and output strings, without any ASCII
> characters ever shown to end users.)

Yes, although see the comments above and the notes below.  It
may also be worth remembering that, whatever normative authority
the PRECIS specs have (or we would like them to have), there are
separate recommendations for various aspects of the Web (from
W3C and WHATWG at least) and generic recommendations from The
Unicode Consortium as well as assorted national recommendations.
They are not all consistent with each other and there is a
tendency for each standards body to believe that its
recommendations should dominate all others.

I don't know how much of this note (including both the above and
the endnotes below) is worth incorporating in these documents
(or potential updates to other documents such as RFC 6943), but
any new text that is inserted into 7564bis or other PRECIS specs
should at least not make things worse.

best,
   john

"It is always more complicated" (V. Cerf, possibly derived from
Douglas Adams)

[1] With apologies to users of scripts developed in recent
centuries and largely derived from Latin, I have generally not
included them in the categorization above, but the implications
and risks should be clear.  

[2] While Latin script is probably the most common source of
borrowed (or appropriated) character forms in recent centuries,
it is by no means unique and there are consequently other
sub-cases of this case too.  This is further complicated by at
least two other issues:   Sometimes Unicode has specified that
the characters borrowed from a base script should be represented
by the base script code points; sometimes the graphemes are
assigned code points in [partially-] derived script.  In the
first case, it becomes nearly impossible to write languages
using the derived script without including code points that are
nominally part of the other script; in the second, there are
more opportunities for confusion.  The principles underlying
those Unicode decisions are, at best, not obvious to a casual
user.

[3] Whatever decisions are made about these issues, it may be
worth remembering that, if NFKC or NFKD are used, even
protectively, I know of nothing in Unicode that prevents a
character from having a compatibility equivalent in a different
script (I believe there are several such cases, but don't have
time to dig them out right now).  So, if one is going to specify
script-based rules and compatibility transformations (or other
compatibility relationships), it may be important to specify the
order of operations.

[precis] Adam Roach's Discuss on draft-ietf-preci… Adam Roach
Re: [precis] Adam Roach's Discuss on draft-ietf-p… Peter Saint-Andre
Re: [precis] Adam Roach's Discuss on draft-ietf-p… John C Klensin