Re: [precis] names and usernames

John C Klensin <> Mon, 27 February 2017 01:48 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 7EEC81299F4 for <>; Sun, 26 Feb 2017 17:48:06 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id VU0SUqsCKxU8 for <>; Sun, 26 Feb 2017 17:48:05 -0800 (PST)
Received: from ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 30A9B1295A5 for <>; Sun, 26 Feb 2017 17:48:05 -0800 (PST)
Received: from [] (helo=PSB) by with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <>) id 1ciA5E-0003AA-I2; Sun, 26 Feb 2017 20:26:36 -0500
Date: Sun, 26 Feb 2017 20:26:30 -0500
From: John C Klensin <>
To: Peter Saint-Andre <>,
Message-ID: <6C5D333475A12AEAAC163D59@PSB>
In-Reply-To: <>
References: <> <2F562E0E75615D28FB8474A8@PSB> <> <>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Scanned: No (on; SAEximRunCond expanded to false
Archived-At: <>
Subject: Re: [precis] names and usernames
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Preparation and Comparison of Internationalized Strings <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Mon, 27 Feb 2017 01:48:06 -0000

--On Sunday, February 26, 2017 17:21 -0700 Peter Saint-Andre
<> wrote:

> I'm still waiting to hear from experts with knowledge of Indic
> and eastern Arabic scripts. However, I have been corresponding
> offlist with someone who has knowledge of the second issue I
> raised...
>>>> Second, apparently some Chinese family names 
> According to my correspondent, the challenge is representation
> of some given names, not family names (e.g., legislation in
> Taiwan stipulates that a given name can include any character
> that has ever appeared in a dictionary, even dictionaries
> published hundreds of years ago).

Entirely possible that I misunderstood the precise nature of the
problem.  From the standpoint of what should be allowed in an
identifier that people would expect to reflect their names,
whether then name is a family name, a personal name, a tribal or
can name, or a patronymic or matronymic whose seem to be me to
make little difference.

>>>> are typically
>>>> written (especially outside the People's Republic of China)
>>>> using characters that the Unicode Consortium assigns to
>>>> non-BMP code points 
> John, forgive my ignorance, but it seems to me that the plane
> is irrelevant here: in PRECIS we base decisions on code point
> properties. Thus, for instance, any code point whose Unicode
> general category is "Lo" (other letter) is allowed in the
> PRECIS IdentifierClass (per Section 9.1 of RFC 7564),
> regardless of the plane. As an example, a code point like
> U+2F804 (CJK COMPATIBILITY IDEOGRAPH-2F804) would be allowed,
> even though it is in the Supplementary Ideographic Plane.

If I recall, there is another issue with compatibility
ideographs (i.e., theu may or may not be a good example), but
your statement above should certainly be correct.  My reason for
mentioning the BMP is that we have recently again heard from
someone who wants to confine an I18n library to work in UTF-16,
apparently without surrogates.  While any problems of that sort
would certainly be bugs, that sort of UTF-16 discussion leaves
me with the sense that higher planes may be a bit less useful in
practice than the BMP.   That is not, however, a protocol
specification problem in any way.

>>>> or assigns in the BMP but as
>>>> compatibility decomposable characters (and thus disallowed
>>>> by RFC 7564 in the IdentifierClass).

> My correspondent said it should be fine to disallow
> compatibility decomposable characters such as U+328A (CIRCLED
> IDEOGRAPH MOON) because according to him they would not be
> used in given or family names.

Whether the rules of IDNA2008 are good enough is another matter
-- if nothing else, we had to work with the properties Unicode
gave us -- but the basic intent was certainly to disallow any
code point that, visually and in general understanding, was a
circled variation on a base character or a font variation that
was not normally considered a different base character

> All of this is second-hand, so take it with a grain of salt.

FWIW, consistent with my knowledge.