Re: [I18ndir] HTML, email addresses, etc

John C Klensin <john-ietf@jck.com> Thu, 11 June 2020 04:23 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 28D013A16AB for <i18ndir@ietfa.amsl.com>; Wed, 10 Jun 2020 21:23:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XzYiXazuAGQ9 for <i18ndir@ietfa.amsl.com>; Wed, 10 Jun 2020 21:23:48 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 53AE33A16A8 for <i18ndir@ietf.org>; Wed, 10 Jun 2020 21:23:48 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jjEkg-0004GJ-HV; Thu, 11 Jun 2020 00:23:42 -0400
Date: Thu, 11 Jun 2020 00:23:36 -0400
From: John C Klensin <john-ietf@jck.com>
To: Nico Williams <nico@cryptonector.com>
cc: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, John Levine <johnl@taugh.com>, i18ndir@ietf.org, marc.blanchet@viagenie.ca
Message-ID: <0742AA76A3E8405855CC437C@PSB>
In-Reply-To: <20200610225842.GI3100@localhost>
References: <20200608145452.EB3E51A4BADD@ary.qy> <7bf0652a-01a9-79e9-6325-75ea5fa20fca@it.aoyama.ac.jp> <81773E93681984B8836B1915@PSB> <20200610225842.GI3100@localhost>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/NSQkwXSRnOYwRwe7eWRZ29o7sHw>
Subject: Re: [I18ndir] HTML, email addresses, etc
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Jun 2020 04:23:52 -0000


--On Wednesday, June 10, 2020 17:58 -0500 Nico Williams
<nico@cryptonector.com> wrote:

> On Tue, Jun 09, 2020 at 09:25:47AM -0400, John C Klensin wrote:
>> Absolutely.   And those are good examples of why a simplistic
>> rule banning mixed scripts would not be helpful.
> 
> As if people might not have names with mixed scripts.  What's
> next, mixed ethnicities?!  Especially for for *input* fields,
> some here would not let me have the address I _want_?  In this
> day and age?

And that is what I said -- that a rule banning mixed scripts
would not be helpful-- so I am not sure what you are responding
to with apparent vehemence. 

> There is no confusables issue here: these are addresses
> entered by the user, not displayed to the user (and if you
> display any, chances are it's the user's to the same user).

I think we almost agree.  Almost.  Let me try to break this down
a bit, focusing on where we might disagree.

(1) I think I still favor having a separate type, whether called
"i18email" or something else, rather than just saying "from now
on, 'email' means just about anything you can write in UTF-8
that has an '@' in the middle" or something close to it.  There
are two reasons, one is that I've seen, too many times,
situations in which a bad-end or intermediate system that is not
written well or robustly is handed UTF-8 (or anything much
different from ASCII) when ASCII is expected and basically
self-destructs, often leading to either (a) something completely
unpredictable and unintelligible (to the user) in the database
or passed on to other systems or (b) an error message with the
semantics and information content of "fail" or "you lose".
The advantage of using a different keyword, even if one believes
it is slight, is that it gives us a chance to specify what the
associated field might mean and to let the page designer take
explicit responsibility rather than having one definition last
week and another definition next week.

(2) As to whether we should give them a rule or regular
expression to associate with the new type that goes much beyond 
   bunch-of-octets-for-local-part@bunch-of-octets-for-domain-part
I rather doubt it.  Experience indicates that, if people try to
perform most sophisticated checks, they will end up with false
negatives and thereby block  things that the back-end systems or
delivery-end mail servers would accept.  I might even argue
against blocking foo@bar@example.!!!#$%^& because, as far as can
be known generally, it might make perfect sense to a local mail
system even if not on the Internet.  

However, if one is going to adopt that model, then the rule has
to be "don't mess with it".  What is accepted goes directly to
the bad-end or mail system: no compatibility equivalents (not
even wide and narrow width forms), no normalization at all, no
case-folding or lower-casing, no dropping of joiners,
non-joiners, or other code points that your note indicated were
unimportant.  Once one goes down that slippery slope, it is not
clear there is any rational stopping point.

(3) What advice to give people who are defining email addresses
for their mail servers (or deciding what to allow users to
create) is a different matter.  We know that trying to make case
distinctions and to either reject incoming messages where the
case isn't quite right or to map forms that differ only by case
causes problems.  Prohibiting that is very different from saying
"likely to case problems, especially confused and unhappy users,
so don't do it unless you need to, understand the implications,
and are willing to explain why" (a more elaborate version of
what 5321 says).  Saying "following a reasonable PRECIS profile
is probably a good idea and is likely to avoid surprises -- if
you need to do something else or allow an exception, be sure you
know what you are doing" seems reasonable to me.  Different from
the requirement I think Marc was suggesting, but still
reasonable guidance.    

The similarity between advice like either of those examples and
the "you need to know what you are doing but here are some
general hints about things to avoid unless you really, really,
know what you are doing" advice of IDNA2008 (and the
clarifications in 5891bis) should be clear to everyone reading
this.

My thought about what to do with a clarification or guidance to
the SMTPTUF8 rules for local parts was similar.  Strongly
suggest avoiding anything but letters, digits, and symbols or
punctuation already in active use 821/2821/5821 email (noting
that some of that punctuation would probably require some
qualifications to PRECIS rules).  Suggest that using a PRECIS
profile, where possible, is a much better idea than working out
per-protocol or per-mail-server rules and conventions.   And
stress that any of those code-point and rule-based mechanisms
are still a superset of what almost any sensible mail system
would allow, but the optimal subset for a given system is going
to need to depend on local needs and may require rules or
conventions about more than code points.

And, finally, whether emoji belong in local-parts or not is
something I'll happily encourage you to discuss on the PRECIS
list as long as you and others on that list remember at least
three things:  (i) the names by which Unicode calls them are not
necessarily the names by which they are called or read out loud,
especially in languages very different from English; (ii) the
graphemes associated with a given emoji code point are not
standardized and for some of them vary widely; (iii) combining
sequences are often problematic in identifiers [1].  Combining
sequences of emoji --either with emoji-specific qualifiers or
with ZWJ-- are not nearly as well structured and predictable as
ones with letters [2].  I'd probably think of more if it weren't
past midnight here.  So, if I were giving advice to a delivery
mail system administrator, it would be "don't unless you are
sure you know your users, their likely correspondents, and have
a plan about any support that might be necessary".   YMMD, of
course.

best,
   john



[1] This is, of course,  one reason IDNA imposes an NFC
requirement and, if I recall, PRECIS does too, at least for
profiles that might be relevant.

[2] As an exercise, think about the emoji sequences for
"vomiting cowboy", "vomited up a cowboy", and "vomited on the
cowboy"