Re: [I18ndir] HTML, email addresses, etc
John C Klensin <john-ietf@jck.com> Thu, 11 June 2020 04:23 UTC
Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 28D013A16AB for <i18ndir@ietfa.amsl.com>; Wed, 10 Jun 2020 21:23:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XzYiXazuAGQ9 for <i18ndir@ietfa.amsl.com>; Wed, 10 Jun 2020 21:23:48 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 53AE33A16A8 for <i18ndir@ietf.org>; Wed, 10 Jun 2020 21:23:48 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jjEkg-0004GJ-HV; Thu, 11 Jun 2020 00:23:42 -0400
Date: Thu, 11 Jun 2020 00:23:36 -0400
From: John C Klensin <john-ietf@jck.com>
To: Nico Williams <nico@cryptonector.com>
cc: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, John Levine <johnl@taugh.com>, i18ndir@ietf.org, marc.blanchet@viagenie.ca
Message-ID: <0742AA76A3E8405855CC437C@PSB>
In-Reply-To: <20200610225842.GI3100@localhost>
References: <20200608145452.EB3E51A4BADD@ary.qy> <7bf0652a-01a9-79e9-6325-75ea5fa20fca@it.aoyama.ac.jp> <81773E93681984B8836B1915@PSB> <20200610225842.GI3100@localhost>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/NSQkwXSRnOYwRwe7eWRZ29o7sHw>
Subject: Re: [I18ndir] HTML, email addresses, etc
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Jun 2020 04:23:52 -0000
--On Wednesday, June 10, 2020 17:58 -0500 Nico Williams <nico@cryptonector.com> wrote: > On Tue, Jun 09, 2020 at 09:25:47AM -0400, John C Klensin wrote: >> Absolutely. And those are good examples of why a simplistic >> rule banning mixed scripts would not be helpful. > > As if people might not have names with mixed scripts. What's > next, mixed ethnicities?! Especially for for *input* fields, > some here would not let me have the address I _want_? In this > day and age? And that is what I said -- that a rule banning mixed scripts would not be helpful-- so I am not sure what you are responding to with apparent vehemence. > There is no confusables issue here: these are addresses > entered by the user, not displayed to the user (and if you > display any, chances are it's the user's to the same user). I think we almost agree. Almost. Let me try to break this down a bit, focusing on where we might disagree. (1) I think I still favor having a separate type, whether called "i18email" or something else, rather than just saying "from now on, 'email' means just about anything you can write in UTF-8 that has an '@' in the middle" or something close to it. There are two reasons, one is that I've seen, too many times, situations in which a bad-end or intermediate system that is not written well or robustly is handed UTF-8 (or anything much different from ASCII) when ASCII is expected and basically self-destructs, often leading to either (a) something completely unpredictable and unintelligible (to the user) in the database or passed on to other systems or (b) an error message with the semantics and information content of "fail" or "you lose". The advantage of using a different keyword, even if one believes it is slight, is that it gives us a chance to specify what the associated field might mean and to let the page designer take explicit responsibility rather than having one definition last week and another definition next week. (2) As to whether we should give them a rule or regular expression to associate with the new type that goes much beyond bunch-of-octets-for-local-part@bunch-of-octets-for-domain-part I rather doubt it. Experience indicates that, if people try to perform most sophisticated checks, they will end up with false negatives and thereby block things that the back-end systems or delivery-end mail servers would accept. I might even argue against blocking foo@bar@example.!!!#$%^& because, as far as can be known generally, it might make perfect sense to a local mail system even if not on the Internet. However, if one is going to adopt that model, then the rule has to be "don't mess with it". What is accepted goes directly to the bad-end or mail system: no compatibility equivalents (not even wide and narrow width forms), no normalization at all, no case-folding or lower-casing, no dropping of joiners, non-joiners, or other code points that your note indicated were unimportant. Once one goes down that slippery slope, it is not clear there is any rational stopping point. (3) What advice to give people who are defining email addresses for their mail servers (or deciding what to allow users to create) is a different matter. We know that trying to make case distinctions and to either reject incoming messages where the case isn't quite right or to map forms that differ only by case causes problems. Prohibiting that is very different from saying "likely to case problems, especially confused and unhappy users, so don't do it unless you need to, understand the implications, and are willing to explain why" (a more elaborate version of what 5321 says). Saying "following a reasonable PRECIS profile is probably a good idea and is likely to avoid surprises -- if you need to do something else or allow an exception, be sure you know what you are doing" seems reasonable to me. Different from the requirement I think Marc was suggesting, but still reasonable guidance. The similarity between advice like either of those examples and the "you need to know what you are doing but here are some general hints about things to avoid unless you really, really, know what you are doing" advice of IDNA2008 (and the clarifications in 5891bis) should be clear to everyone reading this. My thought about what to do with a clarification or guidance to the SMTPTUF8 rules for local parts was similar. Strongly suggest avoiding anything but letters, digits, and symbols or punctuation already in active use 821/2821/5821 email (noting that some of that punctuation would probably require some qualifications to PRECIS rules). Suggest that using a PRECIS profile, where possible, is a much better idea than working out per-protocol or per-mail-server rules and conventions. And stress that any of those code-point and rule-based mechanisms are still a superset of what almost any sensible mail system would allow, but the optimal subset for a given system is going to need to depend on local needs and may require rules or conventions about more than code points. And, finally, whether emoji belong in local-parts or not is something I'll happily encourage you to discuss on the PRECIS list as long as you and others on that list remember at least three things: (i) the names by which Unicode calls them are not necessarily the names by which they are called or read out loud, especially in languages very different from English; (ii) the graphemes associated with a given emoji code point are not standardized and for some of them vary widely; (iii) combining sequences are often problematic in identifiers [1]. Combining sequences of emoji --either with emoji-specific qualifiers or with ZWJ-- are not nearly as well structured and predictable as ones with letters [2]. I'd probably think of more if it weren't past midnight here. So, if I were giving advice to a delivery mail system administrator, it would be "don't unless you are sure you know your users, their likely correspondents, and have a plan about any support that might be necessary". YMMD, of course. best, john [1] This is, of course, one reason IDNA imposes an NFC requirement and, if I recall, PRECIS does too, at least for profiles that might be relevant. [2] As an exercise, think about the emoji sequences for "vomiting cowboy", "vomited up a cowboy", and "vomited on the cowboy"
- Re: [I18ndir] HTML, email addresses, etc John C Klensin
- Re: [I18ndir] HTML, email addresses, etc Martin J. Dürst
- [I18ndir] HTML, email addresses, etc John C Klensin
- Re: [I18ndir] HTML, email addresses, etc John Levine
- Re: [I18ndir] HTML, email addresses, etc Marc Blanchet
- Re: [I18ndir] HTML, email addresses, etc Marc Blanchet
- Re: [I18ndir] HTML, email addresses, etc John C Klensin
- Re: [I18ndir] HTML, email addresses, etc Martin J. Dürst
- Re: [I18ndir] HTML, email addresses, etc John R Levine
- Re: [I18ndir] HTML, email addresses, etc John C Klensin
- Re: [I18ndir] HTML, email addresses, etc Nico Williams
- Re: [I18ndir] HTML, email addresses, etc Nico Williams
- Re: [I18ndir] HTML, email addresses, etc Nico Williams
- Re: [I18ndir] HTML, email addresses, etc John C Klensin
- Re: [I18ndir] HTML, email addresses, etc Martin J. Dürst
- Re: [I18ndir] HTML, email addresses, etc Martin J. Dürst
- Re: [I18ndir] HTML, email addresses, etc John C Klensin
- Re: [I18ndir] HTML, email addresses, etc Nico Williams
- Re: [I18ndir] HTML, email addresses, etc Nico Williams