Re: [I18ndir] HTML, email addresses, etc

John C Klensin <john-ietf@jck.com> Tue, 09 June 2020 05:59 UTC

Return-Path: <john-ietf@jck.com>
X-Original-To: i18ndir@ietfa.amsl.com
Delivered-To: i18ndir@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 05B593A096B for <i18ndir@ietfa.amsl.com>; Mon, 8 Jun 2020 22:59:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Level:
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cliCaSkshnjp for <i18ndir@ietfa.amsl.com>; Mon, 8 Jun 2020 22:59:07 -0700 (PDT)
Received: from bsa2.jck.com (ns.jck.com [70.88.254.51]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 058CC3A07F8 for <i18ndir@ietf.org>; Mon, 8 Jun 2020 22:59:06 -0700 (PDT)
Received: from [198.252.137.10] (helo=PSB) by bsa2.jck.com with esmtp (Exim 4.82 (FreeBSD)) (envelope-from <john-ietf@jck.com>) id 1jiXHr-0007Oa-A4; Tue, 09 Jun 2020 01:59:03 -0400
Date: Tue, 09 Jun 2020 01:58:58 -0400
From: John C Klensin <john-ietf@jck.com>
To: Marc Blanchet <marc.blanchet@viagenie.ca>, John Levine <johnl@taugh.com>
cc: i18ndir@ietf.org
Message-ID: <A2B5494F35A428F832AE94AA@PSB>
In-Reply-To: <EEB31A7E-4A82-4BCF-B048-82C0BE66A3DB@viagenie.ca>
References: <20200608145452.EB3E51A4BADD@ary.qy> <EEB31A7E-4A82-4BCF-B048-82C0BE66A3DB@viagenie.ca>
X-Mailer: Mulberry/4.0.8 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
X-SA-Exim-Connect-IP: 198.252.137.10
X-SA-Exim-Mail-From: john-ietf@jck.com
X-SA-Exim-Scanned: No (on bsa2.jck.com); SAEximRunCond expanded to false
Archived-At: <https://mailarchive.ietf.org/arch/msg/i18ndir/_fgq-8a8cjfxfokG38ZwHu1hLcI>
Subject: Re: [I18ndir] HTML, email addresses, etc
X-BeenThere: i18ndir@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Internationalization Directorate <i18ndir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/i18ndir/>
List-Post: <mailto:i18ndir@ietf.org>
List-Help: <mailto:i18ndir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/i18ndir>, <mailto:i18ndir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Jun 2020 05:59:09 -0000

--On Monday, June 8, 2020 11:05 -0400 Marc Blanchet
<marc.blanchet@viagenie.ca> wrote:

> On 8 Jun 2020, at 10:54, John Levine wrote:
> 
>> In article <C6967F02-35D9-484B-9BF8-7436D1DF3B65@viagenie.ca>
>> you  write:
>>> this is what I was going to say. IMHO, any protocol
>>> identifier (such  as
>>> EAI) using something more than ASCII must define a PRECIS
>>> profile.  This
>>> is the only chance to get it working. Plain UTF8 for an
>>> identifier is just plain wrong.
>> 
>> That is true but I'm wondering how many profiles we need.  As
>> jck pointed out elsewhere, an address with a Tamil mailbox
>> and a Chinese domain name is unlikely to work.
> 
> one profile. profile will get rid of all irrelevant
> codepoints, specify the normalization form, etc… Profile
> won't go into specific scripts.

Combining a comment on your earlier note to this one:

One difficulty here is that, absent the ability to erase the
differences among languages and scripts and --at least-- making
sure everyone can conveniently read and enter every code point
and sequence of code points, we have only a choice between
learning as we go along and adjusting as needed versus not
learning and repeating old mistakes.  IDNA2003 (and UTR#46) were
designed, more or less, around the assumptions that most Unicode
graphics should be allowed, that normative tables were a good
idea, and that case folding, NFKC, and ignoring code points that
were "ignorable" would protect us from any of the associated
issues without causing serious problems.    As described in RFC
4690 (which would have been longer if we knew then what we know
now, particularly about issues normalization does not address),
that didn't work out very well: names that people legitimately
wanted to use were excluded, there was a great deal of potential
for confusion of a type that would impact security, the
maintenance requirements were too high, etc..  So we went back
and did IDNA 2008 and, in particular, tried to adapt the LDH
rule of traditional host names to Unicode (as well as addressing
some other issues).  That adaptation make it, nearly intact,
into PRECIS.

It might be worth remembering that before any of the work that
started with Stringprep and IDNA started, rather strong
arguments were made that domain names were protocol identifiers,
that they should stay in a limited subset of ASCII, and that
non-ASCII naming should be dealt with in other ways, probably at
another layer, and mapped as needed into the DNS.  Part of that
point of view was that non-ASCII labels in the DNS would just
lead to endless trouble about character equivalences, visual
confusion, difficulties caused by different principles (not just
different code points) among writing systems, a need for types
of aliasing the DNS model doesn't support, difficulties with the
DNS "feature" that octets with the high bit off were ASCII and
could exploit ASCII characteristics while other octets were just
octets, and so on.  I wouldn't go so far as to claim that those
taking that position have been proven right but it has now been
more than 17 years since the Stringprep and IDNA2003 specs were
published and at least 20 since the demands for IDNs got loud
and, lo, we are [still] dealing with those predicted issues.  To
some extent, we did IDNA at least as much because it was clear
that the alternative would have been many registries and
implementations making up and deploying their own, incompatible
and non-interoperable, ideas about how to have IDNs rather than
out of conviction that IDNs were a good idea.  And it was a
brilliant (IMO) idea for avoiding tremendous transitional
disruptions at the application level.

Then we came to a demand for non-ASCII email mailbox names.  The
situation was a bit similar with some people arguing that they
would do nothing but cause trouble and fragment the mail system,
with personal name phrases being a more than adequate option,
especially in a world in which many popular MUAs didn't display
actual addresses unless coerced.  And, as with IDNs around 2001
and 2002, it was clear that, if the IETF didn't take action to
standardize something, we would end up with nasty
interoperability problems.  So what became the EAI WG was
created as a mix of people who really, really, wanted
internationalized addresses with people who had misgivings but
were concerned about the interoperability issues.

Drawing on experience with email local parts going back to RFCs
821 and 822 and the knowledge that there was a long history of
email being used to communicate with embedded devices (some
decades old), transmission of commands is subject lines,
per-message or per-recipient backward-pointing addresses, signed
local parts, and other strangenesses including addresses that
were deliberately extremely difficult to type, the WG
extrapolated from the 821 rules and allowed  virtually any
Unicode code point in the local part, noting that, unlike the
DNS situation, local parts were firmly under the control of the
delivery system's management and that there were rules in the
basic protocol specification prohibiting anything else from
trying to interpret addresses or transform them into other
forms. The discussion is possibly worth having again, but I'm
not convinced that the very permissive rule that had WG
consensus was either unwise or that it should be changed, so the
situation is not like the PRECIS one.  

However, RFC 5321 already notes that [ASCII] local parts are
case-sensitive but that any mail system that establishes
"mailbox" and "mAIlbOx" as pointing to separate mailboxes and
especially separate user accounts is looking for trouble.  If
and when we get around to revising 5321/5322 (and maybe sooner),
I'm going to argue that the quoting mechanisms for assorted
characters that cannot be used without quoting have never really
worked consistently well (often due to operating system
conventions and procedures not under the control of the mail
system), that people don't understand them, and that the same
"understand that, if you do this, you are looking for trouble"
warning might reasonably be applied to creation of such mailbox
names.   Suggesting (or even specifying) a PRECIS profile for
what is reasonably safe would seem reasonable; pushing to
require it, maybe not so much so.

But then we get back to John's comment and my problem example
and why PRECIS is useful, but not sufficient.  There are a
number of combinations -- of local parts and domains and within
local parts-- that I don't expect we will ever see in the wild
because those who operate servers would not dream of allowing
them except as a concession to people whose desire to be cute or
to express themselves go far enough to be a hazard to others.
It is reasonable to assume that, if such addresses show up on
the wire, they are associated with spam or other malicious or
destructive behavior.  And it is reasonable to advise server
operators that allowing those combinations is generally a bad
idea, both to provide support to them when someone tries to
object and to make the Internet a better and more reliable
place.   Generally, PRECIS (like IDNA) deals with code points
and not strings and rules like "no mixed scripts" are far too
blunt as instruments for the cases of interest.   Instead,
proper treatment of those cases actually does need to deal with
individual sets of scripts and maybe even languages (my earlier
notes said "writing system" rather than "script" for a reason.

That means that, for mailbox names, the idea of requiring a
PRECIS profile is both overly prescriptive and inadequate.  As a
source of advice: probably quite appropriate.

Just my opinion but one based on a few years of experience with
email and strong memories of the EAI WG discussions.

best,
    john