[dispatch] BCP proposal: regular expressions for Internet Mail identifiers

Sean Leonard <dev+ietf@seantek.com> Tue, 22 March 2016 22:53 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: dispatch@ietfa.amsl.com
Delivered-To: dispatch@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id E787F12DAF2; Tue, 22 Mar 2016 15:53:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Q7RNRmBS-mks; Tue, 22 Mar 2016 15:53:00 -0700 (PDT)
Received: from mxout-08.mxes.net (mxout-08.mxes.net [216.86.168.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id AE58812DAED; Tue, 22 Mar 2016 15:52:59 -0700 (PDT)
Received: from [192.168.123.7] (unknown [75.83.2.34]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id A32FF509B5; Tue, 22 Mar 2016 18:52:58 -0400 (EDT)
References: <20160321235553.10930.4801.idtracker@ietfa.amsl.com>
To: ietf-smtp@ietf.org, dispatch@ietf.org
From: Sean Leonard <dev+ietf@seantek.com>
X-Forwarded-Message-Id: <20160321235553.10930.4801.idtracker@ietfa.amsl.com>
Message-ID: <56F1CD23.2040002@seantek.com>
Date: Tue, 22 Mar 2016 15:54:27 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0
MIME-Version: 1.0
In-Reply-To: <20160321235553.10930.4801.idtracker@ietfa.amsl.com>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: <http://mailarchive.ietf.org/arch/msg/dispatch/qQBFyXcFmO-DkuMqLmgFoq4uYEI>
Subject: [dispatch] BCP proposal: regular expressions for Internet Mail identifiers
X-BeenThere: dispatch@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: DISPATCH Working Group Mail List <dispatch.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dispatch>, <mailto:dispatch-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dispatch/>
List-Post: <mailto:dispatch@ietf.org>
List-Help: <mailto:dispatch-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dispatch>, <mailto:dispatch-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 22 Mar 2016 22:53:02 -0000

Greetings IETF-SMTP Gods and Denizens (and dispatch):

Over the winter I worked on a new Internet-Draft that I would like to 
propose the IETF adopts: Regular Expressions for Internet Mail. The 
draft focuses on two identifiers: email addresses and Message-IDs.

The purpose of this standard (proposed as a Best Current Practice) is to 
have *IETF-vetted* expressions that implementers and non-mail standards 
authors can plug-and-chug without futzing with trying to interpret 40 
years of (occasionally conflicting and arcane) RFCs and implementation 
lore. There are many non-mail systems out there (read: nearly every web 
app, reservation system, customer database, etc. on Earth) that use or 
consume email addresses as identifiers, and their inability to accept 
the most obvious valid characters (like "+" or even "-"; I have used 
apps that do not even accept "-") is a great source of interoperability 
problems. (This document is also relevant to some other threads about 
the nature of email address identifiers in security artifacts such as 
certificates, PGP keys, and DNS records: anyone who is vouching for an 
email address ought to be sure that they are recording something that 
actually is a valid email address in the first place.) We should get 
this right now, before Unicode/EAI makes interoperability issues 50000x 
more expensive to correct.

The document is not meant to modify the mail standards, but merely to 
reflect and track them as they are updated over time.

As a first draft, the document is in rough shape and has extensive notes 
about issues that came up during R&D but have yet to be addressed. 
Significant areas that need adequate treatment include:
1. the impact of Unicode (EAI) on identifiers.
2. handling domain names, which comprise 50% of an email address, but 
perhaps 85% of the complexity when Unicode gets involved.
2. "deliverable email address" (complying with the modern SMTP 
infrastructure) vs. other kinds of email addresses (Internet Message 
Format, historic forms).
3. regular expression engines and grammars (i.e., which grammars to use, 
which are widely used and produce uniform results).
4. efficiency of the regular expressions.
5. different expressions for validation (testing), part extraction 
(capturing groups), decoding, encoding, and searching through text.
6. test vectors.

Hopefully the adoption of this work as an IETF item, coupled with input 
from those with extensive experience

(Thanks to John Levine, Pete Resnick, and others for taking initial 
questions and discussion on the topic.)
Discussion welcome. Thanks.

Sean


-------- Forwarded Message --------
Subject: 	New Version Notification for draft-seantek-mail-regexen-00.txt
Date: 	Mon, 21 Mar 2016 16:55:53 -0700
From: 	internet-drafts@ietf.org



A new version of I-D, draft-seantek-mail-regexen-00.txt
has been successfully submitted by Sean Leonard and posted to the
IETF repository.

Name:		draft-seantek-mail-regexen
Revision:	00
Title:		Regular Expressions for Internet Mail
Document date:	2016-03-21
Group:		Individual Submission
Pages:		24
URL:            https://www.ietf.org/internet-drafts/draft-seantek-mail-regexen-00.txt
Status:         https://datatracker.ietf.org/doc/draft-seantek-mail-regexen/
Htmlized:       https://tools.ietf.org/html/draft-seantek-mail-regexen-00


Abstract:
    Internet Mail identifiers are used ubiquitously throughout computing
    systems as building blocks of online identity. Unfortunately,
    incomplete understandings of the syntaxes of these identifiers has
    led to interoperability problems and poor user experiences. Many
    users use specific characters in their addresses that are not
    properly accepted on various systems. This document prescribes
    normative regular expression (regex) patterns for all Internet-
    connected systems to use when validating or parsing Internet Mail
    identifiers, with special attention to regular expressions that work
    with popular languages and platforms.