Re: [ietf-smtp] [dispatch] BCP proposal: regular expressions for Internet Mail identifiers

Sean Leonard <dev+ietf@seantek.com> Tue, 29 March 2016 18:40 UTC

Return-Path: <dev+ietf@seantek.com>
X-Original-To: ietf-smtp@ietfa.amsl.com
Delivered-To: ietf-smtp@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C3B5912DFFA; Tue, 29 Mar 2016 11:40:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.602
X-Spam-Level:
X-Spam-Status: No, score=-2.602 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2qBaF6xuek2X; Tue, 29 Mar 2016 11:40:22 -0700 (PDT)
Received: from mxout-08.mxes.net (mxout-08.mxes.net [216.86.168.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5353E12E0EC; Tue, 29 Mar 2016 11:10:35 -0700 (PDT)
Received: from [192.168.123.7] (unknown [75.83.2.34]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.mxes.net (Postfix) with ESMTPSA id BE0CE509B6; Tue, 29 Mar 2016 14:10:33 -0400 (EDT)
To: John C Klensin <john-ietf@jck.com>, dispatch@ietf.org, ietf-smtp <ietf-smtp@ietf.org>
References: <87a8lp10i2.fsf@hobgoblin.ariadne.com> <56F30A52.50305@seantek.com> <CAL0qLwagOOByZXsLcRN9CC0aARSGSh9kCGoO7hSMUhSdkHtssw@mail.gmail.com> <0AC7C26B5A969CA50015ACFB@JcK-HP8200.jck.com>
From: Sean Leonard <dev+ietf@seantek.com>
Message-ID: <56FAC574.1010503@seantek.com>
Date: Tue, 29 Mar 2016 11:12:04 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1
MIME-Version: 1.0
In-Reply-To: <0AC7C26B5A969CA50015ACFB@JcK-HP8200.jck.com>
Content-Type: text/plain; charset="windows-1252"; format="flowed"
Content-Transfer-Encoding: quoted-printable
Archived-At: <http://mailarchive.ietf.org/arch/msg/ietf-smtp/gJyrboqadQba9HExU7IofqG-mU4>
Cc: "Dale R. Worley" <worley@ariadne.com>, "Murray S. Kucherawy" <superuser@gmail.com>
Subject: Re: [ietf-smtp] [dispatch] BCP proposal: regular expressions for Internet Mail identifiers
X-BeenThere: ietf-smtp@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: "Discussion of issues related to Simple Mail Transfer Protocol \(SMTP\) \[RFC 821, RFC 2821, RFC 5321\]" <ietf-smtp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-smtp>, <mailto:ietf-smtp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ietf-smtp/>
List-Post: <mailto:ietf-smtp@ietf.org>
List-Help: <mailto:ietf-smtp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-smtp>, <mailto:ietf-smtp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 29 Mar 2016 18:40:25 -0000

On 3/28/2016 6:18 AM, John C Klensin wrote:
>
> --On Sunday, March 27, 2016 22:41 -0700 "Murray S. Kucherawy"
> <superuser@gmail.com> wrote:
>
>> ...
>> And if what you're really producing is regular expressions
>> that match anything that the ABNFs in the mail RFCs will
>> legitimately produce, you might want to do a standards track
>> document that explicitly updates those documents where those
>> ABNFs are listed.
> Murray,
>
> That captures my concern about this effort.  Based on prior
> experience (including RFC RFC 3696 and even the effort to make
> RFCs 2821 and 5321 internally consistent), it is _really_ easy
> to express a requirement in two different ways and have them be
> _almost_ the same.   That is a problem because different people
> will read different docs.
>
> It seems to me that it would be much better to either do this as
> an Informational document that is clearly identified as Sean's
> opinion about regular expressions that impose the same
> requirements as 5321/5322 but that those continue to control or
> to do a standards-track document that contains both the regular
> expressions and ABNF, makes clear which one is primary, and
> updates the syntax requirements of the base specs.

As Dale expressed (thanks!), "BCPs are *standards* not for protocols but 
for *things that people do*. So in regard to 
[draft-seantek-mail-regexen], the "thing that people do" is "write code 
that validates e-mail addresses for further processing". And the point 
[...] is that people need to write correct code for validating e-mail 
addresses."

Sean's opinion about regular expressions for Mail Identifiers (email 
addresses, Message-IDs) is not interesting. If my opinion were all that 
interesting, I would just publish it on Stack Overflow and call it a day 
(see SO Questions [46155] and [201323]). What is interesting is the 
IETF's vetted and (rough)-consensus view on the topic.

This topic is a favorite pet project of programmers. It tends to go:
1) "oh, I know what an email address is! It has dots and alphas and 
maybe a hyphen" (WRONG),
2) "oh, I'll just read RFC 5322 and roll my own" (also wrong, but in 
more subtle ways...for one, RFC 5322 has distinct syntax from RFC 5321), or
3) "I'm lazy, let's just copy whatever regex shows up on Google first" 
(pragmatic, usually not right).

Wouldn't it be better if programmers could uniformly go:
4) "Given my email address recognition problem, I'll just copy the regex 
from BCP xyz", rather than spending dozens if not hundreds of hours 
pouring over email standards documents and testing them against millions 
of arcane email address combinations.

The current draft-seantek-mail-regexen is pretty clear (currently) that 
it does not attempt to change the Mail standards. If folks want to 
change those documents, may I suggest a separate Standards Track 
document that does exactly that.

Just because a document is labeled "BCP" (or, for that matter, 
"Standards Track") does not mean that every last single statement in the 
document is normative and error-free. Otherwise, the RFC 3280 and RFC 
5280 PKIX standards that say that you are supposed to compare an entire 
email address case-insensitively (Section 4.1.2.6 of RFC 3280, Section 
4.2.1.6 of RFC 5280) would have overridden RFCs 5322, 5321, 2822, RFC 
2821, etc. etc. We have an errata process.

Basically if the regular expressions are wrong, they need to be made 
right. One can complain about problems, or one can fix them.

Turns out that regular expressions and ABNF are homomorphic under 
certain conditions. As shown in draft-seantek-mail-regexen, "deliverable 
email addresses" (RFC 5321 + RFC 6531) certainly fall in that 
definition, as they can be expressed in a regular language (i.e., 
computed with a finite state automaton). Therefore, translating between 
the two is basically computationally verifiable. The results may not 
look pretty but they will work. Perhaps a bigger problem is one's view 
as to how normative ABNF is in the context of IETF standards documents. 
It is possible to have ABNF that says somename = *(ALPHA / DIGIT) but 
then have normative text that says that <somename> is limited to 31 
characters and MUST start with an alphabetic character. Moreover, some 
ABNF (RFC 5321 / RFC 5322 in particular) have "obsolete syntax"; whether 
to admit such syntax is a highly context-sensitive engineering decision. 
Addressing all of these points requires rubbing more than two brain 
cells together.

[46155]: 
http://stackoverflow.com/questions/46155/validate-email-address-in-javascript
[201323]: 
http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address

>
> Perhaps a BCP that recommends use of strings that are clearly a
> proper subset of what the standard allows would be ok, but it
> needs to be frightfully clear that it is a recommended subset,
> not a requirement.

I am not really interested in subsets, except those subsets driven by 
the standards themselves. (ASCII-only vs. EAI is a reasonable subset, 
provided that both expressions are provided. I would rather do EAI-only 
but we can be pragmatic about that.)

Best regards,

Sean