Re: [EAI] UTF-8 in Message-IDs

ned+ima@mrochek.com Mon, 15 August 2011 20:35 UTC

Return-Path: <ned+ima@mrochek.com>
X-Original-To: ima@ietfa.amsl.com
Delivered-To: ima@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 001AA11E8105 for <ima@ietfa.amsl.com>; Mon, 15 Aug 2011 13:35:27 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.34
X-Spam-Level:
X-Spam-Status: No, score=-3.34 tagged_above=-999 required=5 tests=[AWL=1.063, BAYES_00=-2.599, DATE_IN_PAST_03_06=0.044, GB_I_LETTER=-2, SARE_SUB_ENC_UTF8=0.152]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HI1Ya+K5ex+x for <ima@ietfa.amsl.com>; Mon, 15 Aug 2011 13:35:26 -0700 (PDT)
Received: from mauve.mrochek.com (mauve.mrochek.com [66.59.230.40]) by ietfa.amsl.com (Postfix) with ESMTP id B9FB911E80E5 for <ima@ietf.org>; Mon, 15 Aug 2011 13:35:26 -0700 (PDT)
Received: from dkim-sign.mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01O4VQ5DWCHC00X2RW@mauve.mrochek.com> for ima@ietf.org; Mon, 15 Aug 2011 13:35:08 -0700 (PDT)
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset="utf-8"
Received: from mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01O4CJSMR6GG00VHKR@mauve.mrochek.com> (original mail from NED@mauve.mrochek.com) for ima@ietf.org; Mon, 15 Aug 2011 13:35:02 -0700 (PDT)
From: ned+ima@mrochek.com
Message-id: <01O4VQ5BI2B200VHKR@mauve.mrochek.com>
Date: Mon, 15 Aug 2011 10:32:30 -0700
In-reply-to: "Your message dated Mon, 15 Aug 2011 12:47:01 -0400" <C31E821E731AC23ED7EE191F@PST.JCK.COM>
References: <CAHhFybo47--0YjCRcvSO4asoV_R89+ULDB3tyij+ba=O_6gKsQ@mail.gmail.com> <01O4T11O8X4M00VHKR@mauve.mrochek.com> <op.vz8z3v0a6hl8nm@clerew.man.ac.uk> <01O4VFNKDGEE00VHKR@mauve.mrochek.com> <C31E821E731AC23ED7EE191F@PST.JCK.COM>
To: John C Klensin <klensin@jck.com>
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mrochek.com; s=mauve; t=1313440403; bh=/j3ko5f7LzDqMTo1dV5qeS7MVnz13mOYxFCzgHpv9wY=; h=MIME-version:Content-type:From:Cc:Message-id:Date:Subject: In-reply-to:References:To; b=RO9vvCy3wCbW8NdwkdGdRNEV0JrysWE4/lFIDaT9DHHDRnlR9bTTrRB5H7i16XRWw kXYFCvkIWxwNxXULkdidGR+Y/47HxJ1MV+RQWin1bcaHJs3yY5dZchfKOj1nD1i9+f UCDdd3NFaspxJPGWxrYVt9vKBrm1FT6DjuX9Wgj0=
Cc: Charles Lindsey <chl@clerew.man.ac.uk>, IMA <ima@ietf.org>
Subject: Re: [EAI] UTF-8 in Message-IDs
X-BeenThere: ima@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "EAI \(Email Address Internationalization\)" <ima.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ima>, <mailto:ima-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ima>
List-Post: <mailto:ima@ietf.org>
List-Help: <mailto:ima-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ima>, <mailto:ima-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Aug 2011 20:35:28 -0000

> Ned, Charles, and others,

> (strictly personal opinion)

> Let me give a different perspective on this, in the hope of
> drawing us together on the conclusions if not on the reasoning.

> I'm not (personally) wildly comfortable with unrestricted
> Message-IDs.  I don't find arguments based on "we chose this
> general model and therefore Message-IDs can't/shouldn't be
> restricted" persuasive,

That's good, because AFAIK nobody is making that argument. The argument we're
making is threefold:

(1) Because of structural issues in the RFC 5322 ABNF, it's much easier
    to make some changes to low-level rules than to try and add utf-8 at
    a higher leve. But one consequence of the low-level approach is that
    message-ids are also affected. This is fixable with a handful of
    additional rules that are still much cleaner and simpler than the
    higher level approach, but it's still something of a wart.

(2) An overwhemlming amount of deployed software generates message-ids by
    blindly appending @domain to some random string. A smaller but significant
    number also place some portion of the local part of the user's address
    on the left hand side. When domains and local parts start having
    utf-8 in them, the chances of all these code paths noticing and doing the
    right thing (which I suppose would be to encode it, but the encoding
    needs to be consistent or collisions are possible) are very, very low.
    So the issue of utf-8 in message-ids has to be faced and dealt with
    irrespective of what the standards say. 

(3) Most of the software that will be affected by utf-8 in message-ids is also
    going to be affected by utf-8 in addresses. (Actually, in most cases
    addresses present a much more difficult problem.) As long as you're prying
    the lid off, might as well deal with the issue a little more generally.

Given all this, the argument is we might as well bite the bullet and allow
it. Frankly, given how boundary markers are generated by some clients, I
think we're going to have enough fun dealing with that leakage case.

> precisely because we've said "protocol
> identifier" before and, IMO, our ability to say that is critical
> to getting any i18n design process right.  In particular, I note
> that we have not permitted non-ASCII characters in new/optional
> header field names by modifying RFC5322's <optional-field> to
> permit <ftext> to be a much less restricted range of characters.

That's because none of the arguments regarding message-ids apply to <ftext>.
<ftext> isn't affected by the ABNF changes, there's essentially no leakage from
domains or addresses into field names, and common code paths between field name
processing and address or message-id processing would seem to be ... unlikely.

> I think that it would be a mistake to make that extension
> (indeed, I personally think that we'd be much better off if
> <ftext> were restricted to ASCII letters, digits, and hyphen
> rather than excluding only non-printable characters and colon).
> But, if one argues that Message-ID has to be permitted to be
> non-ASCII because we've adopted a "UTF-8 nearly everywhere"
> model, the case for excluding field names has not been made (or
> even really discussed)... and a field name of, e.g.,
>    דרעקק: ...
> has a certain appeal (as well as some nightmare aspects).

See above.

> However, other arguments are much more persuasive to me, with
> the two critical ones being:

> 	(1) An MUA, or MTA pretending to be an MUA for message
> 	analysis purposes, that can handle non-ASCII UTF-8 in
> 	addresses or other header fields will need to go to very
> 	small marginal extra effort to handle a Message-ID with
> 	non-ASCII content.  Indeed, most sensible
> 	implementations would see a restriction that Message-IDs
> 	be ASCII as a requirement for added code to support an
> 	unnecessary requirement.    We have not had good luck
> 	with restrictions like that.

Another good point.

> 	(2) Given what we know about how Message-IDs are
> 	constructed in practice, we can expect non-ASCII forms
> 	to appear no matter what the spec says.  There is no
> 	point in writing a spec that we know will be ignored.
> 	For that reason, I think those arguing to restrict
> 	Message-IDs to ASCII need to come up with a clear,
> 	concise, and powerful explanation of why non-ASCII
> 	Message-IDs would be harmful -- an explanation that can
> 	be included in our documentation to persuade
> 	implementers that the requirement is really worth
> 	enforcing.

This is the second point above, more or less.

> To me, the "unnecessary requirement" and "will be ignored in
> practice" positions strongly suggest that the burden of
> demonstrating likely harm -- and producing that clear, concise,
> and powerful explanation-- falls on those who would like to keep
> Message-IDs ASCII-only.   If one believes in the "we adopted a
> different model" story, it just strengthens that position.

> In all of these discussions, I haven't seen that explanation
> emerge.

> I can think of one set of scenarios that might provide a
> foundation for it, even though I'm not convinced that they are
> strong enough to restrict Message-IDs.  The scenario goes
> somewhat like this (and parallels some of the discussion about
> message changes in Submission Servers).

> 	(1) Suppose I'm a user who receives a message.

> 	(2) Suppose I have an EAI-capable delivery MTA, message
> 	store, MUA, and whatever connects the MUA to the message
> 	store (i.e., none of the edge cases that are a headache
> 	for POP and IMAP apply).  Suppose the message I receive
> 	contains a non-ASCII Message-ID and very little other
> 	non-ASCII header field content.   As one example among
> 	many, suppose that only the contents of the Subject
> 	header field are non-ASCII (in addition to the
> 	Message-ID).

> 	(3) For simplification, suppose that my outgoing mail
> 	server's primary domain name does not contain any IDN
> 	components and Message-IDs I generate are all-ASCII.

> 	(4) Now I decide to reply to the originator (note that
> 	neither my address nor the address that appears in the
> 	"From:" header are non-ASCII).  I also add a recipient
> 	to the reply whose system is not EAI-capable (I know
> 	that from out of band information -- how is not
> 	important).   So, in addition to adding the additional
> 	address, I edit the Subject line so it is ASCII-only.
> 	(Remember this is a reply and I'm the user, so both of
> 	those changes are perfectly legitimate according to
> 	everything we've specified).

> Now there is a problem.

No, not really. Again, this work is premised on semantic-preserving downgrades
not being necessary. You are considering an odd corner case where were it not
for the utf-8 in the message-id, such a downgrade would be possible.

So what? If you really care about being able to send *any* subset of EAI
messages to non-EAI recipients without semantic loss, then this entire effort
is misdesigned: We should be using some 7-bit encoding scheme and not utf-8.

> An automatically-generated In-reply-to
> header field, generated as recommended in 5322 (unchanged in
> 5335bis) is going to contain that non-ASCII Message-ID.  That
> requires EAI handling (i.e., the UTF8SMTPbis extension) for the
> outgoing message even though the In-reply-to field is the only
> header field containing non-ASCII characters.  Given the
> scenario above, that makes the message non-deliverable to my
> intended new recipient.

> That is bad news... and bad news that could have been completely
> prevented by requiring that Message-IDs be restricted to ASCII.

Again, so what? We could send an even larger portion of those EAI messages to
the non-EAI recipient if we were to impose similar restrictions on addresses,
which we could easily do by using some other encoding of Unicode. And so on.
This logic puts you on a slippery slope that unavoidably leads to most of the
design decisions this group has made being wrong.

> The question, IMO, is whether cases of that type are likely to
> be prevalent enough to justify a restriction.  I contend that
> they are more than sufficient to justify some explanatory text
> and a recommendation that there are cases in which having a
> sending system be conservative enough to generate only all-ASCII
> Message-IDs will provide some marginal robustness against
> delivery rejections.  But I don't think it is nearly persuasive
> enough to justify an outright ban on non-ASCII Message-IDs.

Well, while I don't find your example here persuasive, I also don't have a
problem with saying utf-8 SHOULD NOT be used in message-ids. My guess is this
will have no more effect on utf-8 showing up there than an outright ban will,
but it's defensible recommendation.

				Ned