Re: [ietf-822] utf8 messages

Daniel Vargha <dvargha@mimecast.com> Wed, 13 August 2014 13:18 UTC

Return-Path: <dvargha@mimecast.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A25661A0095 for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 06:18:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.968
X-Spam-Level:
X-Spam-Status: No, score=-4.968 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RP_MATCHES_RCVD=-0.668, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lSU3zgaui5rf for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 06:18:05 -0700 (PDT)
Received: from service-alpha-uk.mimecast.com (service-alpha-outbound1.mimecast.com [91.220.42.229]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 774DA1A012D for <ietf-822@ietf.org>; Wed, 13 Aug 2014 06:18:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mimecast.com; s=20130419; t=1407935879; bh=bF/EJGj5gFDmADpxbb1tTPIMVffZ08fSNIcMeBNsIJs=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To:MIME-Version:Content-Type; b=q6d0h0wsVDbSLK/Xl32mXJCsw9Buysasfpz4uDwHfLWrYLw6P3wDIzHSP55Uokh8HcQcxVx5w3fla/TIU1uSIoby2DEfFAO3OSAInFs47qgwHq4VkzfpfJyHutKKrBoEvoib7+vmPwdqTcK+qnkVDsZmA4SSKB+ASZXNsaVMtgQ=
Received: from remote.mimecast.com (146.101.202.133 [146.101.202.133]) (Using TLS) by uk-sl-b.uk.mimecast.lan; Wed, 13 Aug 2014 14:17:56 +0100
Received: from MC-LON-EXCH06.mcsltd.internal (192.168.40.206) by MC-LON-EXCH03.mcsltd.internal (192.168.40.12) with Microsoft SMTP Server (TLS) id 14.3.195.1; Wed, 13 Aug 2014 14:17:55 +0100
Received: from MC-LON-EXCH03.mcsltd.internal ([fe80::3879:e7a7:5e3d:3699]) by MC-LON-EXCH06.mcsltd.internal ([fe80::fc47:f11e:e9aa:b670%13]) with mapi id 14.03.0195.001; Wed, 13 Aug 2014 14:17:54 +0100
From: Daniel Vargha <dvargha@mimecast.com>
To: Brandon Long <blong@google.com>, Ned Freed <ned.freed@mrochek.com>
Thread-Topic: [ietf-822] utf8 messages
Thread-Index: AQHPtaaAWFoietPkYkCGwOkQxU65z5vMFRA1gAB7aoCAAIf8hoABBMgAgABpbQA=
Date: Wed, 13 Aug 2014 13:17:53 +0000
Message-ID: <D0111ECB.195FD%dvargha@mimecast.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com> <01PB9FABWA4E0000SM@mauve.mrochek.com> <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com> <01PBABOOL4QO0000SM@mauve.mrochek.com> <CABa8R6vBqS1ewmTtHh8tTOdzobsWpvSEokRxOqpj1Oq3hA+vsw@mail.gmail.com>
In-Reply-To: <CABa8R6vBqS1ewmTtHh8tTOdzobsWpvSEokRxOqpj1Oq3hA+vsw@mail.gmail.com>
Accept-Language: en-GB, en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
user-agent: Microsoft-MacOutlook/14.4.3.140616
x-originating-ip: [94.194.106.16]
MIME-Version: 1.0
X-MC-Unique: BsUzVpinRr6OhuZreP5rIQ-2
Content-Type: multipart/alternative; boundary="_000_D0111ECB195FDdvarghamimecastcom_"
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/o1cmvIgF75a3xgvcsnhtPO6D_GI
Cc: "ietf-822@ietf.org" <ietf-822@ietf.org>
Subject: Re: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Aug 2014 13:18:08 -0000

I fully agree with Brandon, the standard SHOULD consider the use case when a
message is transferred from one system to another as a blob (e.g. flat file) and
the only available "metadata" is that the message is in MIME format. Having
some sort of well defined UTF8 indicator in the header section of the message
would make it much simpler to adopt the new standard as it would require
substantially less development effort in most cases.

Regarding Ned's concern about inconsistent states I think it would be a workable
solution to only honour the UTF8 indicator in the headers when the UTF8 flag
is not available from metadata. In a well known UTF8 context where the SMTP
protocol or the message store already "knows" that the message is UTF8 the
indicator in the headers can be ignored.

I think it is generally desirable to reduce (or at least not increase) the amount
of heuristics required to successfully parse a MIME message. We should try to
learn from previous mistakes instead of repeating them.

Daniel

From: Brandon Long <blong@google.com<mailto:blong@google.com>>
Date: Wednesday, 13 August 2014 09:00
To: Ned Freed <ned.freed@mrochek.com<mailto:ned.freed@mrochek.com>>
Cc: "ietf-822@ietf.org<mailto:ietf-822@ietf.org>" <ietf-822@ietf.org<mailto:ietf-822@ietf.org>>
Subject: Re: [ietf-822] utf8 messages




On Tue, Aug 12, 2014 at 7:54 AM, Ned Freed <ned.freed@mrochek.com<mailto:ned.freed@mrochek.com>> wrote:
> > It is, or is supposed to be, a sealed system implemented as a set of
> > interlocking extensions to existing email facilities.

> So, if I have an "email" message, I can no longer just parse it.  Instead,
> there are actually two
> types of email messages, and the only way to know how to parse it is to
> know a priori which type it is.
> Because all systems are "sealed" and there's never any leakage.

Wrong on all counts, I'm afraid. First and foremost, in practice you have never
been able to "just parse" email messages, for the simple reason that too many
message creation agents don't follow the rules and create wildly incompliant
messages. So there's always some heuristics involved, unless of course you're
willing to only accept syntacticaly valid messages, in which case yes, you
can "just parse" messages, including EAI messages.

Where the lines are drawn has always been a tradeoff, and one which has
changed over the years. It used to be the case that a lot more crap was
generally tolerated. The spam problem has led to an overall tightening up
what's tolerated.

Second, because of overall lack of compliance, there have always been
many types of messages.

Third, of course there's leakage.

Ok, let me try to rephrase my point.  Before, there were two types of messages, well specified / syntactically correct message, and not.  The not well specified messages amount to a non-trivial number, but best effort/heuristic results are fine.

If we imagine handling 1B messages/day, 2% being the "not well specified", we have 20M messages handled by heuristic.  If we have only 1% heuristic failures, that's 200k bad messages a day.  Eh, maybe ok.  Especially since in practice, these tend to be spam messages or the 8bit chars in the header are limited to some "content preview" header or other boneheadedness, or best case, its only a character or two that's broken.

Adding under-specified 6532 messages to the mix means that we are now generating messages that can easily slip into the second pool.  We try our best to only generate well specified / syntactically correct messages.. "be conservative in what you send" and all, but now we're being forced to take steps that we know will increase the number of failures.  We're attempting to adjust our heuristics to compensate, but that seems like a poor response compared to making the messages be well specified.

Now, one can add some bit outside of the message to say its 6532... and then modify everything that exchanges or stores messages to also exchange and store that bit.  Passing messages to procmail?  Guess we need a new env var.  Maildrop programs used by smtp servers to pass messages to imap servers.  Mailing list software.  Mailing list archives.  How you call spamassassin.  The thread about this during the discussion suggested adding a new field to the From_ mbox separator.  Guess a new attribute field in the maildir spec?

> > The problem we're having with 6532 messages, is that we moved from
> > > explicitly identified charsets via 2047/etc mechanisms, to "its just
> > > utf8"... and sometimes we mis-detect the utf8 as cp1250 or other
> > encodings.
> >
> > No message created prior to the release of support for RFC 6532 can be
> > assumed
> > to be a RFC 6532 message. Now, if you want to vet those messages using some
> > sort of process to insure such a message meets the syntax and at least
> > looks
> > semantically sensible, then I suppose you could set the flag in the
> > metadata.
> >
> > But if you can't distinguish such messages from legitimate RFC 6532
> > created and
> > submitted by compliant clients, it sounds to me like you're not retaining a
> > really critical piece of envelope/metadata in your implementation.
> >

> Yes.  And the most obvious place for that information, to me, is in the
> headers of the message.  Assuming that every mechanism for exchanging email
> messages needs an explicit external piece of data...

Sorry, I'm not going to revisit or defend past design decisions. My point was
and is that when you said that a piece of information was missing, that
statement was incorrect.

I'm saying that requiring the information to be external grossly increases the development required to support the standard.  I already pointed out some common tools above.  For us, not having to add a new piece of external metadata means that all we have to do is upgrade our parser... and validate how we use addresses and fix hopefully minor issues.  If we have to use external metadata, then we have 100s of data paths and data stores that need to be upgraded.

We've also already agreed that leaks happen, which means separating the metadata from the data itself guarantees that under specified data will leak.  Making the message itself well specified means that re-synchronization can take place, that messages can pass through agnostic mechanisms (whether agnostic by choice or by happenstance) and be understood on the other side.

You may not like or agree with how this was done. (For that matter, I never
said I liked or agreed with how it was done.) But for better or worse, there's
now a standard in place, and absent compelling evidence of there being a
problem with implementing that standard - evidence which AFAICT you have not
provided - it's not appropriate to propose competing mechanisms to that
standard.

I thought implementation feedback was a useful thing, and generally desired prior to something becoming a standard.  I also thought that internet standards were generally considered a work in progress... 6532 is standards track but not a STD.

I also wasn't proposing a competing mechanism, I was proposing that implementation experience showed, to me, a strong possibility that a clarification or change to the standard would be beneficial.  Also, frankly, there is nothing in the standards that say we MUST not do this, so thanks for the exhortation from authority, I'll take it under advisement.

Frankly, I don't understand this concept of refusing to revisit or defend.  I tried to find and follow the discussion in the eai wg archives... and to me, the decision seemed under explored.  The original section spent a lot of real estate on the importance of such an explicit in message spec, only to be removed.  I found two threads about it, about equal numbers of people contributing to either side (like 1-2 on each side), a somewhat heavy handed dismissal of the need, a call for consensus that passed due mostly to non-contributors.  Perhaps more took place in meetings, or there were other discussions that someone could point me to.

Perhaps you can explain to me what 'compelling evidence of there being a problem implementing the standard' would mean in practice?

Brandon