Re: [ietf-822] utf8 messages

Brandon Long <blong@google.com> Tue, 12 August 2014 08:20 UTC

Return-Path: <blong@google.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8E2AF1A0788 for <ietf-822@ietfa.amsl.com>; Tue, 12 Aug 2014 01:20:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.046
X-Spam-Level:
X-Spam-Status: No, score=-4.046 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, GB_I_LETTER=-2, HTML_MESSAGE=0.001, RP_MATCHES_RCVD=-0.668, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ItN_9EbV-d4n for <ietf-822@ietfa.amsl.com>; Tue, 12 Aug 2014 01:20:28 -0700 (PDT)
Received: from mail-ig0-x230.google.com (mail-ig0-x230.google.com [IPv6:2607:f8b0:4001:c05::230]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C5CD31A078C for <ietf-822@ietf.org>; Tue, 12 Aug 2014 01:20:28 -0700 (PDT)
Received: by mail-ig0-f176.google.com with SMTP id hn18so5987665igb.15 for <ietf-822@ietf.org>; Tue, 12 Aug 2014 01:20:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=N8B1Ow8ZNYrwcYTSnAkEYryYg96JMB0cEWcGkRMzJAc=; b=kUckgnrT/JlE43lAyMvkP1YZNbTt53xJWjjWGga05CKzmZ36YSG2k9nIGpup+Kb45T pG+rlZPt/Ki3HEgHG66mw1wiLtGryctb1gK7edKZY7oNOSqy5rfAzF/TDADPh3lWIPe+ TuEFqw3TJZ1mNQYi3Tpy4/jI2ox8ovYnSfWJ5IKWGrUlqn1AwvJmjsJQDwynyDhTOIcb AwKM9y59caPf1Ad9nKcjDPNAuOgi8nejF4jxfwcVgJOjU7hda9j/6gao4+1YDYIinj34 bDnhxMLg5ZvxIwKswtgaJ0czJcWtyGH3Ac39TQaTuu3dQaGJvSiP3N78e62R1SQdV6wm 905w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=N8B1Ow8ZNYrwcYTSnAkEYryYg96JMB0cEWcGkRMzJAc=; b=mcIMsl7sO2ffenjIkz8+OmS1yYEHTISTcA9EgSUUEkoVOh/MQWRHHQ1waHHmXJHJiD mPin/oTT8ZE7LTUz1r4mAQlvypQmY+9+hFNvoBuGxbCog0NYZ1mAl28HqAclCWsK3Opt ju9TtCPBsFaItF8cTlauv2xziO+hptV3jVc/MtNwfcNubGIzLsNnDoZm56WECem0K9eX yAdcJMTAbo4ujtmwMiryhno/elodWgsEZdcusFt6PGTZhB2MNXcPYOSEUoSw5XSjr6p8 TgEcdPPVtbdyK9HL5VoGfVRYHRRNx6ZmZzKD4cyOWmgqfmFhfjqbVvO8wuU4cAf8WCeg j7rw==
X-Gm-Message-State: ALoCoQlVduto6vux2Xf/Hqj2iINVhoFwtlvV+D/ujRVWaIXLLVmywef7SrS8ADHzc/aqh+uDVm6i
MIME-Version: 1.0
X-Received: by 10.50.79.195 with SMTP id l3mr36768831igx.23.1407831627937; Tue, 12 Aug 2014 01:20:27 -0700 (PDT)
Received: by 10.64.62.78 with HTTP; Tue, 12 Aug 2014 01:20:27 -0700 (PDT)
In-Reply-To: <01PB9FABWA4E0000SM@mauve.mrochek.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com> <01PB9FABWA4E0000SM@mauve.mrochek.com>
Date: Tue, 12 Aug 2014 01:20:27 -0700
Message-ID: <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com>
From: Brandon Long <blong@google.com>
To: Ned Freed <ned.freed@mrochek.com>
Content-Type: multipart/alternative; boundary="089e01182d0a166d9605006a5891"
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/rO0AXu_E7ZFYO8gI_TVsG-ztlEk
Cc: ietf-822@ietf.org
Subject: Re: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 12 Aug 2014 08:20:35 -0000

On Mon, Aug 11, 2014 at 4:08 PM, Ned Freed <ned.freed@mrochek.com> wrote:

> > In our recent launch of support for EAI, we noticed an issue with 6532
> > "utf8" messages.
>
> > As near as I can tell, there is nothing about a 6532 message which tells
> > you it is such a message... except the existence of 8bit characters in
> the
> > headers.  Ie, 7bit -> 5322, 8bit -> 6532.
>
> Sure there is: The smtputf8 flag in the envelope and/or utf8 flag in the
> IMAP
> metadata is what tells you that the message is in RFC 6532 format.
>
> It's true that there's no explicit indicator in the message content. There
> doesn't need to be, and since you have to have the indicator in the
> envelope
> and in the message metadata, putting it in both places just creates the
> possibility of a silly state.
>
> The basic theory of operation of EAI is if decide to submit an RFC 6532
> message, you do so with the SMTPUTF8 extension engaged. If you decide to
> APPEND
> it to an IMAP folder, you do so with the utf8 data extension engaged. Same
> thing if you decide to use CATENATE to create the message. This tells
> whicever
> server that the message is in fact in RFC 6532 format so it can keep track
> of
> that fact.
>
> EAI only allows for the transport and movement of RFC 6532 messages if the
> necessary extensions are present. If they aren't, the message cannot be
> transferred or moved.
>
> The only time a RFC 6532 message can escape the EAI world is inside of a
> DSN,
> and in that case a message/global wrapper provides the indicator.
>
> It is, or is supposed to be, a sealed system implemented as a set of
> interlocking extensions to existing email facilities.
>

So, if I have an "email" message, I can no longer just parse it.  Instead,
there are actually two
types of email messages, and the only way to know how to parse it is to
know a priori which type it is.
Because all systems are "sealed" and there's never any leakage.

As for just check for 8 bit messages... on to the next part.

> Our problem is that this isn't actually true in practice.  Prior to
> > launching support for 6532 messages, we've already had to support
> > widespread use of 8bit messages that were not always in utf8.  Since
> these
> > typically didn't specify which charset they were in, we used a variety of
> > techniques including direct charset detection on such messages.
>
> It depends on what you mean by "8bit message". If you mean messages with
> 8bit
> in the body data, then sure, that's fully standard and widespread.
>
> But 8bit of any sort in a header at any level was a standards violation
> prior
> to RFC 6532. And since there are many 8bit charsets, and telling them
> apart is
> in general impossible (although intelligent guesses can be made) without
> labeling (which implies conversion to 7bit), this was never a terribly
> interoperable thing to be doing from day 1.
>

Yes, it wasn't a great idea.  Apparently, strict adherence to spec was not
a strong concern from the non-English speaking world.  And, this worked
fine before "global" services, such that local sites would just assume
local charsets for these broken messages.  We've had to do auto-detection
on messages for quite a while, and yes, its sometimes broken.

Last time we looked, it was 2% of the email in the subset we checked.

> The problem we're having with 6532 messages, is that we moved from
> > explicitly identified charsets via 2047/etc mechanisms, to "its just
> > utf8"... and sometimes we mis-detect the utf8 as cp1250 or other
> encodings.
>
> No message created prior to the release of support for RFC 6532 can be
> assumed
> to be a RFC 6532 message. Now, if you want to vet those messages using some
> sort of process to insure such a message meets the syntax and at least
> looks
> semantically sensible, then I suppose you could set the flag in the
> metadata.
>
> But if you can't distinguish such messages from legitimate RFC 6532
> created and
> submitted by compliant clients, it sounds to me like you're not retaining a
> really critical piece of envelope/metadata in your implementation.
>

Yes.  And the most obvious place for that information, to me, is in the
headers of the message.  Assuming that every mechanism for exchanging email
messages needs an explicit external piece of data...

Is there a separate three letter filename extension for 6532 messages, for
example? (for platforms which use such things).


> > This would all be solved if 6532 messages were actually denoted as such,
> > and I recall seeing at least one such X header used by another service
> > we've been interoperability testing with:
> > X-CM-HeaderCharset: UTF-8
> > CM no doubt standing for CoreMail, which is the software used:
> > X-Mailer: Coremail Webmail Server Version XT3.0.4 build
> >  20140526(27182.6409.6185) Copyright (c) 2002-2014 www.mailtech.cn
> coremail
>
> Say what? This solves nothing, since nobody else is going to bother doing
> this,
> and creates unnecessary silly states.
>
> Legitimate RFC 6532 messages are, or should be, labeled, and the label
> should be carried along with the message as it moves around.
>
> I suppose if you wanted to you could make the label manifest as a header
> field in the stored copy. You could even write a specification for such
> a header, but you won't be able to count on its presence.
>

[snip]


> > Now, as hinted at in the consensus to remove such a marker from the
> draft,
> > we can certainly add such a header when composing 6532 messages or when
> we
> > receive any message via SMTPUTF8 for our own utility, but I would think
> > there would be some utility in such a marker being mutually understood
> and
> > shared.
>
> Please don't. This is the sort of thing that wrecks standards.


So, its ok to write a specification to have such a header but doing so
wrecks standards?

Or are you saying it would be fine for us to have such a header, but not to
leak it?

Brandon