Re: [ietf-822] utf8 messages

Ned Freed <ned.freed@mrochek.com> Mon, 11 August 2014 23:58 UTC

Return-Path: <ned.freed@mrochek.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 16F651A0728 for <ietf-822@ietfa.amsl.com>; Mon, 11 Aug 2014 16:58:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.67
X-Spam-Level:
X-Spam-Status: No, score=-2.67 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RP_MATCHES_RCVD=-0.668, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Gw3QwaAkYJnB for <ietf-822@ietfa.amsl.com>; Mon, 11 Aug 2014 16:58:23 -0700 (PDT)
Received: from mauve.mrochek.com (mauve.mrochek.com [66.159.242.17]) by ietfa.amsl.com (Postfix) with ESMTP id BA9BD1A06D9 for <ietf-822@ietf.org>; Mon, 11 Aug 2014 16:58:23 -0700 (PDT)
Received: from dkim-sign.mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PB9FAD2I8W0012UH@mauve.mrochek.com> for ietf-822@ietf.org; Mon, 11 Aug 2014 16:53:21 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=mrochek.com; s=mauve; t=1407801203; bh=mQEScY9ER8o4+3zJSnfEpLGctW1W7WXlPgDAtZuVVVs=; h=Cc:Date:From:Subject:In-reply-to:References:To; b=Z8jS5CAX8SoQmPTYIfWpFaGX9NX7qVxgg9889pf/iflEUa/csrxnkMTTFgZ0kCFJj p1vvLReXaH2mftRzl8j9lOL6yBkrpQV+OwpJ8qBUmc6RJzTm6VPj9jPD2uxiHiMONh aQYRj10Og4O14g4j1Jy+A5YeHaJgTLjpmIvnmwJw=
MIME-version: 1.0
Content-transfer-encoding: 7bit
Content-type: text/plain; CHARSET="US-ASCII"
Received: from mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PB2RFWCBO00000SM@mauve.mrochek.com>; Mon, 11 Aug 2014 16:53:19 -0700 (PDT)
Message-id: <01PB9FABWA4E0000SM@mauve.mrochek.com>
Date: Mon, 11 Aug 2014 16:08:15 -0700
From: Ned Freed <ned.freed@mrochek.com>
In-reply-to: "Your message dated Mon, 11 Aug 2014 13:45:48 -0700" <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com>
To: Brandon Long <blong@google.com>
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/ckVjH0N4hgr2b-e5SIRHtnZgzbo
Cc: ietf-822@ietf.org
Subject: Re: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Aug 2014 23:58:26 -0000

> In our recent launch of support for EAI, we noticed an issue with 6532
> "utf8" messages.

> As near as I can tell, there is nothing about a 6532 message which tells
> you it is such a message... except the existence of 8bit characters in the
> headers.  Ie, 7bit -> 5322, 8bit -> 6532.

Sure there is: The smtputf8 flag in the envelope and/or utf8 flag in the IMAP
metadata is what tells you that the message is in RFC 6532 format.

It's true that there's no explicit indicator in the message content. There
doesn't need to be, and since you have to have the indicator in the envelope
and in the message metadata, putting it in both places just creates the
possibility of a silly state.

The basic theory of operation of EAI is if decide to submit an RFC 6532
message, you do so with the SMTPUTF8 extension engaged. If you decide to APPEND
it to an IMAP folder, you do so with the utf8 data extension engaged. Same
thing if you decide to use CATENATE to create the message. This tells whicever
server that the message is in fact in RFC 6532 format so it can keep track of
that fact.

EAI only allows for the transport and movement of RFC 6532 messages if the
necessary extensions are present. If they aren't, the message cannot be
transferred or moved.

The only time a RFC 6532 message can escape the EAI world is inside of a DSN,
and in that case a message/global wrapper provides the indicator.

It is, or is supposed to be, a sealed system implemented as a set of
interlocking extensions to existing email facilities.

> Our problem is that this isn't actually true in practice.  Prior to
> launching support for 6532 messages, we've already had to support
> widespread use of 8bit messages that were not always in utf8.  Since these
> typically didn't specify which charset they were in, we used a variety of
> techniques including direct charset detection on such messages.

It depends on what you mean by "8bit message". If you mean messages with 8bit
in the body data, then sure, that's fully standard and widespread.

But 8bit of any sort in a header at any level was a standards violation prior
to RFC 6532. And since there are many 8bit charsets, and telling them apart is
in general impossible (although intelligent guesses can be made) without
labeling (which implies conversion to 7bit), this was never a terribly
interoperable thing to be doing from day 1.

> The problem we're having with 6532 messages, is that we moved from
> explicitly identified charsets via 2047/etc mechanisms, to "its just
> utf8"... and sometimes we mis-detect the utf8 as cp1250 or other encodings.

No message created prior to the release of support for RFC 6532 can be assumed
to be a RFC 6532 message. Now, if you want to vet those messages using some
sort of process to insure such a message meets the syntax and at least looks
semantically sensible, then I suppose you could set the flag in the metadata.

But if you can't distinguish such messages from legitimate RFC 6532 created and
submitted by compliant clients, it sounds to me like you're not retaining a
really critical piece of envelope/metadata in your implementation.

> Now, we can work on improving our detection and maybe start biasing it to
> utf8 or even just assuming utf8 for any 8bit message which is in
> interchange valid utf8.  Anything we do there will result in some potential
> for mistakes, of course.

Sure. There are always tradeoffs when handling invalid data with no clear
semantics. And those tradeoffs just got nastier in the case of EAI, because now
you can't simply violate the standards by puking out such messages to clients
willy-nilly. Why? Because a compliant EAI client is going to do an ENABLE
UTF8=whatever, and once it does that it's going to expect the messages it gets
that do contain 8bit in headers and addresses to actually comply with RFC 6532.
So you're going to have to do something to either downgrade those messages to
7bit headers or upgrade them to RFC 6532, and yes, mistakes will be made.

But this is not a lack of labeling issue. Rather, it's an issue of having
have to deal with the garbage you let in.

One thing you could argue is that there should be a way to fetch the UTF8
state of a message - AFAIK that's not possible at present, which means
that in some cases clients have to scan the message to figure that out.
I think this is an oversight in the standard, but not a major one.

> This would all be solved if 6532 messages were actually denoted as such,
> and I recall seeing at least one such X header used by another service
> we've been interoperability testing with:
> X-CM-HeaderCharset: UTF-8
> CM no doubt standing for CoreMail, which is the software used:
> X-Mailer: Coremail Webmail Server Version XT3.0.4 build
>  20140526(27182.6409.6185) Copyright (c) 2002-2014 www.mailtech.cn coremail

Say what? This solves nothing, since nobody else is going to bother doing this,
and creates unnecessary silly states.

Legitimate RFC 6532 messages are, or should be, labeled, and the label
should be carried along with the message as it moves around.

I suppose if you wanted to you could make the label manifest as a header
field in the stored copy. You could even write a specification for such
a header, but you won't be able to count on its presence.

> Thoughts?  It looks like there was a i-Email/Header-Type originally, but
> was removed early in the utf8smtp timeframe:
> http://www.ietf.org/mail-archive/web/ima/current/msg01358.html
> The general consensus for removal seemed to be "you'll know because it was
> specified at SMTP time", "just look for 8bit" and "its bad to duplicate
> data between the envelope and the headers".

Which was correct.

> Looks like it goes nearly to the beginning of the utf8smtp time frame:
> http://www.ietf.org/mail-archive/web/ima/current/msg00079.html

> It seems that the pre-existence of 8bit messages was not considered by
> those who felt it wasn't necessary, as least as far as I've read in the
> discussions (wow do I wish the mhonarc had been updated with an easier to
> explore/read model)

The pre-existence of 8bit messages is a problem, but the problem is the
lack of a proper label, not where the label happens to be stored.

> Now, as hinted at in the consensus to remove such a marker from the draft,
> we can certainly add such a header when composing 6532 messages or when we
> receive any message via SMTPUTF8 for our own utility, but I would think
> there would be some utility in such a marker being mutually understood and
> shared.

Please don't. This is the sort of thing that wrecks standards.

				Ned