[ietf-822] utf8 messages

Brandon Long <blong@google.com> Mon, 11 August 2014 20:45 UTC

Return-Path: <blong@google.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8A1DD1A0002 for <ietf-822@ietfa.amsl.com>; Mon, 11 Aug 2014 13:45:51 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.046
X-Spam-Level:
X-Spam-Status: No, score=-2.046 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RP_MATCHES_RCVD=-0.668, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id krNgRtcvKNuO for <ietf-822@ietfa.amsl.com>; Mon, 11 Aug 2014 13:45:49 -0700 (PDT)
Received: from mail-ig0-x234.google.com (mail-ig0-x234.google.com [IPv6:2607:f8b0:4001:c05::234]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 55D411A0010 for <ietf-822@ietf.org>; Mon, 11 Aug 2014 13:45:49 -0700 (PDT)
Received: by mail-ig0-f180.google.com with SMTP id l13so4928167iga.7 for <ietf-822@ietf.org>; Mon, 11 Aug 2014 13:45:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=71+ZTUknbmOzrnd8RMFmxMxJhJc3+RPKqpf4uV6v3Ak=; b=YkkgoWaJIyJYyWABR3plH9wzqbdoof64ZyO9Ngf+72tstSh790CY2Yg6NOtcgtTxJC BwR4sucCh1Ydg+wxCk/+R3KEevekZp2yHibz+WrNeamUF6zUW3ZPIVuMa6Kh8J9ulKww OffJgh1ZCeQ9QlhGjPYtRPqnmba2YjJZfFLQc55wVhZ5xKGMUlqGK8HstlJktJ8X5+Gz lFvo1DE58si7InFpjhEAt1vOHToRo0qEZuFes1Z1UxhvE1kE8a6eJREIBF1QHgpSHu4A UwixmFHancJ1fcHv4nkq0H9g+FwrifGqxC5s/LIzP0uWQSGxCg318W1K4w0aMMbZKg6E Wtlw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=71+ZTUknbmOzrnd8RMFmxMxJhJc3+RPKqpf4uV6v3Ak=; b=fMQIKcHpUXod3NmbamonHlx6lYs0Hf7NYvkAj+/+W483mWnVQcMiQMqxlOoKwCjOOI HTxFn3i9LNiDrMfF2qLWfQCZ61UMnBuJn6vwhr3QJot6n+bzFmYKGE1ZL9ebyaD0vzqL gtSv4KX5b6s4fTginZNhwTi75oCayRQTj38XE5YXL4796QNJTjOfrubRfSGMgTTF67n4 5nlJ+JmrrTZaHo02/WifFUYJNFnbqTmjlf5EMYPcisHGXrRX6D0dJgvq7JRCztvjkq3t nRANfytx2p7Wvz8TLuhr6TA19b0guUd7wYdhU8bLhfmv1+Y8sdLMCOUWdpaCS7jdIOyv yNzw==
X-Gm-Message-State: ALoCoQk5bhSRW5T2yYmhkjfovauw8TUUgGgBevK6UHS1bpKSfzgLlQmyz8fXO3pf+3yBVB60OTuC
MIME-Version: 1.0
X-Received: by 10.50.80.39 with SMTP id o7mr100753igx.0.1407789948294; Mon, 11 Aug 2014 13:45:48 -0700 (PDT)
Received: by 10.64.62.78 with HTTP; Mon, 11 Aug 2014 13:45:48 -0700 (PDT)
Date: Mon, 11 Aug 2014 13:45:48 -0700
Message-ID: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com>
From: Brandon Long <blong@google.com>
To: ietf-822@ietf.org
Content-Type: multipart/alternative; boundary="089e01493922c988e0050060a3e5"
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/z3apPe_6hgR51uIHfDQiIY2An3k
Subject: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 11 Aug 2014 20:45:51 -0000

In our recent launch of support for EAI, we noticed an issue with 6532
"utf8" messages.

As near as I can tell, there is nothing about a 6532 message which tells
you it is such a message... except the existence of 8bit characters in the
headers.  Ie, 7bit -> 5322, 8bit -> 6532.

Our problem is that this isn't actually true in practice.  Prior to
launching support for 6532 messages, we've already had to support
widespread use of 8bit messages that were not always in utf8.  Since these
typically didn't specify which charset they were in, we used a variety of
techniques including direct charset detection on such messages.

The problem we're having with 6532 messages, is that we moved from
explicitly identified charsets via 2047/etc mechanisms, to "its just
utf8"... and sometimes we mis-detect the utf8 as cp1250 or other encodings.

Now, we can work on improving our detection and maybe start biasing it to
utf8 or even just assuming utf8 for any 8bit message which is in
interchange valid utf8.  Anything we do there will result in some potential
for mistakes, of course.

This would all be solved if 6532 messages were actually denoted as such,
and I recall seeing at least one such X header used by another service
we've been interoperability testing with:
X-CM-HeaderCharset: UTF-8
CM no doubt standing for CoreMail, which is the software used:
X-Mailer: Coremail Webmail Server Version XT3.0.4 build
 20140526(27182.6409.6185) Copyright (c) 2002-2014 www.mailtech.cn coremail

Thoughts?  It looks like there was a i-Email/Header-Type originally, but
was removed early in the utf8smtp timeframe:
http://www.ietf.org/mail-archive/web/ima/current/msg01358.html
The general consensus for removal seemed to be "you'll know because it was
specified at SMTP time", "just look for 8bit" and "its bad to duplicate
data between the envelope and the headers".

Looks like it goes nearly to the beginning of the utf8smtp time frame:
http://www.ietf.org/mail-archive/web/ima/current/msg00079.html

It seems that the pre-existence of 8bit messages was not considered by
those who felt it wasn't necessary, as least as far as I've read in the
discussions (wow do I wish the mhonarc had been updated with an easier to
explore/read model)

Now, as hinted at in the consensus to remove such a marker from the draft,
we can certainly add such a header when composing 6532 messages or when we
receive any message via SMTPUTF8 for our own utility, but I would think
there would be some utility in such a marker being mutually understood and
shared.

Brandon