Re: [ietf-822] utf8 messages

Brandon Long <blong@google.com> Wed, 13 August 2014 23:23 UTC

Return-Path: <blong@google.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 97E391A03E1 for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 16:23:01 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.046
X-Spam-Level:
X-Spam-Status: No, score=-2.046 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, RP_MATCHES_RCVD=-0.668, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id nQfCWQcDYcjG for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 16:22:54 -0700 (PDT)
Received: from mail-ig0-x22a.google.com (mail-ig0-x22a.google.com [IPv6:2607:f8b0:4001:c05::22a]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 13FE71A0114 for <ietf-822@ietf.org>; Wed, 13 Aug 2014 16:21:58 -0700 (PDT)
Received: by mail-ig0-f170.google.com with SMTP id h3so12827242igd.1 for <ietf-822@ietf.org>; Wed, 13 Aug 2014 16:21:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=aIlI31gPgK4AQ8HSTIW7+3ytH73VaKV9CxNN1woADcA=; b=X7oDLwGuBKhA9qzMrKUcXrd8ehBOHpkPZUldGS6e6jovKruM/me/j+rdeFyFx0AAtk 6i1oSoxRZqPLoq26GZs5eUrXc7ixJ18LTWYVSzithJWBtMWMggO0nzZRnSLEsuwxdrir DvbyHkVLAxG/o6M/px6pI534TRcMPc5XtsQ9r1rbk17eAGhgx/6VwmTn/4O1646MCxxe sllusK0Sw39+H3sX5v3GsBNI97X7futLhoiMKElU9diXTkpL/NgXzrKPKk48UCy1sPHb TPROZKeVciWC5i1AFNKFZhIqx3e0AFvxa/u5oPTXeHldaTf805w82gtj1ytHYM5AuZJt 7AZA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=aIlI31gPgK4AQ8HSTIW7+3ytH73VaKV9CxNN1woADcA=; b=k+mC4iOkW7FH0EvDo0xQdmTq6mhx8vKEc3yRtJ+Z/Ut+bJNdLT/ycb15OVJ0/rCTex 9FCueuEn44mS7X2iPZ35gvJj8aAq5DRxsgWYdvTGUMQN3uNn+7u46KVPDOCHVw8Hpsci CpODJIsxZDrfE6BJlP8T1btdS+Hi6KLWIkReEX2CVD90gDaEJnbLAhILU89GSzFFd6lh En1u5YMd5QzvKFy7pazsAMfPTX6IsITGNF4x3vOR9dpe5HrmdM2WC4XsCKvy445v/KFa ZuWE4FIH6Vd90ncSHR8SXrPeN4MKLyZAVDBR/CDjaLGzZ/1e83/k0HT2ASHFCVCDaGPM NeHA==
X-Gm-Message-State: ALoCoQlWr6NezXTAISbhmPQgoI0I2heVDaqCsbIIxIwvH2j1zUg4Tbhdk/9FD2GUiP008V/3mwDg
MIME-Version: 1.0
X-Received: by 10.50.3.66 with SMTP id a2mr11817920iga.23.1407972117341; Wed, 13 Aug 2014 16:21:57 -0700 (PDT)
Received: by 10.64.62.78 with HTTP; Wed, 13 Aug 2014 16:21:56 -0700 (PDT)
In-Reply-To: <01PBBWUH11D60000SM@mauve.mrochek.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com> <01PB9FABWA4E0000SM@mauve.mrochek.com> <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com> <01PBABOOL4QO0000SM@mauve.mrochek.com> <CABa8R6vBqS1ewmTtHh8tTOdzobsWpvSEokRxOqpj1Oq3hA+vsw@mail.gmail.com> <01PBBWUH11D60000SM@mauve.mrochek.com>
Date: Wed, 13 Aug 2014 16:21:56 -0700
Message-ID: <CABa8R6uJ--4Fcntdgef+h6ZXjP_q0q7hZaBW-SOozMTtiE918g@mail.gmail.com>
From: Brandon Long <blong@google.com>
To: Ned Freed <ned.freed@mrochek.com>
Content-Type: multipart/alternative; boundary="089e013d06e6e8b74d05008b0d4d"
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/WeoFCjpiXemqpdPlt1Kb-NK5b9A
Cc: ietf-822@ietf.org
Subject: Re: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Aug 2014 23:23:01 -0000

Let me try one more time, since something isn't making it through.

I have three messages.  One message has an entirely 7bit header with 2047
encoded subject.  Another message is a 6532 message, with the subject in
utf8.  A third message is has a cp-1250 8bit subject.  There are two 8bit
bytes in the subject in both of the last two messages, and in the cp1250
case, those two bytes happen to also be a valid utf8 character.

We want to be able to parse all three of those and do so correctly.  We
know the third type is technically invalid, but we see millions of such
messages every day, dropping all of those would be a dis-service to our
users.  We currently see way more of such messages than we do of 6532
messages... though in practice, the most common charset now is utf-8, so I
guess those are now the same as 6532 messages that have leaked.

An example, we receive the Subject: Zdj\xc4\x99cia.  In UTF8,
that's Zdjęcia, in cp-1250, that's ZdjÄ™cia.

How do I tell which its supposed to be?  Our encoding detector chose
incorrectly.

And my apologies if bringing this to ietf-822 instead of the eai-wg list
was the wrong choice, it wasn't clear to me that the latter was still
active since the completion of the working group, and that with its
completion, there's no longer a "split" between the two, and that a concern
specifically about the format of email messages (which now includes 6532)
would belong on the list about such things.

I also seemed to have triggered some fear of a revolt which I don't
understand.

And I realize that to someone who spent years working on this that being
asked to retread these things is annoying.  Unfortunately, the resulting
RFCs don't include summarized information on why other possibly choices
were considered and rejected.  I'm unclear on how one is supposed to gain
this knowledge short of reading years worth of mailing list archives across
multiple lists... and even that doesn't help about things discussed
off-line or on other lists I don't even know to look for.

Brandon