Re: [ietf-822] utf8 messages

Ned Freed <ned.freed@mrochek.com> Thu, 14 August 2014 04:19 UTC

Return-Path: <ned.freed@mrochek.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A1DE41A00B0 for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 21:19:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.57
X-Spam-Level:
X-Spam-Status: No, score=-2.57 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.668, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WwIhXOjF_9Tn for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 21:19:15 -0700 (PDT)
Received: from mauve.mrochek.com (mauve.mrochek.com [66.159.242.17]) by ietfa.amsl.com (Postfix) with ESMTP id 2B90B1A00AE for <ietf-822@ietf.org>; Wed, 13 Aug 2014 21:19:15 -0700 (PDT)
Received: from dkim-sign.mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PBCGZGYRAO0017T8@mauve.mrochek.com> for ietf-822@ietf.org; Wed, 13 Aug 2014 21:14:13 -0700 (PDT)
MIME-version: 1.0
Content-transfer-encoding: 7bit
Content-type: TEXT/PLAIN; CHARSET="US-ASCII"
Received: from mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PBCA2453TC0000SM@mauve.mrochek.com>; Wed, 13 Aug 2014 21:14:10 -0700 (PDT)
Message-id: <01PBCGZERCU20000SM@mauve.mrochek.com>
Date: Wed, 13 Aug 2014 18:34:35 -0700
From: Ned Freed <ned.freed@mrochek.com>
In-reply-to: "Your message dated Wed, 13 Aug 2014 16:21:56 -0700" <CABa8R6uJ--4Fcntdgef+h6ZXjP_q0q7hZaBW-SOozMTtiE918g@mail.gmail.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com> <01PB9FABWA4E0000SM@mauve.mrochek.com> <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com> <01PBABOOL4QO0000SM@mauve.mrochek.com> <CABa8R6vBqS1ewmTtHh8tTOdzobsWpvSEokRxOqpj1Oq3hA+vsw@mail.gmail.com> <01PBBWUH11D60000SM@mauve.mrochek.com> <CABa8R6uJ--4Fcntdgef+h6ZXjP_q0q7hZaBW-SOozMTtiE918g@mail.gmail.com>
To: Brandon Long <blong@google.com>
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/aLxGTTxaMH7npMgGCvyVU9qplCE
Cc: ietf-822@ietf.org, Ned Freed <ned.freed@mrochek.com>
Subject: Re: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Aug 2014 04:19:18 -0000

> Let me try one more time, since something isn't making it through.

> I have three messages.  One message has an entirely 7bit header with 2047
> encoded subject.  Another message is a 6532 message, with the subject in
> utf8.  A third message is has a cp-1250 8bit subject.  There are two 8bit
> bytes in the subject in both of the last two messages, and in the cp1250
> case, those two bytes happen to also be a valid utf8 character.

> We want to be able to parse all three of those and do so correctly.  We
> know the third type is technically invalid, but we see millions of such
> messages every day, dropping all of those would be a dis-service to our
> users.  We currently see way more of such messages than we do of 6532
> messages... though in practice, the most common charset now is utf-8, so I
> guess those are now the same as 6532 messages that have leaked.

I thought I understood the problem you were attempting to solve, but now I'm
totally confused, because this seems to hqve nothing to do with additional
labeling of legitimate EAI messages at all.

You say you have to deal with invalid messages with 8bit in headers. You say
that there's a trend towards these using utf-8 rather than some other charset.
You say that EAI messages are in the distinct minority. And finally, you say
there are issues with your heuristics misidentifying the charset.

Given that EAI messages are currently in the minority, your first order of
business clearly needs to be work on those heuristics. Beyond that, it seems to
me that your focus needs to be on calling out the details nonstandard stuff
you're doing, rather than creating openings for silly states with the
standard stuff.

More specifically, when you receive an invalid message that needs or has
undergone heuristic processing, why not just label it as such? This way
there's a clear indicator that the message has issues and that there
may be problems interpreting it.

This label is actually orthogonal to marking the message as an EAI message.
If your heuristics say, which high probability, that this is an EAI
message, then you probably want to set the EAI bit so that other things
will treat it as such. But the additional label tells you that the EAI
label came about implicitly rather than explicitly.

I also note that existing stock of messages containing invalid 8bit in the
headers are not EAI messages by definition. And you can check this by looking
at the timestamps in the message, message metadata, or both. So the lack of
these labels on old messages is a nonissue.

You can also use the label to write down the heuristics you have, or will
apply. Or whatever other contextual details exist that aren't stored
anywhere else that may assist in the handling of the message.

What am I missing here?

				Ned