Re: [ietf-822] utf8 messages
Ned Freed <ned.freed@mrochek.com> Thu, 14 August 2014 04:19 UTC
Return-Path: <ned.freed@mrochek.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A1DE41A00B0 for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 21:19:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.57
X-Spam-Level:
X-Spam-Status: No, score=-2.57 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.668, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WwIhXOjF_9Tn for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 21:19:15 -0700 (PDT)
Received: from mauve.mrochek.com (mauve.mrochek.com [66.159.242.17]) by ietfa.amsl.com (Postfix) with ESMTP id 2B90B1A00AE for <ietf-822@ietf.org>; Wed, 13 Aug 2014 21:19:15 -0700 (PDT)
Received: from dkim-sign.mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PBCGZGYRAO0017T8@mauve.mrochek.com> for ietf-822@ietf.org; Wed, 13 Aug 2014 21:14:13 -0700 (PDT)
MIME-version: 1.0
Content-transfer-encoding: 7bit
Content-type: TEXT/PLAIN; CHARSET="US-ASCII"
Received: from mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PBCA2453TC0000SM@mauve.mrochek.com>; Wed, 13 Aug 2014 21:14:10 -0700 (PDT)
Message-id: <01PBCGZERCU20000SM@mauve.mrochek.com>
Date: Wed, 13 Aug 2014 18:34:35 -0700
From: Ned Freed <ned.freed@mrochek.com>
In-reply-to: "Your message dated Wed, 13 Aug 2014 16:21:56 -0700" <CABa8R6uJ--4Fcntdgef+h6ZXjP_q0q7hZaBW-SOozMTtiE918g@mail.gmail.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com> <01PB9FABWA4E0000SM@mauve.mrochek.com> <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com> <01PBABOOL4QO0000SM@mauve.mrochek.com> <CABa8R6vBqS1ewmTtHh8tTOdzobsWpvSEokRxOqpj1Oq3hA+vsw@mail.gmail.com> <01PBBWUH11D60000SM@mauve.mrochek.com> <CABa8R6uJ--4Fcntdgef+h6ZXjP_q0q7hZaBW-SOozMTtiE918g@mail.gmail.com>
To: Brandon Long <blong@google.com>
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/aLxGTTxaMH7npMgGCvyVU9qplCE
Cc: ietf-822@ietf.org, Ned Freed <ned.freed@mrochek.com>
Subject: Re: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 14 Aug 2014 04:19:18 -0000
> Let me try one more time, since something isn't making it through. > I have three messages. One message has an entirely 7bit header with 2047 > encoded subject. Another message is a 6532 message, with the subject in > utf8. A third message is has a cp-1250 8bit subject. There are two 8bit > bytes in the subject in both of the last two messages, and in the cp1250 > case, those two bytes happen to also be a valid utf8 character. > We want to be able to parse all three of those and do so correctly. We > know the third type is technically invalid, but we see millions of such > messages every day, dropping all of those would be a dis-service to our > users. We currently see way more of such messages than we do of 6532 > messages... though in practice, the most common charset now is utf-8, so I > guess those are now the same as 6532 messages that have leaked. I thought I understood the problem you were attempting to solve, but now I'm totally confused, because this seems to hqve nothing to do with additional labeling of legitimate EAI messages at all. You say you have to deal with invalid messages with 8bit in headers. You say that there's a trend towards these using utf-8 rather than some other charset. You say that EAI messages are in the distinct minority. And finally, you say there are issues with your heuristics misidentifying the charset. Given that EAI messages are currently in the minority, your first order of business clearly needs to be work on those heuristics. Beyond that, it seems to me that your focus needs to be on calling out the details nonstandard stuff you're doing, rather than creating openings for silly states with the standard stuff. More specifically, when you receive an invalid message that needs or has undergone heuristic processing, why not just label it as such? This way there's a clear indicator that the message has issues and that there may be problems interpreting it. This label is actually orthogonal to marking the message as an EAI message. If your heuristics say, which high probability, that this is an EAI message, then you probably want to set the EAI bit so that other things will treat it as such. But the additional label tells you that the EAI label came about implicitly rather than explicitly. I also note that existing stock of messages containing invalid 8bit in the headers are not EAI messages by definition. And you can check this by looking at the timestamps in the message, message metadata, or both. So the lack of these labels on old messages is a nonissue. You can also use the label to write down the heuristics you have, or will apply. Or whatever other contextual details exist that aren't stored anywhere else that may assist in the handling of the message. What am I missing here? Ned
- [ietf-822] utf8 messages Brandon Long
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Brandon Long
- Re: [ietf-822] utf8 messages Mark Martinec
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Jan Kundrát
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Brandon Long
- Re: [ietf-822] utf8 messages Alessandro Vesely
- Re: [ietf-822] utf8 messages Daniel Vargha
- Re: [ietf-822] utf8 messages Mark Martinec
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Daniel Vargha
- Re: [ietf-822] utf8 messages Brandon Long
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Brandon Long
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Jan Kundrát
- Re: [ietf-822] utf8 messages Daniel Vargha
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Daniel Vargha
- Re: [ietf-822] utf8 messages Brandon Long
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Arnt Gulbrandsen
- Re: [ietf-822] utf8 messages Mark Martinec
- Re: [ietf-822] utf8 messages Jan Kundrát
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Arnt Gulbrandsen
- Re: [ietf-822] utf8 messages Tony Finch
- Re: [ietf-822] utf8 messages Ned Freed
- Re: [ietf-822] utf8 messages Mark Martinec
- Re: [ietf-822] utf8 messages Chris Newman