Re: [ietf-822] utf8 messages

Ned Freed <ned.freed@mrochek.com> Wed, 13 August 2014 18:42 UTC

Return-Path: <ned.freed@mrochek.com>
X-Original-To: ietf-822@ietfa.amsl.com
Delivered-To: ietf-822@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 1AA681A03ED for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 11:42:41 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.57
X-Spam-Level:
X-Spam-Status: No, score=-2.57 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.668, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ftH0iOkZ8aw7 for <ietf-822@ietfa.amsl.com>; Wed, 13 Aug 2014 11:42:38 -0700 (PDT)
Received: from mauve.mrochek.com (mauve.mrochek.com [66.159.242.17]) by ietfa.amsl.com (Postfix) with ESMTP id B15061A0362 for <ietf-822@ietf.org>; Wed, 13 Aug 2014 11:42:38 -0700 (PDT)
Received: from dkim-sign.mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PBBWUJCNLS000ROH@mauve.mrochek.com> for ietf-822@ietf.org; Wed, 13 Aug 2014 11:37:35 -0700 (PDT)
MIME-version: 1.0
Content-transfer-encoding: 7bit
Content-type: TEXT/PLAIN; CHARSET="us-ascii"
Received: from mauve.mrochek.com by mauve.mrochek.com (PMDF V6.1-1 #35243) id <01PB2RFWCBO00000SM@mauve.mrochek.com>; Wed, 13 Aug 2014 11:37:32 -0700 (PDT)
Message-id: <01PBBWUH11D60000SM@mauve.mrochek.com>
Date: Wed, 13 Aug 2014 09:08:51 -0700
From: Ned Freed <ned.freed@mrochek.com>
In-reply-to: "Your message dated Wed, 13 Aug 2014 01:00:32 -0700" <CABa8R6vBqS1ewmTtHh8tTOdzobsWpvSEokRxOqpj1Oq3hA+vsw@mail.gmail.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com> <01PB9FABWA4E0000SM@mauve.mrochek.com> <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com> <01PBABOOL4QO0000SM@mauve.mrochek.com> <CABa8R6vBqS1ewmTtHh8tTOdzobsWpvSEokRxOqpj1Oq3hA+vsw@mail.gmail.com>
To: Brandon Long <blong@google.com>
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/5nhtbpJ0niNKozZOCw_f_sD3WSQ
Cc: ietf-822@ietf.org, Ned Freed <ned.freed@mrochek.com>
Subject: Re: [ietf-822] utf8 messages
X-BeenThere: ietf-822@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion of issues related to Internet Message Format \[RFC 822, RFC 2822, RFC 5322\]" <ietf-822.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/ietf-822/>
List-Post: <mailto:ietf-822@ietf.org>
List-Help: <mailto:ietf-822-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ietf-822>, <mailto:ietf-822-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 13 Aug 2014 18:42:41 -0000

> Ok, let me try to rephrase my point.  Before, there were two types of
> messages, well specified / syntactically correct message, and not.  The not
> well specified messages amount to a non-trivial number, but best
> effort/heuristic results are fine.

And now there are still two types types of messages by this criteria, but the
set of syntactically valid messages has increased to include those with utf-8
in certain places.

> If we imagine handling 1B messages/day, 2% being the "not well specified",
> we have 20M messages handled by heuristic.  If we have only 1% heuristic
> failures, that's 200k bad messages a day.  Eh, maybe ok.  Especially since
> in practice, these tend to be spam messages or the 8bit chars in the header
> are limited to some "content preview" header or other boneheadedness, or
> best case, its only a character or two that's broken.

Not sure of the relevance, but OK.

> Adding under-specified 6532 messages

What's underspecified about EAI messages? Either a message contains utf-8 in
certain places, making it an EAI message, or it doesn't making it something
else.

Or perhaps more to the point, in making your argument you are
assuming the conclusion you want to reach: That such messages are
in fact "underspecified".

> to the mix means that we are now
> generating messages that can easily slip into the second pool.  We try our
> best to only generate well specified / syntactically correct messages.. "be
> conservative in what you send" and all, but now we're being forced to take
> steps that we know will increase the number of failures.  We're attempting
> to adjust our heuristics to compensate, but that seems like a poor response
> compared to making the messages be well specified.

I'm having a great deal of difficuly parsing this, but what I think you're
trying to say is that there will be cases where, despite your best efforts and
intentions, you will construct what you intend to be RFC 6532 messages but
those messages will contain 8bit material in headers not in utf-8. And
you are concerned that such messages may be misclassified.

First, you're right to be concerned, but not necessarily for the reasons you
think. I am skeptical that EAI implementations are going take the lax attitude
towards incompliant garbage in messagess they have in the past. (Remember that
the trend is consistently towards less and less tolerance as time goes on.) My
own EAI implementation-in-progress certainly doesn't tolerate such stuff - it
won't accept such messages.

But in this case the label you think you need doesn't mean "EAI message",
rather, it means "this was intended to be an EAI message but may not be".

I seriously question the value of such a field. Or, for that matter,
using the SMTPUTF8 flag in this fashion.

> Now, one can add some bit outside of the message to say its 6532... and
> then modify everything that exchanges or stores messages to also exchange
> and store that bit.  Passing messages to procmail?  Guess we need a new env
> var.

Or not. procmail is a delivery agent. It can look for a "with SMTPUTF8"
clause in the Received: field.

> Maildrop programs used by smtp servers to pass messages to imap
> servers.

Absolutely have to be modified, because the IMAP server has to have this
information.

And guess what? EAI did exactly this: The SMTPUTF8 extension is valid in LMTP,
the maildrop protocol specified by the IETF.

> Mailing list software.  Mailing list archives.  How you call
> spamassassin.

Mailing lists need access in order to propogate the bit. But they are operating
at the envelope level in order to have access to the list name, so I'm not
seeing the problem.

Archiving and AS/AV software generally operates in the "has to handle anything
tossed at it", so aside from logging the bit or using it as another input, I'm
not sure there's much point. And if there is, compliance archiving typically
comes into play at the SMTP or IMAP APPEND, both of which have this information
readily available. Operational archiving requires access to store metadata for
other reasons. And milter in theory provides access to SMTP extension
information, so that case is also covered.

> The thread about this during the discussion suggested adding
> a new field to the From_ mbox separator.  Guess a new attribute field in
> the maildir spec?

Yes, the back end store format has to store this information for IMAP
to work properly.

> > > The problem we're having with 6532 messages, is that we moved from
> > > > > explicitly identified charsets via 2047/etc mechanisms, to "its just
> > > > > utf8"... and sometimes we mis-detect the utf8 as cp1250 or other
> > > > encodings.
> > > >
> > > > No message created prior to the release of support for RFC 6532 can be
> > > > assumed
> > > > to be a RFC 6532 message. Now, if you want to vet those messages using
> > some
> > > > sort of process to insure such a message meets the syntax and at least
> > > > looks
> > > > semantically sensible, then I suppose you could set the flag in the
> > > > metadata.
> > > >
> > > > But if you can't distinguish such messages from legitimate RFC 6532
> > > > created and
> > > > submitted by compliant clients, it sounds to me like you're not
> > retaining a
> > > > really critical piece of envelope/metadata in your implementation.
> > > >
> >
> > > Yes.  And the most obvious place for that information, to me, is in the
> > > headers of the message.  Assuming that every mechanism for exchanging
> > email
> > > messages needs an explicit external piece of data...
> >
> > Sorry, I'm not going to revisit or defend past design decisions. My point
> > was
> > and is that when you said that a piece of information was missing, that
> > statement was incorrect.
> >

> I'm saying that requiring the information to be external grossly increases
> the development required to support the standard.

So does using SMTP and IMAP extensions, as opposed to the just-send-UTF-8 your
proposal is likely to result in an which the WG rejected.  Your point is ...
what, exactly?

> I already pointed out
> some common tools above.  For us, not having to add a new piece of external
> metadata means that all we have to do is upgrade our parser... and validate
> how we use addresses and fix hopefully minor issues.  If we have to use
> external metadata, then we have 100s of data paths and data stores that
> need to be upgraded.

Quite possible.

> We've also already agreed that leaks happen, which means separating the
> metadata from the data itself guarantees that under specified data will
> leak.

Again, you're begging the question.

>  Making the message itself well specified means that
> re-synchronization can take place, that messages can pass through agnostic
> mechanisms (whether agnostic by choice or by happenstance) and be
> understood on the other side.

> > You may not like or agree with how this was done. (For that matter, I never
> > said I liked or agreed with how it was done.) But for better or worse,
> > there's
> > now a standard in place, and absent compelling evidence of there being a
> > problem with implementing that standard - evidence which AFAICT you have
> > not
> > provided - it's not appropriate to propose competing mechanisms to that
> > standard.
> >

> I thought implementation feedback was a useful thing, and generally desired
> prior to something becoming a standard.  I also thought that internet
> standards were generally considered a work in progress... 6532 is standards
> track but not a STD.

This is implementation feedback? On a list where proposals for header
extensions are typically discussed, rather than either of the two lists for EAI
discussions and feedback?

Sure doesn't sound like it to me. Rather, it sounds like an argument for what
is effectively a competing proposal, and coming from someone at Google, that's
cause for major concern.

> I also wasn't proposing a competing mechanism, I was proposing that
> implementation experience showed, to me, a strong possibility that a
> clarification or change to the standard would be beneficial.  Also,
> frankly, there is nothing in the standards that say we MUST not do this, so
> thanks for the exhortation from authority, I'll take it under advisement.

Of course there isn't. There also isn't a statement to the effect that you
should develop your own private, different competing SMTP and IMAP extensions
that use GB18030 instead of utf-8. Rather, it's hoped that people doing such
things will take the possible consequences of their actions into account.

And in this case the consequences could be extremely dire. If people assume
such an indicator in the message body is effectively the same as the SMTPUTF8
extension, things could get very ugly indeed.

> Frankly, I don't understand this concept of refusing to revisit or defend.

Seriously? You tnink every decision we made, including the most fundamental
ones, should be open for review and required to be defended any time anyone
shows up with what they think is a  problem?

You have way more time to spend on this stuff than I do.

>  I tried to find and follow the discussion in the eai wg archives... and to
> me, the decision seemed under explored.  The original section spent a lot
> of real estate on the importance of such an explicit in message spec, only
> to be removed.  I found two threads about it, about equal numbers of people
> contributing to either side (like 1-2 on each side), a somewhat heavy
> handed dismissal of the need, a call for consensus that passed due mostly
> to non-contributors.  Perhaps more took place in meetings, or there were
> other discussions that someone could point me to.

> Perhaps you can explain to me what 'compelling evidence of there being a
> problem implementing the standard' would mean in practice?

For me, compelling evidence would consist of a set of compelling use cases
along with an analysis showing the cost versus benefit. Thus far I've seen no
acknowledgment of the likely costs whatsoever.

				Ned