Re: [ietf-822] utf8 messages

Ned Freed <ned.freed@mrochek.com> Tue, 12 August 2014 15:26 UTC

MIME-version: 1.0
Content-transfer-encoding: 7bit
Content-type: TEXT/PLAIN; CHARSET="us-ascii"
Message-id: <01PBABOOL4QO0000SM@mauve.mrochek.com>
Date: Tue, 12 Aug 2014 07:54:30 -0700
From: Ned Freed <ned.freed@mrochek.com>
In-reply-to: "Your message dated Tue, 12 Aug 2014 01:20:27 -0700" <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com>
References: <CABa8R6tWEhjjZSvq6NbM7EimokOms3suZufn0-6N1SB_fzGM8Q@mail.gmail.com> <01PB9FABWA4E0000SM@mauve.mrochek.com> <CABa8R6tns-idiZTj=+vb9fVNyH-nNYT+w9oNMb80XbCs5osvFw@mail.gmail.com>
To: Brandon Long <blong@google.com>
Archived-At: http://mailarchive.ietf.org/arch/msg/ietf-822/SXIndcvODvhjQDt9gPWiOwYpZtI
Cc: ietf-822@ietf.org, Ned Freed <ned.freed@mrochek.com>
Subject: Re: [ietf-822] utf8 messages
Precedence: list

> > It is, or is supposed to be, a sealed system implemented as a set of
> > interlocking extensions to existing email facilities.

> So, if I have an "email" message, I can no longer just parse it.  Instead,
> there are actually two
> types of email messages, and the only way to know how to parse it is to
> know a priori which type it is.
> Because all systems are "sealed" and there's never any leakage.

Wrong on all counts, I'm afraid. First and foremost, in practice you have never
been able to "just parse" email messages, for the simple reason that too many
message creation agents don't follow the rules and create wildly incompliant
messages. So there's always some heuristics involved, unless of course you're
willing to only accept syntacticaly valid messages, in which case yes, you
can "just parse" messages, including EAI messages.

Where the lines are drawn has always been a tradeoff, and one which has
changed over the years. It used to be the case that a lot more crap was
generally tolerated. The spam problem has led to an overall tightening up
what's tolerated.

Second, because of overall lack of compliance, there have always been
many types of messages.

Third, of course there's leakage.

> As for just check for 8 bit messages... on to the next part.

> > Our problem is that this isn't actually true in practice.  Prior to
> > > launching support for 6532 messages, we've already had to support
> > > widespread use of 8bit messages that were not always in utf8.  Since
> > these
> > > typically didn't specify which charset they were in, we used a variety of
> > > techniques including direct charset detection on such messages.
> >
> > It depends on what you mean by "8bit message". If you mean messages with
> > 8bit
> > in the body data, then sure, that's fully standard and widespread.
> >
> > But 8bit of any sort in a header at any level was a standards violation
> > prior
> > to RFC 6532. And since there are many 8bit charsets, and telling them
> > apart is
> > in general impossible (although intelligent guesses can be made) without
> > labeling (which implies conversion to 7bit), this was never a terribly
> > interoperable thing to be doing from day 1.
> >

> Yes, it wasn't a great idea.  Apparently, strict adherence to spec was not
> a strong concern from the non-English speaking world.  And, this worked
> fine before "global" services, such that local sites would just assume
> local charsets for these broken messages.  We've had to do auto-detection
> on messages for quite a while, and yes, its sometimes broken.

Actually, no, it sucked from the get-go. Entire books were written on how to
deal with the CJK disaster long before any of this was anything even close to
being "global".

This mess is part of what led to the development of MIME. Yes, it really
does go back that far.

> > The problem we're having with 6532 messages, is that we moved from
> > > explicitly identified charsets via 2047/etc mechanisms, to "its just
> > > utf8"... and sometimes we mis-detect the utf8 as cp1250 or other
> > encodings.
> >
> > No message created prior to the release of support for RFC 6532 can be
> > assumed
> > to be a RFC 6532 message. Now, if you want to vet those messages using some
> > sort of process to insure such a message meets the syntax and at least
> > looks
> > semantically sensible, then I suppose you could set the flag in the
> > metadata.
> >
> > But if you can't distinguish such messages from legitimate RFC 6532
> > created and
> > submitted by compliant clients, it sounds to me like you're not retaining a
> > really critical piece of envelope/metadata in your implementation.
> >

> Yes.  And the most obvious place for that information, to me, is in the
> headers of the message.  Assuming that every mechanism for exchanging email
> messages needs an explicit external piece of data...

Sorry, I'm not going to revisit or defend past design decisions. My point was
and is that when you said that a piece of information was missing, that
statement was incorrect.

You may not like or agree with how this was done. (For that matter, I never
said I liked or agreed with how it was done.) But for better or worse, there's
now a standard in place, and absent compelling evidence of there being a
problem with implementing that standard - evidence which AFAICT you have not
provided - it's not appropriate to propose competing mechanisms to that
standard.

> Is there a separate three letter filename extension for 6532 messages, for
> example? (for platforms which use such things).

Yes there is: .u8msg. See RFC 6532 section 3.7.

> > > Now, as hinted at in the consensus to remove such a marker from the
> > draft,
> > > we can certainly add such a header when composing 6532 messages or when
> > we
> > > receive any message via SMTPUTF8 for our own utility, but I would think
> > > there would be some utility in such a marker being mutually understood
> > and
> > > shared.
> >
> > Please don't. This is the sort of thing that wrecks standards.

> So, its ok to write a specification to have such a header but doing so
> wrecks standards?

> Or are you saying it would be fine for us to have such a header, but not to
> leak it?

It's one thing to manifest such a marker in your IMAP store as a header. People
have been exposing various sorts of store metadata that way for years, and it
seems to have been Mostly Harmless.

It's quite another to write a specification advocating using such a mechanism
to identify messages. It that's done it will create silly states when EAI
messages are sent with such a marker but not SMTPUTF8 flag and vice versa.

				Ned

[ietf-822] utf8 messages Brandon Long
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Brandon Long
Re: [ietf-822] utf8 messages Mark Martinec
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Jan Kundrát
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Brandon Long
Re: [ietf-822] utf8 messages Alessandro Vesely
Re: [ietf-822] utf8 messages Daniel Vargha
Re: [ietf-822] utf8 messages Mark Martinec
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Daniel Vargha
Re: [ietf-822] utf8 messages Brandon Long
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Brandon Long
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Jan Kundrát
Re: [ietf-822] utf8 messages Daniel Vargha
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Daniel Vargha
Re: [ietf-822] utf8 messages Brandon Long
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Arnt Gulbrandsen
Re: [ietf-822] utf8 messages Mark Martinec
Re: [ietf-822] utf8 messages Jan Kundrát
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Arnt Gulbrandsen
Re: [ietf-822] utf8 messages Tony Finch
Re: [ietf-822] utf8 messages Ned Freed
Re: [ietf-822] utf8 messages Mark Martinec
Re: [ietf-822] utf8 messages Chris Newman