Re: [apps-discuss] Feedback about "Update to MIME regarding Charset Parameter Handling in Textual Media Types"

Henri Sivonen <hsivonen@iki.fi> Thu, 23 February 2012 10:12 UTC

Return-Path: <hsivonen@gmail.com>
X-Original-To: apps-discuss@ietfa.amsl.com
Delivered-To: apps-discuss@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CB50121F8613 for <apps-discuss@ietfa.amsl.com>; Thu, 23 Feb 2012 02:12:24 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.977
X-Spam-Level:
X-Spam-Status: No, score=-2.977 tagged_above=-999 required=5 tests=[AWL=-0.000, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id P505c-ewoAtH for <apps-discuss@ietfa.amsl.com>; Thu, 23 Feb 2012 02:12:23 -0800 (PST)
Received: from mail-gy0-f172.google.com (mail-gy0-f172.google.com [209.85.160.172]) by ietfa.amsl.com (Postfix) with ESMTP id 9148C21F861A for <apps-discuss@ietf.org>; Thu, 23 Feb 2012 02:12:21 -0800 (PST)
Received: by ghbg16 with SMTP id g16so541903ghb.31 for <apps-discuss@ietf.org>; Thu, 23 Feb 2012 02:11:57 -0800 (PST)
Received-SPF: pass (google.com: domain of hsivonen@gmail.com designates 10.101.152.34 as permitted sender) client-ip=10.101.152.34;
Authentication-Results: mr.google.com; spf=pass (google.com: domain of hsivonen@gmail.com designates 10.101.152.34 as permitted sender) smtp.mail=hsivonen@gmail.com; dkim=pass header.i=hsivonen@gmail.com
Received: from mr.google.com ([10.101.152.34]) by 10.101.152.34 with SMTP id e34mr360838ano.13.1329991917299 (num_hops = 1); Thu, 23 Feb 2012 02:11:57 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=78c+YPAWMbQGH2EBfkrgLqUPE3d0AcvO4SLKL/OM3LQ=; b=GdKFrRLzAhe0W+qBCC0NNUqKcDLkges/wYJhOxAwZL5Ii+TwLwp+gf+WDpjYQ3wtm4 wFOqRCYHS4E2n8zo58b3WBt2tq7ypkgrZ9WveSfphuKXqm/rKPOpe3w5YhEz7UhH1oUD aEpMh7k30cUpqeI42KDGQd451dFKSuAajVrtY=
MIME-Version: 1.0
Received: by 10.101.152.34 with SMTP id e34mr279274ano.13.1329991917019; Thu, 23 Feb 2012 02:11:57 -0800 (PST)
Sender: hsivonen@gmail.com
Received: by 10.101.170.17 with HTTP; Thu, 23 Feb 2012 02:11:56 -0800 (PST)
In-Reply-To: <01OCB41ZCJES00ZUIL@mauve.mrochek.com>
References: <CAJQvAudekOKa2mzas-igD_6pa2je000Darin2HDNda-sk9TLCQ@mail.gmail.com> <01OCB41ZCJES00ZUIL@mauve.mrochek.com>
Date: Thu, 23 Feb 2012 12:11:56 +0200
X-Google-Sender-Auth: XOEnrEdkzC1BaCx3toLQmDbCwkI
Message-ID: <CAJQvAufpNOJ85QpQs5DgWO1dztdxi-8DtQv0ZdVS-fs7reYB4Q@mail.gmail.com>
From: Henri Sivonen <hsivonen@iki.fi>
To: apps-discuss@ietf.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: Anne van Kesteren <annevk@opera.com>
Subject: Re: [apps-discuss] Feedback about "Update to MIME regarding Charset Parameter Handling in Textual Media Types"
X-BeenThere: apps-discuss@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: General discussion of application-layer protocols <apps-discuss.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/apps-discuss>
List-Post: <mailto:apps-discuss@ietf.org>
List-Help: <mailto:apps-discuss-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/apps-discuss>, <mailto:apps-discuss-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 23 Feb 2012 10:12:24 -0000

On Thu, Feb 23, 2012 at 5:59 AM, Ned Freed <ned.freed@mrochek.com> wrote:
>> Additionally, media types should be able to define circumstances where
>> in-band indicators override the charset parameter even if the charset
>> parameter is present.
>
> That's a terrible way to do it

It's reality in WebKit and IE for text/html. So far, evidence suggests
that Gecko would serve users better if it changed to treat the BOM as
having higher precedence than the HTTP-level charset parameter for
text/html and text/javascript. (text/plain is loaded using the HTML
parser per the HTML spec and per reality, so it makes sense to do the
same for text/plain even though I don't have evidence to show about
compatibility issues either way.)

> - if the type is self-identifying in terms of
> charset, a charset parameter should simply not be defined for the type -
> exactly what the current specification says to do.

Logically, yes, but that's not how text/html, text/xml and text/css work.

>> In particular, media types should be allowed to override the charset
>> parameter if the first two or three bytes of the payload look like an
>> UTF-16 or UTF-8 BOM.
>
> There are quite a few charsets in existence where it is perfectly permissible
> for the first few bytes to match a BOM, except that it means something entirely
> different.

Seems to work for IE and WebKit in practice.

>> This seems wrong. If the charset parameter is present, it has an
>> effect for text/xml.
>
> That's only because the definition of test/xml did it incorrectly.

There's text/xml legacy by now, so chances are it's too late to make
HTTP charset have no effect for text/xml.

>> > or b) require explicit unconditional inclusion of the "charset" parameter
>> eliminating the need for a default value.
>
>> This seems naïve.

It's awesome that the IETF's own Web interface to the list archive
mangles the ï here. Yay dogfood.

>> Formats need to specify what happens when a charset
>> parameter is missing, since no matter how much the format says it's
>> "required", the party sending data can omit the charset parameter.
>
> Yep. They can also misspell the parameter name, misspell the charset name, use
> the wrong type/subtype, misspell the header field name, omit the field
> entirely, or any of a million other mistakes.
>
> And when (not if) this sort of thing happens, the receiver can elect to pursue
> whatever course of action it deems appropriate for invalid material.

No. To achieve interoperability, the behavior in error cases needs to
be well-defined. Broken content will depend on the behavior of
incumbent consumers. If that behavior isn't specified, new entrants to
the market face a barrier to competition when they need to reverse
engineer incumbents instead of just reading specs to see what needs to
be done.

>> > The default charset parameter value for text/plain is unchanged from
>> > [RFC2046] and remains as "US-ASCII".
>
> Note that this has the effect of rendering any content that contains 8bit
> as being invalid.

I'm ok with defining the absence of the charset parameter for
text/plain invalid if the BOM is also absent.

>> This is incompatible with reality. Web browsers, for instance, assume
>> a configuration-dependent default (which correlates with browser
>> localization) and may also (depending on configuration which, again,
>> correlates with localization by default) perform a heuristic analysis
>> on the payload.
>
> Well, on this one you'll have to argue with someone else. I don't especially
> like the approach in the draft, but I have been unable to come up with any
> reasonable alternative. Your proposed alternative below, is totally unworkable.

Why is it unworkable. It's what Gecko does with the modification that
the precedence of the BOM is as in WebKit and IE.

>> New text/* media types SHOULD use the following algorithm:
>
>> The character encoding is UTF-8. Terminate these steps.
>
> This approach is ridiculous when you consider how general the usage of some
> text types are and  the close similarity of many charsets in common use.

The UTF-8-only recommendation is for new text/* types which (by
definition of "new") don't have any charsets in common use previously.

> It is
> not at all common for there to only be one or two characters that in a large
> document that display incorrectly when the wrong charset is selected.

You mean like "naÃve" in the archive copy of your email? ;-)

> It is also
> unacceptable to ignore a valid label just because you don't support
> the encoding.

That's how browsers work.

> Finally, I'd say the odds of any implementor following such a complex
> set of rules is remote in the extreme.

As an implementor intending to implement the steps, I disagree. Gecko
already implements those steps with the BOM steps in a different order
relative to the charset parameter. I expect to patch Gecko to make the
BOM take precedence over the charset parameter in due course to match
IE and WebKit in text/html handling (and text/plain reuses the
text/html code path).

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/