Unknown text/* subtypes

eric at w3.org (Eric Prud'hommeaux) Sun, 13 January 2008 19:41 UTC

From: "eric at w3.org"
Date: Sun, 13 Jan 2008 19:41:08 +0000
Subject: Unknown text/* subtypes
In-Reply-To: <Pine.LNX.4.62.0801130538560.13181@hixie.dreamhostps.com>
References: <20071218114549.GQ8244@w3.org> <6.0.0.20.2.20071226102314.083e2170@localhost> <fl1mtc$d82$1@ger.gmane.org> <Pine.LNX.4.62.0801130538560.13181@hixie.dreamhostps.com>
Message-ID: <20080113184012.GD23317@w3.org>
X-Date: Sun Jan 13 19:41:08 2008

* Ian Hickson <ian@hixie.ch> [2008-01-13 05:47+0000]
> On Fri, 28 Dec 2007, Frank Ellermann wrote:
> > 
> > Years later (after 2616bis) it might be possible to upgrade "default 
> > ASCII" to UTF-8, Latin-1 was a dead end.  As soon as we're back to 
> > "default ASCII" just let RFC 2277 finish it off.
> 
> FWIW, a number of specs are already overriding both MIME and HTTP when it 
> comes to character encodings. For example HTML4 says to not default to any 
> encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as 
> currently proposed defaults to an even more complicated heuristic [3], and 
> so on.
> 
> In the "real world" the implementations are following the heuristics 
> described in CSS2.1 and HTML5 (or something close to them), and those 
> differ for text/css and text/html, so it would seem pointless for HTTP to 
> try to define something here: it would just get ignored.
> 
> IMHO the best option is for HTTP to stay out of the discussion altogether 
> and let the lower level specs (MIME) and the higher level specs (XML, 
> HTML, CSS, etc, defining the formats) figure it out amongst themselves.

I think this is consistent with Martin's proposal that HTTP1.1bis not
set a default encoding
  http://www.w3.org/2008/01/rdf-media-types#noDefault
(noting that Frank Ellerman believed the default should be us-ascii for
 the same effect)
  http://www.w3.org/2008/01/rdf-media-types#defAscii

What we still need, however, is an update to 2046 that reflects
current practice (and eases the discovery process for folks
registering non-ascii text/ media types). Let's geek out the
changes to we'd like to see.

? CRLF rules:
[[
  The canonical form of any MIME "text" subtype MUST always represent
  a line break as a CRLF sequence.  Similarly, any occurrence of CRLF
  in MIME "text" MUST represent a line break.  Use of CR and LF
  outside of line break sequences is also forbidden.
]] ? RFC2046 ?4.1.1 ?1 http://www.rfc.net/rfc2046.html#s4.1.1.
is not respected by HTTP1.1, nor is it respected in general when
shipping text/xml.

Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs,
web servers, proxies or clients)? I would like to think that MIME
shouldn't care about recognizing new lines in the text block.

If it can't go away, can it be relaxed in accordance with HTTP 1.1
[[
  The line terminator for message-header fields is the sequence CRLF.
  However, we recommend that applications, when parsing such headers,
  recognize a single LF as a line terminator and ignore the leading
  CR.
]] ? RFC2616 ?19.3 ?3 http://www.rfc.net/rfc2616.html#s19.3
or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR):
[[
   1. the two-character sequence #xD #xA

   2. the two-character sequence #xD #x85

   3. the single character #x85

   4. the single character #x2028

   5. any #xD character that is not immediately followed by #xA or
      #x85.
]] ? XML 1.1 ?2.11 ?2 http://www.w3.org/TR/xml11/#sec-line-ends

The XML 1.1 rule interacts with character encoding because, while most
character encodings line up with ascii on CR and LF, clearly none do
on #x85 and #x2028


? character encoding:
[[
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive.  The default character set, which
must be assumed in the absence of a charset parameter, is US-ASCII.

The specification for any future subtypes of "text" must specify
whether or not they will also utilize a "charset" parameter, and may
possibly restrict its values as well.  For other subtypes of "text"
than "text/plain", the semantics of the "charset" parameter should be
defined to be identical to those specified here for "text/plain",
i.e., the body consists entirely of characters in the given charset.
In particular, definers of future "text" subtypes should pay close
attention to the implications of multioctet character sets for their
subtype definitions.

The charset parameter for subtypes of "text" gives a name of a
character set, as "character set" is defined in RFC 2045.  The rules
regarding line breaks detailed in the previous section must also be
observed -- a character set whose definition does not conform to these
rules cannot be used in a MIME "text" subtype.
]] ? RFC2046 ?4.1.2 ?2-4 http://www.rfc.net/rfc2046.html#s4.1.2.

When should the "default" character set apply?
  ? no charset parameter
  ? no charset parameter, no fixed encoding for the media type
  ? no charset, no fixed encoding, no internal encoding declaration

The current text specifies the first, while HTML and CSS count on the
third. From the use case of "best effort rendering", we are already in
a state where users who are better-informed than their web or mail
clients manually set the encoding so they can see the right
characters. The following heuristics may meet or exceed the user
experience with today's data while advancing the state of the art to
enable better rendering with future data:
[[
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The first of the following
determinants that apply will identify the character set:

  1. charset parameter

  2. fixed encoding registered with the media type, if known

  3. encoding algorithm registered with the media type, if known

  4. UFT-8 if the document conforms to the UTF-8 encoding pattern

  5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e]

  6. application preference
]]

@@charset constraints ? can it have faux line feeds?

@@bidi? Martin, what do you think?

@@lowest common demoninator:
  RFC2046 ?4.1.2 ?22 http://www.rfc.net/rfc2046.html#s4.1.2.
Is it better to encourage the world to write "UTF-8" or "US-ASCII"
for ascii subset? tension between lcd and one common encoding.

@@Content-Transfer-Encoding: Base64
  Content-Type: text/wibbly
How does TE affect this? I suspect it's completely orthogonal.

> -- Footnotes --
> 
> [1] http://www.w3.org/TR/html4/charset.html#h-5.2.2
> This text explicitly says that HTTP's default is useless. It then 
> recomments behaviour that is even more useless, but that's another 
> problem altogether...
> 
> [2] http://www.w3.org/TR/CSS21/syndata.html#charset
> 
> [3] http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#determining
> 
> Cheers,
> -- 
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

-- 
-eric

office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA
mobile: +1.617.599.3509

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 481 bytes
Desc: Digital signature
Url : http://www.alvestrand.no/pipermail/ietf-types/attachments/20080113/96f92a01/attachment.bin
>From ned.freed@mrochek.com  Mon Jan 14 01:26:01 2008
From: ned.freed at mrochek.com (Ned Freed)
Date: Mon Jan 14 01:44:11 2008
Subject: Unknown text/* subtypes
In-Reply-To: "Your message dated Sun, 13 Jan 2008 13:40:12 -0500"
	<20080113184012.GD23317@w3.org>
Message-ID: <01MQ1UZ0IIRA00004Z@mauve.mrochek.com>

> * Ian Hickson <ian@hixie.ch> [2008-01-13 05:47+0000]
> > On Fri, 28 Dec 2007, Frank Ellermann wrote:
> > >
> > > Years later (after 2616bis) it might be possible to upgrade "default
> > > ASCII" to UTF-8, Latin-1 was a dead end.  As soon as we're back to
> > > "default ASCII" just let RFC 2277 finish it off.
> >
> > FWIW, a number of specs are already overriding both MIME and HTTP when it
> > comes to character encodings. For example HTML4 says to not default to any
> > encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as
> > currently proposed defaults to an even more complicated heuristic [3], and
> > so on.
> >
> > In the "real world" the implementations are following the heuristics
> > described in CSS2.1 and HTML5 (or something close to them), and those
> > differ for text/css and text/html, so it would seem pointless for HTTP to
> > try to define something here: it would just get ignored.
> >
> > IMHO the best option is for HTTP to stay out of the discussion altogether
> > and let the lower level specs (MIME) and the higher level specs (XML,
> > HTML, CSS, etc, defining the formats) figure it out amongst themselves.

> I think this is consistent with Martin's proposal that HTTP1.1bis not
> set a default encoding
>   http://www.w3.org/2008/01/rdf-media-types#noDefault
> (noting that Frank Ellerman believed the default should be us-ascii for
>  the same effect)
>   http://www.w3.org/2008/01/rdf-media-types#defAscii

> What we still need, however, is an update to 2046 that reflects
> current practice (and eases the discovery process for folks
> registering non-ascii text/ media types). Let's geek out the
> changes to we'd like to see.

You might, and I emphasize might, be able to get this changed to protocol
specific restriction. (The MIME specifications specify both an email-specific
extension as well as some more generally useful facilities.) There is no chance
of this rule being lifted in general.

> ? CRLF rules:
> [[
>   The canonical form of any MIME "text" subtype MUST always represent
>   a line break as a CRLF sequence.  Similarly, any occurrence of CRLF
>   in MIME "text" MUST represent a line break.  Use of CR and LF
>   outside of line break sequences is also forbidden.
> ]] ? RFC2046 ?4.1.1 ?1 http://www.rfc.net/rfc2046.html#s4.1.1.
> is not respected by HTTP1.1, nor is it respected in general when
> shipping text/xml.

> Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs,
> web servers, proxies or clients)?

Not only does email depend on this, conformance to this has been dramatically
strengthened, not weakened, in subsequest revisions of the email protocol
specification. Specifically, RFC 821 was essentially silent on what bare CR and
LF mean, but 2821 and 2821bis (now in last call) both say that bare CR and LF
MUST NOT be sent and if received MUST NOT be treated as CRLF.

This, incidentially, is not the way I personally think things should have been
done. I like the "ignore bare CR treat LF like CRLF" approach. But my personal
opinion isn't especially relevant - I mention it only to avoid "shoot the
messenger" sorts of responses.

> I would like to think that MIME
> shouldn't care about recognizing new lines in the text block.

I'm sorry, but that's fanciful in the extreme.

> If it can't go away, can it be relaxed in accordance with HTTP 1.1
> [[
>   The line terminator for message-header fields is the sequence CRLF.
>   However, we recommend that applications, when parsing such headers,
>   recognize a single LF as a line terminator and ignore the leading
>   CR.
> ]] ? RFC2616 ?19.3 ?3 http://www.rfc.net/rfc2616.html#s19.3

Again, I personally think this is the way to go. But that's not what
has happened.

> or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR):
> [[
>    1. the two-character sequence #xD #xA

>    2. the two-character sequence #xD #x85

>    3. the single character #x85

>    4. the single character #x2028

>    5. any #xD character that is not immediately followed by #xA or
>       #x85.
> ]] ? XML 1.1 ?2.11 ?2 http://www.w3.org/TR/xml11/#sec-line-ends

> The XML 1.1 rule interacts with character encoding because, while most
> character encodings line up with ascii on CR and LF, clearly none do
> on #x85 and #x2028

> ? character encoding:
> [[
> Unlike some other parameter values, the values of the charset
> parameter are NOT case sensitive.  The default character set, which
> must be assumed in the absence of a charset parameter, is US-ASCII.

> The specification for any future subtypes of "text" must specify
> whether or not they will also utilize a "charset" parameter, and may
> possibly restrict its values as well.  For other subtypes of "text"
> than "text/plain", the semantics of the "charset" parameter should be
> defined to be identical to those specified here for "text/plain",
> i.e., the body consists entirely of characters in the given charset.
> In particular, definers of future "text" subtypes should pay close
> attention to the implications of multioctet character sets for their
> subtype definitions.

> The charset parameter for subtypes of "text" gives a name of a
> character set, as "character set" is defined in RFC 2045.  The rules
> regarding line breaks detailed in the previous section must also be
> observed -- a character set whose definition does not conform to these
> rules cannot be used in a MIME "text" subtype.
> ]] ? RFC2046 ?4.1.2 ?2-4 http://www.rfc.net/rfc2046.html#s4.1.2.

> When should the "default" character set apply?
>   ? no charset parameter
>   ? no charset parameter, no fixed encoding for the media type
>   ? no charset, no fixed encoding, no internal encoding declaration

> The current text specifies the first, while HTML and CSS count on the
> third. From the use case of "best effort rendering", we are already in
> a state where users who are better-informed than their web or mail
> clients manually set the encoding so they can see the right
> characters. The following heuristics may meet or exceed the user
> experience with today's data while advancing the state of the art to
> enable better rendering with future data:
> [[
> Unlike some other parameter values, the values of the charset
> parameter are NOT case sensitive. The first of the following
> determinants that apply will identify the character set:

>   1. charset parameter

>   2. fixed encoding registered with the media type, if known

>   3. encoding algorithm registered with the media type, if known

>   4. UFT-8 if the document conforms to the UTF-8 encoding pattern

>   5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e]

>   6. application preference
> ]]

Again, there is absolutely no chance this will fly for email so it cannot be 
written with this degree of generality. And if this is made protocol specific
the specifics of any protocol other than email don't belong in a RFC 2046
revision.

> @@charset constraints ? can it have faux line feeds?

> @@bidi? Martin, what do you think?

> @@lowest common demoninator:
>   RFC2046 ?4.1.2 ?22 http://www.rfc.net/rfc2046.html#s4.1.2.
> Is it better to encourage the world to write "UTF-8" or "US-ASCII"
> for ascii subset? tension between lcd and one common encoding.

Marking something as utf-8 when it is in fact restricted to the us-ascii subset
has been known to cause problems. I think change in this area is unlikely.

				Ned
>From justivo@gmail.com  Mon Jan 14 13:53:33 2008
From: justivo at gmail.com (=?UTF-8?Q?Ivo_Emanuel_Gon=C3=A7alves?=)
Date: Mon Jan 14 14:18:53 2008
Subject: Request for review of Ogg Media Types: video/ogg, audio/ogg,
	application/ogg
In-Reply-To: <dc107ee70712041813w63d28a2bga22f56c134a0854d@mail.gmail.com>
References: <dc107ee70712031027q47e6748bk957986ea5db7467c@mail.gmail.com>
	<fj1s1p$dkn$1@ger.gmane.org>
	<dc107ee70712041813w63d28a2bga22f56c134a0854d@mail.gmail.com>
Message-ID: <dc107ee70801140453j6e295d36q573596541a7b4ecc@mail.gmail.com>

Hello list,

This is the continuation to a thread started by me roughly a month
ago.  As a reminder, anyone may still post feedback regarding the Ogg
media types described in [1].  Feedback regarding good and bad aspects
is equally welcome.

So far, only Mr Ellermann commented on the registration proposal.
Please don't feel shy.

-Ivo

[1] http://www.ietf.org/internet-drafts/draft-goncalves-rfc3534bis-00.txt
>From simon@josefsson.org  Mon Jan 14 14:34:00 2008
From: simon at josefsson.org (Simon Josefsson)
Date: Mon Jan 14 14:34:15 2008
Subject: Request for review of Ogg Media Types: video/ogg, audio/ogg,
	application/ogg
In-Reply-To: <dc107ee70801140453j6e295d36q573596541a7b4ecc@mail.gmail.com>
	("Ivo Emanuel =?iso-8859-1?Q?Gon=E7alves=22's?= message of "Mon, 14 Jan
	2008 12:53:33 +0000")
References: <dc107ee70712031027q47e6748bk957986ea5db7467c@mail.gmail.com>
	<fj1s1p$dkn$1@ger.gmane.org>
	<dc107ee70712041813w63d28a2bga22f56c134a0854d@mail.gmail.com>
	<dc107ee70801140453j6e295d36q573596541a7b4ecc@mail.gmail.com>
Message-ID: <87prw4sf93.fsf@mocca.josefsson.org>

"Ivo Emanuel Gon?alves" <justivo@gmail.com> writes:

> Hello list,
>
> This is the continuation to a thread started by me roughly a month
> ago.  As a reminder, anyone may still post feedback regarding the Ogg
> media types described in [1].  Feedback regarding good and bad aspects
> is equally welcome.

I read the draft, and it looks fine to me.  There is a reference for
base64 to RFC2397 which looks odd to me, please consider to use RFC4648
instead.

Thanks for including the liberal 'Copying Conditions' section, which
makes it possible to include the document in free software packages.

Thanks,
Simon
>From nobody@xyzzy.claranet.de  Tue Jan 15 04:12:32 2008
From: nobody at xyzzy.claranet.de (Frank Ellermann)
Date: Tue Jan 15 04:12:29 2008
Subject: Unknown text/* subtypes
References: <20071218114549.GQ8244@w3.org>
	<4767D82F.1060003@gmx.de><fk8puv$5f8$1@ger.gmane.org>
	<6.0.0.20.2.20071226102314.083e2170@localhost>
	<fl1mtc$d82$1@ger.gmane.org>
	<Pine.LNX.4.62.0801130538560.13181@hixie.dreamhostps.com>
Message-ID: <fmh89s$g2g$1@ger.gmane.org>

Ian Hickson wrote:
 
> For example HTML4 says to not default to any encoding at all [1]
[...]

Yes, but HTTP has to work for plain text, pre-HTML 4, etc., and I
think HHTP needs its own idea of what is allowed in a HTTP header.

If one side refuses to say what the body is the other side needs
a working assumption for the job at hand (= HTTP transmission). 
How browsers display a body (if at all) is a different question.

"Assume it's something remotely related to ASCII, i.e. all octets
 that could be ASCII actually are ASCII" is good enough for HTTP,
isn't it ?  I don't see where "assume Latin-1" is actually needed
today with respect to *HTTP*, even for HTML 2 (or arguably 3.2).  

The W3C validator ignores this HTML detail - AFAIK I'm the only
user who ever asked if that's as it should be.  It is irrelevant
outside of validator torture tests... :-)

> it would seem pointless for HTTP to try to define something
> here: it would just get ignored.

I think we mean the same thing when I propose that it's pointless
to define "something different from MIME" in the HTTP spec., a
normative MIME reference (+ explanation of the change) will do.

 Frank