Unknown text/* subtypes
eric at w3.org (Eric Prud'hommeaux) Sun, 13 January 2008 19:41 UTC
From: "eric at w3.org"
Date: Sun, 13 Jan 2008 19:41:08 +0000
Subject: Unknown text/* subtypes
In-Reply-To: <Pine.LNX.4.62.0801130538560.13181@hixie.dreamhostps.com>
References: <20071218114549.GQ8244@w3.org> <6.0.0.20.2.20071226102314.083e2170@localhost> <fl1mtc$d82$1@ger.gmane.org> <Pine.LNX.4.62.0801130538560.13181@hixie.dreamhostps.com>
Message-ID: <20080113184012.GD23317@w3.org>
X-Date: Sun Jan 13 19:41:08 2008
* Ian Hickson <ian@hixie.ch> [2008-01-13 05:47+0000] > On Fri, 28 Dec 2007, Frank Ellermann wrote: > > > > Years later (after 2616bis) it might be possible to upgrade "default > > ASCII" to UTF-8, Latin-1 was a dead end. As soon as we're back to > > "default ASCII" just let RFC 2277 finish it off. > > FWIW, a number of specs are already overriding both MIME and HTTP when it > comes to character encodings. For example HTML4 says to not default to any > encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as > currently proposed defaults to an even more complicated heuristic [3], and > so on. > > In the "real world" the implementations are following the heuristics > described in CSS2.1 and HTML5 (or something close to them), and those > differ for text/css and text/html, so it would seem pointless for HTTP to > try to define something here: it would just get ignored. > > IMHO the best option is for HTTP to stay out of the discussion altogether > and let the lower level specs (MIME) and the higher level specs (XML, > HTML, CSS, etc, defining the formats) figure it out amongst themselves. I think this is consistent with Martin's proposal that HTTP1.1bis not set a default encoding http://www.w3.org/2008/01/rdf-media-types#noDefault (noting that Frank Ellerman believed the default should be us-ascii for the same effect) http://www.w3.org/2008/01/rdf-media-types#defAscii What we still need, however, is an update to 2046 that reflects current practice (and eases the discovery process for folks registering non-ascii text/ media types). Let's geek out the changes to we'd like to see. ? CRLF rules: [[ The canonical form of any MIME "text" subtype MUST always represent a line break as a CRLF sequence. Similarly, any occurrence of CRLF in MIME "text" MUST represent a line break. Use of CR and LF outside of line break sequences is also forbidden. ]] ? RFC2046 ?4.1.1 ?1 http://www.rfc.net/rfc2046.html#s4.1.1. is not respected by HTTP1.1, nor is it respected in general when shipping text/xml. Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs, web servers, proxies or clients)? I would like to think that MIME shouldn't care about recognizing new lines in the text block. If it can't go away, can it be relaxed in accordance with HTTP 1.1 [[ The line terminator for message-header fields is the sequence CRLF. However, we recommend that applications, when parsing such headers, recognize a single LF as a line terminator and ignore the leading CR. ]] ? RFC2616 ?19.3 ?3 http://www.rfc.net/rfc2616.html#s19.3 or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR): [[ 1. the two-character sequence #xD #xA 2. the two-character sequence #xD #x85 3. the single character #x85 4. the single character #x2028 5. any #xD character that is not immediately followed by #xA or #x85. ]] ? XML 1.1 ?2.11 ?2 http://www.w3.org/TR/xml11/#sec-line-ends The XML 1.1 rule interacts with character encoding because, while most character encodings line up with ascii on CR and LF, clearly none do on #x85 and #x2028 ? character encoding: [[ Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII. The specification for any future subtypes of "text" must specify whether or not they will also utilize a "charset" parameter, and may possibly restrict its values as well. For other subtypes of "text" than "text/plain", the semantics of the "charset" parameter should be defined to be identical to those specified here for "text/plain", i.e., the body consists entirely of characters in the given charset. In particular, definers of future "text" subtypes should pay close attention to the implications of multioctet character sets for their subtype definitions. The charset parameter for subtypes of "text" gives a name of a character set, as "character set" is defined in RFC 2045. The rules regarding line breaks detailed in the previous section must also be observed -- a character set whose definition does not conform to these rules cannot be used in a MIME "text" subtype. ]] ? RFC2046 ?4.1.2 ?2-4 http://www.rfc.net/rfc2046.html#s4.1.2. When should the "default" character set apply? ? no charset parameter ? no charset parameter, no fixed encoding for the media type ? no charset, no fixed encoding, no internal encoding declaration The current text specifies the first, while HTML and CSS count on the third. From the use case of "best effort rendering", we are already in a state where users who are better-informed than their web or mail clients manually set the encoding so they can see the right characters. The following heuristics may meet or exceed the user experience with today's data while advancing the state of the art to enable better rendering with future data: [[ Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The first of the following determinants that apply will identify the character set: 1. charset parameter 2. fixed encoding registered with the media type, if known 3. encoding algorithm registered with the media type, if known 4. UFT-8 if the document conforms to the UTF-8 encoding pattern 5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e] 6. application preference ]] @@charset constraints ? can it have faux line feeds? @@bidi? Martin, what do you think? @@lowest common demoninator: RFC2046 ?4.1.2 ?22 http://www.rfc.net/rfc2046.html#s4.1.2. Is it better to encourage the world to write "UTF-8" or "US-ASCII" for ascii subset? tension between lcd and one common encoding. @@Content-Transfer-Encoding: Base64 Content-Type: text/wibbly How does TE affect this? I suspect it's completely orthogonal. > -- Footnotes -- > > [1] http://www.w3.org/TR/html4/charset.html#h-5.2.2 > This text explicitly says that HTTP's default is useless. It then > recomments behaviour that is even more useless, but that's another > problem altogether... > > [2] http://www.w3.org/TR/CSS21/syndata.html#charset > > [3] http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#determining > > Cheers, > -- > Ian Hickson U+1047E )\._.,--....,'``. fL > http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. > Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' -- -eric office: +1.617.258.5741 NE43-344, MIT, Cambridge, MA 02144 USA mobile: +1.617.599.3509 (eric@w3.org) Feel free to forward this message to any list for any purpose other than email address distribution. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 481 bytes Desc: Digital signature Url : http://www.alvestrand.no/pipermail/ietf-types/attachments/20080113/96f92a01/attachment.bin >From ned.freed@mrochek.com Mon Jan 14 01:26:01 2008 From: ned.freed at mrochek.com (Ned Freed) Date: Mon Jan 14 01:44:11 2008 Subject: Unknown text/* subtypes In-Reply-To: "Your message dated Sun, 13 Jan 2008 13:40:12 -0500" <20080113184012.GD23317@w3.org> Message-ID: <01MQ1UZ0IIRA00004Z@mauve.mrochek.com> > * Ian Hickson <ian@hixie.ch> [2008-01-13 05:47+0000] > > On Fri, 28 Dec 2007, Frank Ellermann wrote: > > > > > > Years later (after 2616bis) it might be possible to upgrade "default > > > ASCII" to UTF-8, Latin-1 was a dead end. As soon as we're back to > > > "default ASCII" just let RFC 2277 finish it off. > > > > FWIW, a number of specs are already overriding both MIME and HTTP when it > > comes to character encodings. For example HTML4 says to not default to any > > encoding at all [1], CSS defaults to a complicated heuristic [2], HTML5 as > > currently proposed defaults to an even more complicated heuristic [3], and > > so on. > > > > In the "real world" the implementations are following the heuristics > > described in CSS2.1 and HTML5 (or something close to them), and those > > differ for text/css and text/html, so it would seem pointless for HTTP to > > try to define something here: it would just get ignored. > > > > IMHO the best option is for HTTP to stay out of the discussion altogether > > and let the lower level specs (MIME) and the higher level specs (XML, > > HTML, CSS, etc, defining the formats) figure it out amongst themselves. > I think this is consistent with Martin's proposal that HTTP1.1bis not > set a default encoding > http://www.w3.org/2008/01/rdf-media-types#noDefault > (noting that Frank Ellerman believed the default should be us-ascii for > the same effect) > http://www.w3.org/2008/01/rdf-media-types#defAscii > What we still need, however, is an update to 2046 that reflects > current practice (and eases the discovery process for folks > registering non-ascii text/ media types). Let's geek out the > changes to we'd like to see. You might, and I emphasize might, be able to get this changed to protocol specific restriction. (The MIME specifications specify both an email-specific extension as well as some more generally useful facilities.) There is no chance of this rule being lifted in general. > ? CRLF rules: > [[ > The canonical form of any MIME "text" subtype MUST always represent > a line break as a CRLF sequence. Similarly, any occurrence of CRLF > in MIME "text" MUST represent a line break. Use of CR and LF > outside of line break sequences is also forbidden. > ]] ? RFC2046 ?4.1.1 ?1 http://www.rfc.net/rfc2046.html#s4.1.1. > is not respected by HTTP1.1, nor is it respected in general when > shipping text/xml. > Does anyone rely on any vestige of this rule (e.g. mail clients, MTAs, > web servers, proxies or clients)? Not only does email depend on this, conformance to this has been dramatically strengthened, not weakened, in subsequest revisions of the email protocol specification. Specifically, RFC 821 was essentially silent on what bare CR and LF mean, but 2821 and 2821bis (now in last call) both say that bare CR and LF MUST NOT be sent and if received MUST NOT be treated as CRLF. This, incidentially, is not the way I personally think things should have been done. I like the "ignore bare CR treat LF like CRLF" approach. But my personal opinion isn't especially relevant - I mention it only to avoid "shoot the messenger" sorts of responses. > I would like to think that MIME > shouldn't care about recognizing new lines in the text block. I'm sorry, but that's fanciful in the extreme. > If it can't go away, can it be relaxed in accordance with HTTP 1.1 > [[ > The line terminator for message-header fields is the sequence CRLF. > However, we recommend that applications, when parsing such headers, > recognize a single LF as a line terminator and ignore the leading > CR. > ]] ? RFC2616 ?19.3 ?3 http://www.rfc.net/rfc2616.html#s19.3 Again, I personally think this is the way to go. But that's not what has happened. > or XML 1.1 (which includes NEXT LINE (NEL) and LINE SEPARATOR): > [[ > 1. the two-character sequence #xD #xA > 2. the two-character sequence #xD #x85 > 3. the single character #x85 > 4. the single character #x2028 > 5. any #xD character that is not immediately followed by #xA or > #x85. > ]] ? XML 1.1 ?2.11 ?2 http://www.w3.org/TR/xml11/#sec-line-ends > The XML 1.1 rule interacts with character encoding because, while most > character encodings line up with ascii on CR and LF, clearly none do > on #x85 and #x2028 > ? character encoding: > [[ > Unlike some other parameter values, the values of the charset > parameter are NOT case sensitive. The default character set, which > must be assumed in the absence of a charset parameter, is US-ASCII. > The specification for any future subtypes of "text" must specify > whether or not they will also utilize a "charset" parameter, and may > possibly restrict its values as well. For other subtypes of "text" > than "text/plain", the semantics of the "charset" parameter should be > defined to be identical to those specified here for "text/plain", > i.e., the body consists entirely of characters in the given charset. > In particular, definers of future "text" subtypes should pay close > attention to the implications of multioctet character sets for their > subtype definitions. > The charset parameter for subtypes of "text" gives a name of a > character set, as "character set" is defined in RFC 2045. The rules > regarding line breaks detailed in the previous section must also be > observed -- a character set whose definition does not conform to these > rules cannot be used in a MIME "text" subtype. > ]] ? RFC2046 ?4.1.2 ?2-4 http://www.rfc.net/rfc2046.html#s4.1.2. > When should the "default" character set apply? > ? no charset parameter > ? no charset parameter, no fixed encoding for the media type > ? no charset, no fixed encoding, no internal encoding declaration > The current text specifies the first, while HTML and CSS count on the > third. From the use case of "best effort rendering", we are already in > a state where users who are better-informed than their web or mail > clients manually set the encoding so they can see the right > characters. The following heuristics may meet or exceed the user > experience with today's data while advancing the state of the art to > enable better rendering with future data: > [[ > Unlike some other parameter values, the values of the charset > parameter are NOT case sensitive. The first of the following > determinants that apply will identify the character set: > 1. charset parameter > 2. fixed encoding registered with the media type, if known > 3. encoding algorithm registered with the media type, if known > 4. UFT-8 if the document conforms to the UTF-8 encoding pattern > 5. ISO-8859-1 if all the octets are in [\r\n\x20-\x7e] > 6. application preference > ]] Again, there is absolutely no chance this will fly for email so it cannot be written with this degree of generality. And if this is made protocol specific the specifics of any protocol other than email don't belong in a RFC 2046 revision. > @@charset constraints ? can it have faux line feeds? > @@bidi? Martin, what do you think? > @@lowest common demoninator: > RFC2046 ?4.1.2 ?22 http://www.rfc.net/rfc2046.html#s4.1.2. > Is it better to encourage the world to write "UTF-8" or "US-ASCII" > for ascii subset? tension between lcd and one common encoding. Marking something as utf-8 when it is in fact restricted to the us-ascii subset has been known to cause problems. I think change in this area is unlikely. Ned >From justivo@gmail.com Mon Jan 14 13:53:33 2008 From: justivo at gmail.com (=?UTF-8?Q?Ivo_Emanuel_Gon=C3=A7alves?=) Date: Mon Jan 14 14:18:53 2008 Subject: Request for review of Ogg Media Types: video/ogg, audio/ogg, application/ogg In-Reply-To: <dc107ee70712041813w63d28a2bga22f56c134a0854d@mail.gmail.com> References: <dc107ee70712031027q47e6748bk957986ea5db7467c@mail.gmail.com> <fj1s1p$dkn$1@ger.gmane.org> <dc107ee70712041813w63d28a2bga22f56c134a0854d@mail.gmail.com> Message-ID: <dc107ee70801140453j6e295d36q573596541a7b4ecc@mail.gmail.com> Hello list, This is the continuation to a thread started by me roughly a month ago. As a reminder, anyone may still post feedback regarding the Ogg media types described in [1]. Feedback regarding good and bad aspects is equally welcome. So far, only Mr Ellermann commented on the registration proposal. Please don't feel shy. -Ivo [1] http://www.ietf.org/internet-drafts/draft-goncalves-rfc3534bis-00.txt >From simon@josefsson.org Mon Jan 14 14:34:00 2008 From: simon at josefsson.org (Simon Josefsson) Date: Mon Jan 14 14:34:15 2008 Subject: Request for review of Ogg Media Types: video/ogg, audio/ogg, application/ogg In-Reply-To: <dc107ee70801140453j6e295d36q573596541a7b4ecc@mail.gmail.com> ("Ivo Emanuel =?iso-8859-1?Q?Gon=E7alves=22's?= message of "Mon, 14 Jan 2008 12:53:33 +0000") References: <dc107ee70712031027q47e6748bk957986ea5db7467c@mail.gmail.com> <fj1s1p$dkn$1@ger.gmane.org> <dc107ee70712041813w63d28a2bga22f56c134a0854d@mail.gmail.com> <dc107ee70801140453j6e295d36q573596541a7b4ecc@mail.gmail.com> Message-ID: <87prw4sf93.fsf@mocca.josefsson.org> "Ivo Emanuel Gon?alves" <justivo@gmail.com> writes: > Hello list, > > This is the continuation to a thread started by me roughly a month > ago. As a reminder, anyone may still post feedback regarding the Ogg > media types described in [1]. Feedback regarding good and bad aspects > is equally welcome. I read the draft, and it looks fine to me. There is a reference for base64 to RFC2397 which looks odd to me, please consider to use RFC4648 instead. Thanks for including the liberal 'Copying Conditions' section, which makes it possible to include the document in free software packages. Thanks, Simon >From nobody@xyzzy.claranet.de Tue Jan 15 04:12:32 2008 From: nobody at xyzzy.claranet.de (Frank Ellermann) Date: Tue Jan 15 04:12:29 2008 Subject: Unknown text/* subtypes References: <20071218114549.GQ8244@w3.org> <4767D82F.1060003@gmx.de><fk8puv$5f8$1@ger.gmane.org> <6.0.0.20.2.20071226102314.083e2170@localhost> <fl1mtc$d82$1@ger.gmane.org> <Pine.LNX.4.62.0801130538560.13181@hixie.dreamhostps.com> Message-ID: <fmh89s$g2g$1@ger.gmane.org> Ian Hickson wrote: > For example HTML4 says to not default to any encoding at all [1] [...] Yes, but HTTP has to work for plain text, pre-HTML 4, etc., and I think HHTP needs its own idea of what is allowed in a HTTP header. If one side refuses to say what the body is the other side needs a working assumption for the job at hand (= HTTP transmission). How browsers display a body (if at all) is a different question. "Assume it's something remotely related to ASCII, i.e. all octets that could be ASCII actually are ASCII" is good enough for HTTP, isn't it ? I don't see where "assume Latin-1" is actually needed today with respect to *HTTP*, even for HTML 2 (or arguably 3.2). The W3C validator ignores this HTML detail - AFAIK I'm the only user who ever asked if that's as it should be. It is irrelevant outside of validator torture tests... :-) > it would seem pointless for HTTP to try to define something > here: it would just get ignored. I think we mean the same thing when I propose that it's pointless to define "something different from MIME" in the HTTP spec., a normative MIME reference (+ explanation of the change) will do. Frank
- Unknown text/* subtypes Eric Prud'hommeaux
- Unknown text/* subtypes Eric Prud'hommeaux
- Unknown text/* subtypes Ian Hickson
- Unknown text/* subtypes Eric Prud'hommeaux
- Unknown text/* subtypes Larry Masinter
- Unknown text/* subtypes Frank Ellermann
- Unknown text/* subtypes Ian Hickson
- Unknown text/* subtypes Frank Ellermann
- Unknown text/* subtypes Ned Freed
- Unknown text/* subtypes Julian Reschke