Re: MPEG asks for MIME review for the MPEG21 file format

Martin Duerst <> Sat, 19 May 2007 07:17 UTC

Received: from (localhost []) by (8.13.5/8.13.5) with ESMTP id l4J7HvOt061944 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 19 May 2007 00:17:57 -0700 (MST) (envelope-from
Received: (from majordom@localhost) by (8.13.5/8.13.5/Submit) id l4J7HvkA061938; Sat, 19 May 2007 00:17:57 -0700 (MST) (envelope-from
X-Authentication-Warning: majordom set sender to using -f
Received: from ( []) by (8.13.5/8.13.5) with ESMTP id l4J7HrCD061908 for <>; Sat, 19 May 2007 00:17:54 -0700 (MST) (envelope-from
Received: from (scmse1 []) by (secret/secret) with SMTP id l4J7HoVW006184 for <>; Sat, 19 May 2007 16:17:51 +0900 (JST)
Received: from ( by via smtp id 7b9e_0d981236_05d9_11dc_8501_0014221fa3c9; Sat, 19 May 2007 16:17:50 +0900
Received: from ([]:33160) by with [XMail 1.22 ESMTP Server] id <SA57EB> for <> from <>; Sat, 19 May 2007 16:16:17 +0900
Message-Id: <>
X-Sender: duerst@localhost
X-Mailer: QUALCOMM Windows Eudora Version 6J
Date: Sat, 19 May 2007 09:59:51 +0900
To: Dave Singer <>, Anne van Kesteren <>, Chris Lilley <>, Larry Masinter <>,
From: Martin Duerst <>
Subject: Re: MPEG asks for MIME review for the MPEG21 file format
Cc: "'Graham Klyne'" <>,, "'Christian Timmerer (ITEC)'" <>,,
In-Reply-To: <p0624083ec273750809d7@[]>
References: <> <> <> <p06240821c26e59493bca@[]> <> <p0624084bc26f9de750bc@[]> <002901c798b8$0f9ab4f0$2ed01ed0$@org> <> <> <> <p0624083ec273750809d7@[]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Precedence: bulk
List-Archive: <>
List-ID: <>
List-Unsubscribe: <>

Hello Dave,

Some very good questions.

At 00:27 07/05/19, Dave Singer wrote:
>For a lot of these encodings, of course, the initial string is identical (all the ones which have an 'ascii' core).  UTF-16 uses twice the bytes etc.
>But in general, given a MIME type with a "+xml" suffix, an XML reader should be prepared to do what?

At the minimum, handle it if it's UTF-8 or UTF-16 (with BOM in the later
case). Everything else is optional.

>I think I am reading "treat the resource as being, in turn, all the encodings you know of, and if you treat it as an encoding, do you find a confirming "encoding" attribute?"

My reading of Appendix F of the XML Spec would be somewhat different.
First, it's not character encodings, but character encoding families,
that you try. This makes this process quite a bit faster.

Second, that appendix gives a list of character encoding families.
As the appendix is non-normative, it doesn't necessary exclude
other character encoding families, but there aren't really any other
character encoding families that I know of.

>Which means that encoding='EBCDIC' (I made that up, by the way) would work?

You didn't have to make that up. EBCDIC as a family is listed in said
appendix. The IETF charset registry lists close to 50 EBCDIC variants
(see I guess that
for the 'original' EBCDIC, you'd have to write encoding='EBCDIC-US'.

>Is a ZIP compressed XML file servable under a +xml MIME type? "encoding='zipped Shift_JIS'"?

First, the encoding names allowed in the XML spec don't permit spaces
(see, but that's a detail.

Second, I'm not familliar with ZIP encoding, but I guess that it's
not starting with any of the byte sequences mentioned in Appendix F.

The third point is that ZIP files are archives, not compressions of
single files. So you would have to restrict this kind of thing to
archives containing single files.

Fourth, ZIP files don't have any way to identify internal character
encodings. And polluting the charset space with zipped_foo,
zipped_bar, ... does not look like a good idea.

>'Semantic' encodings (e.g. MPEG BiM, which uses the schema to be able to compact the XML) are even greyer;  the 'encoding=' is inserted by the BiM decoder, so what does it say then?  I think the 'sanity check' has to be not that the resulting 'encoding=' says BiM, but that the BiM decode worked;  it makes noi sense for the BiM decoder to produce a text document that says "encoding='BiM'"!

Well, actually this is less of a problem. Or put it another way round,
it's a problem that turns up with simple plain old character encodings.
The easiest way to understand this is to think in terms of Java, because
Java has a very clear distinction between byte sequences (Streams) and
character sequences (Readers/Writers).

The whole encoding stuff is important as long as you are on the byte
level. Once the decoding is done, you have external information about
the encoding (in Java, you know it's UTF-16), so the encoding
pseudo-attribute in the XML declaration becomes irrelevant.
That's how the implementations I know handle this, you can hand
an XML document from a Reader or a String to a Java XML parser,
and the characters in there might read:  encoding='shift_jis',
but that's just ignored. There is not too much in the spec
that defines this explicitly, but it's pretty difficult to do

>This is all well off-topic for MPEG-21 of course, but by exploring these edge cases we might get some clarity on +xml, which would be a Good Thing.

Yes indeed. Thanks a lot.

Regards,    Martin.

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University