Re: Comments on MIME/SGML

"Daniel W. Connolly" <connolly@hal.com> Wed, 09 March 1994 00:46 UTC

Received: from ietf.nri.reston.va.us by IETF.CNRI.Reston.VA.US id aa13631; 8 Mar 94 19:46 EST
Received: from CNRI.RESTON.VA.US by IETF.CNRI.Reston.VA.US id aa13627; 8 Mar 94 19:46 EST
Received: from dimacs.rutgers.edu by CNRI.Reston.VA.US id aa27592; 8 Mar 94 19:46 EST
Received: by dimacs.rutgers.edu (5.59/SMI4.0/RU1.5/3.08) id AA14573; Tue, 8 Mar 94 19:34:06 EST
Received: from hal.COM by dimacs.rutgers.edu (5.59/SMI4.0/RU1.5/3.08) id AA14569; Tue, 8 Mar 94 19:34:02 EST
Received: from ulua.hal.com by hal.com (4.1/SMI-4.1.1) id AA07389; Tue, 8 Mar 94 16:33:31 PST
Received: from localhost by ulua.hal.com (4.1/SMI-4.1.2) id AA05155; Tue, 8 Mar 94 18:24:43 CST
Message-Id: <9403090024.AA05155@ulua.hal.com>
To: Ed Levinson <elevinso@accurate.com>
Cc: Multiple Recipients of List <ietf-822@dimacs.rutgers.edu>, MIME/SGML discussion group <mime-sgml@infoods.mit.edu>
Subject: Re: Comments on MIME/SGML
In-Reply-To: Your message of "Mon, 07 Mar 1994 17:10:54 EST." <9403072210.AA02671@Accurate.COM>
Date: Tue, 08 Mar 1994 18:24:42 -0600
Sender: ietf-archive-request@IETF.CNRI.Reston.VA.US
From: "Daniel W. Connolly" <connolly@hal.com>

In message <9403072210.AA02671@Accurate.COM>, Ed Levinson writes:
>
>The essence of my proposal is to replace the "dtd" parameter with "prolog"
>and to require both prolog and instance.  The reason I suggest this
>approach is practical, various implementations treat these two document
>elements differently.

Hmmm... my reasons were chiefly practial too; they were based on
experience with the SGMLs package. Could you give some background (or
pointers to materials I should read) about these "various
implementations" that treat the prologue and the instance differently?

>As to using text/sgml or application/sgml, I chose application to keep
>within expressed boundaries others in the MIME community have
>suggested.  Namely, that text be reserved for very simple things.

Formally speaking, it's a coin toss. I agree we should go with
whatever precedents are out there. application/sgml is fine. I just
don't like it when MH uses base64 encoding on my html body parts when
I know most of my audience can read html source -- perhaps I just need
to learn to use my tools better.

>The correspondences you provided I like, it may be easier to explain
>waht is happening using your table.  I summarize it, with my own
>suggestions, below.
>
>Do you find my proposal acceptable?

It's acceptable, but I'm not sure it's optimal yet.
Let's take another whack at the SGML->MIME correspondence:
[I won't comment on the SDIF terms as I haven't read the SDIF standard.]

>	SGML:			MIME:
>	notation (type)		Content-Type:
>	SYSTEM indentifier	Content-ID:
>	data entity		Body Part

I can't find the term "notation type" in the SGML standard. I have
found:

	4.75 data content notation: An application-specific
	interpretation of an element's data content, or of a non-SGML
	data entity, that usually extends or differs from the normal
	meaning of the document character set.

and
	4.213 notation identifier: An _external identifier_ that
	identifies a data content notation in a _notation
	declaration_. It can be a _public identifier_ if the notation
	is public, and, if not, a description or other information
	sufficient to invoke a program to interpret the notation.

Also, "data entity" is not a term from the standard. We could use:

	4.134 external entity: An entity whose text is not
	incorporated directly in an entity declaration; its
	system identifier and/or public identifier is specified
	instead.

When I look at this closely, there's some redundancy: in SGML, the
choice of notations is expressed in the ENTITY declaration along with
the "filename" info. In MIME, the content type is expressed in the
referenced body part. When using MIME/SGML, we have to put it in both
places.

Is the connection between SGML notiation identifiers and the MIME
Content-Type syntax supposed to be explicit, or is there an implicit
correspondence between a MIME content type and an SGML data content
notation? For example, does the MIME content type show up explicitly
in the NOTATION declaration, like this:

	--8<
	Content-Type: application/postscript
	Content-ID: id1

	%!PS-Adobe...

	--8<
	Content-Type: application/sgml

	<!DOCTYPE T SYSTEM [
	<!NOTATION ps SYSTEM "application/postscript">
	<!ENTITY fig1 SYSTEM "id1" NDATA ps>
	]>
	...
	--8<--

or is it sufficient to write:

	--8<
	Content-Type: application/postscript
	Content-ID: id1

	%!PS-Adobe...

	--8<
	Content-Type: application/sgml

	<!DOCTYPE T SYSTEM [
	<!NOTATION ps PUBLIC "-//Adobe/PostScript" -- exact syntax? -->
	<!ENTITY fig1 SYSTEM "id1" NDATA ps>
	]>
	...
	--8<--

Hmmm... the implicit connection is probably more practical, but it
introduces redundancy and the chance for errors. The explicit mapping
causes the namespace of SYSTEM identifiers to include MIME
content-types. Blech.

>	marked up text		Application/SGML
>	document		Multipart/SGML

Up to here, we have been using terms from the standard. Your
suggestion to introduce the term "marked up text" is a departure from
what seemed like an otherwise elegant proposal. It's still
well-defined, but in application-specific terms rather than in SGML
standard terms. The question is whether practical considerations
sufficiently motivate the departure.

I suggested that we make the formal correspondence between the following:

	SGML entity		body of Application/SGML body part
c.f.:
	4.284 SGML entity: An entity whose characters are interpreted
	as markup or data in accordance with this International
	Standard.

The idea here is that MIME plays the role of entity manager, and MIME
body parts map 1-1 to SGML entities. The first production in the
standard is:

	[1] SGML document = SGML document entity
		(SGML subdocument entity |
		SGML text entity | non-SGML data entity)*

You can't split the prologue and the instance across SGML entities.
But you _can_ split the SGML document entity across system-specific
objects:

	NOTES
	1 This Internation Standard does not constrain the physical
	organization of the document within the data stream, message
	handling protocol, filesystem, etc., that contains it. In
	particular, separate entities could occur in the same physical
	object, a single entity could be divided between multiple
	objects, and the objects could occur in any order

Using the example I originally sent, we had:
								SGML term or
	Content-ID:			Contents		App convention
	<10024.761615492.3@ulua>	SGML document		App
	<10024.761615492.4@ulua>	external entity		SGML
	<10024.761615492.5@ulua>	SGML document entity	SGML
	<10024.761615492.6@ulua>	SGML text entity	SGML
	<10024.761615492.7@ulua>	SGML declaration	App

Your suggestion makes it look like:
	Content-ID:			Contents
	<10024.761615492.3@ulua>	SGML document		App
	<10024.761615492.4@ulua>	external entity		SGML
	<10024.761615492.5@ulua>	prolog			App
	<10024.761615492.6@ulua>	external entity		SGML
	<10024.761615492.7@ulua>	declaration		App
	<10024.761615492.8@ulua>	instance		App

But in the end, it's not really critical that SGML text entities map
exactly to MIME body parts (even my proposal did app-specific stuff
with the SGML declaration). [Hmmm... until you start talking about
subdocument entities... I think a concrete example of this is in order.]

The critical thing is how all this interacts with available (and
conceivable) tools. For example, with either of the above examples, I
could do
	
	mhn store cur

and get several files: 4.sgml, 5.sgml, 6.sgml, ...
After I replace system identifiers (SYSTEM "10024.761615492.6@ulua")
with filenames (SYSTEM "6.sgml") in those files, I could validate the
document using:

	sgmls -s 7.sgml 5.sgml		# Connolly's version, or
	sgmls -s 7.sgml 5.sgml 8.sgml	# Levinson's version

Hmmm... about replacing system identifiers... this could be a _really_
tedious process. I wonder if we could get rid of this step somehow
(with something like the original Content-Reference stuff?). Let's
see... you could leave the SGML declaration body part alone. Then you
have to process the other parts in the order they will be presented to
the SGML parser... in fact, I think you have to parse them! Consider
the following pathological case:

foo.sgml:
	<!DOCTYPE T [
	<!ELEMENT T - - ANY>
	<!ENTITY example SYSTEM "ex1.sgml">
	]>
	<T>blah blah, for example:
	<![ RCDATA [ &example; ]]>
	</T>

ex1.sgml:
	<!ENTITY foo SYSTEM "fake-file">

All the characters in ex1.sgml are data, even though they look like
markup.

	[AAARGH!!! My X server just died and emacs lost my last 3 hours'
	work on this message!]

Quickly, before I forget:

* As it stands, the MIME/SGML packer/unpacker cannot be implemented as
an SGML layer over MIME or as a MIME layer over SGML -- it must be a
piece of software that understands both simultaneously (see the above
entity usage). I suggest that instead of messing with the SYSTEM
identifiers in the data stream, we do an external mapping. Using the
above example, the packer would write:

	Content-Type: multipart/sgml; boundary="xxx";
		document="id2"; entity-map="id1"

	--xxx
	Content-Type: application/sgml-entity-map

	<id2>	"foo.sgml"
	<id3>	"ex1.sgml"

	--xxx
	Content-Type: application/sgml; name="foo.sgml"

	<!DOCTYPE T [
	<!ELEMENT T - - ANY>
	<!ENTITY example SYSTEM "ex1.sgml">
	]>
	<T>blah blah, for example:
	<![ RCDATA [ &example; ]]>
	</T>

	--xxx
	Content-Type: application/sgml; name="ex1.sgml"

	<!ENTITY foo SYSTEM "fake-file">

	--xxx--

For most cases, this makes the packer and unpacker trivial -- it works
just like application/octet-stream. For cases where the sender's
filenames can't be encoded in the MIME name parameter, or cases where
the syntaxes of the sender and receiver's filesystems are different,
the entity-map provides sufficient information to make the necessary
translation.

* The character set section of the MIME/SGML draft is overly brief and
uses the nebulous term "ASCII." It should use the term US-ASCII, which
is well-defined in the Internet community, and equate it to
ISO-646-1983, which is the character set from the default SGML
declaration. It should also give at least one complete example of
using another charcter set (for example ISO-Latin-1 -- I tried for
weeks to figure out how to spell that in SGML).

* We need examples of usage of subdocument entities. I think this is
another facter that motivates the mapping of an SGML document entity
onto a single MIME body part (the alternative is to represent an SGML
subdocument entity as another multipart/sgml body part, then extract
the prologue and instance body parts, and concatentate them together
-- then you have the subdocument entity. Workable, but clumsy...)

* It's not clear how the single application/sgml body part works. The
example given was:

	Content-Type: application/SGML;
	  dtd="-//USA-DOD//DTD MIL-M-21742 911001//EN"

	<! ... an SGML instance >

This implies an algorithm for producing an SGML document entity from
a public identifier for a DTD and an instance. I don't quite see how
to do this in general (what's the name of the DOCTYPE?).

Dan