Re: [apps-discuss] I-D Action: draft-ietf-appsawg-xml-mediatypes-05.txt

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Wed, 20 November 2013 11:08 UTC

Message-ID: <528C980D.7070106@it.aoyama.ac.jp>
Date: Wed, 20 Nov 2013 20:07:57 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: "Henry S. Thompson" <ht@inf.ed.ac.uk>
References: <20131119120919.12901.59046.idtracker@ietfa.amsl.com> <f5b1u2cr365.fsf@troutbeck.inf.ed.ac.uk>
In-Reply-To: <f5b1u2cr365.fsf@troutbeck.inf.ed.ac.uk>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: apps-discuss@ietf.org
Subject: Re: [apps-discuss] I-D Action: draft-ietf-appsawg-xml-mediatypes-05.txt
Precedence: list

Hello Henry, others,

I'm sorry this is very late, but I managed to review most of the -04 
draft. I checked -05, so the comments below apply to -05.

First, let me say that my comments in the first review (months ago, 
mostly against closely interleaving spec parts and historical notes) 
have been addressed very well.

This time, the comments are much more on details.

Copyright notice: Given the long history of this draft, I'd guess that 
this document needs the following addition in the copyright:

 >>>>
    This document may contain material from IETF Documents or IETF
    Contributions published or made publicly available before November
    10, 2008.  The person(s) controlling the copyright in some of this
    material may not have granted the IETF Trust the right to allow
    modifications of such material outside the IETF Standards Process.
    Without obtaining an adequate license from the person(s) controlling
    the copyright in such materials, this document may not be modified
    outside the IETF Standards Process, and derivative works of it may
    not be created outside the IETF Standards Process, except to format
    it for publication as an RFC or to translate it into languages other
    than English.
 >>>>

(This can easily be produced with a setting on some attribute in the XML 
source.)

Section 3 and Section 8: These sections have more than a page of text 
before the first subsection. I suggest to add one or more additional 
subsection titles at the start or very close to the start of the section 
for better structuring.

Section 3 says:

 >>>>
   document entities  The media types application/xml or text/xml MAY be
       used.
 >>>>

First, it would be good to have some syntactic delimiter (colon maybe) 
between "document entities" and the rest. Same for the other items.

Second, RFC 2119 defines MAY as follows: "This word, or the adjective 
"OPTIONAL", mean that an item is truly optional." This is quite a bit 
misleading in the sentence above. Using application/xml or text/xml for 
XML document entities is the default case, not just an optional option. 
I suggest something like "The media types application/xml or text/xml, 
or a more specific media type, SHOULD be used." (A should without 
additional qualification is probably too strong.)

 >>>>
    external parsed entities  The media types application/xml-external-
       parsed-entity or text/xml-external-parsed-entity SHOULD be used.
       The media types application/xml and text/xml MUST NOT be used
       unless the parsed entities are also well-formed "document
       entities" and are referenced as such.
 >>>>

The last clause ("and are referenced as such") is confusing. Stuff is 
just served or sent; on the server side, it's unclear how it's being 
referenced, and so such a condition does not make sense operationally.

Note starting with:
 >>>>
       Note that [RFC3023] (which this specification obsoletes)
       recommended the use of text/xml and text/xml-external-parsed-
       entity for document entities and external parsed entities,
 >>>>

Because of the indenting, it looks as if this note only applies to the 
immediately preceding item ("external parameter entities"), but 
content-wise, it seems to apply more generally. The note should be 
outdented (if that's possible) or should be moved to another place where 
it's less confusing to the reader as to what it applies to.

Para starting with:
 >>>
    Compared to [RFC2376] or [RFC3023], this specification alters the
    charset handling of text/xml and text/xml-external-parsed-entity,
 >>>

This is very long, in particular the first sentence. A very easy first 
step towards improvement would be to use a period before the "however", 
and change "however" to "However". Any additional untangling would be 
appreciated, too. Also, "for the text/xml... types" should be changed to 
"for types with a top-level media type text". (several instances)

Last paragraph before Section 3.1: It was unclear what exactly the spec 
tried to say here. I suggest to add a sentence at the end, e.g. "Such 
processing is not specified in this document."

Section 3.1, Encoding considerations (and elsewhere): The term "charset 
encoding" shows up. This is an unfortunate mixture of terminology. MIME 
has the "charset" parameter, and XML has the "encoding" 
pseudo-attribute, but this doesn't mean that these two words should be 
combined just like this. Also, this isn't used uniformly through the 
spec, e.g. there are things like "ASCII-compatible character sets" (see 
also http://www.w3.org/MarkUp/html-spec/charset-harmful.html).

I suggest, in Section 2, to shortly talk about the fact that MIME has 
the "charset" parameter, and XML has the "encoding" pseudo-attribute, 
and then use a single term. I'd personally suggest "character encoding" 
(see e.g. RFC 3986), but I'd be happy with any term that has been used 
widely already.

Also, we have "7bit or 8bit data, for example data with charset encoding 
UTF-8 or US-ASCII". In general speach, a chiasmus is something nice, but 
it's generally only confusing in specs, so I'd change this to "7bit or 
8bit data, for example data with charset encoding US-ASCII or UTF-8".

Section 3.1, Applications that use this media type: There is a missing 
"and" but a superfluous comma : "is supported by a wide range of generic 
XML tools (editors, parsers, Web agents, ...)*,* *and* generic and 
task-specific applications." Probably, reordering makes this easier to 
read: "is supported by generic and task-specific applications and a wide 
range of generic XML tools (editors, parsers, Web agents, ...)."

Section 3.2: Text/xml Registration
This is defined as an "alias", but the Media Type registry (e.g. in 
contrast to the charset registry) doesn't know the concept of an alias. 
So this should be reworded, e.g. saying that the registration 
information is the same. This also applies to Section 3.4.

Section 3.3, Encoding considerations (and other items): There are two 
"as" prepositions in short succession. What about "Same as 
application/xml, see Section 3.1." or some such?

Section 3.3, Interoperability considerations:
 >>>>
                                                  Identifying XML
       external parsed entities with their own content type should
       enhance interoperability of both XML documents and XML external
       parsed entities.
 >>>>
Lowercase "should" SHOULD be avoided! (there are other cases, too) I 
suggest to change to "will", or just say "enhances".

Section 3.6:
 >>>>
    XML MIME producers are RECOMMENDED to provide means for XML MIME
    entity authors to control the supply of charset parameters for their
    entities, for example by enabling user-level configuration of
    filename-to-Content-Type-header mappings on a file-by-file or suffix
    basis.
 >>>>
"control the supply" reads as if these charset parameters were in ample 
or short supply. I suggest to replace "supply" with "presence" or 
"presence or absence".

Section 3.6:
It may be helpful to create (sub)subsections for producers and 
consumers, because that's what many readers of the spec will look for.

Section 3.6, para starting with (and the following citation and para):
 >>>>
    When a charset parameter is specified for an XML MIME entity, then
 >>>>
This is way too lengthy and complicated. The first sentence is almost 
six lines long. This is another case of mixing history and justification 
with the hard facts, and should be untangled.

"Section 4.3.3 of the [XML] specification": Please fix this to "Section 
4.3.3 of [XML]" (as in other locations) or "Section 4.3.3 of the XML 
specification [XML]".

"When MIME producers conform to the requirements on them stated above,"
"on them" is redundant and should be removed.

Section 4: I'm not sure why this is a separate section, as the content 
is tightly related to Section 3.6. At the minimum, I suggest moving the 
stuff about BOMs from 3.6 to 4. A better solution would be to promote 
3.6 to a section, and include the current section 4 in there (with 
appropriate additional subsections as suggested above).

 >>>>
                                       byte order mark (BOM), which is a
    hexadecimal octet sequence 0xFE 0xFF (or 0xFF 0xFE, depending on
    endianness)
 >>>>
A byte order mark is a character, not an octet sequence. Also, better 
say which endianness is which. This would result in
 >>>>
                                       byte order mark (BOM), which
    appears as the hexadecimal octet sequence 0xFE 0xFF (big-endian)
    or 0xFF 0xFE (little-endian)
 >>>>
The change from "is" to "appears as" is also needed for UTF-8.

 >>>>
    Applications which convert XML into "utf-8" SHOULD add a BOM after
    conversion is complete.
 >>>>
There are two problems there:
1) "after conversion is complete", if taken literally, would lead to 
very efficient implementations (adding three bytes at the start of a 
long file). This clause should therefore be removed.
2) There is absolutely no need for a SHOULD. SHOULDs are only used when 
otherwise, there are interoperability problems, but XML in UTF-8 without 
a BOM 'should' not have any such problems. MAY seems much more 
appropriate here.

Section 5: "IRI" is mentioned without a reference. The reference was 
dropped between -04 and -05 because it looked as if it wasn't needed, 
but it should be put back in (with "Dueerst" fixed to "Duerst").

 >>>>
    A registry of XPointer schemes [XPtrReg] is maintained at the W3C.
    Document authors SHOULD NOT use unregistered schemes.  Scheme authors
    SHOULD register their schemes ([XPtrRegPolicy] describes requirements
    and procedures for doing so).
 >>>>
I fully agree with the SHOULDs here, but they don't belong in this spec.

 >>>>
    When a URI has a fragment identifier, it is encoded by a limited
    subset of the repertoire of US-ASCII [ASCII] characters, as defined
    in [RFC3986].
 >>>>
I'm not sure what this helps here. A pointer to the relevant parts of 
the XPointer spec(s) would be better, because some issues with respect 
to XPointer encoding in URIs (and IRIs) can be rather tricky.

Section 6:
Again very complicated language. I'd shorten the background information 
drastically, e.g. as follows (replacing the first *two* paragraphs of 
Section 6):
 >>>>
    An XML MIME entity of type application/xml, text/xml,
    application/xml-external-parsed-entity or
    text/xml-external-parsed-entity MAY use the xml:base attribute, as
    described in [XMLBase], to establish a base URI for that entity
    (see Section 5.1 of [RFC3986]).
 >>>>

Section 8: A Naming Convention...
I suggest removing the "a". In RFC 3023, this was in many ways just a 
trial, and so "a" was appropriate. Today, this doesn't have to be 
stressed anymore, and there are no other naming conventions for 
XML-Based Media Types.

"pattern '*/*+xml'": This is shell notation applied to something else 
than file names. It's close to the syntax allowed in an HTTP Accept: 
header, but (as correctly noted in the draft) not the same. It should be 
obvious to many readers, but it would be better if it were clearly 
explained.

 >>>>
       When an XML-based media type is restricted to UTF-8, it is not
       necessary to introduce the charset parameter.  "UTF-8 only" is a
       generic principle and UTF-8 is the default of XML.
 >>>>
I'm not sure what ""UTF-8 only" is a generic principle" is referring to. 
I guess it refers to the idea that using UTF-8 only for certain use 
cases on the Internet simplifies things a lot and is therefore a good 
idea. I fully agree. But a) this should be clearer, and b) it should be 
separated from "UTF-8 is the default of XML", because the former is a 
justification for the antecedent of the previous sentence (which I don't 
think is actually necessary), whereas the later is a justification for 
the conclusion made in that sentence. So in order of decreasing preference:
1) Remove ""UTF-8 only" is a generic principle and"
2) Split the second sentence into two, explaining the "generic 
principle" in slightly more detail.

"Similarly, media subtypes that do not represent XML MIME...": I don't 
see any similarity to what comes before. If a connective is really 
needed, I'd use "Conversely", but the best solution would be to do 
without connectives altogether.

8.1: "Referencing": Please use a slightly longer subsection title to 
make it easier for readers to understand what this subsection is talking 
about. Maybe "Registration Template Details"?

"Registrations for new XML-based media types under top-level types"
Please remove "under top-level types". It doesn't add any information.

8.2, Reference:  Replaces "This specification" with "RFC XXXX".
This makes the template more portable. There are more occasions that 
would benefit from the same change.

8.2, Fragment identifier considerations:
The two provisions "they MAY restrict the syntax to a specified subset 
of schemes" and "They MAY further require support for other registered 
schemes" look okay, but they leave open the question of what's the 
default. As far as I understand, barenames and element scheme pointers 
are the (bare :-) minimum, and so the first provision seems unnecessary. 
Removing that provision (and removing the "further" in the second 
provision) should make it clear that the minimum is the default.

 >>>>
             For fragment identifiers matching the syntax defined in
             [XPointerFramework], where the fragment identifier does
             _not_ resolve per the rules specified there, then process as
             specified in "xxx/yyy+xml";
 >>>>
Is this the case of an unregistered XPointer scheme? If yes, it would be 
good to mention here (not with MUSTard) that this is a bad idea. If not, 
I don't understand what case this addresses.

Section 9:
The last para before subsection 9.1 should be moved to the start of this 
section.

 >>>>
                            the charset portion, if any, of the value of
    the MIME Content-type header
 >>>>
I'd prefer to keep the full Content-type header, as in the examples in 
RFC 3023. Why? I think something like
Content-type: application/xml; charset="UTF-8"
is easier to read than
Content-type charset: charset="UTF-8"
Either this can use different types in the different examples, or use 
application/xml throughout. I'd personally prefer the later.

 >>>>
                            and the XML MIME entity may contain other
    data in addition to the XML declaration;
 >>>>
The 'may' here is misleading a) because lower-case 'may' SHOULD be 
avoided and b) because (except for an XML declaration) there 'may' 
indeed be no data, but that would be exceedingly rare. So I would change 
this to something like "and the XML MIME entity will contain other data 
in addition to the XML declaration (or might be empty);",
where the parenthetical in my opinion isn't even necessary.

Section 9.1:

Why is there no <?xml version="1.0">, similar to 9.2? Why are there no 
cases without any XML declaration (difficult to represent in the current 
way but very realistic)?

 >>>>
    If sent using a 7-bit transport (e.g., SMTP[RFC0821]), the XML MIME
    entity MUST use a content-transfer-encoding of either quoted-
    printable or base64.
 >>>>
I may be wrong here, but in my understanding, it would be possible to 
send pure US-ASCII labeled as UTF-8 through 7-bit transport (i.e. what 
defines the transport is the actual data and not the potential bit 
patterns allowed by the charset).

Section 9.1 and 9.2: The 7-bit/8-bit/binary considerations repeat what's 
already said in 3.1, so they should be replaced with a pointer to that 
section.

9.3: The title includes "and 8-bit MIME entity", but the considerations 
apply equally well e.g. to encoding="iso-2022-jp", which is 7-bit.

9.5: This is lengthy. Remove "and UTF-8 Entity" from the title, and 
simply say that this is interpreted as UTF-8 because that's the default 
for XML.

9.6 "Observe that the BOM does not exist." -> "Observe that the BOM 
isn't present." or "Observe that there is no BOM." (the BOM exists as 
well as any other character :-)

That's as far as I got. More maybe over the weekend or next week, sorry.

Regards,   Martin.

On 2013/11/19 21:23, Henry S. Thompson wrote:
> internet-drafts writes:
>
>> A New Internet-Draft is available from the on-line Internet-Drafts
>> directories.  This draft is a work item of the Applications Area
>> Working Group Working Group of the IETF.
>>
>> 	Title           : XML Media Types
>> 	Filename        : draft-ietf-appsawg-xml-mediatypes-05.txt
>
> As for previous drafts, an editors' diff is available, at
>
>    http://www.w3.org/XML/2012/10/3023bis/draft-ietf-appsawg-xml-mediatypes-05_diff.html
>
> This draft contains only very modest changes, which address all but
> one outstanding comment.
>
> My reasoning for not changing 3.6 in response to Bjoern Hoehrmann's
> objection to the treatment of a BOM as authoritative even in the
> presence of a charset parameter [1] is set out in [2].  Martin Duerst
> (in another thread) summarises my reasoning, as follows:
>
>    What's most important now is to know what receivers actually
>    accept. We are not in a design phase, we are just updating the
>    definition ... and making sure we fix problems if there are
>    problems, but we have to use the installed base for the main
>    guidance
>
> Given the wide review these drafts have received (thanks in particular
> to Julian Reschke and Erik Wilde), and the recent endorsement by the
> W3C XML Core WG [3], I hope we can move this back to WG Last Call, get
> an official review, and take it forward.
>
> ht
>
> [1] http://www.ietf.org/mail-archive/web/apps-discuss/current/msg10810.html
> [2] http://www.ietf.org/mail-archive/web/apps-discuss/current/msg10883.html
> [3] http://www.ietf.org/mail-archive/web/apps-discuss/current/msg10849.html

[apps-discuss] I-D Action: draft-ietf-appsawg-xml… internet-drafts
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Bjoern Hoehrmann
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Rushforth, Peter
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
[apps-discuss] expat and the BOM (was Re: I-D Act… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Martin J. Dürst
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Martin J. Dürst
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson
Re: [apps-discuss] I-D Action: draft-ietf-appsawg… Henry S. Thompson