Re: [media-types] Thoughts on suffixes, single and multiple

Ted Thibodeau Jr <tthibodeau@openlinksw.com> Thu, 04 April 2024 20:27 UTC

Return-Path: <tthibodeau@openlinksw.com>
X-Original-To: media-types@ietfa.amsl.com
Delivered-To: media-types@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0617EC1654EF for <media-types@ietfa.amsl.com>; Thu, 4 Apr 2024 13:27:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.096
X-Spam-Level:
X-Spam-Status: No, score=-2.096 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=openlinksw.com
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gH3paJENUGrs for <media-types@ietfa.amsl.com>; Thu, 4 Apr 2024 13:27:01 -0700 (PDT)
Received: from mail.openlinksw.com (mail.openlinksw.com [194.109.129.60]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 0C84BC151539 for <media-types@ietf.org>; Thu, 4 Apr 2024 13:27:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=openlinksw.com; s=dkim-20170105; h=References:To:In-Reply-To:Subject:Date: Mime-Version:Content-Type:Message-Id:From:Sender:Reply-To:Cc: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=0z5HPMs685IWVZDTSrdWoG3uGGXXsjlv7nDdlfpLB0I=; b=OYSni7sNFIex5YBP0kcrMBkdt qc962lXH1ws9ZVdH6aVm6HE9msoZ4UGJeBnojG1yeXx06u1NBAMBwfNTc/9WBUaXcpIYU9LUszqbE fhUvOrRclxnD/rU3b512jHhx18kP+AknhZ/CdiIPSZdC2Q0JC26B7uvLy0ycyCZQFVLiMBUcmH/nW 22cYmQP/03tk/BEx/DtpLXE1y4e4xdyOADA+3V23OGNknAcq/4g3E1LAMxV9VYZ5VC8L/6Jjn1blV UDrQegvepZ+/plN2EbR3AORfaRDqF9TZLbJ7AeWRKvdNjh2XXLgigWUkaUOWzvosNQzmDzlstlCfd VqaP6owlQ==;
Received: from c-174-180-13-244.hsd1.ma.comcast.net ([174.180.13.244] helo=tjtmbp2015.lan) by mail.openlinksw.com with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.92.3) (envelope-from <tthibodeau@openlinksw.com>) id 1rsTfg-0006nd-7p; Thu, 04 Apr 2024 22:26:53 +0200
From: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Message-Id: <E83E80FF-5810-4A53-85D8-E5095F9C1C1C@openlinksw.com>
Content-Type: multipart/signed; boundary="Apple-Mail=_B5031A95-9A67-4065-BB03-E2E517C803DD"; protocol="application/pkcs7-signature"; micalg="sha-256"
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.21\))
Date: Thu, 04 Apr 2024 16:26:47 -0400
In-Reply-To: <1c404c4d-437c-464a-b414-4e0d39c1d8ea@alvestrand.no>
To: media-types@ietf.org
References: <2E20FEDE-C766-43EE-A6E2-1FB63E79CF0B@mnot.net> <1c404c4d-437c-464a-b414-4e0d39c1d8ea@alvestrand.no>
X-Mailer: Apple Mail (2.3445.104.21)
Archived-At: <https://mailarchive.ietf.org/arch/msg/media-types/5ZGZVNyYbBfMklZgnnxIEizUEMk>
Subject: Re: [media-types] Thoughts on suffixes, single and multiple
X-BeenThere: media-types@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: "IANA mailing list for reviewing Media Type \(MIME Type, Content Type\) registration requests." <media-types.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/media-types>, <mailto:media-types-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/media-types/>
List-Post: <mailto:media-types@ietf.org>
List-Help: <mailto:media-types-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/media-types>, <mailto:media-types-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Apr 2024 20:27:07 -0000

tl;dr: I am *not* suggesting deprecation of structured suffixes in toto.

I am strongly suggesting that the simple structure long in use be conformed with, reaffirmed, and clarified —

   type/[[sub-sub-subtype+]sub-subtype+]subtype[;parameter=value]

Read on:

On Apr 4, 2024, at 03:36 AM, Harald Alvestrand <harald@alvestrand.no> wrote:
> 
> On 4/3/24 08:29, Mark Nottingham wrote:
>> After the meeting in Brisbane, some of us went aside to continue to the multiple suffixes discussion. There, we quickly came to the conclusion that we should deprecate the concept of suffixes in media subtypes -- i.e., they would still be syntactically allowed, but would have no meaning or registry. Martin Thomson and I took an action to write something down about this.
>> Once I was home, I started to think more carefully about this and do research. One thing that I haven't yet seen is a summary of how suffixes are currently used (apologies if I missed someone else's effort there). These are the counts for each suffix in the registry that I came up with about a week ago:
>> +xml = 439
>> +json = 145
>> +ber = 0
>> +cbor = 16
>> +der = 1
>> +fastinfoset = 1
>> +wbxml = 7
>> +zip = 24
>> +tlv = 1
>> +json-seq = 2
>> +sqlite = 1
>> +jwt = 6
>> +gzip = 2
>> +cbor-seq = 4
>> +zstd = 0
>> +yaml = 2
>> +cose = 0
>> As you can see, we have a few very widely used suffixes (in a registry of 1,588 entries as of that survey), and many very seldom used ones - with a few not used at all.


Those counts don't represent media type *usage*. 

They represent media type *registrations*.

I wanted to put some *usage* statistics here, drawing on Google or other search engine for data about the media types they've encountered, but I cannot find a way to search on `Content-type` headers. The nearest I found was Google's `filetype:<type>` search term which is focused on filename extension, which is minimally helpful.

I think some analysis of what's out there on the (public) web, in actual *use*, would be far more informative than counting what is in the IANA registry, at present. (It would be great to also get some analysis of what's out there in various LANs, VPNs, and other non-public web, but this seems beyond hope.)


>> The widespread use of +xml and +json in particular made me more cautious about deprecating suffixes altogether -- especially since we still sort-of believe that they are indeed used by (or at least potentially useful to) things like editors to hint syntactic conventions.
>> So, that leaves a few different options, considering the constraints we have:
>> 1) Disallow more than one "+" sign in media subtypes, as floated at the meeting. This would put a fair amount of pressure on the registry's ability to reflect reality, depending on how widely deployed some things get (although we could grandfather some types in to ease the pressure here).

I would be strongly against this idea.


> There's also the option that was floated at the meeting: Allow registration of types with any number of + signs, in any position, without regard to the suffix registry (which would close); explain in informational (not normative) text what + signs have been used for and why it's not a good idea to police them.

I would be similarly strongly against this idea.


> This would allow, for instance, the registration of text/code-c++ as a legal media type....
> 
>> 2) Syntactically allow suffixes before the last one, but not assign them any meaning or register them; e.g., application/foo+bar+xml would be an XML format, but who knows what bar is; effectively, it's just part of "foo+bar". This would allow people to define suffix-like things, but wouldn't give them any recognition or coordination -- potentially leading to the need to formalise things more down the road, just as we did in the first round of suffixes.
> 
> That's the opposite sequence of the one in the current draft. I think we saw dislike expressed for this earlier.

Multiple `+` have always been syntactically valid (perhaps because no-one thought to forbid them. oops. that horse has long-since left the barn). The major issue, in my experience, has been that there's no definitive guidance on how a consumer encountering such should treat it. I think that's not hard to resolve. (Please read on, dear reader...)


>> 3) Consider multiple suffixes, when they occur, to be unrelated hints as to the syntax of the format -- i.e., there is no processing model, there is no ordering (although a registrant would have to choose an order; registrations with different orderings should be refused). Effectively, suffixes would just be a 'bag of hints' about the format being used.
> 
> In the "we don't police this" model, there would be advice that registering both foo+bar+baz and foo+baz+bar is likely to be a Bad Idea, but we wouldn't want to actually forbid it - someone might come up with an use case.

Only one of these two hypotheticals should be valid, because the nature of compound media types has been (and should remain) understood (though not specified; hence our draft RFC) as —

   type/[[sub-sub-subtype+]sub-subtype+]subtype

Working from this understanding, either `bar` is a subtype of `baz`, or `baz` is a subtype of `bar`. Both cannot be true, unless `baz` and `bar` are the same, in which case, only one *should* be registered (except to the limited degree that aliases are permitted; `bar` might be an alias of `baz`, or vice versa), and any subtype `foo` should be registered as a subtype of that one (i.e., `foo+bar` *or* `foo+baz` but not both). 


>> I'd be interested in hearing people's reactions to these.
>> Separately, I think we need to settle a few other matters to make progress:
>> ### Defining What Suffixes Are For (no matter how many there are)
>> After the discussion in Brisbane, I strongly believe that suffixes should ONLY be for hinting about the syntax or format convention in use, as an aid eg to editors, syntax highlighters, etc. This is the proven use case for media type suffixes. Suffixes should not be used to hint semantics; only syntax. We should have strong language about the dangers of using suffixes to hint particular kinds of processing; cf the previous discussion on the 'polyglot problem' and the potential security issues around performing processing based upon suffixes.
>> The suffix registration process should be designed to assure that only such suffixes are registered.
>> Note that in this view, "+ld" is very likely unregistrable.
>> ### Cleaning Up Existing Suffixes
>> +gzip and +zstd are problematic; the former should be disallowed for new registrations, and the latter should be removed or obsoleted in the registry. Likewise, I am highly suspicious of +jwt and +cose. +zip _is_ a format convention, so I suppose it's OK?


I am disturbed by your suspiciousness, which doesn't seem to be based on anything more than goosebumps. Do your suspicions have any technical basis?

JWT and COSE both have uniquely specified syntaxes, and I believe this qualifies them to be subtype registrations (i.e., application/jwt and application/cose), and there may well eventually be *appropriately differentiated* sub-subtypes of these (e.g., application/foo+jwt and application/bar+cose). *Today*, I don't think such sub-subtypes exist, and any hinting at the content type enveloped by these subtypes should be provided by `;contents=<content media type>` or similar

GZIP is a compressed archive of a single document. It *should* be typed as `application/gzip`, which could reasonably be decorated with `;contents=<content media type>` or similar (there certainly could be a better parameter name).

Best I can tell, the registered `+zstd` structured suffix has not been used by anyone for any registered media type. The only entry in the main table is for `application/zstd` itself —

   https://www.iana.org/assignments/media-types/media-types.xhtml

— but if it *were*, my (quick, cursory) reading suggests that each Zstd archive would expand/extract to a single document, which media type I submit *should* be hinted by `:contents=<content type>` or similar.

Structured suffixes are useful, and even necessary. When properly specified, they tell the consumer how to parse the document for internal fragment identifiers, vital for processing of Linked Data (including but not *only* JSON-LD), among other things.

I think that whether a proposed structured suffix should be allowed/registered depends on the associated format specification, and whether that specification defines a format which is processed/interpreted differently than the formats identified by existing structured suffixes.

There may be other considerations, but they really ought to be approached from a position of "I don't know whether this is good or bad, yet; here are my technical concerns..." not from a prejudged position of "Someone once wrote a screed against things like this, so all specifics, iterations, and variations must be bad," which prejudgement comes across through much of the recent writings on this subject.


Media types have always seemed clear to me, though the internal addition of sub-subtype strings (the long used and apparently generally accepted, until relatively recently, single `+` assembly, `type/sub-subtype+subtype`) has always seemed a bit odd.

Still — in relatively simple language, we have, for instance —

   application/json

— which is a super type of —

   application/ld+json

— which is in turn a super type of the currently anticipated-to-be-registered —

   application/vc+ld+json

Perhaps in an easier to understand order, we have the anticipated-to-be-registered —

   application/vc+ld+json

— which is a sub type of —

   application/ld+json

— which is in turn a sub type of —

   application/json


A consumer encountering a document with media type `application/vc+ld+json` can *fully* process it as just that, or can process it with *limited understanding* as `application/ld+json`, and with *more limited understanding* as `application/json`.

Note that this is not the damned (by some few writings — have a look at some *positive* writings! <https://www.google.com/search?q=polyglot+code>) "polyglot" that has varying *interpretation* by different consumers, but rather a variance of *depth of interpretation*. 

This variance is somewhat analogous to the depth of zoom shown on a GoogleEarth location. At one mile to the pixel you get one idea of what you're looking at; at 1/2 mile to the pixel, you get a different idea; at one yard to the pixel, you get a very different idea. None of these ideas is *wrong*; it's just a different depth.

Another analogy — `application/json`, `application/ld+json`, and `application/vc+ld+json` are somewhat akin to Simple, Compound, and Complex sentences, written once but readable with varying degrees of "understanding". See <https://www.google.com/search?q=simple+to+compound+to+complex> for other discussions of this matter.



In clearer English, we could suppose —

   shape/quadrilateral

   shape/rectangle+quadrilateral

   shape/square+rectangle+quadrilateral

All quadrilaterals are shapes. But some shapes (e.g., circles, triangles) are not quadrilaterals.

All rectangles are quadrilaterals. But some quadrilaterals (e.g., kite, trapezoid) are not rectangles.

All squares are rectangles. But some rectangles are not squares.

Some tools may be able to operate on squares, but not on rectangles, nor on other non-square shapes in general.

Other tools may be able to operate on quadrilaterals, but not have any features specifically relevant to rectangles or to squares, nor be able to handle circles or triangles at all.

In similar fashion, some tools may be able to operate on JSON (`application/json`) which includes JSON-LD (`application/ld+json`), but not have any features which specifically address JSON-LD.

Other tools may fully consume the additional features delivered by JSON-LD, or the nascent `application/vc+ld+json`.

There is *one* structured suffix in action here -- `+json`.  The fragment rules for interpretation of `application/ld+json` and `application/vc+ld+json` do not differ in any way which would require definition of `+ld+json` as a structured suffix.


There have been some (to my mind) regrettable media type registrations, such as `image/svg+xml`, which degrades to `image/xml`, but which I think really ought to degrade to `application/xml`, meaning that its primary registration should be `application/svg+xml` *even though its desirably rendered content is an image*.


More regrettable, is the pattern of appending a compression or archive structured suffix on another media type. Consumption of these documents requires handling of the compression/archive structure — but requires no handling of the contents of the compression/archive document!

Anything currently `application/blah+gz` or `application/blah+zip` or `application/blah+tar` should be reduced to `application/gz` or `application/zip` or `application/tar`, respectively, with the possible addition of a `;profile=application/blah` or `;profile=image/blah` — but (of these) such `;profile=` additions *only* work for `application/gz`, which always expands to a single document. It does not work for `application/zip` nor for `application/tar`, each of which typically expands to multiple documents, each with its own media type.


To be clear — I strongly believe that archives should always have the media type associated with that archive format. They should not include in that media type any "hint" of their contents *except* when the content is singular for all archives of that sort, and then the hint should be provided via `;parameter=` whether that `parameter` is `;profile=` or otherwise. The hint should *never* be wedged into the media type a la `application/vc+ld+json+sd-jwt`. That should be simply `application/sd-jwt`, and such a document should use the `cty` (for "content type") header — already built into SD-JWT, per RFC 7515 — for the * enveloped contents* of the *enveloping archive* SD-JWT. (The nearby `typ` header should be `sd-jwt`.)

See —

   https://www.rfc-editor.org/rfc/rfc7515.html#section-4.1.9

— and —

   https://www.rfc-editor.org/rfc/rfc7515.html#section-4.1.10


(Further, it appears to me that the archive media types should probably all be converted from, e.g., `application/zip`, to `multipart/zip`, as they are indeed multipart!)

The keen eyed reader will note that I am suggesting that some RFCs and their registered media types require revision, because the registrations and recommendations of media types they currently include violate both the spirit and the word of previous RFCs, particularly those specifying media types themselves.


Again —

I am *not* suggesting deprecation of structured suffixes in toto.

I am strongly suggesting that the simple structure long in use be conformed with, reaffirmed, and clarified —

   type/[[sub-sub-subtype+]sub-subtype+]subtype[;parameter=value]

End of rant. Sorry it went so long.

Be seeing you,

Ted







--
A: Yes.                          http://www.idallen.com/topposting.html
| Q: Are you sure?           
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?

Ted Thibodeau, Jr.           //                tel:+1-781-273-0900,1,32
Senior Support & Evangelism  //        mailto:tthibodeau@openlinksw.com
                             //              http://twitter.com/TallTed
OpenLink Software, Inc.      //               http://www.openlinksw.com
    117 Kendrick Street, Suite 300, Needham Heights, MA 02494-2722
     Weblog    -- http://www.openlinksw.com/blogs/
     Community -- https://community.openlinksw.com/
     LinkedIn  -- http://www.linkedin.com/company/openlink-software/
     Twitter   -- http://twitter.com/OpenLink
     Facebook  -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers