[websec] When is sniffing heuristic?

Larry Masinter <masinter@adobe.com> Mon, 09 January 2012 00:06 UTC

From: Larry Masinter <masinter@adobe.com>
To: Bjoern Hoehrmann <derhoermi@gmx.net>, Adam Barth <ietf@adambarth.com>
Date: Sun, 08 Jan 2012 16:06:03 -0800
Thread-Topic: When is sniffing heuristic?
Thread-Index: AczOYnThWYRE6GV9RBic65DXqdBRWg==
Message-ID: <C68CB012D9182D408CED7B884F441D4D06123B4E4C@nambxv01a.corp.adobe.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "IETF WebSec WG (websec@ietf.org)" <websec@ietf.org>
Subject: [websec] When is sniffing heuristic?
Precedence: list

There are several different situations where sniffing is of necessity heuristic, because you are 'guessing' the intent of the content.
These are due to the fact that the set of possible valid Content-Type values does not partition the space of possible bodies.

There may be other situations where sniffing is heuristic, but in these cases, sniffing is *necessarily* heuristic because there are multiple results which are valid, and knowing the right result requires additional information about the intent of the communication. The heuristic comes presumably from a manual examination of some web material where such information about intent is known, and projecting that the generalization applies to all such material and cases.

a) Specializations:

A file which is, for example, application/xhtml+xml is, of necessity, also a valid file of type application/xml. If you were to "sniff" some content that was valid application/xhtml+xml, you could also legitimately claim it was application/xml.
Most data types which are 'text' are also text/plain.
Every type is a subset of application/octet-stream.

There are numerable examples of this, and a large number of failure cases, e.g., zip-based packaging formats being sniffed as zip when the specialization isn't correctly recognized, image/dng which is sniffed to be image/tiff, etc.

b) "Polyglot":

This is a situation where data is intentionally prepared to be interpretable as two different media types, possibly to be served and later processed as either, where the intention of the content is to behave similarly for ordinary processing, but amenable to specialized processing only defined for one or the other media type. The XHTML/HTML polyglot spec

http://dev.w3.org/html5/html-xhtml-author-guide/

is of course is the most relevant use case. The same content could be sniffed to be either type. This is different from the specialization case because neither of the media types are subsets of the other.

c) "Multiview"

I don't know exactly what to call this, but it is the situation where the same content is valid as two different media types intentionally, the media types do not overlap but the treatment as the different types is intentionally different. The use case for multiview I was looking at was one where the same content could be viewed as XHTML (for a presentational view) and also as RDF (for a data point of view).

This is different from specialization (since the two types overlap but one is not a subset of the other), and polyglot (since the material is intended to have different meaning in its ordinary application).

[websec] When is sniffing heuristic? Larry Masinter