[websec] When is sniffing heuristic?

Larry Masinter <masinter@adobe.com> Mon, 09 January 2012 00:06 UTC

Return-Path: <masinter@adobe.com>
X-Original-To: websec@ietfa.amsl.com
Delivered-To: websec@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C073621F8528 for <websec@ietfa.amsl.com>; Sun, 8 Jan 2012 16:06:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -106.817
X-Spam-Level:
X-Spam-Status: No, score=-106.817 tagged_above=-999 required=5 tests=[AWL=-0.218, BAYES_00=-2.599, RCVD_IN_DNSWL_MED=-4, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PbQT+52OlR0T for <websec@ietfa.amsl.com>; Sun, 8 Jan 2012 16:06:13 -0800 (PST)
Received: from exprod6og114.obsmtp.com (exprod6og114.obsmtp.com [64.18.1.33]) by ietfa.amsl.com (Postfix) with ESMTP id E09AC21F851D for <websec@ietf.org>; Sun, 8 Jan 2012 16:06:12 -0800 (PST)
Received: from outbound-smtp-1.corp.adobe.com ([192.150.11.134]) by exprod6ob114.postini.com ([64.18.5.12]) with SMTP ID DSNKTwovcfBFx5WrG95pO2BAf54rfyBAE58M@postini.com; Sun, 08 Jan 2012 16:06:12 PST
Received: from inner-relay-4.eur.adobe.com (inner-relay-4.adobe.com [193.104.215.14]) by outbound-smtp-1.corp.adobe.com (8.12.10/8.12.10) with ESMTP id q0904Haa026570; Sun, 8 Jan 2012 16:04:18 -0800 (PST)
Received: from nahub02.corp.adobe.com (nahub02.corp.adobe.com [10.8.189.98]) by inner-relay-4.eur.adobe.com (8.12.10/8.12.9) with ESMTP id q090657o006105; Sun, 8 Jan 2012 16:06:05 -0800 (PST)
Received: from nambxv01a.corp.adobe.com ([10.8.189.95]) by nahub02.corp.adobe.com ([10.8.189.98]) with mapi; Sun, 8 Jan 2012 16:06:04 -0800
From: Larry Masinter <masinter@adobe.com>
To: Bjoern Hoehrmann <derhoermi@gmx.net>, Adam Barth <ietf@adambarth.com>
Date: Sun, 08 Jan 2012 16:06:03 -0800
Thread-Topic: When is sniffing heuristic?
Thread-Index: AczOYnThWYRE6GV9RBic65DXqdBRWg==
Message-ID: <C68CB012D9182D408CED7B884F441D4D06123B4E4C@nambxv01a.corp.adobe.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "IETF WebSec WG (websec@ietf.org)" <websec@ietf.org>
Subject: [websec] When is sniffing heuristic?
X-BeenThere: websec@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Web Application Security Minus Authentication and Transport <websec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/websec>, <mailto:websec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/websec>
List-Post: <mailto:websec@ietf.org>
List-Help: <mailto:websec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/websec>, <mailto:websec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 09 Jan 2012 00:06:13 -0000

There are several different situations where sniffing is of necessity heuristic, because you are 'guessing' the intent of the content.
These are due to the fact that the set of possible valid Content-Type values does not partition the   space of possible bodies.

There may be other situations where sniffing is heuristic, but in these cases, sniffing is *necessarily* heuristic because there are multiple results which are valid, and knowing the right result requires additional information about the intent of the communication. The heuristic comes presumably from a manual examination of some web material where such information about intent is known, and projecting that the generalization applies to all such material and cases.

a) Specializations:

A file which is, for example, application/xhtml+xml is, of necessity, also a valid file of type application/xml. If you were to "sniff" some content that was valid application/xhtml+xml, you could also legitimately claim it was application/xml.
Most data types which are 'text' are also text/plain.
Every type is a subset of application/octet-stream.

There are numerable examples of this, and a large number of failure cases, e.g., zip-based packaging formats being sniffed as zip when the specialization isn't correctly recognized,  image/dng which is sniffed to be image/tiff, etc.


b) "Polyglot":

This is a situation where data is intentionally prepared to be interpretable as two different media types, possibly to be served and later processed as either, where the intention of the content is to behave similarly for ordinary processing, but amenable to specialized processing only defined for one or the other media type. The XHTML/HTML polyglot spec

http://dev.w3.org/html5/html-xhtml-author-guide/

is of course is the most relevant use case. The same content could be sniffed to be either type.  This is different from the specialization case because neither of the media types are subsets of the other.

c) "Multiview"

I don't know exactly what to call this, but it is the situation where the same content is valid as two different media types intentionally, the media types do not overlap but the treatment as the different types is intentionally different.  The use case for multiview I was looking at was one where the same content could be viewed as XHTML  (for a presentational view) and also as RDF (for a data point of view).

This is different from specialization (since the two types overlap but one is not a subset of the other), and polyglot (since the material is intended to have different meaning in its ordinary application).