Re: [websec] #22: content-type sniffing should include charset sniffing

"Martin J. Dürst" <duerst@it.aoyama.ac.jp> Mon, 24 October 2011 06:37 UTC

Return-Path: <duerst@it.aoyama.ac.jp>
X-Original-To: websec@ietfa.amsl.com
Delivered-To: websec@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 154F611E8073 for <websec@ietfa.amsl.com>; Sun, 23 Oct 2011 23:37:30 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -97.926
X-Spam-Level:
X-Spam-Status: No, score=-97.926 tagged_above=-999 required=5 tests=[AWL=-0.591, BAYES_00=-2.599, FRT_ADOBE2=2.455, HELO_EQ_JP=1.244, HOST_EQ_JP=1.265, MIME_8BIT_HEADER=0.3, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 69NwDx+U05FW for <websec@ietfa.amsl.com>; Sun, 23 Oct 2011 23:37:29 -0700 (PDT)
Received: from scintmta02.scbb.aoyama.ac.jp (scintmta02.scbb.aoyama.ac.jp [133.2.253.34]) by ietfa.amsl.com (Postfix) with ESMTP id C272721F8BEB for <websec@ietf.org>; Sun, 23 Oct 2011 23:37:25 -0700 (PDT)
Received: from scmse02.scbb.aoyama.ac.jp ([133.2.253.231]) by scintmta02.scbb.aoyama.ac.jp (secret/secret) with SMTP id p9O6bGeV007753 for <websec@ietf.org>; Mon, 24 Oct 2011 15:37:16 +0900
Received: from (unknown [133.2.206.133]) by scmse02.scbb.aoyama.ac.jp with smtp id 6016_7072_9d8b7562_fe0a_11e0_a78b_001d096c5782; Mon, 24 Oct 2011 15:37:16 +0900
Received: from [IPv6:::1] ([133.2.210.1]:40243) by itmail.it.aoyama.ac.jp with [XMail 1.22 ESMTP Server] id <S156288D> for <websec@ietf.org> from <duerst@it.aoyama.ac.jp>; Mon, 24 Oct 2011 15:37:16 +0900
Message-ID: <4EA5079B.9050700@it.aoyama.ac.jp>
Date: Mon, 24 Oct 2011 15:37:15 +0900
From: "\"Martin J. Dürst\"" <duerst@it.aoyama.ac.jp>
Organization: Aoyama Gakuin University
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100722 Eudora/3.0.4
MIME-Version: 1.0
To: Larry Masinter <masinter@adobe.com>
References: <059.f8bc48a3163d95d888ee3a23a2ca7fb9@trac.tools.ietf.org> <4EA4DA67.3000502@gondrom.org> <CAJE5ia_7PO_g-0P9OvXsSazwkTkgWz6-Vs4N5tFvg=VygfFt5g@mail.gmail.com> <C68CB012D9182D408CED7B884F441D4D0605EFA3C4@nambxv01a.corp.adobe.com> <CAJE5ia_tq8D4wTc51rMV68N6KhUTMpeT_FTW6Xag7bT76tz5Xw@mail.gmail.com> <C68CB012D9182D408CED7B884F441D4D0605EFA3C6@nambxv01a.corp.adobe.com>
In-Reply-To: <C68CB012D9182D408CED7B884F441D4D0605EFA3C6@nambxv01a.corp.adobe.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Transfer-Encoding: 7bit
Cc: "websec@ietf.org" <websec@ietf.org>
Subject: Re: [websec] #22: content-type sniffing should include charset sniffing
X-BeenThere: websec@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Web Application Security Minus Authentication and Transport <websec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/websec>, <mailto:websec-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/websec>
List-Post: <mailto:websec@ietf.org>
List-Help: <mailto:websec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/websec>, <mailto:websec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Oct 2011 06:37:30 -0000

I agree with Adam and Tobias that we should not pull all of charset 
sniffing into this document. Many charset details depend on the mime 
type in the first place, and are carefully described in the respective 
specs. For some transfer protocols, the question of charset may be 
irrelevant (e.g. for text over Websocket, which prescribes and checks 
for UTF-8).

Larry is right that in some cases, some preliminary charset sniffing is 
necessary to get at some information at the start of the document, but I 
think we should strictly limit this draft to these cases.

Regards,    Martin.

On 2011/10/24 13:14, Larry Masinter wrote:
> I was talking about the necessary dependency of the specifications -- that you couldn't specify media type sniffing completely without making at least a normative reference to charset sniffing.
>
> The fact that the code works that way is evidence, of course, but we're not talking about possibility of implementation (where a single implementation is evidence) but rather orthogonality of interfaces (where the question is whether ALL implementations must follow this pattern.)
>
> Larry
>
>
>
>
> -----Original Message-----
> From: Adam Barth [mailto:ietf@adambarth.com]
> Sent: Sunday, October 23, 2011 8:37 PM
> To: Larry Masinter
> Cc: Tobias Gondrom; websec@ietf.org
> Subject: Re: [websec] #22: content-type sniffing should include charset sniffing
>
> I mean, that's how the code works, so it must be possible.  :)
>
> Adam
>
>
> On Sun, Oct 23, 2011 at 8:32 PM, Larry Masinter<masinter@adobe.com>  wrote:
>> I know it's complicated, but scanning text is necessarily part of determining which application/something+xml  you have.  I think (but should really check before saying this) that XML media type registrations describe what the DOCTYPE or XML namespace or root element are, and that, to properly "sniff" them, you'd have to scan text. But before you scan text, you have to determine charset.
>>
>> So if we're going to support sniffing of media types in general, I don't see how we can do that without also specifying charset determination.
>>
>>
>>
>> Larry
>> ]
>>
>> -----Original Message-----
>> From: websec-bounces@ietf.org [mailto:websec-bounces@ietf.org] On
>> Behalf Of Adam Barth
>> Sent: Sunday, October 23, 2011 8:28 PM
>> To: Tobias Gondrom
>> Cc: websec@ietf.org
>> Subject: Re: [websec] #22: content-type sniffing should include
>> charset sniffing
>>
>> The charset sniffing is also complicated by the fact that sometimes user agents need to parse some of the HTML to find a<meta>  element.
>> In some situations, user agents need to restart the parsing algorithm, which is quite delicate and better to describe in the same document as HTML parsing (at least for use by HTML processing engines).
>>
>> Adam
>>
>>
>> On Sun, Oct 23, 2011 at 8:24 PM, Tobias Gondrom<tobias.gondrom@gondrom.org>  wrote:
>>> <hat="individual">
>>> I tend not to agree with that.
>>>
>>> The fact that charset sniffing might happen at the same time as
>>> mime-sniffing does not seem like a strong argument to include this in
>>> the draft.
>>>
>>> Furthermore I would rather have these issues separate:
>>> First you determine the content-type and then after that you may want
>>> to determine the charset used within that content-type (if you really
>>> have to sniff the charset). I can also imagine that charset sniffing
>>> algorithm might be depending on the application identified by the
>>> sniffed mime-type, which again would speak against throwing it in together with mime-sniffing....
>>>
>>> Kind regards, Tobias
>>>
>>>
>>>
>>> On 24/10/11 00:55, websec issue tracker wrote:
>>>>
>>>> #22: content-type sniffing should include charset sniffing
>>>>
>>>>   the HTML5 spec contains some algorithms for sniffing charset,
>>>> overriding
>>>>   labeled charset, etc.
>>>>
>>>>   MIME parameters like charset are as much a part of the content-type
>>>> as the
>>>>   base internet media type, and any sniffing of parameters and other
>>>>   metadata (overriding content-type or guessing where it is not
>>>> supplied or
>>>>   wrong) should be included in this document, since the sniffing will
>>>> happen
>>>>   at the same time.
>>>>
>>>
>>> _______________________________________________
>>> websec mailing list
>>> websec@ietf.org
>>> https://www.ietf.org/mailman/listinfo/websec
>>>
>> _______________________________________________
>> websec mailing list
>> websec@ietf.org
>> https://www.ietf.org/mailman/listinfo/websec
>>
> _______________________________________________
> websec mailing list
> websec@ietf.org
> https://www.ietf.org/mailman/listinfo/websec
>