[Json] JSON encodings

Willem Bogaerts <w.bogaerts@kratz.nl> Wed, 30 April 2014 10:18 UTC

Return-Path: <w.bogaerts@kratz.nl>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 944751A6F37 for <json@ietfa.amsl.com>; Wed, 30 Apr 2014 03:18:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.442
X-Spam-Level: *
X-Spam-Status: No, score=1.442 tagged_above=-999 required=5 tests=[BAYES_20=-0.001, HELO_EQ_NL=0.55, HOST_EQ_NL=1.545, RP_MATCHES_RCVD=-0.651, SPF_HELO_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id T_O6qv9_1t71 for <json@ietfa.amsl.com>; Wed, 30 Apr 2014 03:18:53 -0700 (PDT)
Received: from lebbis.kratz.nl (lebbis.kratz.nl [87.233.4.162]) by ietfa.amsl.com (Postfix) with ESMTP id 876B11A6F3D for <json@ietf.org>; Wed, 30 Apr 2014 03:18:52 -0700 (PDT)
Received: from [10.77.77.108] (unknown [37.0.87.117]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by lebbis.kratz.nl (Postfix) with ESMTP id 01F055DC003 for <json@ietf.org>; Wed, 30 Apr 2014 12:18:49 +0200 (CEST)
Message-ID: <5360CDFA.9040803@kratz.nl>
Date: Wed, 30 Apr 2014 12:18:34 +0200
From: Willem Bogaerts <w.bogaerts@kratz.nl>
Organization: Kratz Business Solutions
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0
MIME-Version: 1.0
To: json@ietf.org
X-Enigmail-Version: 1.6
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Archived-At: http://mailarchive.ietf.org/arch/msg/json/lI-Uclyv82Po6806DLl6YPVXwZ0
X-Mailman-Approved-At: Wed, 30 Apr 2014 05:07:14 -0700
Subject: [Json] JSON encodings
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json/>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 30 Apr 2014 10:22:41 -0000

Dear Group,

I am new here, so this issue may have been handled in the past. I'd like
to solve an issue with JSON standard.

The issue is that according to RFC 7159, JSON is both text-based and
binary. And those two are not exactly compatible. With a mime-type of
"application/json", one would expect a binary format, which it really is
not. So one would expect a mime type like "text/json;charset=utf8", for
example. This would allow the parser to "get ready" for the data before
choking on it. In fact, most implementations check for an encoding, call
some library to convert it to an internally used encoding, and only then
call the JSON parser, which does not need to be aware of all the
possible encodings.

To do that, off course, the encoding should be separate from the data
itself. Not separating the two is like sending a locked safe with its
key inside: you will have to break open the safe to find the key that
opens it. In fact, some references to the JSON standard provide an
example on how to "crack the safe" by looking at the first 4 bytes.

This can never be the intention. I know that the HTML standard has a
history of "cracking" the pages for the http-header as a meta tag, but
this only serves as an example on how it should NOT be done.

To overcome the issue, I encountered content-types like
"application/json;charset=utf8". Although this seems (and is) a little
odd, I think this is the best way of preparing the client to the data it
is getting. Although JSON has an "application" base type, it really is
text and therefore needs an encoding.

So my comment to RFC 7159 is that it should be possible to send the
character encoding as meta-data when sending a JSON request. The absence
of this encoding then causes the request to be treated as utf-8.

In the section "IANA considerations", the standard now says "n/a" for
optional parameters to the mime type. I'd like to propose that the
"charset" parameter is an option.

This can also be a security consideration. There are illegal sequences
that can be misinterpreted and cause havoc. This is less likely with a
"pure data" format like JSON, but you never know where the ill-formed
data is going next. Somebody could try to inject characters that are not
recognised as quotes by a web page script, but are regarded as such by
the database that parses it as part of a query. You hardly need more for
a successful SQL injection attack. If I recall correctly, attacks like
this happened with UTF-7 encodings in the past.

Also, imagine that somebody follows the advice on looking at the first 4
bytes for the encoding and uses a forgiving encoding translator for the
rest. I am sure that it would be possible to start with, say, an
UTF-16LE encoded white space and add some nasty surprises in an other
encoding for the rest of the request. The encoding that is "cracked" out
of the first 4 bytes is not likely to be checked, as it was cracked out
of the request in the first place: there is nothing to validate it with.
Treating a piece of text as a binary makes this kind of attacks possible.

Stating the character encoding beforehand can prevent any surprises in
how the request is parsed, can serve as a meaningful validation, and
allows the programmer to send meaningful error messages for any
ill-encoded request.

Best regards,
-- 
Willem Bogaerts

Application Smith
Kratz Business Solutions
http://www.kratz.nl/