Re: [Json] "Generators SHOULD escape all Unicode whitespace characters"?

Jacob Davies <jacob@well.com> Thu, 13 June 2013 23:48 UTC

MIME-Version: 1.0
Sender: cromis@gmail.com
In-Reply-To: <257919C3-279E-47CA-9430-17FD52F82745@lindenbergsoftware.com>
References: <CAO1wJ5S_c_4H5PD5HAZo9UR2KbhDHqfXjo=C3GAGJeGEqCSFHA@mail.gmail.com> <257919C3-279E-47CA-9430-17FD52F82745@lindenbergsoftware.com>
From: Jacob Davies <jacob@well.com>
Date: Thu, 13 Jun 2013 16:47:47 -0700
Message-ID: <CAO1wJ5TDUh8T-gbovjU4qJbHay0eH6Fk8YhcBVV9WQO36Qv8iw@mail.gmail.com>
To: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Content-Type: multipart/alternative; boundary="001a11c2c6ce191cfb04df11c3f1"
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] "Generators SHOULD escape all Unicode whitespace characters"?
Precedence: list

>
> This list includes some but not all Unicode control characters in addition
> to space characters.
>

Yes, and languages vary in what they actually consider "whitespace". I
think in general we're concerned with "non-printing or whitespace
characters other than a simple space".

> "Whitespace smuggling" is a mild security concern and, from
> > experience, can be quite hard to debug if non-0x20 spaces are not
> > escaped. There is a small overhead of a couple of characters in doing
> > so.
>
> Can you provide more detail on the problem that this proposal is intended
> to solve?


Sure - what sometimes happens is that various parts of a system disagree
over what is whitespace. For instance, a server may strip whitespace using
Java's built-in check that does not recognize all of the above-mentioned
characters -
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isWhitespace(int)-
when the intent was to remove all whitespace. Malicious users may use
characters that evade this check to introduce whitespace to content in ways
that are unwanted or misleading. In other cases some system may tokenize by
whitespace but varying definitions of what whitespace is result in
different tokenizations and security concerns (as in the case Stephen Dolan
mentions earlier).

It may also assist in debugging seemingly-identical JSON strings that
differ only by invisible or indiscernible whitespace, whether malicious or
intentional.

Does the proposal really solve the problem, given that generators don't
> have to implement it, that they cannot implement it for characters added to
> Unicode in a Unicode version later than the one they're based on, and that
> parsers cannot rely on generators to have implemented it?
>

It certainly does not solve it. It mitigates it in the same way that
escaping control characters and ASCII whitespace in strings mitigate
similar concerns; they make it easier to see exactly what a string is
intended to contain, in human-readable characters.

The case it helps mitigate is the common one where a non-malicious
generator is sending the JSON you're trying to understand, as for instance
when a site's Javascript is communicating with its own server. One of the
nice things about JSON is that it is easy to debug problems in JSON data
using primitive tools - dumping text into page content, or hitting a URL
directly and looking at the JSON in the browser. As much as possible,
implementations should assist.

The recommendation could list a specific set of current characters and
additionally refer to the whitespace and control characters in the latest
Unicode version. As a mitigation measure it helps even though it is partial.

This may be a candidate for a best practice recommendation instead; I
thought it was worth mentioning one way or another.

[Json] "Generators SHOULD escape all Unicode whit… Jacob Davies
Re: [Json] "Generators SHOULD escape all Unicode … Stephen Dolan
Re: [Json] "Generators SHOULD escape all Unicode … Norbert Lindenberg
Re: [Json] "Generators SHOULD escape all Unicode … Jacob Davies