Re: [Json] "Generators SHOULD escape all Unicode whitespace characters"?

Jacob Davies <jacob@well.com> Thu, 13 June 2013 23:48 UTC

Return-Path: <cromis@gmail.com>
X-Original-To: json@ietfa.amsl.com
Delivered-To: json@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 2A9E321F9B32 for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 16:48:09 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.839
X-Spam-Level:
X-Spam-Status: No, score=-1.839 tagged_above=-999 required=5 tests=[AWL=0.138, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_MESSAGE=0.001, NO_RELAYS=-0.001]
Received: from mail.ietf.org ([12.22.58.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HuYfuG5i1YFB for <json@ietfa.amsl.com>; Thu, 13 Jun 2013 16:48:08 -0700 (PDT)
Received: from mail-qa0-x234.google.com (mail-qa0-x234.google.com [IPv6:2607:f8b0:400d:c00::234]) by ietfa.amsl.com (Postfix) with ESMTP id 17F9421F9B31 for <json@ietf.org>; Thu, 13 Jun 2013 16:48:08 -0700 (PDT)
Received: by mail-qa0-f52.google.com with SMTP id bv4so35212qab.11 for <json@ietf.org>; Thu, 13 Jun 2013 16:48:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:cc:content-type; bh=HSIxEI2t5WsPK4G8SDdcIefFKZIFktTWX/cSWUSnELY=; b=uQdr4wCDKv7lBf+q04f3QKZZtOgPCB96WqH7CHUHbvzhamDVjuJeq0QBLAkrVnIDsC Ox2J0qEGdXv8YtHO8H7v5Bjnt6T5RRsk8wVog6XjMddwc4gN92lphi0cuE/7T31Fn2l3 cD/hZfjUiUmumykrptaJP4iyzkoraDZk4x3eyaqTv6djfhG5PqEhw6HlCfRwIAZ2283c Em7ouGfMNWvH6LfttO+XVmnBEx9UYVfzuC9AuL3W7FbZlHEkOBSvXcqrcJqvdN6Fa5bF +FvumlVTHe9SIJ9HXUzrnnBlir3WRAQYxuUvic5ERODCbpjCr/+fdcaQyu+XpXlGT543 mqwA==
X-Received: by 10.49.101.74 with SMTP id fe10mr4441224qeb.11.1371167287436; Thu, 13 Jun 2013 16:48:07 -0700 (PDT)
MIME-Version: 1.0
Sender: cromis@gmail.com
Received: by 10.49.106.228 with HTTP; Thu, 13 Jun 2013 16:47:47 -0700 (PDT)
In-Reply-To: <257919C3-279E-47CA-9430-17FD52F82745@lindenbergsoftware.com>
References: <CAO1wJ5S_c_4H5PD5HAZo9UR2KbhDHqfXjo=C3GAGJeGEqCSFHA@mail.gmail.com> <257919C3-279E-47CA-9430-17FD52F82745@lindenbergsoftware.com>
From: Jacob Davies <jacob@well.com>
Date: Thu, 13 Jun 2013 16:47:47 -0700
X-Google-Sender-Auth: wJM5sEdUkehBNJkp_uOlJR0dNTU
Message-ID: <CAO1wJ5TDUh8T-gbovjU4qJbHay0eH6Fk8YhcBVV9WQO36Qv8iw@mail.gmail.com>
To: Norbert Lindenberg <ietf@lindenbergsoftware.com>
Content-Type: multipart/alternative; boundary="001a11c2c6ce191cfb04df11c3f1"
Cc: "json@ietf.org" <json@ietf.org>
Subject: Re: [Json] "Generators SHOULD escape all Unicode whitespace characters"?
X-BeenThere: json@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: "JavaScript Object Notation \(JSON\) WG mailing list" <json.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/json>, <mailto:json-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/json>
List-Post: <mailto:json@ietf.org>
List-Help: <mailto:json-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/json>, <mailto:json-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Jun 2013 23:48:09 -0000

>
> This list includes some but not all Unicode control characters in addition
> to space characters.
>

Yes, and languages vary in what they actually consider "whitespace". I
think in general we're concerned with "non-printing or whitespace
characters other than a simple space".

> "Whitespace smuggling" is a mild security concern and, from
> > experience, can be quite hard to debug if non-0x20 spaces are not
> > escaped. There is a small overhead of a couple of characters in doing
> > so.
>
> Can you provide more detail on the problem that this proposal is intended
> to solve?


Sure - what sometimes happens is that various parts of a system disagree
over what is whitespace. For instance, a server may strip whitespace using
Java's built-in check that does not recognize all of the above-mentioned
characters -
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isWhitespace(int)-
when the intent was to remove all whitespace. Malicious users may use
characters that evade this check to introduce whitespace to content in ways
that are unwanted or misleading. In other cases some system may tokenize by
whitespace but varying definitions of what whitespace is result in
different tokenizations and security concerns (as in the case Stephen Dolan
mentions earlier).

It may also assist in debugging seemingly-identical JSON strings that
differ only by invisible or indiscernible whitespace, whether malicious or
intentional.

Does the proposal really solve the problem, given that generators don't
> have to implement it, that they cannot implement it for characters added to
> Unicode in a Unicode version later than the one they're based on, and that
> parsers cannot rely on generators to have implemented it?
>

It certainly does not solve it. It mitigates it in the same way that
escaping control characters and ASCII whitespace in strings mitigate
similar concerns; they make it easier to see exactly what a string is
intended to contain, in human-readable characters.

The case it helps mitigate is the common one where a non-malicious
generator is sending the JSON you're trying to understand, as for instance
when a site's Javascript is communicating with its own server. One of the
nice things about JSON is that it is easy to debug problems in JSON data
using primitive tools - dumping text into page content, or hitting a URL
directly and looking at the JSON in the browser. As much as possible,
implementations should assist.

The recommendation could list a specific set of current characters and
additionally refer to the whitespace and control characters in the latest
Unicode version. As a mitigation measure it helps even though it is partial.

This may be a candidate for a best practice recommendation instead; I
thought it was worth mentioning one way or another.