Re: If not JSON, what then ?
James M Snell <jasnell@gmail.com> Mon, 01 August 2016 15:10 UTC
Return-Path: <ietf-http-wg-request+bounce-httpbisa-archive-bis2juki=lists.ie@listhub.w3.org>
X-Original-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Delivered-To: ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id EE9E112DC4F for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 1 Aug 2016 08:10:16 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -8.308
X-Spam-Level:
X-Spam-Status: No, score=-8.308 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.287, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GROUOdKVFOAU for <ietfarch-httpbisa-archive-bis2Juki@ietfa.amsl.com>; Mon, 1 Aug 2016 08:10:14 -0700 (PDT)
Received: from frink.w3.org (frink.w3.org [128.30.52.56]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5005412DA16 for <httpbisa-archive-bis2Juki@lists.ietf.org>; Mon, 1 Aug 2016 08:10:14 -0700 (PDT)
Received: from lists by frink.w3.org with local (Exim 4.80) (envelope-from <ietf-http-wg-request@listhub.w3.org>) id 1bUEnA-0003kL-Bx for ietf-http-wg-dist@listhub.w3.org; Mon, 01 Aug 2016 15:06:08 +0000
Resent-Date: Mon, 01 Aug 2016 15:06:08 +0000
Resent-Message-Id: <E1bUEnA-0003kL-Bx@frink.w3.org>
Received: from maggie.w3.org ([128.30.52.39]) by frink.w3.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <jasnell@gmail.com>) id 1bUEn2-0003jS-QX for ietf-http-wg@listhub.w3.org; Mon, 01 Aug 2016 15:06:00 +0000
Received: from mail-io0-f172.google.com ([209.85.223.172]) by maggie.w3.org with esmtps (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <jasnell@gmail.com>) id 1bUEmy-0004Pw-Jg for ietf-http-wg@w3.org; Mon, 01 Aug 2016 15:05:59 +0000
Received: by mail-io0-f172.google.com with SMTP id q83so186001382iod.1 for <ietf-http-wg@w3.org>; Mon, 01 Aug 2016 08:05:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=rODCXe9W9f5rOAwL8/rvgyIrhRpYCQE3nepBhOslTR0=; b=UNV1TSJtJMeWAbGyq6zltJe1wZl5Ak8F2yZnWORiQiGJCgmRmiLF07DoraYNyf1c8t WDryTUTXVaPP1LCm3CSvefUNw1vgfNXbaorutRYZ/pIbcVd4XXAazvGZFVvuol6Cfev6 iq+7c3SmIDN3SU4xRqQJYm3MTOOHr3nfQ72fm7nmM99kRB2LXfiA5Xw6BP70uq+fTY1+ hVLTHPO6fUXjVPPRCUPBSMp/LjD2kjMxNHr22Ec1NlGT13Y8ogKcscIyxDyvZNlTeG+0 LyetLLsBrWDSBCn6vtOSfzfspgLLUL0UJIc3neBXSu+kqcSE1Q/YvTkL71OaN1KOlfme Ed9A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=rODCXe9W9f5rOAwL8/rvgyIrhRpYCQE3nepBhOslTR0=; b=Jc6woMSzaDFxFu/tGRISEdaCKcA5N/YBZQLxqh2xDjfw3OlMs3Wy6zqB608SViEasF eRy2PW83oMhhtIEFGFokIip/VmE1oBnAM0MvtruKBVlCk4FJkAfCLpGmSXIVgIKswLMd 9jH/qSI//BU/Zms613IrkBNaNyMOWQKvC1gqQaeN2IgTY4HyA1lXcjsxKxnOZ46wS/k3 E/Msdq4mBYtvJwz6uxU0EoVo3t4TaYMzK6ismGuNCjXxyC+8PvOMJWmRYiSa2J3gHL5N 8EB9Fl754c5CotVQlP9qCifKZbacSUwNhIhZTUdcXxo/4swHPwiGRu6iSOqEwxaO0+aW QrGQ==
X-Gm-Message-State: AEkoout5uEuacWxNvjmY/XiILZbaJFL3GbD3V+LN0G8OzuOcomoPKQ08r1MaBRnWnEA4D9KISJ/WeKQ1EsUs1Q==
X-Received: by 10.107.130.170 with SMTP id m42mr53813274ioi.78.1470063466265; Mon, 01 Aug 2016 07:57:46 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.79.8.135 with HTTP; Mon, 1 Aug 2016 07:57:26 -0700 (PDT)
In-Reply-To: <77778.1470037414@critter.freebsd.dk>
References: <77778.1470037414@critter.freebsd.dk>
From: James M Snell <jasnell@gmail.com>
Date: Mon, 01 Aug 2016 07:57:26 -0700
Message-ID: <CABP7RbefhRj1BgZQ67MKOu-xBD+r2zdO6zVrVckQ_SRx=zPxdg@mail.gmail.com>
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Content-Type: text/plain; charset="UTF-8"
Received-SPF: pass client-ip=209.85.223.172; envelope-from=jasnell@gmail.com; helo=mail-io0-f172.google.com
X-W3C-Hub-Spam-Status: No, score=-7.9
X-W3C-Hub-Spam-Report: AWL=1.777, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001, W3C_AA=-1, W3C_DB=-1, W3C_IRA=-1, W3C_IRR=-3, W3C_WL=-1
X-W3C-Scan-Sig: maggie.w3.org 1bUEmy-0004Pw-Jg 3426c0fc88ead79fa511a96bc0d12228
X-Original-To: ietf-http-wg@w3.org
Subject: Re: If not JSON, what then ?
Archived-At: <http://www.w3.org/mid/CABP7RbefhRj1BgZQ67MKOu-xBD+r2zdO6zVrVckQ_SRx=zPxdg@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
X-Mailing-List: <ietf-http-wg@w3.org> archive/latest/32117
X-Loop: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list
List-Id: <ietf-http-wg.w3.org>
List-Help: <http://www.w3.org/Mail/>
List-Post: <mailto:ietf-http-wg@w3.org>
List-Unsubscribe: <mailto:ietf-http-wg-request@w3.org?subject=unsubscribe>
phk, I'm very happy to see the discussion of efficient binary encoding of HTTP headers coming back around. This is an area that I had explored fairly extensively early in the process of designing HTTP/2 with the "Binary-optimized Header Encoding" I-D's (see https://tools.ietf.org/html/draft-snell-httpbis-bohe-13). While HPACK won out with regards to being the header compression scheme used for HTTP/2, there is still quite a bit in the BOHE drafts that could be useful here. - James On Mon, Aug 1, 2016 at 12:43 AM, Poul-Henning Kamp <phk@phk.freebsd.dk> wrote: > Based on discussions in email and at the workshop in Stockholm, > JSON doesn't seem like a good fit for HTTP headers. > > A number of inputs came up in Stockholm which informs the process, > Marks earlier attempt to classify header syntax into groups and the > desire the for a efficient binary encoding in HTTP[3-6] (or HPACK++) > > My personal intuition was that we should find a binary serialization > (like CORS), and base64 it into HTTP1-2: Ie: design for the future > and shoe-horn into the present. But no obvious binary serialization > seems to exist, CORS was panned by a number of people in the WS as > too complicated, and gag-reflexes were triggered by ASN.1. > > Inspired by Marks HTTP-header classification, I spent the train-trip > back home to Denmark pondering the opposite attack: Is there a > common data structure which (many) existing headers would fit into, > which could serve our needs going forward? > > This document chronicles my deliberations, and the strawman I came > up with: Not only does it seem possible, it has some very interesting > possibilities down the road. > > Disclaimer: ABNF may not be perfect. > > Structure of headers > ==================== > > I surveyed current headers, and a very large fraction of them > fit into this data structure: > > header: ordered sequence of named dictionaries > > The "ordered" constraint arises in two ways: We have explicitly > ordered headers like {Content|Transfer}-Encoding and we have headers > which have order by their q=%f parameters. > > If we unserialize this model from RFC723x definitions, then ',' is > the list separator and ';' the dictionary indicator and separator: > > Accept: audio/*; q=0.2, audio/basic > > The "ordered ... named" combination does not map directly to most > contemporary object models (JSON, python, ...) where dictionary > order is undefined, so a definition list is required to represent > this in JSON: > > [ > [ "audio/*", { "q": 0.2 }], > [ "audio/basic", {}] > ] > > It looks tempting to find a way to make the toplevel JSON a dictionary > too, but given the use of wildcards in many of the keys ("text/*"), > and the q=%f ordering, that would not be helpful. > > Next we want to give people the ability to have deeper structure, > and we can either do that recursively (ie: nested ordered seq of > dict) or restrict the deeper levels to only dict. > > That is probably a matter of taste more than anything, but the > recursive design will probably appeal aesthetically to more than > just me, and as we shall see shortly, it comes with certain economies. > > So let us use '<...>' to mark the recursion, since <> are shorter than > [] and {} in HPACK/huffman. > > Here is a two level example: > > foobar: foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar > > Parsed into JSON that would be: > > [ > [ > "foo", > { > "p1": 1, > "p4": {}, > "p3": [ > [ > "x1", > {} > ], > [ > "x2", > {} > ], > [ > "x3", > { > "y2": 2 > "y1": 1, > } > ] > ], > "p2": "abc" > } > ], > [ > "bar", > {} > ] > ] > > (NB shuffled dictionary elements to show that JSON dicts are unordered) > > And now comes the recursion economy: > > First we wrap the entire *new* header in <...>: > > foobar: <foo;p1=1;p2=abc;p3=<x1,x2,x3;y1=1;y2=2>;p4, bar> > > This way, the first character of the header tells us that this header > has "common structure". > > That explicit "common structure" signal means privately defined > headers can use "common structure" as well, and middleware and > frameworks will automatically Do The Right Thing with them. > > Next, we add a field to the IANA HTTP header registry (one can do > that I hope ?) classifying their "angle-bracket status": > > A) not angle-brackets -- incompatible structure use topical parser > Range > > B) implicit angle-brackets -- Has common structure but is not <> enclosed > Accept > Content-Encoding > Transfer-Encoding > > C) explicit angle-brackets -- Has common structure and <> encloosed > all new headers go here > > D) unknown status. > As it says on the tin. > > Using this as whitelist, and given suitable schemas, a good number > of existing headers can go into the common parser. > > And then for the final trick: We can now define new variants of > existing headers which "sidegrade" them into the common parser: > > Date: < 1469734833 > > > This obviously needs a signal/negotiation so we know the other side > can grok them (HTTP2: SETTINGS, HTTP1: TE?) > > Next: > > Data Types > ========== > > I think we need these fundamental data types, and subtypes: > > 1) Unicode strings > > 2) ascii-string (maybe) > > 3) binary blob > > 4) Token > > 5) Qualified-token > > 6) Number > > 7) integer > > 8) Timestamp > > In addition to these subtypes, schemas can constrain types > further, for instance integer ranges, string lengths etc. > more on this below. > > I will talk about each type in turn, but it goes without saying > that we need to fit them all into RFC723x, in a way that is not > going to break anything important and HPACK should not hate > them either. > > In HTTP3+, they should be serialized intelligently, but that > should be trivial and I will not cover that here. > > 1) Unicode string > ----------------- > > The first question is do we mean "unrestricted unicode" or do > we want to try to sanitize it. > > An example of sanitation is RFC7230's "quoted-string" which bans > control characters except forward horizontal white-space (=TAB). > > Another is I-JSON (RFC7493)'s: > > MUST NOT include code points that identify Surrogates or > Noncharacters as defined by UNICODE. > > As far as I can tell, that means that you have to keep a full UNICODE > table handy at all times, and update it whenever additions are made > to unicode. Not cool IMO. > > Imposing a RFC7230 like restriction on unicode gets totally > roccoco: What does "forward horizontal white-space" mean on > a line which used both left-to-right and right-to-left alphabets ? > What does it mean in alphabets which write vertically ? > > Let us absolve the parser from such intimate unicode scholarship > and simply say that the data type "unicode string" is what it says, > and use the schemas to sanitize its individual use. > > Encoding unicode strings in HTTP1+2 requires new syntax and > for any number of reasons, I would like to minimize that > and {re-|ab-}use quoted-strings. > > RFC7230 does not specify what %80-%FF means in quoted-string, but > hints that it might be ISO8859. > > Now we want it to become UTF-8. > > My proposal at the workshop, to make the first three characters > inside the quotes a UTF-8 BOM is quite pessimal in HPACK's huffman > encoding: It takes 68 bits. > > Encoding the BOM as '\ufeff' helps but still takes an unreasonable > 48 bits in HPACK/huffman. > > In both H1 and H2 defining a new "\U" escape seems better. > > Since we want to carry unrestricted unicode, we also need escapes > to put the <%20 codepoints back in. I suggest "\u%%%%" like JSON. > > (We should not restict which codepoints may/should use \u%%%% until > we have studied if \u%%%% may HPACK/huffman better than "raw" UTF-8 > in asian codepages.) > > The heuristic for parsing a quoted-string then becomes: > > 1) If the quoted-string first two characters are "\U" > -> UTF-8 > > 2) If the quoted-string contains "\u%%%%" escape anywhere > -> UTF-8 > > 3) If the quoted-string contains only %09-%7E > -> UTF-8 (actually: ASCII) > > 4) If the quoted-string contains any %7F-%8F > -> UTF-8 > > 5) If header definition explicitly says ISO-8859 > -> ISO8859 > > 6) else > -> UTF-8 > > 2) Ascii strings > ---------------- > > I'm not sure if we need these or if they are even a good idea. > > The "pro" argument is if we insist they are also english text > so we have something the entire world stands a chance to understand. > > The "contra" arguement is that some people will be upset about that. > > If we want them, they're quoted-strings from RFC723x without %7F-%FF. > > It is probably better the schema them from unicode strings. > > 3) Binary blobs > --------------- > > Fitting binary blobs from crypto into RFC7230 should squeeze into > quoted-string as well, since we cannot put any kinds of markers or > escapes on tokens without breaking things. > > Proposal: > > Quoted-string with "\#" as first two chars indicates base64 > encoded binary blob. > > I chose "\#" because "#" is not in the base64 set, so if some > nonconforming implementation eliminates the "unnecessary escape" > it will be clearly visible (and likely recoverable) rather than > munge up the content of the base64. > > Base64 is chosen because it is the densest well known encoding which > works well with HPACK/huffman: The b64 characters on average emit > 6.46 bits. > > I have no idea how these blobs would look when parsed into JSON, > probably as base64 ? But in languages which can, they should > probably become native byte-strings. > > 4) Token > -------- > > As we know it from RFC7230: > > tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / > "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA > token = 1*tchar > > 5) Qualified Token > ------------------ > > qualified_token = token 0*1("/" token) > > All keys in all dictionaries are of this type. (In JSON/python... > the keys are strings) > > Schemas can restrict this further. > > 6 Numbers > --------- > > These are signed decimal numbers which may have a fraction > > In HTTP1+2 we want them always on "%f" format and we want them to > fit in IEEE754 64 bit floating point, which lead to the following > definition: > > 0*1"-" DIGIT 0*nDIGIT 0*1("." 0*mDIGIT ) n+m < 15 > > (15 digits fit in IEEE754 64 binary floating point.) > > These numbers can (also) be used for millisecond resolution absolute > UNIX-epoch relative timestamps for all forseeable future. > > 7) Integers > ----------- > > 0*1"-" 1*15 DIGIT > > Same restriction as above to fit into IEEE 754. > > Range can & should be restricted by schemas as necessary. > > 8 Timestamps > ------------ > > I propose we do these as subtype of Numbers, as UNIX-epoch relative > time. That is somewhat human-hostile and is leap-second-challenged. > > If you know from the schema that a timestamp is coming, the parser > can easily tell the difference between a RFC7231 IMF-fixdate or a > Number-Date. > > Without guidance from a schema it becomes inefficient to determine > if it is an IMF-fixdate, since the week day part looks like a token, > but it is not impossible. > > > Schemas > ======= > > There needs a "ABNF"-parallel to specify what is mandatory and > allowed for these headers in "common structure". > > Ideally this should be in machine-readable format, so that > validation tools and parser-code can be produced without > (too much) human intervation. I'm tempted to say we should > make the schemas JSON, but then we need to write JSON schemas > for our schemas :-/ > > Since schemas basically restict what you are allowed to > express, we need to examine and think about what restrictions > we want to be able to impose, before we design the schema. > > This is the least thought about part of this document, since > the train is now in Lund: > > Unicode strings: > ---------------- > > * Limit by (UTF-8) encoded length. > Ie: a resource restriction, not a typographical restriction. > > * Limit by codepoints > Example: Allow only "0-9" and "a-f" > The specification of code-points should be list of codepoint > ranges. (Ascii strings could be defined this way) > > * Limit by allowed strings > ie: Allow only "North", "South", "East" and "West" > > Tokens > ------ > > * Limit by codepoints > Example: Allow only "A-Z" > > * Limit by length > Example: Max 7 characters > > * Limit by pattern > Example: "A-Z" "a-z" "-" "0-9" "0-9" > (use ABNF to specify ?) > > * Limit by well known set > Example: Token must be ISO3166-1 country code > Example: Token must be in IANA FooBar registry > > Qualified Tokens > ---------------- > > * Limit each of the two component tokens as above. > > Binary Blob > ----------- > > * Limit by length in bytes > Example: 128 bytes > Example: 16-64 or 80 bytes > > Number > ------ > > * Limit resolution > Example: exactly 3 decimal digits > > * Limit range > Example: [2.716 ... 3.1415] > > Integer > ------- > > * Limit range > Example [0 ... 65535] > > Timestamp > --------- > > (I cant thing of usable restrictions here) > > > Aaand... I'm in Copenhagen... > > Let me know if any of this looks usable... > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk@FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. >
- Re: If not JSON, what then ? Sam Johnston
- Re: If not JSON, what then ? Martin J. Dürst
- Re: If not JSON, what then ? Alcides Viamontes E
- Re: If not JSON, what then ? Mark Nottingham
- Re: If not JSON, what then ? Willy Tarreau
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Martin Thomson
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Mark Nottingham
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Martin Thomson
- Re: If not JSON, what then ? Willy Tarreau
- Re: If not JSON, what then ? Kari hurtta
- Re: If not JSON, what then ? Willy Tarreau
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Kari hurtta
- Re: If not JSON, what then ? Martin Thomson
- Re: If not JSON, what then ? Carsten Bormann
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Mark Nottingham
- Re: If not JSON, what then ? Carsten Bormann
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Carsten Bormann
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Stefan Eissing
- Re: If not JSON, what then ? Willy Tarreau
- Re: If not JSON, what then ? nicolas.mailhot
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Mark Nottingham
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Mark Nottingham
- Re: If not JSON, what then ? Mark Nottingham
- Re: If not JSON, what then ? nicolas.mailhot
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Stefan Eissing
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Willy Tarreau
- Re: If not JSON, what then ? Mark Nottingham
- Re: If not JSON, what then ? James M Snell
- Re: If not JSON, what then ? Cory Benfield
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Willy Tarreau
- Re: If not JSON, what then ? Nicolas Mailhot
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Poul-Henning Kamp
- Re: If not JSON, what then ? Willy Tarreau
- Re: If not JSON, what then ? Carsten Bormann
- If not JSON, what then ? Poul-Henning Kamp