Re: UTF-8 text (was: Header Serialization Discussion)

James M Snell <jasnell@gmail.com> Wed, 17 April 2013 18:47 UTC

Resent-Date: Wed, 17 Apr 2013 18:46:57 +0000
Resent-Message-Id: <E1USXNh-0000nk-AQ@frink.w3.org>
MIME-Version: 1.0
In-Reply-To: <1369687790.71571377.1366222842246.JavaMail.root@zimbra71-e12.priv.proxad.net>
References: <CABP7RbdtNtBin_mq0qVNDChCKq8jvjFKUU6oMZy7XZMnb_JwTw@mail.gmail.com> <1369687790.71571377.1366222842246.JavaMail.root@zimbra71-e12.priv.proxad.net>
From: James M Snell <jasnell@gmail.com>
Date: Wed, 17 Apr 2013 11:46:06 -0700
Message-ID: <CABP7Rbdd8tPQXCgbzJ-PH76Nz3SswO0LJxPvwUt2HMCFJ6Pz3Q@mail.gmail.com>
To: Frédéric Kayser <f.kayser@free.fr>
Cc: "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Received-SPF: pass client-ip=209.85.214.180; envelope-from=jasnell@gmail.com; helo=mail-ob0-f180.google.com
Subject: Re: UTF-8 text (was: Header Serialization Discussion)
Archived-At: <http://www.w3.org/mid/CABP7Rbdd8tPQXCgbzJ-PH76Nz3SswO0LJxPvwUt2HMCFJ6Pz3Q@mail.gmail.com>
Resent-From: ietf-http-wg@w3.org
Resent-Sender: ietf-http-wg-request@w3.org
Precedence: list

On Wed, Apr 17, 2013 at 11:20 AM, Frédéric Kayser <f.kayser@free.fr> wrote:
> I think this is the way to go. The codepoints between 80 and FF where in the tree to encode ISO Latin-1 more precisely what is called Latin-1 Supplement in Unicode, it's the second Unicode block, leading octets C2 and C3 are used to access it in UTF-8. Allowing ISO Latin-1 in the header probably had sens in the 90s, nowadays Unicode and UTF-8 make a lot more sens.

Yep.. I have this working now and it does encode much more compactly.
I've ensured that the leading octet is encoded as an 8-bit prefix,
allowing us to encode the UTF-8 characters compactly with 2-fewer bits
per continuation byte.

> But the question remains could we use UTF-8 text fields to store IRIs (RFC3987) in native form without punycode and percent escaping? More and more countries allow IDNs (Internationalized Domain Name) even ccTLDs can be internationalized nowadays. DNS protocols were stuck to ASCII but HTTP/2 has the opportunity to embrace Unicode.
>

This is a more difficult question. In theory, yes, we ought to be able
to support these, but there's the question of backwards compatibility.
We could define that the new :path field (and referer, location, link,
etc) contain a UTF-8 encoded IRIs, so for backwards compatibility with
HTTP/1, an implementation would need to do the appropriate standard
conversion to a URI. Going the other direction, an impl could choose
to leave it as a URI or convert it to it's IRI form. I think this
makes a lot of sense and has a very clear http/2 <--> http/1
translation. So I'm +1 on it.

- James

> Frédéric
>
> ----- Mail d'origine -----
> De: James M Snell <jasnell@gmail.com>
> À: Frédéric Kayser <f.kayser@free.fr>
> Cc: ietf-http-wg@w3.org
> Envoyé: Wed, 17 Apr 2013 19:31:18 +0200 (CEST)
> Objet: Re: Header Serialization Discussion
>
> Yeah, I was looking at that also. My concern was mainly how adding
> symbols for each leading octet would affect the efficiency of the
> ascii portion of the code tree but looking it over, the existing code
> tree contains symbols for codepoints >= 127 already. If we drop those
> generally unused higher codepoints, change the EOF from 256 to 127 and
> add in the 50 leading octet symbols the resulting tree is very
> reasonable. I've been playing with the frequency numbers and was able
> to generate a tree where the leading octets can be encoded in exactly
> 8-bits. Then, as you suggest, write out 6-bits from each trailing
> octet. This should give us a fairly compact encoding for all
> codepoints >= 128.
>
> On Wed, Apr 17, 2013 at 8:14 AM, Frédéric Kayser <f.kayser@free.fr> wrote:
>> Ok based on Huffman.java you have apparently changed the other prefix codes accordingly and your sample was in fact for the request table.
>>
>> Since UTF-8 is a superset of ISO Latin-1 and based on storeUtf8Codepoint it would cost at most 6+11 that's 17 bits to send an ISO Latin-1 code point encoded in UTF-8 compared to at least 25 bits in native form. I see two problems here:
>> 17 bits is longer than 16 bits the raw UTF-8 size of such a code point
>> Why keep the long Latin-1 codes when their UTF-8 counter parts are shorter ?
>>
>> I would suggest to build a new table keeping the statistics gathered for ASCII chars but using the upper part of the table to build prefix codes for UTF-8 leading octet C2—F4 range.
>> When encoding/decoding an UTF-8 codepoint if the leading octet was in the C2—DF range write/read 6 bits (a single continuation octet), if it was in the E0—EF range 12 bits (two continuation octets), and in the F0—F4 range 18 bits (three continuation octets).
>>
>> Frédéric
>>
>> ----- Mail d'origine -----
>> De: Frédéric Kayser <f.kayser@free.fr>
>> À: ietf-http-wg@w3.org
>> Envoyé: Wed, 17 Apr 2013 16:02:37 +0200 (CEST)
>> Objet: Re: Header Serialization Discussion
>>
>> Hello,
>> how could 110101 be used as prefix code for 257 since it's the beginning of prefix code 1101010 (61), the same goes for 110110 (258) beginning of 1101100 (83)...
>>
>> Frédéric
>> ----- Mail d'origine -----
>> De: James M Snell <jasnell@gmail.com>
>> À: ietf-http-wg@w3.org
>> Envoyé: Wed, 17 Apr 2013 04:01:30 +0200 (CEST)
>> Objet: Re: Header Serialization Discussion
>>
>> Another update... I just checked in preliminary support for UTF-8
>> support in the static huffman code. The basics are...
>>
>> Building off of Roberto's dictionary, I added additional symbols to
>> the static huffman code tree, each representing a possible UTF-8
>> encoded length. Right now I have symbols for 2, 3, 4, 5 and 6 byte
>> lengths (yes, I know we don't need all those). For instance, in the
>> response table I have:
>>
>> (257) |110101 [6]
>> (258) |110110 [6]
>> (259) |01111 [5]
>> (260) |10000 [5]
>> (261) |10001 [5]
>>
>> Where 257 is for 2 byte characters, 258 is for 3 byte, etc (I'm going
>> to be tweaking the tables so that this is better later).
>>
>> As I iterate through each codepoint in the source text, I check to see
>> if the codepoint is < 256, if it is, I encode it normally. If the
>> codepoint > 256, I determine the appropriate escape code from the
>> table, then write the exact number of bits for the codepoint. So, for
>> instance, the supplementary character U+10400 is encoded out as [78 41
>> 00 00], consuming a total of 26 bits.
>>
>> This is purely experimental at this point, but the code is functional
>> and checked in to github. There are obviously a number of improvements
>> that could be made and I'm definitely open to more suggestions! One
>> problem with this, of course, is that it's not very efficient if the
>> field value consists primarily of extended characters -- that is, this
>> approach is optimized specifically for the case where we have the
>> random odd UTF-8 character thrown in to a mostly ASCII string.
>>
>> One important side effect of this is that we ought to be able to "Just
>> Use UTF-8" across the board now, without requiring the ISO-8859-1
>> escape hatch.
>>
>> - James
>>
>> On Sat, Apr 13, 2013 at 3:01 PM, James M Snell <jasnell@gmail.com> wrote:
>>> Ok... so I've implemented HeaderDiff serialization [1] in java [2]...
>>> just the serialization so far, will do deserialization early next
>>> week. The implementation is very rough and definitely needs
>>> improvement / optimization but it's functional enough to start from.
>>>
>>> [1] http://tools.ietf.org/html/draft-ruellan-headerdiff-00
>>> [2] https://github.com/jasnell/http2/tree/master/src/snell/http2/headers/headerdiff
>>>
>>> I know that Roberto is working on refinements to his own concept of
>>> the delta header serialization and hopefully he'll be sharing those
>>> thoughts soon, but I wanted to get my current thoughts captured for
>>> discussion.
>>>
>>> The HeaderDiff approach is straightforward and effective and makes
>>> efficient use of available bits. It's variable length integer syntax
>>> is a bit complicated with the notion of bit-prefix lengths but once
>>> you've got it it's easy to understand what's going on. The
>>> implementation is a bit less complicated than delta, which is good and
>>> I like that there's no "Header Group" notion. Headers are either on or
>>> off per request. There are several concerns, however.
>>>
>>> - The true utility of the common prefix length mechanism is
>>> questionable. Aside from the potential security risks, I questioning
>>> just how effective it's going to be in practice. (What header fields
>>> do we expect to actually use it in practice?)
>>>
>>> - The fact that items are never removed from the Name Table is
>>> concerning and poses a potential security risk. The the decompressor
>>> is forced to maintain a symbol table for every header name it
>>> encounters, a malicious compressor could cause overruns by sending a
>>> high number of junk header names. The compressor ought to be able to
>>> treat the Name Table as an LRU, and ought to be able to place strict
>>> limits on it's size, just as it does with the Header Table. Delta does
>>> not have this issue because it's name indices are already tied to the
>>> LRU.
>>>
>>> With HeaderDiff in mind, I'm thinking about how we can bring
>>> HeaderDiff and Delta together and roll in the additional concepts of
>>> Typed Codecs. Here's what I have in mind.
>>>
>>> Syntax:
>>>
>>> The serialization syntax borrows concepts from both HeaderDiff and
>>> Delta and is divided into two parts: Basic Header Serialization and
>>> Value Serialization.
>>>
>>> As with Delta currently, there are four operation codes (opcodes),
>>> represented by two-bit identifiers. Each op code has a single-bit
>>> "ephemeral flag". Operations are serialized into batches of no more
>>> than 32 instances.
>>>
>>> 00 0 00000 { HEADER }
>>>
>>> The first two most significant bits...
>>>
>>> 00 = Index (what Delta calls "Toggl")
>>> 01 = Index Range (what Delta calls "Trang")
>>> 10 = Clone
>>> 11 = Literal (what Delta calls "KVSto")
>>>
>>> The third most significant bit is the ephemeral toggle
>>>
>>> 0 = Persist (the header value is to be persisted in the LRU table)
>>> 1 = Ephemeral (the header value is not to be persisted in the LRU table)
>>>
>>> The remaining five bits specify the number of distinct instances in
>>> this batch. This counter is 0-based, so 0x0 = 1 item in the batch,
>>> 0x1F = 32 items in the batch.
>>>
>>> The ephemeral flag is ignored for opcodes 00 and 01.
>>>
>>> Indexed headers are identified by a single byte index value.
>>>
>>> 0 0000000
>>>
>>> The first most significant bit indicates whether or not the index is
>>> part of the static header dictionary or lru cache.
>>>
>>> 0 = LRU
>>> 1 = Static
>>>
>>> This gives us a strict maximum of 128 storage slots for the LRU.
>>> That's limited *on purpose* to place strictly finite constraints on
>>> the amount of state any single compressor can require. This also
>>> ensures that we are always able to encode indices using only a single
>>> byte.
>>>
>>> As for state management, we would essentially use the Header Diff
>>> model minus the separate Name Table. That is, there is no persistent
>>> Header Group in use; just the Header Table. State management depends
>>> on the specific opcodes used. State management occurs as the headers
>>> are encountered in the serialized header block.
>>>
>>> 00 = Use an indexed header instance from the header table for this request
>>> 01 = Use multiple sequentially indexed header instances from the
>>> header table for this request
>>> 10 = Use a new value for an already existing header name, index
>>> references an existing header from the table, persist the new value if
>>> the ephemeral bit is not set.
>>> 11 = Use a new key and value for this request. Persist the new pair
>>> if the ephemeral bit is not set.
>>>
>>> So, for example, the serialization:
>>>
>>> 00000000 10000000 =
>>> Use index #0 from the static dictionary
>>>
>>> 00000001 10000000 00000001 =
>>> Use index #0 from the static dictionary and index #1 from the lru cache
>>>
>>> 01000001 10000000 10000001 00000001 00000010 =
>>> Use indices #0-#1 from the static dictionary and indices #1-#2
>>> from the lru cache
>>>
>>> 10000000 10000000 {VALUE}
>>> Provide a new value for the header name identified by index #0 in
>>> the static dictionary and persist the value in the lru cache
>>>
>>> 10100000 10000000 {VALUE}
>>> Provide a new value for the header name identified by index #0 in
>>> the static dictionary but do not persist the value in the lru cache
>>>
>>> Non-ephemeral items are pushed into the next available slot in the
>>> LRU. Once all available slots have been used, the indices rotate. The
>>> LRU can also be limited by storage space. As storage space is
>>> required, the least recently items are dropped. Keeping in mind that
>>> there are only 128 available storage slots, it will be important for
>>> compressors to be selective about which fields are persisted.
>>>
>>> (note.. the LRU is really a "least recently written" cache.. not
>>> "least recently used"...)
>>>
>>> These values are tracked only for the current message. There is no
>>> Header Group as currently defined by Delta so there is no need to
>>> track whether or not any given header is currently on or off. It's
>>> either referenced for this request or it's not.
>>>
>>> If there are more than 32 instances of any single opcode per
>>> serialized header block (which is unlikely), a single opcode may
>>> appear multiple times within the serialized block.
>>>
>>> As far as Value serialization is concerned... As I have discussed
>>> previously, I have been looking at support for four distinct value
>>> types, each represented by their own two-bit identifier. As I will
>>> explain further in a second, these types are represented using the two
>>> most significant bits in a byte:
>>>
>>> 00 = Text
>>> 01 = Number
>>> 10 = Date Time
>>> 11 = Binary
>>>
>>> Text can be either UTF-8 or ISO-8859-1, indicated by a single bit flag
>>> following the type code. All text strings are prefixed by it's length
>>> given as an unsigned variant length integer
>>> Numbers are encoded as unsigned variable length integers.
>>> Date Times are encoded as Numbers representing the number of seconds
>>> that have passed since the epoch (GMT).
>>> Binary data is encoded as raw binary octets prefixed by an unsigned
>>> variable length integer specifying the number of bytes.
>>>
>>> Any single header value instance can consist of up to 32 distinct
>>> value tokens. The number of tokens is specified using the 5 least
>>> significant bytes of type type flag.
>>>
>>> For instance:
>>>
>>> 00000000 00000001 01100001 =
>>> A single ISO-8859-1 Text value with length = 1
>>>
>>> 00100000 00000001 01100001 =
>>> A single UTF-8 Text value with length = 1
>>>
>>> 00100001 00000001 01100001 00000001 01100010 =
>>> Two UTF-8 Text values, each with length = 1
>>>
>>> 01000000 00000001 =
>>> A single Number value
>>>
>>> 10000000 00000001 =
>>> A single Date value (1 second after the epoch)
>>>
>>> 11000000 00000001 00000001 =
>>> A single Raw Binary value
>>>
>>> For ISO-8859-1 Text, the Static Huffman Code used by Delta would be
>>> used for the value. If we can develop an approach to effectively
>>> handling Huffman coding for arbitrary UTF-8, then we can apply Huffman
>>> coding to that as well.
>>>
>>> For the Number and DateTime serialization:
>>>
>>> - 16-bit numbers serialize with a strict maximum of 3 bytes
>>> - 32-bit numbers serialize with a strict maximum of 5 bytes.
>>> - 64-bit numbers serialize with a strict maximum of 10 bytes.
>>> - Date Times will serialize with five bytes for the reasonably
>>> relevant future, then six bytes for quite some time after that. Dates
>>> prior to the epoch cannot be represented.
>>>
>>> In order to properly deal with the backwards compatibility concerns
>>> for HTTP/1, there are several important rules for use of Typed Codecs
>>> in HTTP headers:
>>>
>>> 1. Headers must be explicitly defined to use the new header types. All
>>> existing HTTP/1 headers, then, will continue to be required to be
>>> represented as ISO-8859-1 Text unless their standard definitions are
>>> updated. The HTTP/2 specification would update the definition of
>>> specific known headers (e.g. content-length, date, if-modified-since,
>>> etc).
>>>
>>> 2. Extension headers that use the typed codecs will have specific
>>> normative transformations to ISO-8859-1 defined.
>>> a. UTF-8 Text will be converted to ISO-8859-1 with extended
>>> characters pct-encoded
>>> b. Numbers will be converted to their ASCII equivalent values.
>>> c. Date Times will be converted to their HTTP-Date equivalent values.
>>> d. Binary fields will be Base64-encoded.
>>>
>>> 3. There will be no normative transformation from ISO-8859-1 values
>>> into the typed codecs. Implementations are free to apply
>>> transformation where those impls determine it is appropriate, but it
>>> will be perfectly legal for an implementation to pass a text value
>>> through even if it is known that a given header type has a typed codec
>>> equivalent (for instance, Content-Length may come through as a number
>>> or a text value, either will be valid). This means that when
>>> translating from HTTP/1 -> HTTP/2, receiving implementations need to
>>> be prepared to handle either value form.
>>>
>>> This approach combines what I feel are the best concepts from
>>> HeaderDiff and Delta and provides simple, straightforward header
>>> serialization and state management. Obviously, lots of testing is
>>> required, however.
>>>
>>> As always, thoughts/opinions/gripes are appreciated.
>>>
>>> - James
>>
>>
>>
>>
>>
>
>
>

Header Serialization Discussion James M Snell
RE: Header Serialization Discussion RUELLAN Herve
Re: Header Serialization Discussion James M Snell
Re: Header Serialization Discussion Adrien W. de Croy
Re: Header Serialization Discussion Frédéric Kayser
RE: Header Serialization Discussion RUELLAN Herve
RE: Header Serialization Discussion RUELLAN Herve
Re: Header Serialization Discussion James M Snell
Re: Header Serialization Discussion James M Snell
Re: Header Serialization Discussion Roberto Peon
Re: Header Serialization Discussion James M Snell
Re: Header Serialization Discussion Martin Nilsson
Re: Header Serialization Discussion James M Snell
Re: Header Serialization Discussion Frédéric Kayser
Re: Header Serialization Discussion Frédéric Kayser
Re: Header Serialization Discussion James M Snell
Re: UTF-8 text (was: Header Serialization Discuss… Frédéric Kayser
Re: UTF-8 text (was: Header Serialization Discuss… James M Snell
Re: UTF-8 text (was: Header Serialization Discuss… James M Snell
Re: Header Serialization Discussion Martin Thomson
Re: UTF-8 text James Cloos