[Sip] UDPComp: parallels with header compression

"Price, Richard" <richard.price@roke.co.uk> Thu, 23 August 2001 13:29 UTC

Received: from optimus.ietf.org (ietf.org [132.151.1.19] (may be forged)) by ietf.org (8.9.1a/8.9.1a) with ESMTP id JAA23208 for <sip-archive@odin.ietf.org>; Thu, 23 Aug 2001 09:29:39 -0400 (EDT)
Received: from optimus.ietf.org (localhost [127.0.0.1]) by optimus.ietf.org (8.9.1a/8.9.1) with ESMTP id JAA04721; Thu, 23 Aug 2001 09:01:09 -0400 (EDT)
Received: from ietf.org (odin [132.151.1.176]) by optimus.ietf.org (8.9.1a/8.9.1) with ESMTP id JAA04682 for <sip@ns.ietf.org>; Thu, 23 Aug 2001 09:00:55 -0400 (EDT)
Received: from rsys000a.roke.co.uk ([193.118.201.102]) by ietf.org (8.9.1a/8.9.1a) with SMTP id IAA22901 for <sip@ietf.org>; Thu, 23 Aug 2001 08:59:32 -0400 (EDT)
Received: by RSYS002A with Internet Mail Service (5.5.2653.19) id <QP8M9HPF>; Thu, 23 Aug 2001 14:00:37 +0100
Message-ID: <76C92FBBFB58D411AE760090271ED4186F9E2E@RSYS002A>
From: "Price, Richard" <richard.price@roke.co.uk>
To: 'Jonathan Rosenberg' <jdrosen@dynamicsoft.com>, "'sip@ietf.org'" <sip@ietf.org>, "'rohc@cdt.luth.se'" <rohc@cdt.luth.se>
Date: Thu, 23 Aug 2001 14:00:27 +0100
MIME-Version: 1.0
X-Mailer: Internet Mail Service (5.5.2653.19)
Content-Type: multipart/mixed; boundary="------------InterScan_NT_MIME_Boundary"
Subject: [Sip] UDPComp: parallels with header compression
Sender: sip-admin@ietf.org
Errors-To: sip-admin@ietf.org
X-Mailman-Version: 1.0
Precedence: bulk
List-Id: Session Initiation Protocol <sip.ietf.org>
X-BeenThere: sip@ietf.org

Hi,

The idea of pre-agreed upon Huffman codes is something we use in
header compression, as part of our framework for writing new
compression algorithms for new protocol stacks. I believe, however,
that downloading the Huffman codes "on the fly" is more appropriate
for SIP compression.

In fact the framework we developed for header compression shares
many ideas with UDPComp, so it might be worth comparing the two schemes
to see how much we can reuse. Streamlining the terminology would be
of particular benefit as it will help us header compression folks
understand UDPComp all the more quickly.

-   UDPComp has a set of tokens (READ, WRITE etc.) for transmitting
    instructions from the compressor to the decompressor. The same
    system is employed in RFC3095, but the tokens are referred to as
    "encoding methods".

-   UDPComp uses a dictionary of indexed fields to store information
    at the compressor and decompressor. This is also done in RFC3095,
    but referred to as the "context".

The job of the compression algorithm is to send enough tokens to
correctly transmit each packet to the decompressor. In RFC3095 the
instructions for which tokens to send at which point are hand-coded
in the standard, which gives a high compression efficiency but is
somewhat inflexible.

In TCP compression (draft-ietf-rohc-tcp-epic-01.txt) we have the idea
of standardising a new compression algorithm as a set of instructions,
each containing an index to a dictionary item, followed by a type of
token (READ, WRITE etc.) and a probability. For example:

  encode Traffic_Class  as STATIC          99.9%
                        or IRREGULAR(6)     0.1%
    
  encode ECT_Flag       as STATIC          99.9%
                        or IRREGULAR(1)     0.1% 
    
  encode CE_Flag        as VALUE(1,0)        99%
                        or VALUE(1,1)         1%

It may not be obvious at a first glance, but each line in the set of
instructions corresponds precisely to the "longhand" form of a UDPComp
token. The token type (STATIC, IRREGULAR, VALUE etc.) is given together
with the dictionary entry that is to be read from or written to (note
that we use names rather than numbers to index the dictionary entries).
The probabilities are converted by the decompressor into a set of
Huffman codes; the compressor need only send the Huffman code to invoke
the corresponding token at the decompressor.

The key difference between this scheme and UDPComp is how the set of
instructions is given to the decompressor. For header compression the
performance of the scheme must be optimal, so the set of instructions
is programmed into the decompressor offline. In UDPComp however, the
idea is to download the instruction set one line at a time whenever
each line is needed. This is very useful for SIP compression because
we can avoid standardising the compression algorithm altogether, and
just leave it as a local implementation decision at the compressor.

Since the two schemes are essentially identical save how the shorthand
Huffman codes are communicated to the decompressor, it would be very
useful to develop them in parallel. The best possible situation would
be for the two schemes to share a common dictionary structure: in this
case we would have a single unified framework for header and signalling
compression! For each protocol we could choose to standardise the
compression algorithm explicitly, or just leave it up to the implementers
to download suitable Huffman codes to the decompressor on the fly.

Streamlining the two documents shouldn't affect the deadline for
standardising UDPComp: in fact it might help speed things along because
we have a decent tree-based dictionary structure worked out already.

More comments included inline.

Richard P

> -----Original Message-----
> From: Jonathan Rosenberg [mailto:jdrosen@dynamicsoft.com]
> Sent: Wednesday, August 22, 2001 5:51 PM
> To: Price, Richard; Jonathan Rosenberg; 'sip@ietf.org';
> 'rohc@cdt.luth.se'
> Subject: RE: Encapsulating EPIC in UDPComp
> 
> > -----Original Message-----
> > From: Price, Richard [mailto:richard.price@roke.co.uk]
> > Sent: Wednesday, August 22, 2001 8:27 AM
> > To: 'Jonathan Rosenberg'; 'sip@ietf.org'; 'rohc@cdt.luth.se'
> > Subject: RE: Encapsulating EPIC in UDPComp
> > 
> > > Richard is exactly correct that the cost of a general purpose 
> > > decompressor like this is a hit in overhead. With purely static
> > > codebook indices, the overhead is quite large as the results
> > > below show. I agree totally that some kind of Huffman coding is
> > > needed. I have been thinking about this, and I have a proposal
> > > that is different from what Richard is proposing.
> > 
> > On the contrary, your proposal is exactly how I hoped Huffman
> > coding would be introduced!
> 
> Great! I had thought you were proposing that a pre-agreed 
> upon Huffman code be referenced during startup time, or that it
> was downloaded at startup time. The proposal here is that the
> Huffman codes are constructed as you go, since you need that to
> support dynamic codebooks, which I think we all agree is needed.
> Of course, one can easily support downloading of the codes at
> start time if a primitive is defined which is "WRITE this, 
> but don't output anything into a decompressed message".

I believe that we could still have a dynamic dictionary even if we
agree/download the Huffman codes offline. In fact, this is how we
distribute the Huffman codes for header compression. However for SIP
compression it is better to download the Huffman codes as they are
needed, so that we can avoid standardising them in the Internet Draft.

> > 
> > > My proposal would be to make the support for huffman coding 
> > > very general purpose, and based on a tree-like structure where
> > > the codes are dependent on previous elements. Such a hierarchical
> > > structure is needed, I think, to support the TCCB and EPIC
> > > algorithms.
> > 
> > Yes. In EPIC, the Huffman codes are local to each dictionary item.
> 
> I think that you will need such a thing in general to achieve 
> really good compression.

Agreed. Without a local set of Huffman codes we don't get any better
performance than Lempel-Ziv.

> 
> > > So, the idea is that when you write into the codebook, you 
> > > provide a bit which says whether this write takes place at the
> > > current "level" in the hierarchy, or back at the top. Then, you
> > > provide a bitstring that represents the huffman code for 
> the token,
> > > followed by the token.
> > 
> > I would propose to take this one step further, and make the Huffman
> > code indicate not only the location of the dictionary item, but also
> > the type of token (READ, WRITE etc.).
> 
> Excellent idea! Our goal should ultimately be that the encoder output
> is always a series of huffman codes, with some extra stuff to write
> the strings into the codebooks that are not already there. Some more 
> thoughts on that below.
> 
> 
> > We could do this as follows:
> > 
> > -   Keep all of the existing token types (READ, WRITE, 
> > WRITE-REFERENCE)
> >     but prefix them with a large bit pattern, say "11111111".
> > 
> > -   Reserve the remaining prefix codes ("00000000" to 
> "11111110") for
> >     the Huffman codes.
> > 
> > -   When a token is sent explicitly, attach a "shorthand" 
> Huffman code
> >     that can be used to send the token in future.
> > 
> > When we send the Huffman code, it implicitly specifies the 
> token type
> > (READ, WRITE etc.) as well as the location of the item to be read or
> > written.
> 
> I agree with that, but I'm not sure that you want to use these large
> bit patterns to represent the initial commands. The result will be that
> initial WRITEs and READs are really, really large. You only need these
> large codes when there are lots of Huffman codes in any particular table. 
> That won't always be the case at all.
> 
> One way to solve this is to have a really nice way to dynamically
> construct the Huffman codes. Lets say you have a Huffman code
> represented by some tree. Adding a new string can be done in only one
> way. You take an existing vertex somewhere in the tree, and remove
> the subtree (if present). Add two new branches below that vertex. Add
> the subtree to the right branch, and place your new string to the the
> other. 
> 

This is a particularly cunning method of sending the Huffman codes to
the decompressor!

> Here's an example Huffman tree. Assume 0 is left, 1 is right:
> 
>        
>        /\
>       /\ A
>       B C
> 
> THis corresponds to:
> 
> 00  B
> 01  C
> 1   A
> 
> Now, we wish to add D. TO do that, we send a command that 
> reads like: "WRITE into the table, spltting at 00, and the
> value is D". So, the decoder finds the vertex 00. TO do this,
> it starts at the root, goes left, and then left again. This
> is the vertex which currently encodes B. It removes B, adds two
> branches, places B on the right, and the new element on the left:
> 
>     /\
>    /\ A
>   /\ C      
>   D B
> 
> Now, this table looks like:
> 
> 1  A
> 01 C
> 001 B
> 000 D
> 
> So, the idea is that we always start with a default table that has
> two codes in it, which says "WRITE" in one of the two places:
> 
> initial table:
> 
> Code       output
> -------------------
> 0          WRITE the string which follows by splitting the 
> indicated vertex
> in the curren table
> 1          WRITE the string which follows into the root table 
> (i.e, change
> contexts) by splitting the indicated vertex
> 
> This code would need to be followed by the identity of the 
> vertex, and then the text itself.
> 
> This way, you start off with very efficient codes that write 
> into the table.
> As the table gets larger, you use the efficient codes for the common
> strings, and the code for writing into the table gets longer.
> 
> Rather than simply enclosing the text as a string, we should 
> either Huffman encode that with a standardized table, or else take
> all of the text in a message, put them together, compress it as an
> independent blob, and then send that together.

I believe that algorithms like EPIC and Lempel-Ziv do this anyway. In
EPIC we send Huffman codes for the longest matching string we can find,
where the string can be 1 character long if we can't find a better
match. So when encapsulated in UDPComp we never send a WRITE token
containing any more than 1 ASCII character. Everything else is done
using READ and WRITE-REFERENCE.

> 
> 
> 
> > 
> > The advantage of this approach is that schemes like EPIC and 
> > Lempel-Ziv
> > only ever send Huffman codes when reading and writing entries in the
> > dictionary. Once UDPComp has successfully transmitted these 
> > Huffman codes
> > to the decompressor there is no need to send the tokens 
> explicitly any
> > more. This makes the overhead of UDPComp negligible (in 
> fact, the only
> > overhead is the reserved bit pattern "11111111" that we used to send
> > the tokens explicitly in the first place).
> 
> Right. My proposal above reduces that as well, since this 
> overhead is not
> once per message, but once per new token.
> 
> 
> > 
> > > It is the responsibility of the encoder to make sure that 
> each code
> > > is a valid prefix code. When the encoder sends a read command, it
> > > contains a bit which indicates whether the read is from the top of
> > > the tree or the current position, followed by the code.
> > 
> > Agreed. However we should make this indicator flag part of 
> the Huffman
> > code as well, so that we don't have to send it explicitly 
> every time.
> 
> Right, that is the proposal above.
> 
> 
> > > An interesting improvement is that I can specify a command which
> > > says "READ code XX, and then read the next N symbols in the tree
> > > if the number of children at each node is 1, in which case you
> > > have no choice about the next symbol anyway". This is useful for
> > > trees that are really deep, as opposed to wide. If you think about
> > > it, this covers LZSS, which can be viewed as a "hierarchical"
> > > codebook where each node is a single character, and the tree
> > > is one where each node has only one child, the next character.
> > > 
> > 
> > A similar enhancement might be to allow 1 Huffman code to indicate
> > several consecutive choices of branches on the tree. This is
> > especially useful when there is a single very likely route through
> > the tree, as we can specify this route using only a single short
> > Huffman code.
> 
> Hmm, interesting idea. You need to be careful about this, though,
> since management of this tangle of tables can be complex. It would 
> be really nice to maintain a simple structure. 

As with all of the other ideas for improving the compression ratio, we
need to weigh the performance gain vs the additional complexity. For
this particular modification the improvement in compression ratio is
just over 30%.

> 
> I think the way to do what you are proposing is to define a 
> command which is "WRITE into the current table, splitting the
> specified vertex. The element to write is found by reading from
> here with the following huffman codes: XXX, and then the next table
> is the existing one at the end of the read".
> 
> This turns the tree of tables into a more general purpose 
> digraph, which is probably a really good idea. It can save a lot
> of overhead in the management of the table. Protocols, like SIP,
> have many "paths" through the message which can arrive at a common
> piece of data (for example, the SIP URL). This way, the encoding
> for the SIP URL appears only once in memory, rather than
> based on each path to get to it.

I believe this also improves the compression ratio, because the same
URL can appear several times in a given SIP message. If we use a
digraph then we can avoid sending the URL more than once.

> 
> Maybe time to start writing all of this up in an updated I-D 
> for UDPComp, since its probably hard to get it from these terse
> emails....
> 
> -Jonathan R.
> 
> 
> ---
> Jonathan D. Rosenberg, Ph.D.                72 Eagle Rock Ave.
> Chief Scientist                             First Floor
> dynamicsoft                                 East Hanover, NJ 07936
> jdrosen@dynamicsoft.com                     FAX:   (973) 952-5050
> http://www.jdrosen.net                      PHONE: (973) 952-5000
> http://www.dynamicsoft.com
>