Re: [art] Request for review of draft-diaz-lzip-07

Antonio Diaz Diaz <antonio@gnu.org> Thu, 27 April 2023 12:43 UTC

Return-Path: <antonio@gnu.org>
X-Original-To: art@ietfa.amsl.com
Delivered-To: art@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4965FC1516F8 for <art@ietfa.amsl.com>; Thu, 27 Apr 2023 05:43:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.4
X-Spam-Level:
X-Spam-Status: No, score=-4.4 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=gnu.org
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Z9nYU5qYUmnj for <art@ietfa.amsl.com>; Thu, 27 Apr 2023 05:43:22 -0700 (PDT)
Received: from eggs.gnu.org (eggs.gnu.org [IPv6:2001:470:142:3::10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1C482C151549 for <art@ietf.org>; Thu, 27 Apr 2023 05:43:21 -0700 (PDT)
Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from <antonio@gnu.org>) id 1ps0xy-0001ku-RP; Thu, 27 Apr 2023 08:43:18 -0400
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=In-Reply-To:References:Subject:To:MIME-Version:From: Date; bh=mLct/mJYAMBRIvQIrsJdgA1aBcoS8R9RTn7CLSLONtE=; b=OHIT594TI9NFm1DMMvUT HLpEs0fP2o3qIVEbWrJFU1AbM0y3iTF69/Xwe9fBe6AoVLM8vzs8H3BuzaE89zjeJZo7fFMOJp8iX PJ93qFOqmChXZK2a0jnavH0T/MHFbd4JfZf5XmBAQPpZAog2A+x3jfLXZfgOkP9J5hHlBSujMC/Lm oj2ejElbyL4l0R8O5NfyDcYdeSvm8z1fdQy2Xn9XCvzwCCLurXWdQdT7p5cLAgUyiuD28bFeNq4V0 Au7rVgnPO7fE8TFRzmvak0nASCrc+34AT3ySJZfSu8+oarzv//gBrLem0d2oPUxC+CGUyP7I7WcSe hlraALEaZjZUmw==;
Received: from 93.red-81-34-173.dynamicip.rima-tde.net ([81.34.173.93] helo=[192.168.1.2]) by fencepost.gnu.org with esmtpsa (TLS1.0:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.90_1) (envelope-from <antonio@gnu.org>) id 1ps0xx-0000Ui-Mr; Thu, 27 Apr 2023 08:43:18 -0400
Message-ID: <644A6E12.3030001@gnu.org>
Date: Thu, 27 Apr 2023 14:44:02 +0200
From: Antonio Diaz Diaz <antonio@gnu.org>
User-Agent: Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.9.1.19) Gecko/20110420 SeaMonkey/2.0.14
MIME-Version: 1.0
To: "Dale R. Worley" <worley@ariadne.com>
CC: art@ietf.org, Antonio Diaz Diaz <antonio@gnu.org>
References: <87jzy0ofwb.fsf@hobgoblin.ariadne.com>
In-Reply-To: <87jzy0ofwb.fsf@hobgoblin.ariadne.com>
Content-Type: text/plain; charset="ISO-8859-15"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/art/_XM07Yx3RxIGCJPhzX8bDNcUQ9E>
Subject: Re: [art] Request for review of draft-diaz-lzip-07
X-BeenThere: art@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Applications and Real-Time Area Discussion <art.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/art>, <mailto:art-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/art/>
List-Post: <mailto:art@ietf.org>
List-Help: <mailto:art-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/art>, <mailto:art-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Apr 2023 12:43:26 -0000

Dale,

Thank you very much for your detailed remarks.

Dale R. Worley wrote:
> It would help if you would put in your message what the goal of the
> document is, so that people would be motivated to review it.

You are right. I should have stated it explicitly instead of just linking to 
the original request.

I'm trying to publish draft-diaz-lzip as a RFC so that I can register at
IANA a media type, a content encoding, and a structured syntax suffix for
lzip compressed data.

> Note this is a media type registration in the standards tree, and so
> must follow the procedures of RFC 6838 sec. 5, and it appears that this
> document and review is part of that process.

Yes, I'm trying to follow the procedures of RFC 6838 sec. 5.

> Generally, I think this document is in good condition to submit to IANA
> for registration, and to the Independent Stream Editor for publication.

Thanks.

Maybe the independent stream is not the right one for draft-diaz-lzip, 
because section 5.2 of RFC 6838 states that: "Normal IETF processes need to 
be followed for all IETF registrations in the standards tree", which seems 
to imply an "IETF Review" registration policy. Then, section 4.8 of RFC 8126 
states that: "With the IETF Review policy, new values are assigned only 
through RFCs in the IETF Stream -- those that have been shepherded through 
the IESG as AD-Sponsored or IETF working group documents [RFC2026] 
[RFC5378], have gone through IETF Last Call, and have been approved by the 
IESG as having IETF consensus". This seems consistent with the fact that 
both brotli (RFC 7932) and zstd (RFCs 8478 and 8878) were submitted as 
"individual in art area".

>     ... provides a 3 factor integrity checking to maximize
>     interoperability and optimize safety.
>
> This doesn't read well, as I am led to think that "3 factor integrity
> checking" is similar to "2 factor authentication".  The meaning seems to
> be that Lzip has internal checking fields to defend against
> unintentional data corruption, of approximately the reliability of a CRC.

IMHO it is similar to "2 factor authentication" in the sense that each 
factor is independent of the others and may therefore help in discriminating 
real corruption from a false positive for corruption. BTW, the reliability 
of the integrity checking of lzip is estimated to be more than 49 million 
times higher than that of a CRC32 alone for the single-bit error model (bit 
flip). For the zeroed-block error model (whole sector read error) 
reliability is even better, and I estimate that the lower bound for the MTBF 
in this case is about 2.5 times longer than the age of the universe.

> I also have a gripe with the use of "maximize" and "optimize" when the
> thing in question isn't the *maximum* or *optimum* that can be obtained
> under the stated constraints, but rather is used just to mean "better",
> often without stating what it is "better" than.

Yes, I have not yet managed to write that phrase in a clear way.

> So could you rewrite this phrase to be clearer?

Sure. The complete phrase is:

"Lzip uses a simplified form of the LZMA stream format and provides a 3 
factor integrity checking to maximize interoperability and optimize safety."

Note that the "simplified form of the LZMA stream format" is what "maximizes 
the interoperability". I think that this part of the phrase is exact. I have 
removed all options from the original LZMA stream format; lzip only uses 
default values for LZMA parameters, and always finishes the LZMA stream with 
an "End Of Stream" marker (optional in the original LZMA). Also, integrity 
checking in lzip is unique and mandatory. This means that all lzip 
implementations produce and decode exactly one and the same format, making 
it impossible that one implementation can't decode a file produced by 
another implementation.

OTOH, the "3 factor integrity checking" is what "optimizes the safety". This 
may be inexact, but I haven't found a better way of express it. Lzip 
provides a complete set of fields in its trailer, and uses all of them to 
check integrity as completely as possible. Gzip does not have a "member 
size" field, and bzip2 lacks both member and data sizes. Writing that "lzip 
provides a 3 factor integrity checking that improves safety compared with 
gzip and bzip2" does not seem more informative than the current wording.

Maybe I could change the phrase to something like:

"Lzip uses a simplified form of the LZMA stream format to maximize 
interoperability and provides an accurate and robust 3 factor integrity 
checking."

>     interoperability
>
> The document uses "interoperable" in a number of places but it's not
> clear to me it what ways it is interoperable, or perhaps better to say,
> it's not clear in what ways a data compression algorithm from bytes to
> bytes might not be interoperable.  And the media type registration
> itself just says "Interoperability considerations:  N/A"

Interoperability problems, as in "my decompressor for format X can't decode 
some files in format X" do not seem to be uncommon nowadays. See for example 
section 3 of RFC 8878 (zstd):

    A compliant decompressor must be able to decompress at least one
    working set of parameters that conforms to the specifications
    presented here.  It may also ignore informative fields, such as the
    checksum.  Whenever it does not support a parameter defined in the
    compressed stream, it must produce an unambiguous error code and
    associated error message explaining which parameter is unsupported.

or section 3.1.2:

    Skippable frames defined in this specification are compatible with
    skippable frames in [LZ4].

which may make difficult for a decompressor to tell a zstd file from a LZ4 
file if both start with one (or more) skippable frames.

Lzip is free from all these interoperability problems.

I'll think about how to explain this in the draft.

>     1.1.  Purpose
>
>     The data can be produced or consumed, even for an arbitrarily long
>     sequentially presented input data stream, using only an a priori
>     bounded amount of intermediate storage, and hence can be used in data
>     communications.
>
> This depends on exactly what is meant by data communications.

I think I have used the same meaning as the RFCs for gzip, brotli, and zstd, 
as all of them use a similar wording:

gzip RFC 1952:
           * Can compress or decompress a data stream (as opposed to a
             randomly accessible file) to produce another data stream,
             using only an a priori bounded amount of intermediate
             storage, and hence can be used in data communications or
             similar structures such as Unix filters;

brotli RFC 7932:
       *  can be produced or consumed, even for an arbitrarily long,
          sequentially presented input data stream, using only an a
          priori bounded amount of intermediate storage; hence, it can be
          used in data communications or similar structures, such as Unix
          filters.

zstd RFC 8878:
    The data can be produced or consumed, even for an arbitrarily long
    sequentially presented input data stream, using only an a priori
    bounded amount of intermediate storage; hence, it can be used in data
    communications.

> If there are timing considerations regarding the end-to-end delays
> involved, it's necessary to be able to "flush" the compressed data
> stream, that is, to be able to command the compressor to emit enough
> bytes to get the decompressor to output the last-input source byte,
> without the source needing to supply any further bytes of the data
> stream.  (Imagine a line-by-line dialog like a Telnet session.)  I'd call
> this the difference between "file compression" and "(real-time) data
> communications compression".

I think gzip, brotli, zstd, and lzip all belong to the "file compression" 
category.

> Now it does appear to me that Lzip has this facility.  I haven't dug
> into the LZMA algorithm, but it seems like the compressor can end
> creating a member at any point in the data stream that it chooses, and
> sent the tail of the member to the decompressor.  That appears to
> require 26 bytes of overhead, so it doesn't seem a good strategy.  Is
> there a better way?  If you want to state "Lzip can be used in data
> communications", it's worth documenting an efficient way to flush the
> compression.

None of the two formats I know best (gzip and lzip) provides any flush 
mechanism beyond starting a new member. For the kind of "(real-time) data 
communications" you have in mind, you would need to use either zlib or 
lzlib. Note that zlib and gzip have separate media types. See RFC 6713 (The 
'application/zlib' and 'application/gzip' Media Types ).

>     2.  File Format
>
>     A lzip file consists of a series of independent "members" (compressed
>     data sets).
>
> Given the common use of "members" to mean the individual files within an
> archive format (and the IBM usage of "data set" to mean a file), it
> would help if you explain this a bit more.

"member" has been used since about 1992 by GNU Gzip with exactly the same 
meaning as it has in lzip. You can find it in RFC 1952 (GZIP file format 
specification version 4.3). Moreover, the expression "multimember file" is 
the standard way of refering to the files created by plzip (the parallel 
implementation of lzip).

The structure of a lzip member is described in the paragraph following the 
sentence above. Therefore I expect the reader to quickly notice that a lzip 
member is not a tar member or a zip member without the need to mention 
archive formats.

> AFAICT, the intention is
> that when the Lzip file is decompressed, each member is decompressed
> separately, and then the decompressed members are concatenated to form
> the full output data.  In particular, the "members" have no identities
> that are visible from outside the file structure.

Correct. But I consider these properties evident and do not think they need 
to be documented. When decompressing compressed blocks/members/streams, 
either sequentially or in parallel, the resulting decompressed 
blocks/members/streams are concatenated to form the full output data in 
every compressed format I know.

> Also, must there be at least one member?  Best to state that
> unambiguously.

Thanks. I have changed the phrase above to:

    A lzip file consists of one or more independent "members" (compressed
    data sets).

Note that there is nothing in a gzip/lzip file but members. A file with no 
members is an empty file, not a lzip file.

>     DS (coded dictionary size, 1 byte)
>        ...
>        Bits 4-0 contain the base 2 logarithm of the base size (12 to 29).
>        Bits 7-5 contain the numerator of the fraction (0 to 7) to
>        ...
> How are bits within this byte numbered?  That doesn't seem to be
> specified.

Thanks. I have added the following phrase to section 2 (File Format):

    In a byte, bit 7 is the most significant bit (MSB), while bit 0 is
    the least significant bit (LSB).

>     Member size (8 bytes)
>        Total size of the member, including header and trailer.  This
>        field acts as a distributed index, allows the verification of
>        stream integrity, and facilitates the safe recovery of undamaged
>        members from multimember files.
>
> You might want to explain how this "facilitates the safe recovery of
> undamaged members" -- I would assume that the recovery process is to
> advance through the Lzip file looking for an ID, then seeing if decoding
> can proceed successfully from that point.

It basically allows to locate all the members efficiently (index), unless 
some header or member size is corrupt, in which case it still helps in 
validating the ID in the headers of the undamaged members. It is 
complicated, and possibly out of scope for a specification, to explain all 
the ways in which lziprecover (the data recovery tool for lzip files) can 
use the "member size" field. It would be perhaps better for anybody 
interested to read the documentation of lziprecover.

>        Member size should be limited to
>        2 PiB to prevent the data size field from overflowing.
>
> It seems to me that preventing Data Size from overflowing would be done
> in the compressor by counting the number of bytes that it is processing
> and breaking off generating the current member before the count gets to
> 16 EiB.  More subtly, in the worst case, Member Size can somewhat exceed
> Data Size, so the compressor also has be counting the size of the
> generated member.

Overflowing the data size field should not have bad effects in the decoding 
itself. It will simply show a truncated decompressed size, just as "gzip -l" 
has done until recently (but with a 32-bit data size field). Limiting the 
"member size" field to 2 PiB is the easiest way of preventing the overflow, 
but it can be prevented by other means, or not prevented at all. (This is 
why I have used "should"). In any case, I expect that anybody trying to 
compress a file so large will use plzip to create a multimember file or else 
it may take ages to compress it. The compression speed of the 
single-threaded lzip at compression level -0 (fast) on my AMD Athlon 64 X2 
Dual Core Processor 5200+ is of about 20 MB/s. Compressing just 2 PB of 
uncompressed data would take 100 million seconds (more than 3 years).

>     3.  Format of the LZMA stream in lzip files
>
>     The EOS marker is the only marker allowed in lzip files.
>
> I'm guessing that "marker" is a term of art in regard to LZMA
> compression.  You probably want to write this sentence to make that
> clear.  E.g. "The EOS marker is the only LZMA "marker" allowed in lzip
> files."

Done. Thanks.

>     What follows is a description of the decoding algorithm for
>     LZMA-302eos streams using as reference the source code of "lzd", an
>     educational decompressor for lzip files which can be downloaded from
>     the lzip download directory.  Lzd is written in C++11 and its source
>     code is included in appendix A.
>
> I think you want to reverse these clauses.  If you're publishing an RFC
> to register Lzip, the reference definition of the decoding process is
> the code in the RFC, which is Appendix A of this document.  That's the
> archival version of the code, not a web site.

Also done. Thanks again.

>     5.  Security Considerations
>
>     Any data appended to the
>     end of the stream is easily detected, and an error can be signaled.
>
> At first look, this appears to be incorrect, as an additional member can
> be appended, resulting in a valid Lzip file.

I meant "Any non-lzip data". I have fixed it in my copy of the draft.

Of course one or more additional members can be appended, just like in most 
compressed formats, and even in plain uncompressed data. But I don't see how 
this could be avoided.

> But in any case, Lzip (like every other compression format per se) offers
> no protection against deliberate modification of the data stream, as the
> attacker can decompress, modify, and recompress the compressed data
> stream.

True.

Best regards,
Antonio.