Re: [Cfrg] Merkle trees [was Re: streamable AEAD construct for stored data?]

Bryan Ford <brynosaurus@gmail.com> Sat, 07 November 2015 13:45 UTC

Return-Path: <brynosaurus@gmail.com>
X-Original-To: cfrg@ietfa.amsl.com
Delivered-To: cfrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id CBC3C1A00FE for <cfrg@ietfa.amsl.com>; Sat, 7 Nov 2015 05:45:07 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zHLVeqEpmSav for <cfrg@ietfa.amsl.com>; Sat, 7 Nov 2015 05:45:04 -0800 (PST)
Received: from mail-pa0-x22c.google.com (mail-pa0-x22c.google.com [IPv6:2607:f8b0:400e:c03::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 41EF11A00FA for <cfrg@irtf.org>; Sat, 7 Nov 2015 05:45:04 -0800 (PST)
Received: by pasz6 with SMTP id z6so155501632pas.2 for <cfrg@irtf.org>; Sat, 07 Nov 2015 05:45:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :message-id:references:to; bh=bZzjKeICraoYnLNRkY4U21zwNDGs3bOLropBLM1cN3s=; b=M3z3ZBYuLXtPw5FkWJ9n6n5p3JX4Ui1XtCOHmJycUESvtJ9/NUiEwytgyDwkmjRYdV /NDazbfWrkmb7D0LL232hYHYzEYMV42mjzcpCwmLm1C4o/uj6g61M0iyqzvfI+xVVhqa 1QV7s7cTK18tMBesuULxwA8Z9NzQTT2goSxTGguJZFPO4wz3F+rDtNDkLZAtrvIwxPIB nq1lWwUkw29EKyDpjji0mtXbTpt671QuAb3K1xagPghh3zdwyL4s6nn/uWRu9NvDZQ0n C2ruyxq6xa+Sxjn6RcDSWddmjsMzXx3JI1BBXbAy/GLcZL2nTYvdmcQKFh2JOYOlf6HO 4dRg==
X-Received: by 10.68.106.69 with SMTP id gs5mr25037903pbb.91.1446903903286; Sat, 07 Nov 2015 05:45:03 -0800 (PST)
Received: from [10.11.0.66] (y255183.ppp.asahi-net.or.jp. [118.243.255.183]) by smtp.gmail.com with ESMTPSA id rv8sm6004586pbc.84.2015.11.07.05.44.59 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 07 Nov 2015 05:45:01 -0800 (PST)
Content-Type: multipart/signed; boundary="Apple-Mail=_E115DEB3-8764-42DF-97F2-4CFB2494BFF7"; protocol="application/pkcs7-signature"; micalg="sha1"
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\))
From: Bryan Ford <brynosaurus@gmail.com>
In-Reply-To: <CAM_a8JzO9ThuAoo+yymzj_Gjbo4+4=W1ox2Faj6apmV6gq2r6g@mail.gmail.com>
Date: Sat, 07 Nov 2015 22:45:00 +0900
Message-Id: <F97D91CF-376F-4EC8-8963-A6DE2464ABDA@gmail.com>
References: <20151031193619.2CCF260350@jupiter.mumble.net> <5638E22D.9060003@st.com> <CAG5KPzzV0pOJV-fSoBue+5PuVc5FLvscugYAwCGNpwLCBFVzng@mail.gmail.com> <CAM_a8JzO9ThuAoo+yymzj_Gjbo4+4=W1ox2Faj6apmV6gq2r6g@mail.gmail.com>
To: Zooko Wilcox-OHearn <zooko@leastauthority.com>
X-Mailer: Apple Mail (2.2104)
Archived-At: <http://mailarchive.ietf.org/arch/msg/cfrg/OtI-o4OgU-nFXafCHz9M_okdYMg>
Cc: "cfrg@irtf.org" <cfrg@irtf.org>
Subject: Re: [Cfrg] Merkle trees [was Re: streamable AEAD construct for stored data?]
X-BeenThere: cfrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Crypto Forum Research Group <cfrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/cfrg>, <mailto:cfrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/cfrg/>
List-Post: <mailto:cfrg@irtf.org>
List-Help: <mailto:cfrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/cfrg>, <mailto:cfrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Sat, 07 Nov 2015 13:45:08 -0000

Hi Zooko,

> On Nov 4, 2015, at 11:01 AM, Zooko Wilcox-OHearn <zooko@leastauthority.com> wrote:
> 
> Preventing length-extension attack was a prime requirement for SHA-3,
> because users had sometimes interpreted "can produce a terminal node
> that has this data as input" as a proof of "knows this data". This
> resulted in security failures, for example against naive key-prefix
> MACs.
> 
> Now, what if I give you the root of a Merkle Tree but withhold from
> you the data (the inputs to the leaves). Should you be able to
> generate a new Merkle Tree which has my Merkle Tree as one of its
> subtrees? Obviously not! That would be a length-extension attack.
> Sakura ¹ and BLAKE2 ² (designed after SHA-3) don't allow that.

In at least some Merkle tree usage scenarios I would argue are “reasonable,” that is precisely what you want.  In particular, you sometimes want Merkle trees to “compose” cleanly, allowing large Merkle trees to be built from smaller Merkle trees whose (individual) creators didn’t necessarily know or care how big the large, composed Merkle tree would be.

Case in point: the collective timestamping and time-freshness-proof we’ve implemented in the collective signing/witnessing scheme I presented in the Trans WG (see http://datatracker.ietf.org/doc/draft-ford-trans-witness/ <http://datatracker.ietf.org/doc/draft-ford-trans-witness/> and http://arxiv.org/abs/1503.08768 <http://arxiv.org/abs/1503.08768>) and briefly in the NTP WG.

In the timestamping use-case for this protocol, we have a “leader” timeserver of some kind, and any number of followers or “witnesses”.  Every so often, say every 10 seconds or so, the leader initiates a new round in which it proposes a new uint64 timestamp representing what it thinks is the current time, and broadcasts a message to all its witnesses soliciting contributions to a large Merkle tree of hashes to be timestamped in that rounds.  Both the leader and all followers can accept timestamp requests from clients, in which a client simply submits a hash or other random value of the client’s choosing, and waits for an answer until the next round completes.  

Each server (leader or follower) takes all the hashes/nonces submitted in each round, and composes them into a local Merkle tree, yielding a hash representing the root of that tree.  For efficiency the followers are organized into a communication-tree structure (not to be confused with the Merkle trees).  Each leaf follower in this communication tree passes the root of its local Merkle tree up to its parent, who composes its own local Merkle tree and those of its children into a larger Merkle tree to be passed up to its parent, and so on, until the leader finally forms the top levels of a global Merkle tree accounting for all hashes/nonces submitted by all clients to any of the servers in that round.

Now the leader initiates a second round-trip down through the communication-tree, which serves two purposes: (a) it allows the leader to collect the parts of a collective signature on the root of the global Merkle tree, and (b) more importantly for the present purposes, it enables servers higher in the communication-tree to “hand down” to each of their immediate children an inclusion-proof (Merkle tree path) of each immediate child’s smaller Merkle tree within the global Merkle tree.  These inclusion-proofs further compose going down the communication-tree towards the leaves.  Any witness server anywhere in the communication-tree who observes that its local Merkle tree wasn’t included in the global one to be signed (i.e., if the inclusion proof doesn’t check out) can raise an alarm and refuse to co-sign.  Finally, once each server has an inclusion-proof relating its local Merkle tree to the collectively-signed global root for this round, each server can compose that with an inclusion-proof within its local Merkle tree to produce an inclusion-proof for each of its waiting clients, proving to that client (and anyone else) that the client’s submission in that round was included (and hence timestamped) under the collective authority of the collective signature on the global Merkle tree root for that round.

I think it’s fairly important for the efficiency of this scheme that (a) each individual server gets to create a local tree of whatever size it “needs” in that round (depending on the number of clients that show up and submit hashes to timestamp in that round), without having to coordinate in advance with other servers or their parents in the communication-tree; (b) the higher-level servers in the communication tree need not know or care how big their descendants’ local Merkle trees are, or whether their sizes or depths are “balanced” with respect to each other; and (c) most important, none of these servers ever need to ship around the bulk contents of any of these Merkle trees.  The most that gets communicated between any pair of communicating partners in any round (between parent and child servers or between any server and one of its clients) is a single inclusion proof.  Clients who just wanted a particular hash timestamped just get one inclusion proof (and the collectively signed global Merkle tree root) in return, and don’t need to know or care what part of that inclusion proof connects the global root to their server’s local root and which part connects the local root to their hash.

The type of “length-extension attack” you’re worried about here (shouldn’t it be called a “tree-extension attack”?) aren’t a security problem at least in the timestamping case because the root of trust is the global root of the tree: anyone is free to compose that tree or any (perhaps unknown to them) subtree of it into some other Merkle tree, but no one is going to use that for anything (at least in this protocol) unless it’s appropriately signed.  Further, anyone is free to take a subtree (or an entire global root) from one round and submit it as a “hash to be timestamped” in a future round by the same leader, or in another time-sequence driven by another leader entirely; doing so is useful because it can create provable “happens-before” relationships relating one leader’s time-series to any other's.  If handled appropriately, this type of “ad hoc composition” can ensure that there is eventually a demonstrable “happens-before” relationship, evidenced by a single, efficient, uniform Merkle inclusion proof, between any two “timestamps” on any two trees, provided their trees cross-fertilize regularly in this way.

Outside of the scalable collective timestamping example, I think it’s often desirable to be able to build a Merkle tree incrementally without initially knowing its complete size, or to create incremental modifications (“edits”) of an existing Merkle tree efficiently, by creating a new tree that just re-uses parts of an older one.  For example, in a (perhaps distributed) file system in which large files or directories are represented by Merkle trees, it can be nice not to have to re-create the entire Merkle tree when one part of it changes, but to allow some number of incremental changes that reuse parts of the old tree and thus require less metadata-update (perhaps until the tree gets too messy or unbalanced).  In the discussion on streaming-mode encryption for OpenPGP, there was an example of the encryptor needing to build a Merkle tree incrementally without knowing in advance how big it will eventually - although in this case it at least can always build it “left-to-right, bottom-to-top”, which simplifies things.

For many use-cases, especially where we expect Merkle trees always to be built “all at once”, I agree that it’s probably best to include the type of “unexpected-reuse-protection” or “tree-extension protection” mechanisms you have in mind.  But my point is simply that there are also situations in which I would want to be able to turn them off. :)

B