Re: [openpgp] Reducing the meta-data leak

Bryan Ford <brynosaurus@gmail.com> Sat, 07 November 2015 11:34 UTC

Return-Path: <brynosaurus@gmail.com>
X-Original-To: openpgp@ietfa.amsl.com
Delivered-To: openpgp@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B8E4A1B3223 for <openpgp@ietfa.amsl.com>; Sat, 7 Nov 2015 03:34:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6Xl4Uj8uAdwk for <openpgp@ietfa.amsl.com>; Sat, 7 Nov 2015 03:34:03 -0800 (PST)
Received: from mail-pa0-x22b.google.com (mail-pa0-x22b.google.com [IPv6:2607:f8b0:400e:c03::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id CC6F21B3221 for <openpgp@ietf.org>; Sat, 7 Nov 2015 03:34:02 -0800 (PST)
Received: by pacdm15 with SMTP id dm15so125095703pac.3 for <openpgp@ietf.org>; Sat, 07 Nov 2015 03:34:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to; bh=O98G160aSczmjs0LI5VUl3Q26JsRRcL4KY8ndtX7h5c=; b=MU6I7Vf0h2e6fgQPxrc2lT4LTdHLV56liEEjaRqIa9b9LgJeNLwg/MAfUVssiE+N7U BzGMCBns3AwnrFL+DxibNGD1NahpB5JTC8k5zbZ2ImG52gcKgRGlCjOjV49DwfwXYefr KUCqA26UaVMEr2HbRH7EB+WOAi0L10D9TDKhnKqrySPEAMjug6Js1F+U7OzEM+1S9YQQ k89Ukbn5mTGHTXurrQBgDgrHvat4jLqTptQwOWOUctySVbxJB82Gcs3iCnyqjknsgskN MBhU+dDStXStFfnnkjuBzymo7oyyf5dHdE2yQ8K+Tj/KwvewXFiK5W4/p+Zk0guA+Ogr +5iQ==
X-Received: by 10.69.26.36 with SMTP id iv4mr25068830pbd.0.1446896042438; Sat, 07 Nov 2015 03:34:02 -0800 (PST)
Received: from [10.11.0.66] (y255183.ppp.asahi-net.or.jp. [118.243.255.183]) by smtp.gmail.com with ESMTPSA id ws6sm5411078pbc.33.2015.11.07.03.33.59 for <openpgp@ietf.org> (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 07 Nov 2015 03:34:00 -0800 (PST)
From: Bryan Ford <brynosaurus@gmail.com>
Content-Type: multipart/signed; boundary="Apple-Mail=_2DB6DB06-D96C-4406-9458-FC9C96D1CD24"; protocol="application/pkcs7-signature"; micalg="sha1"
Message-Id: <160A8D98-3DF8-4F51-A38C-EF3E0DAE71EE@gmail.com>
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\))
Date: Sat, 07 Nov 2015 20:33:59 +0900
References: <mailman.92.1446580813.31211.openpgp@ietf.org> <86CB1513-F594-4A9B-A3B6-17ECB9CA9EB6@isoc.org>
To: "openpgp@ietf.org" <openpgp@ietf.org>
In-Reply-To: <86CB1513-F594-4A9B-A3B6-17ECB9CA9EB6@isoc.org>
X-Mailer: Apple Mail (2.2104)
Archived-At: <http://mailarchive.ietf.org/arch/msg/openpgp/TfV8pAs001mXwna_k_26vIcubSY>
Subject: Re: [openpgp] Reducing the meta-data leak
X-BeenThere: openpgp@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Ongoing discussion of OpenPGP issues." <openpgp.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/openpgp>, <mailto:openpgp-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/openpgp/>
List-Post: <mailto:openpgp@ietf.org>
List-Help: <mailto:openpgp-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/openpgp>, <mailto:openpgp-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 07 Nov 2015 11:34:10 -0000

I’m glad to see the metadata-leakage protection topic drawing some interest, including some healthy skepticism on whether it’s practical. :)

I’ll try to summarize my scheme at least somewhat concisely, and include this in a draft-of-a-draft that I have in progress.  A bare-bones (i.e., incomplete, badly-documented) but working proof-of-concept implementation of this scheme is actually implemented in the dedis crypto library for Go, available here:

	Code: https://github.com/DeDiS/crypto/tree/master/nego <https://github.com/DeDiS/crypto/tree/master/nego>
	GoDoc: https://godoc.org/github.com/DeDiS/crypto/nego <https://godoc.org/github.com/DeDiS/crypto/nego>

But I would recommend reading the conceptual summary below before trying to understand the code. :)

The main desirable properties the scheme has are (a) it produces uniform random blobs with no unencrypted metadata; (b) it supports multiple encryption schemes, both passphrase and/or public-key - though “too many” schemes would get costly; (c) it supports multiple passphrase and/or recipient public keys with each scheme; (d) for a given scheme and a given passphrase or private key held by a recipient, the recipient needs to perform at most one public-key crypto operation and O(log N) symmetric-key crypto operations - and to read a similarly small part of the input - in order to determine if the file is encrypted for this recipient passphrase/keypair and unlock it if so.

Let’s work from a simple case to progressively more “interesting” and general cases:

1. Encryption using a single scheme (e.g., a single well-known “ciphersuite”) via a single passphrase.
2. Single symmetric-key scheme, but multiple alternative passphrases using that scheme.
3. Single public-key encryption scheme, with message encrypted for one or multiple public keys compatible with that scheme.
4. Multiple symmetric and/or public-key schemes, allowing multiple different passphrases and recipient public-keys each, respectively.

—
1. Suppose simplistically we only want single-passphrase encryption using a single well-known encryption scheme.  More-or-less as usual for PGP, we (a) choose a random session key, (b) use a password-hashing scheme such as Argon2 to produce a symmetric encryption key that is used to AEAD-encrypt the file’s session key, and (c) use the session key to AEAD-encrypt the file’s content.  Assume for now that we encrypt the file in one big chunk, though chunking (with metadata encryption) can be added as well using existing techniques (e.g., those mentioned earlier analyzed in the context of SSH).

So we basically have three important blobs of data to place in the file: (a) the salt for the password-hashing scheme, (b) the AEAD-encrypted session key encrypted using the password-hasher’s output, and (c) the AEAD-encrypted file content encrypted using the session key.  Item (a) needs to be encoded in the file in “cleartext” since it needs to be available to the decryptor before it can decrypt anything, but fortunately the salt can just be a uniformly random blob anyway (of a length fixed by this well-known scheme).  So for the moment let’s just put it at the very beginning of the encoded file.  Then place the AEAD-encrypted session key blob (b) immediately afterwards, whose size can also easily be fixed for this scheme.  This fixed-length session-key blob may contain encrypted metadata in addition to the session key, such as the file offset of the AEAD-encrypted file content, the (possibly padded) total size of the AEAD-encrypted blob, and perhaps the size of the “useful payload” within that blob after removing any padding.  This metadata will of course appear as uniform random bits to a non-recipient as long as the AEAD encryption scheme is doing its job.  Finally, place the AEAD-encrypted file content (c), including any padding, after the encrypted session-key blob as the rest of the file.

—
Single passphrase-encryption scheme, multiple passphrases:

Now assume we want to encrypt with multiple passphrases.  Perhaps not a highly compelling case, but it’s conceptually simpler to start with than multiple recipient public-keys but addresses the same technical issues.  The approach Neal suggested of using Bloom filters is not far off, but we at least wouldn’t want to use *unencrypted* Bloom filters that might look distinguishable from random bits.

As a straw-man design suppose we just pick an upper-bound on the number, say N, of different passphrases with which we expect anyone might ever want to encrypt a file.  Then we lay out the encrypted file as follows: (a) a single, uniformly random password-salt value of fixed length comes first in the file, followed by (b) a hash table with room for not just one but N consecutive AEAD-encrypted session-key blobs of the same kind as for the single-password scheme above, and finally (c) the AEAD-encrypted file content.  Both the encryptor and decryptor use the salt to hash the user’s password, then hash the output of that to form both the AEAD-encryption key for the session-key blob and an index into the N-entry hash table.  The decryptor then simply attempts to AEAD-decrypt and check that particular hash-table entry - or to allow for rare hash collisions, attempts to decrypt (say) three consecutive hash-table entries starting at that position and wrapping around mod N.  If any of those AEAD-decryptions works, the decryptor wins; if not, the decryptor gives up and decides the passphrase doesn’t work.

Since each passphrase will produce a different (pseudo-random) hash-index, the session-key blobs for different passphrases will typically occupy different positions in the hash-table, thereby allowing decryption using any of the corresponding passphrases.  Hash collisions may occur, but if the number of actual passphrases is not too close to N (i.e., the hash table not too full), the encryptor should be able to squeeze in session-key blobs for all of them using the “slop” allowed by the 3-tries (or k-tries) search rule above, and/or by simply retrying everything from scratch with a fresh salt if that doesn’t work.  Once the hash table is laid out and filled with all useful session-key blobs, all remaining unused entries are just filled with uniform random bits, so anyone not holding one of the passphrases can’t distinguish “full” from “empty” hash-table entries.

Now the next step we’d like is to avoid having to pick a particular value of N: too large and we’re wasting a lot of space at the beginning of every file on a hash table that often will probably have only 1 entry populated; too small and we don’t allow for multiple session key blobs.  So instead of including just one hash table in the file, we include a variable-length sequence of hash tables whose sizes increase by powers of two.  Thus, immediately after the password-hashing salt, we encode a 1-entry hash table, followed by a 2-entry hash table, followed by a 4-entry hash table, etc., stopping wherever the encoder feels like stopping.  The encoder takes the passphrases it’s supposed to encrypt the file with, and successively creates an encrypted session-key blob for each passphrase, storing it at the appropriate hashed position (for that passphrase) - or within a 3-tries distance - in the lowest-numbered hash table that has room for it.  Then the encoder writes out to the encrypted file: (a) the common password-encryption salt, (b) all the *non-empty* hash tables starting from the size-1 table, with any empty entries filled with uniform random bits, and (c) the AEAD-encrypted file content, as usual.  Note that the encryptor can “lay out” the header and figure out how many hash tables are actually needed before it does the AEAD-encryption of the session key blobs, which means those session key blobs can still contain the file offset and length metadata pointing to the encrypted file content.

On the decryptor side, given the salt at the beginning of the file and the user’s passphrase, the decryptor now needs to generate a hash-index for each possible hash table and try decrypting a few consecutive entries per hash table.  Neither the decryptor, nor anyone else, initially knows the number of hash tables in the file.  But note that the decryptor only needs to do a single expensive password-hashing operation (with that one salt and the entered password).  Also, since the total number of hash tables is log2(N) where N is an upper-bound on the number of recipients, the total number of symmetric-key trial-decryptions the recipient needs to perform to see if the passphrase “works" is O(log N), regardless of the number of passphrases the file was actually encrypted for.  We might want to set some kind of upper bound on the maximum size that the header portion might be, of course, but that can probably be fairly large (e.g., 1MB) since it’s just an upper bound.  In the common case when a file is encrypted with only one passphrase, only the first, size-1 hash table is non-empty, and the encrypted file is just as compact as it would be in the “naive” base-case design above that only supports a single passphrase.

—
Single public-key encryption scheme, multiple recipient public keys:  

We now assume that we wish to encrypt a file so as to be decryptable by multiple receivers, the encoder has public keys for all those receivers, and (most importantly) all those receivers’ public keys are in the same group: e.g., they are all Ed25519 keys or all Ed448 keys.  This assumption is not very realistic for “legacy” DSA keys and not realistic at all for RSA keys, but let’s assume we’re willing to live with that restriction at least for “new” files encrypted this way.

Structurally, the public-key multiple-receivers scheme works exactly the same as above for passphrase encryption, but the encryptor (a) replaces the password-hashing salt at the beginning of the file with an Elligator-encoded ephemeral Diffie-Hellman public key, whose corresponding private key was freshly picked by the encryptor; (b) computes a shared Diffie-Hellman master secret using this ephemeral private key and each receiver’s public key (which as stated above we assume all to be in a common group); and (c) AEAD-encrypts all the session-key blobs in the hash tables using these Diffie-Hellman master secrets instead of password-based secrets.  Everything else about hash table layout and such works exactly the same way as discussed above for multiple passphrases.

For those not familiar with Elligator and its follow-on work (e.g., Elligator 2, Elligator Squared, Binary Elligator Squared), all are ways to encode an elliptic curve point usable for Diffie-Hellman key exchange, such that the encoding appears indistinguishable from uniform random bits.  Different schemes apply to different subsets of elliptic curves and impose different tradeoffs: e.g., the original scheme works only for certain curves (including Ed25519) and is very compact but may require the encoder to retry the generation of the ephemeral DH point several times (no big deal in practice); other later schemes are a bit less compact (e.g., require 2x the space for the representation) but less constrained and can encode any point on the curve rather than only about half the points, etc.  The details aren’t important for present purposes; only the fact that they exist and they work. :)

So because Elligator encodes the initial DH key in uniform representation, and everything else in the file is either an AEAD-encoded blob or just random bits, the whole file looks like uniform random bits to anyone not holding one of the recipients’ private keys.  A non-recipient can’t even tell whether the file is encrypted to just one recipient (the Elligator-encoded point followed by the size-1 hash table followed by the encrypted file content) or to many (the Elligator point followed by several hash tables).  The decryptor doesn’t know this either without trying, but the decryptor only needs to do the one public-key DH agreement calculation to compute its ephemeral secret shared with the encryptor, and then perform at most O(log N) AEAD trial decryptions to see if its key works near the appropriate indexes of any of the O(log N) hash tables.

—
4. Multiple symmetric-key and/or public-key schemes:

This is where things start to get still a bit more hairy, although the underlying principles remain pretty similar.  The problem now is that we may now have multiple, different and incompatible encryption schemes: e.g., we might want to encrypt a file both with a passphrase *and* for several public-key recipients, some of whom have (say) Ed25519 keypairs while others have Ed448 keypairs while others have NIST P-* keypairs.  We’ll assume (a) that all public-key schemes are in DH-compatible, Elligator-encodable groups of some kind, and (b) we will administratively avoid needing to support a huge number of different schemes all at once.  (It feels like this assumption is compatible with where the CFRG and OpenPGP WGs are headed anyway.)

So as should be clear from the above by now, the “cornerstone” of each scheme is a salt value in the case of a passphrase scheme, and an Elligator-encoded point in the case of a public-key scheme.  Passphrase salts can just be unstructured random bits, so in that case one might imagine just starting the file with a single “common salt” large enough to seed all the passphrase-based schemes we might want to support (if we even want to support more than one in a given file, which might be unlikely).  But an Elligator-encoded point must be created rather carefully such that the encryptor knows the corresponding private key for DH agreement, so it’s not feasible to dump some random bits at the beginning of the file and expect it to be usable as multiple different Elligator points for multiple different curves: we really need to encode multiple distinct Elligator points for distinct curves so that they’re all independently decodable.  

Further compounding the challenge, we don’t want the decryptor to have to perform multiple expensive public-key operations (or multiple expensive password-hashing operations) per scheme.  We can’t avoid the decryptor having to do *one* expensive public-key or password-hashing operation per scheme that the user might hold a key for: e.g., if the user’s keychain holds both an Ed25519 private key and an Ed448 private key, we’ll have to do two Elligator point decodes and two DH key agreements, but we want to avoid having to do more than two.  (And for the common-case of users holding only one private key, we want the user to have to do only one DH key agreement.)

So to solve this, we’ll use a multiple-hash-table approach similar to the above to allow a file to contain multiple “cornerstone” values (password salts and/or Elligator points) at the beginning, but slightly modified to avoid the need for multiple trial decryptions.  For each scheme, we take that scheme’s cornerstone value (e.g., Elligator point), and first round its size up to the next power of two.  Next, for each scheme, we pick (via standardization) an upper bound on the total number of other *schemes* we want this scheme to be compatible with now and in the near future (i.e., number of other distinct cornerstone values we need to encode at the beginning of the file).  For each scheme, we then standardize a small number of possible positions for that scheme’s cornerstone value, perhaps chosen more-or-less randomly (but statically), one position per hash table within a small number of hash tables laid out in increasing power-of-two sizes exactly as earlier for session-key blobs.  We ensure, again via standardization, that among the set of schemes we want to be compatible with each other, each scheme has at least one possible position (perhaps the largest) for which its cornerstone can be encoded so as to be disjoint and non-overlapping with at least one possible position of every other scheme.

For example, suppose we want to support Ed25519, Ed448, and P-384, and may want to support other schemes later on whose recipient keys can be intermixed with keys from these schemes.  Let’s say that we can Elligator-encode an Ed25519 point in 32 bytes (true), we can Elligator-encode an Ed448 point in 64 bytes (not sure about this, but either that or a 128-bit encoding should be feasible).  We might pick 3 possible cornerstone-value positions for each of these schemes as follows:

Ed25519: offset 0, offset (32 or 64), and offset (96, 128, 160, or 192).
Ed448: offset 0, offset (64 or 128), and offset (192, 256, 320, or 384).

For each of the parenthesized alternatives, give the OpenPGP WG chairs a coin during some meeting and have them flip it to pick one of the two or four alternatives for the second and third possible position for each scheme. :)

Future schemes we might add will come with their own list of possible cornerstone-value positions, which can be of varying lengths, and similarly can overlap (though all of them should have 0 as the base-case offset) provided we maintain the invariant that each scheme we still care about has some unique position that doesn’t overlap with any other scheme we still care about.  We can also optimize the possible positions more carefully if we choose; the above example is just to keep things conceptually simple.

Now, when the encryptor is encrypting an OpenPGP file, it creates a cornerstone value suitable for each scheme, and picks a position for each scheme’s cornerstone value from the fixed list of available positions for that scheme, such that the actually-picked cornerstone value positions are as low as feasible but don’t overlap.  For example, if a file is encrypted with both Ed25519 and Ed448 receiver keys, the “best” set of positions from those above might be position 0 for the Ed448 key and position 64 for the Ed25519 (other choices would also work but be a bit less space-efficient).  But now the encryptor needs to encode these cornerstone values so that the decryptor can get exactly the cornerstone value it needs, on the “first try”, without actually knowing at which possible position for a given scheme it was encoded.

So here’s the trick: we use a variant of DC-nets or PIR-type encoding techniques to “anonymize” the true position at which a given cornerstone value is encoded.  In particular, the encoder arranges the file such that for each scheme, the decoder reads *all* of the possible cornerstone value positions for that scheme, and XORs them together, to reconstruct the “actual” cornerstone value for that scheme.  That is, to get the Elligator-encoded Ed25519 point, the decoder will read the 32 bytes starting at each of the three positions defined for Ed25519, then XOR these 32-byte sequences together, and use that as the Ed25519 point.  Similarly for the other schemes.

The important point is that as long as the encoder picks some “workable” order in which to encode all the cornerstone values, it can treat previously-encoded cornerstone values in one scheme as “noise” to be canceled (via XOR) when encoding cornerstone values for other schemes.  For example, suppose the encryptor encodes the Ed448 point first at file offset 0.  The encoder fills the second and third possible positions for Ed448 with random bits, then XORs those alternative positions with the actual Ed448 cornerstone value and writes the result at offset 0.  Now the byte-ranges corresponding to those positions are fixed and no longer changeable, but can be included as “noise” to be canceled in writing other cornerstone values.  For example, say now the encryptor wants to encode the Ed25519 point: it obviously can’t put it at position 0, but hopefully the second or third alternative position for Ed25519 should be non-overlapping with the already-locked Ed448 point positions.  Say the second Ed25519 position is still available: the encryptor fills the third position with random bits (and the first, offset-0 position has already been filled with random-looking bits), then XORs the contents of the first and third Ed25519 positions and the true cornerstone value to fill the second Ed25519 position.

OK, so it might look like we’ll always need to reserve a fairly large chunk of header space for these cornerstone values (e.g., ~448 bytes in the above example depending on the precise choices of the cornerstone value positions).  But it turns out that’s not true, provided we (a) first reserve space for only *one* suitably-chosen cornerstone-encoding position per scheme, (b) then lay out all the hash tables containing symmetric-key blobs for all the schemes, (c) then lay out and encrypt the file’s contents (or its first substantial-size chunk of data in the streaming case), (d) encrypt all the session-key blobs of all schemes into appropriate, already-reserved positions in those schemes’ respective hash tables, (e) fill all remaining unreserved space in this whole variable-length header region with random bits, and finally (f) encode the cornerstone values at the very end.  This way, some of the possible positions for the cornerstone values of each scheme can actually overlap with symmetric-key blobs generated by the same or other schemes, since all those random-looking bits will just be treated as pre-existing noise that gets XORed together with the “true” cornerstone value to be filled into the “last position” for the corresponding scheme.

Going back to the question of the hash tables for the session-key blobs, note that although we now have multiple hash tables of different sizes for several different schemes, all those hash tables can actually overlap, provided the encoder handles its layout properly.  For example, each scheme’s smallest, size-1 hash table might always be defined to start immediately after the offset-0 position for that scheme’s cornerstone value, and successive hash tables grow from there.  As the encryptor is choosing positions for session-key blobs in any schemes, it basically just looks for a byte-range in one of that scheme’s hash tables that hasn’t yet been allocated by that or any other scheme.

Thus, in the very likely common case of a file being encrypted to only one recipient under only one scheme (e.g., Ed25519), the file can always be “as small as possible” and consist of simply (a) the elligator-encoded Ed25519 cornerstone point for DH agreement, (b) the single session-key blob encrypted using the DH key shared between the encryptor’s cornerstone keypair and the recipient’s keypair, and (c) the variable-length encrypted file content itself, all back-to-back in one uniformly-random-looking blob.  If a few but not many different recipients are to be included in one or a few schemes, the incurred header overhead grows gracefully with the number of recipients, and no one without a key can tell how many recipients the file is actually encrypted for.  And the recipient only needs to do at most public-key (or password-hashing) operation per scheme and O(log N) AEAD trial decryptions per scheme for which the recipient holds a key.

As a potentially nice little bonus, with this general scheme it’s entirely possible that an encoder could create a file such that different symmetric-key pairs actually lead to and enable access to *different* file data chunks, effectively giving different receivers the ability to “open” different parts - or different subsets - of encrypted data.  And one receiver will not necessarily be able to tell either how many other receivers there are or how much “other” data might be hiding in the file, beyond some upper bound.  Kind of like Truecrypt partitions, but with any number of hidden partitions scattered anywhere you like in the volume with different passphrases. :)  This kind of thing may well reasonably be out-of-scope for what we want OpenPGP to do, but interesting to think about.

Cheers
Bryan