[MLS] MLS in decentralised environments

Matthew Hodgson <matthew@matrix.org> Tue, 03 April 2018 09:29 UTC

To: mls@ietf.org
From: Matthew Hodgson <matthew@matrix.org>
Organization: Matrix.org Foundation
Message-ID: <6745e49d-9826-ac74-03b6-e6adbde7e805@matrix.org>
Date: Tue, 03 Apr 2018 10:29:06 +0100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:60.0) Gecko/20100101 Thunderbird/60.0
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Language: en-GB
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/mls/MnLJkbJ_Mwe8Oz0Ll6delGJLPz4>
Subject: [MLS] MLS in decentralised environments
Precedence: list

Hi all,

[TL;DR: MLS doesn't seem to support decentralisation.  Can we fix that, 
especially given it'll help solve other problems too?  I have a 
handwavey proposal.]

Since IETF101 I've been trying to work out how well MLS could be applied 
in fully decentralised environments which lack any single controlling 
server, as it seems that the current proposal effectively rules this out 
due to the state sequencing requirements (thanks to Ben Schwartz for 
spelling this out to me after the BOF!).

This feels like it could be quite a large oversight, given there are 
many real-time communication protocols and services (e.g. Matrix[1], 
Tox[2], Briar[3], Secure Scuttlebutt[4], Whisper[5], PSS[6], psyc2[7], 
XMPP FMUCs[8] (and MIX?), even NNTP and SMTP) where messages are 
replicated over a network of peers without any single controlling server 
- all of which could benefit from the well specified, interoperable & 
scalable group e2e encryption that MLS promises!  There's also a more 
ideological argument that interoperable communication is such a 
fundamental right that the IETF should support services which support 
communication without a necessary central logical point of control (much 
as the internet itself is decentralised).

More practically, there may be some benefit to considering 
decentralisation in MLS in general: for instance, general solutions to 
races and key synchronisation within a decentralised network could also 
solve races in a centralised deployment.  It could also improve 
scalability and geo-redundancy by avoiding the need for an atomic 
ordering system for state changes.

To give some concrete context: all conversations in Matrix are expressed 
as Merkle DAGs of messages which are replicated over the participating 
servers using eventual consistency semantics (a bit like a full mesh of 
Git repositories all constantly pushing commits to one another).  As a 
result the conversation DAG often forks: either due to races, 
partitioned networks, disconnected or offline servers etc - but these 
temporary forks are very much a desirable feature, allowing partially 
disconnected operation (e.g. letting a site continue communicating 
locally even when isolated from the wider network), and aiding 
scalability (no need for global locks or sequencing).

Currently Matrix uses Olm[9] (a Double Ratchet Algorithm implementation) 
to maintain a full mesh of secure 1:1 channels between devices, and then 
shares a group ratchet (Megolm[10]) per sender over these channels.  The 
megolm ratchet is a simple hash ratchet which advances every message, 
and is replaced every N messages (or when group membership changes). 
The cost of replacement is O(N) with the size of the group, as is the 
cost of adding users to the group, which obviously makes ART & MLS’s 
O(log N) behaviour appealing.

However, the eventual consistency semantics of a decentralised protocol 
introduce challenges for E2E encryption: for instance, the membership of 
a given room is not well defined, as there may be a partition of the 
room (either due to races or netsplit) which include devices or users 
that a sender is not yet aware of.  This mandates a way of 
retrospectively syncing message keys between devices after such a fork: 
deliberately prioritising UX (ensuring messages that users expect to be 
able to decrypt can be decrypted) over forward secrecy.  The same 
mechanism can be used for cross-device history sync or sharing history 
from before users join a group.

In Matrix we solve this by letting devices share group ratchet key state 
between each other over Olm by making so-called 'keyshare requests'. 
Devices must have explicitly verified and trusted the identity of the 
requesting device before they share keys to it (and currently, keyshare 
is only supported between a given user's devices - in future we could 
also support keyshare between all the group's devices if the requester 
is verified and can provide a proof of permission to view the requested 
content).  This obviously puts a large onus on the device verification 
mechanism to ensure an attacker isn't able to exfiltrate keys, but our 
experience has been that the improved UX is worth the risk (plus can 
always be disabled if needed, e.g. in a single-server centralised 
deployment).

So, how can we support something like this in MLS?

There seem to be two main obstacles: 1) the requirement for strict 
sequencing of state changes, 2) the lack of keyshare semantics to 
recover from the missing key data which is inevitable in an eventually 
consistent view of a room.  I'm going to ignore the second for now as it 
can be fixed out of band (although much like attachments, it feels like 
something which MLS should make /some/ recommendation on, given it can 
be incredibly useful as a primitive)

However, for state sequencing: Am I right in saying that a race between 
B and C joining a group can cause one client to see a DH binary tree 
with frontier (AB, C) versus (AC, B), and thus have inconsistent root 
group keys - messages encrypted during the partition are going to 
inevitably be undecryptable by the other side?

Is there a way where MLS could allow a group to recover from a partition 
like this by (partially?) rebalancing the DH binary tree into a 
canonical form once the partition heals?  Thus if a partition was 
detected at the application layer (in Matrix's case, we'd do this by 
noting that the B-join and C-join events share the same A-join parent), 
the servers participating in the room would rebalance (AB,C) and (AC,B) 
to both be (AB,C) and then the conversation could proceed as normal. 
One might be able to avoid rebalancing the whole tree to minimise CPU 
impact (although in practice healing these races are fairly rare edge 
cases).  Obviously this assumes that one has a way to request keys to 
recover the messages lost when the groups were out of sync.

If this mechanism makes sense, it seems that it could provide a way to 
eliminate the "Sequencing of State Changes" requirement entirely from 
MLS - or at least provide an alternative for folks where either 
server-side or client-side strict sequencing isn't an option, and so 
solve the general 'how to handle races' problem mentioned at the BOF 
(whilst also necessitating an interoperable solution to history/key 
sharing, similar to the earlier “Use cases for avoiding forward secrecy” 
thread[11])

Anyway, apologies for the stream of consciousness - feedback from those 
who properly understand ART & MLS would be hugely appreciated :)

thanks,

Matthew

[1] https://matrix.org/docs/spec
[2] https://toktok.ltd/spec.html#introduction
[3] 
https://code.briarproject.org/akwizgran/briar-spec/blob/master/protocols/BTP.md
[4] https://ssbc.github.io/scuttlebutt-protocol-guide/
[5] https://github.com/ethereum/go-ethereum/wiki/Whisper
[6] https://gist.github.com/zelig/d52dab6a4509125f842bbd0dce1e9440
[7] https://gnunet.org/sites/default/files/gnunet-psyc.pdf
[8] https://xmpp.org/extensions/xep-0289.html
[9] https://git.matrix.org/git/olm/about/docs/olm.rst
[10] https://git.matrix.org/git/olm/about/docs/megolm.rst
[11] https://mailarchive.ietf.org/arch/msg/mls/b5YoQfdeFcoLYrFdxbZmdX__jWA

P.S. This has probably already been fixed, but the [signal] footnote on 
https://tools.ietf.org/html/draft-barnes-mls-protocol-00 is credited to 
"T. and M. Marlinspike" - I think you are missing a 'Perrin'?


-- 
Matthew Hodgson
Matrix.org

[MLS] MLS in decentralised environments Matthew Hodgson
Re: [MLS] MLS in decentralised environments Dave Cridland
Re: [MLS] MLS in decentralised environments Katriel Cohn-Gordon
Re: [MLS] MLS in decentralised environments Matthew Hodgson
Re: [MLS] MLS in decentralised environments Martin Thomson
Re: [MLS] MLS in decentralised environments Katriel Cohn-Gordon
Re: [MLS] MLS in decentralised environments Simon Friedberger
Re: [MLS] MLS in decentralised environments Matthew Hodgson