Re: [TLS] About encrypting SNI - Traffic Analysis Attacks?

Tom Ritter <tom@ritter.vg> Thu, 15 May 2014 03:31 UTC

Return-Path: <tom@ritter.vg>
X-Original-To: tls@ietfa.amsl.com
Delivered-To: tls@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 730CD1A021B for <tls@ietfa.amsl.com>; Wed, 14 May 2014 20:31:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 1.321
X-Spam-Level: *
X-Spam-Status: No, score=1.321 tagged_above=-999 required=5 tests=[BAYES_50=0.8, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FM_FORGED_GMAIL=0.622, SPF_PASS=-0.001] autolearn=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ei9gHIWGwcAK for <tls@ietfa.amsl.com>; Wed, 14 May 2014 20:31:32 -0700 (PDT)
Received: from mail-wi0-x22f.google.com (mail-wi0-x22f.google.com [IPv6:2a00:1450:400c:c05::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 1FB871A01EB for <tls@ietf.org>; Wed, 14 May 2014 20:31:31 -0700 (PDT)
Received: by mail-wi0-f175.google.com with SMTP id f8so9077630wiw.14 for <tls@ietf.org>; Wed, 14 May 2014 20:31:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ritter.vg; s=vg; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=8G4QjLdXQbxDYA81wI/e9sATkeVVAPUI5VaZgU1eZS0=; b=vlrZOpAzC1FD9de57U3CGKBUZU6bN6jZPki7+CRSyYps+hAAkz7aCCOkCRuCnTcHXb MPJlH/GIYIqKBl3KRxSvC7ywDLa79BO29HIxlRwODaPVkiwoGpVPr87Um57eKSWiMPFY HTI3al24Ldg5eCKgNUQhKRhE0z2qzZ1VkVXc4=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=8G4QjLdXQbxDYA81wI/e9sATkeVVAPUI5VaZgU1eZS0=; b=lQulRNOmQPOUJddVoyu4vHU/9d0Yilr+Zz+jpDbPHhTupGYaqm4EsS8zRYOK4/K0S0 3MHe9GdTFJUoOJ4sj2+BU1i9EmoLKpWJVMKNXoNRW0WafT7vdTic/aSB9kKo/LumRJxT q/MsyesVVTu8QDq/GD9NujVbvfeTnaG0b8QrsbaqvUgbSdLHOP42gw7mlMLH2wZWT57q f0QaAjAfYskFUBqvXKle6jdsKqTGUqgAcvXr5rAszO86G7nNTAGF9Dv182wvEpgThXn7 fl5lCrTnjKS3yTAnB1IpeIUBXfjV/vnLKjU2pnRgKYjw97woTJ3D0qHBnJDIbhxg3MMl HuMw==
X-Gm-Message-State: ALoCoQmcvMPjGiqx+xCIyyIpalmF1d2f9Z7j7WhGd11QmkgDDglMSQ6TYTWyTKFNomondVirVLwH
X-Received: by 10.180.211.243 with SMTP id nf19mr6356444wic.58.1400124684155; Wed, 14 May 2014 20:31:24 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.194.58.134 with HTTP; Wed, 14 May 2014 20:31:04 -0700 (PDT)
In-Reply-To: <53727940.2070900@nthpermutation.com>
References: <2A0EFB9C05D0164E98F19BB0AF3708C7120A04ED40@USMBX1.msg.corp.akamai.com> <534C3D5A.3020406@fifthhorseman.net> <474FAE5F-DE7D-4140-931E-409325168487@akamai.com> <D2CB0B72-A548-414C-A926-A9AA45B962DA@gmail.com> <2A0EFB9C05D0164E98F19BB0AF3708C7120B490162@USMBX1.msg.corp.akamai.com> <CACsn0cmusUc3Rsb2Wof+dn0PEg3P0bPC3ZdJ75b9kkZ5LDGu_A@mail.gmail.com> <534DB18A.4060408@mit.edu> <CABcZeBOJ7k8Hb9QqCAxJ_uev9g_cb4j361dp7ANvnhOOKsT7NA@mail.gmail.com> <CA+cU71kFo6EihTVUrRRtBYEHbZwCa9nZo-awt4Sub2qXcKHC7g@mail.gmail.com> <CAK3OfOi1x9huaazwcO=d72mfOFuV_RyXnfHmFRduhhbJE2miYw@mail.gmail.com> <CALCETrWukS2QJSb01n7OpXD2iaK43OhZr4E8YZyJ6JaorCdBKw@mail.gmail.com> <CAKC-DJjgFrAmxkC-MsmL+-uRWpN_mDPGkV_g-6DhbVH+69EQEQ@mail.gmail.com> <2A0EFB9C05D0164E98F19BB0AF3708C7130ABEA050@USMBX1.msg.corp.akamai.com> <53725C34.8060105@fifthhorseman.net> <53727940.2070900@nthpermutation.com>
From: Tom Ritter <tom@ritter.vg>
Date: Wed, 14 May 2014 23:31:04 -0400
Message-ID: <CA+cU71nD89Wos7WLBggvo69DWfDKdCOX5_N9wFh3jUFP4an8yg@mail.gmail.com>
To: Michael StJohns <msj@nthpermutation.com>
Content-Type: text/plain; charset="ISO-8859-1"
Archived-At: http://mailarchive.ietf.org/arch/msg/tls/RgasPjSIZ4NH4khPhH5cPQmHA74
Cc: "tls@ietf.org" <tls@ietf.org>
Subject: Re: [TLS] About encrypting SNI - Traffic Analysis Attacks?
X-BeenThere: tls@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "This is the mailing list for the Transport Layer Security working group of the IETF." <tls.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tls>, <mailto:tls-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tls/>
List-Post: <mailto:tls@ietf.org>
List-Help: <mailto:tls-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tls>, <mailto:tls-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 15 May 2014 03:31:34 -0000

I have a lot of points on Encrypted SNI, so I'm looking forward to
tomorrow, but I wanted to start putting some into email.  I'd like to
start with the notion that HTTPS fingerprinting makes it useless.

One of the best surveys of the literatre is Roger Dingledine's blog
post at https://blog.torproject.org/blog/critique-website-traffic-fingerprinting-attacks.
 (It does not include the most recent paper "I know why you went to
the Clinic" by Miller et al.) There are lots of scenarios for HTTPS
fingerprinting, and it's important to note that he concerns himself
with Tor's use case: the adversary is trying to do fingerprinting when
the client is potentially accessing the entire Internet.  Some of the
other scenarios considered in literature are a) determining which
webPAGE is visited in a specific webSITE and b) determining which
webPAGE you are visiting among many webPAGES on many (but finite)
webSITES.

An arguement that Encrypted SNI or Server Certificate (taken together,
Encrypted Handshake) is not useful is absolutely correct if there is
only one webSITE hosted on an IP address. The IP gives it away,
regardless of DNS Privacy, Encrypted Handshake, etc (so long as the
attacker can index the site, and we assume they can.)  So I'd like to
instead talk about the case when there is multiple webSITEs hosted at
an IP, and an attacker wishes to determine which webSITE in a set of
webSITEs the user is visiting - because that is the metadata that
Encrypted Handshake protects against.

Unfortunetly, there are no studies that attempt this, of course.  But
we can still talk about some stuff.

First off, because we assume that the attacker knows all the websites
hosted on an IP, it's a closed world study, an advantage to the
attacker, they're much easier.  However, even a 'closed world' is not
truly closed. For a true closed world, an attacker must be able to
train their classification engine on all the pages in the site.
Dynamic content that changes via authentication, and just general
additional data added to the page since classification (think
comments, posts, etc) make this impossible.  All of the studies do not
attempt to deal with this problem. The attacker's advantage is not
taken away, but it is ever so slightly reduced. But there are other
things that make the closed world less closed: client beahvior.
Caching, AdBlocking, Plugin Click-To-Play, Third-PArty Cookies
Enabled/Disabled, Ads, Javascript Disabled and so on.  Advantaged
reduced a little bit again, although Miller attempts to address the
caching one.

Next off: False Positives Matter. A Lot. I'm basically just going to
point you to the same section in Roger's blog post. Summed up and
applied to us, it states that if your goal is to determine if a user
is visiting Site A on IP X, instead of Site B - the false positive
rate keeps adding up on each page load. You're not going to get a
binary answer, you're more likely to get an answer that looks like
"Well, we had 8 hits for SiteA, 10 for SiteB, 3 for SiteC".  Advantage
to the Defender.

Finally, there are defenses a website concerned about this can employ.
There are a number of different padding mechanisms for defenders to
deploy, collected from many papers and outlined in Section 7 of the
Miller paper. All of these defenses can be deployed unilaterally by
the server (no client changes needed), inside of the TLS protocol,
with the simple ability to insert random padding ignored by the
client. I've seen this topic come up before, and I hope people are not
opposed to this optional feature being present.



So at this point I'd like to talk about the most recent paper, "I know
why you went to the Clinic" by Miller et al.  At a high level, they
attempt to identify which webPAGE you are viewing inside a known
webSITE. Different from our goal, but we're not speaking different
languages. I'm going to liberally take excepts from it that in the
real world will reduce their impressive claim of 89% accuracy.
 - They don't operate on whole sites, they took a subset of the site,
500 'labels' (unique pages), followed redirections, and ran it off
that
 - They didn't operate on single loads, they visited 75 pages in a
session and each page they collected 16 times.
 - They discover and remark that "caching significantly decreases the
number of unique packet sizes observed for samples of a given label"
and that "A reduction in the number of unique packet sizes reduces the
number of non-zero features and creates difficulty in distinguishing
samples." - said another way: caching makes it harder
 - They also find that visiting webPAGEs on a webSITE (as opposed to
different webSITEs) causes decreased traffic volume, which also makes
it harder.  This is our scenario.
 - Their accuracy went from 72% to 89% by using a Hidden Markov Model.
While this is not bad, strictly speaking, it assumes that the user
follows the link structure of the site strictly. No bookmarks.  They
also don't factor in the possibility of 'Back' buttons (Although I
suspect most websites that link A->B also link B->A so it's probably
accounted for most of the time.)
 - They used 500 labels for their subset, the 4 sites whose
redirections ballooned that up (to ~1000 labels) had the worst
accuracy. Logical, but nice to see confirmation that the larger the
site, the worse accuracy.
 - They do not factor in browser differences, OS differences, or user
configuration of the browser
 - They present a defense that takes their own accuracy from 89% to
27%, with only 9% traffic overhead

 Perhaps most unrealistic: They assume users browsing with a single
tab and easily delineated page requests.  No mashing the refresh
button, no multiple tabs, no opening in a background tab, no
background tab that's doing AJAX polling, etc.



There are some very sexy presentations on HTTPS Traffic
Fingerprinting, probably the sexiest of which is watching you load
Google Map Tiles over HTTPS, and detecting where you're zooming in on.
 But that presentation, and the papers, all assume a very unrealistic
operating environment.  Which is not to denigrate them, I think
they're generally excellent work and foundations for further research.
 But you can't point to them and say the situation is hopeless.  At
this point, I'm actually pretty optimistic that attacks in the real
world will be quite difficult.

Trying to defeat Encrypted Handshake using HTTPS Traffic Analysis has
a lot of things going against it:
1) The adversary's determination of Site A vs Site B is made
significantly more difficult by cumulative false positives
2) The greater number of sites hosted at an address, the more
difficult his job becomes.
3) The attacker has to do a considerable amount of work to train their
classifier for the specific user preference that matches the user
they're attacking
4) And factor in the current cache state of the user
5) A site can actively try and defend against it, and we have
indications this would be very effective

To put a point on active defenses: I think a lot of people assume that
would never happen, but Twitter is currently padding profile images to
resist traffic analysis. In the future, I expect more sites will start
to care, and more people will ask them to.

One of the things I think would be awesome was if a research project
was done on this exact problem.  I think that a CDN could provide
accurate numbers for the characteristics and numbers of sites hosted
on an IP address, which would be much better than arbitrarily chosen
numbers. However negative results, that is results that say "We
couldn't get good accuracy" often are not published, so we'd want to
see those too ;)

-tom