[privacydir] Fwd: Review of draft-ietf-ipfix-anon

Sean Turner <turners@ieca.com> Wed, 05 January 2011 14:03 UTC

Message-ID: <4D247AA3.1060500@ieca.com>
Date: Wed, 05 Jan 2011 09:05:23 -0500
From: Sean Turner <turners@ieca.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101207 Lightning/1.0b2 Thunderbird/3.1.7
MIME-Version: 1.0
To: privacydir@ietf.org
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Subject: [privacydir] Fwd: Review of draft-ietf-ipfix-anon
Precedence: list

Here's Nick's review.

spt

-------- Original Message --------
Subject: Review of draft-ietf-ipfix-anon
Date: Thu, 30 Dec 2010 14:02:55 -0500
From: Nick Mathewson <nickm@freehaven.net>
To: Sean Turner <turners@ieca.com>

My expertise here is on anonymity and privacy enhancing technologies
in general, and not in IPFIX, so please take my observations with
the appropriate amount of skepticism.  Also, I've not had much prior
experience reviewing IETF drafts, so please forgive any suggestions
I make that are out-of-scope for this document.

A couple of minor notes on terminology:

   - The discussion and definition of 'anonymization' here may be the
     one more usually used when referring to data flows, but there
     are larger fields at work here that it might be a good idea to
     harmonize with.  The idea of deleting fields or aggregating
     records in a data set is more closely associated with the topic
     of inference control (see also work on "k-anonymity").  For
     these, the Inference Control subsection of Ross Anderson's
     "Security Engineering" is a reasonably good introduction.  The
     idea of anonymity in general seems more closely tied to notions
     of untraceability and unlinkability; for those see Pfitzmann and
     Hansen "A terminology for talking about privacy by data
     minimization".  It's no disaster (and certainly not
     unprecedented!) to have the meaning of "anonymize" be
     context-dependent, but it _is_ important to say what property
     you actually mean to achieve thereby.  See also the notes below
     on section 3.

   - The authors should decide whether they're going to use American
     or British spellings.  The draft uses the American spellings for
     categorize, organize, minimize, behavior, and so on, but
     unaccountably uses the British "-ise" in anonymise and
     pseudonymise.  In the research literature, and in the other
     relevant RFCs, "anonymize" seems to be more popular, but either
     spelling type is fine so long as it's consistent.

Sec 1:

It might be wise to repeat here (or even in the abstract) the note
from the Security Considerations section that this draft is only
meant to explain how to interchange anonymized data, not to provide
any recommendations as to which anonymization techniques to use, or
even any guarantee that any particular technique achieves any
particular purpose.  Otherwise, it is easy to misread some parts of
section 4 as promising that particular techniques will prevent
particular attacks, which is not in fact the case for reasonable
threat models.

It would also be useful to note that this draft _only_ considers the
kind of inference control you can achieve through transformation of
individual flow records.  Aggregation-based anonymization is not
considered --and not even included in this draft's definition of
anonymization!-- even though it provides more robust privacy
results.

Sec 3: notes on goals

So it seems that the goal of anonymization, as stated here, is to
prevent IP flow data from being traced to the networks, hosts, or
users that participated it.  But this is only one possible goal of
these techniques.

Other uses of the anonymization techniques in this draft include:

    - One-way untraceability.  Under some circumstances, it is fine
      to identify the originator/recipient of a given flow, but not
      both.  For example, it might be fine to identify users so long
      as the services they use can't be inferred, and it might be
      fine to identify servers so long as you can't tell who their
      users are.

    - Resistance to partner profiling.  Even if no particular flow can
      be linked to a particular entity, it might be undesirable for
      the set of flows as a whole to be useful for statistically
      inferring certain properties of networks, hosts, or users.  For
      example, even if an attacker can trace no specific flow to
      users Alice and Bob with confidence greater than 0.01%, if they
      could nevertheless infer that Alice and Bob communicate
      regularly with P=99%, Alice and Bob would reasonably consider
      their privacy to have been compromised.

      For a more rigorous of an attack that achieves profiling without
      tracing specific interactions, see Danezis's Statistical
      Disclosure attack.

    - Non-observability.  It might be undesirable for a flow or set
      of flows to confirm that a particular entity was in fact
      present or absent at a given time.

Some related desirable properties include:

    - Non-linkability.  It might be undesirable for two flows
      generated by the same entity to be linked to one another.
      Linkability between flows is a strong amplifier for traceability
      attacks: if through mischance, misdesign, or external
      knowledge, an adversary manages to trace a single flow to one
      of its entities, then linkability between flows means that
      _all_ of that entity's flows are now traced.

      Website Fingerprinting ("Fingerprinting Websites Using Traffic
      Analysis", Andrew Hintz, 2002) is an example of a more subtle
      attack enabled by flow linkability.

    - Ccorrelation resistance. It might be undesirable for
      anonymized flows processed by different IPFIX installations to
      be correlated to one another.  For example, suppose that the
      same flow is anonymized one way as it travels through network
      A, and another way as it travels through network B.  It might
      be the case that neither anonymized flow on its own has enough
      information to identify a user, but that both flows, taken
      together, can identify the user.  If this is so, and an attacker
      might see both anonymized flows, then it becomes critical to
      ensure that the adversary cannot easily learn that the two
      anonymized flows refer to the same flow.

I'm not proposing that every IPFIX user should want all of these
properties under all circumstances, but without them, the
untraceability properties become more fragile and much harder to
achieve.

It seems that elsewhere in the document, requirements _like_ these
are considered, though they're usually not explicit. Instead, the
document only says that certain properties that are not themselves
identifiers "can be used to identify hosts and users" without much
considering how in some cases.

Sec 3: notes on threat modeling

(Perhaps this applies better to the security consideration section.
Either way, without a discussion of known attacks against entities'
privacy, it's hard to have a meaningful discussion of how privacy
can be achieved.)

Privacy, like security, requires us to consider threat models.  In
other words, we need to state our privacy requirements in terms of
an attacker's resources, and what count as a successful attack.
The part of this draft that worries me most is that, when discussing
"untraceability", it does so with no actual explicit attacker in mind.

Because there isn't an explicit threat model or collection of threat
models, it's not really possible to say whether some of the attacks
and caveats below are really "valid" attacks against their
anonymization methods, because "without an treat model, there are
no vulnerabilities--only surprising features."

For example, some places in the draft discuss "traffic injection"
attacks, implying an active attacker.  But other places in the draft
claim that anonymization techniques are effective when they _do not_
resist an active attacker.

Sec 4:

Throughout this section, it seems potentially misleading to say what
various anonymization techniques are "intended to defeat", and so
on.  A naive reader could take this to mean that a technique
_actually does_ defeat one of these attacks, or that it _actually
will_ provide a given degree of privacy, which I think is not what
the authors are trying to say.

On the other hand, if this draft _is_ trying to say which techniques
achieve what, then it needs to be much, much more specific about the
threat models and circumstances for which its statements are true.

sec 4.1.2:

"Reverse truncation is effective for making networks unidentifiable"
-- this is not so if the adversary can be assumed to know a little
bit about the network.  For example, assume we have removed all but
the last 8 bits of the IPv4 address, but we have left port numbers
unchanged.  An attacker reading the anonymized flow data can still
learn which hosts are running which services, and match this service
map (fuzzily) to known service maps on networks of interest.

sec 4.1.3:

Surely we should say something about the security requirements of
the permutation function.  There is all the difference in the world
between, say, a block cipher and an xor with a known constant, but
this section doesn't actually make that distinction.  Below, in
section 5.4, the draft says that the permutation and its parameters
SHOULD not be exported, but that's not quite the same as saying that
the permutation SHOULD be hard to invert without knowing its
parameters.

Similarly, the recommendation to use a hash function can fail badly
if the hash function is known to the attacker: it is trivial for the
attacker to brute force all IPv4 addresses to deanonymize subjects
if a known hash is used.  HMAC with a secret key would be more
appropriate.

4.2: See notes from 4.1 above.  Brute-forcing a 48-bit MAC addresses
is harder than brute-forcing a 32-bit IPv4 address, but not out of
reach even for a hobbyist.

4.3: There is existing research on the extent to which the beginning
and ending times of related flows can be used to link an anonymized
view of a flow to a non-anonymized view of the flow.  See Murdoch
and Zelinski's "Sampled Traffic Analysis by Internet-Exchange-Level
Adversaries".
[ http://petworkshop.org/2007/papers/PET2007_preproc_Sampled_traffic.pdf ]

4.3.2: An active attacker who can create recognizable flows can turn
an enumerated timestamp dataset into a precision-degraded dataset by
periodically injecting a recognizable flow.

4.3.3: Adding a uniform random shifts is remarkably fragile: if the
adversary can identify the correct time for even one flow, he can
learn the times for all other flows.  Worse, if datasets are
generated continuously, with each one starting right after the
previous one finishes, then the attacker who knows the shift for one
dataset can place bounds about the shifts for all close-in-time
datasets by induction.

4.3 and 4.4, in general: There is a pretty extensive literature
about the extent to which perturbing timing and volume information
prevents correlation, linkability, and website fingerprinting.
Check out the traffic analysis section of freehaven.net/anonbib, and
also check out the literature on "stepping stone detection".

The results are unintuitive to many people; in general, to resist
correlation and linkability attacks, you need to use perturbations
of higher-variance or bins of larger size than many implementors
would expect.

5.3, 5.4:

**This point is critical**:

Statements to the effect of "Information about [the particular
anonymization technique used] SHOULD NOT be exported" are a total
violation of Kerckhoffs' principle: the security of a system should
depend only on the secrecy of key-like parameters, not on the
secrecy of its algorithms.

In practice, after all, any competent attacker will know which
permutation functions and binning functions are implemented by the
popular IPFIX vendors.  Any aspect of permutation/binning which the
attacker must not learn needs be keyed with a secret key that can be
changed locally.

Section 9:

A risk that it could be worthwhile to mention: Frequently,
anonymized data will be treated by administrators as "not
privacy-sensitive" when in fact it should only be treated as "less
privacy-sensitive."  (For examples in other fields, see the results
concerning user reidentification from AOL's search terms, or Netflix
film queues.)  The anonymization techniques described here do indeed
make entities associated with flows harder to trace ... but there is
a risk that when they are applied, administrators will treat flow
data as "completely safe" when in fact it has only become "less
harmful if misused".

[privacydir] Fwd: Review of draft-ietf-ipfix-anon Sean Turner