[privacydir] Fwd: Review of draft-ietf-ipfix-anon
Sean Turner <turners@ieca.com> Wed, 05 January 2011 14:03 UTC
Return-Path: <turners@ieca.com>
X-Original-To: privacydir@core3.amsl.com
Delivered-To: privacydir@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id DDD983A6C05 for <privacydir@core3.amsl.com>; Wed, 5 Jan 2011 06:03:23 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -102.53
X-Spam-Level:
X-Spam-Status: No, score=-102.53 tagged_above=-999 required=5 tests=[AWL=0.068, BAYES_00=-2.599, UNPARSEABLE_RELAY=0.001, USER_IN_WHITELIST=-100]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id L2xg6UqKLzl1 for <privacydir@core3.amsl.com>; Wed, 5 Jan 2011 06:03:21 -0800 (PST)
Received: from nm10.bullet.mail.sp2.yahoo.com (nm10.bullet.mail.sp2.yahoo.com [98.139.91.80]) by core3.amsl.com (Postfix) with SMTP id B33743A6C18 for <privacydir@ietf.org>; Wed, 5 Jan 2011 06:03:21 -0800 (PST)
Received: from [98.139.91.68] by nm10.bullet.mail.sp2.yahoo.com with NNFMP; 05 Jan 2011 14:05:26 -0000
Received: from [98.139.91.41] by tm8.bullet.mail.sp2.yahoo.com with NNFMP; 05 Jan 2011 14:05:25 -0000
Received: from [127.0.0.1] by omp1041.mail.sp2.yahoo.com with NNFMP; 05 Jan 2011 14:05:25 -0000
X-Yahoo-Newman-Id: 929015.8143.bm@omp1041.mail.sp2.yahoo.com
Received: (qmail 93114 invoked from network); 5 Jan 2011 14:05:25 -0000
Received: from thunderfish.local (turners@96.231.119.109 with plain) by smtp115.biz.mail.mud.yahoo.com with SMTP; 05 Jan 2011 06:05:24 -0800 PST
X-Yahoo-SMTP: ZrP3VLSswBDL75pF8ymZHDSu9B.vcMfDPgLJ
X-YMail-OSG: BHipV9IVM1lYHemYjjX2cEkBaz3S4lyZyYSgaRUZTykNMNS 7EohpcU9nKpMl9LmlhEKV.oRYqOS4PVk5skJBUyC71PXd6eGGnW18UWAJdxq ZdLrPT_bhXa2dwxhIK8yMO93VO4sNjHBZL8NEQCUkP_KLn39YsjSk7xwz94l JNEYk.jJb9hhjaD2dIsaX4ylvtNXuKJKn43qHqtNX7I2yf9KY1RjjxEkUhuV R6AMLgXiAlEG5vgyIqCz1KBDNf2vX9zkh8KIM5rpCRIq7nrYGabrK3vqt9nM ICjIgFDKhu.EHQRWtFddtpbeEhwIUhaVN5GNJ2yfJPcI1cAJSPuSbDlvJL5S cgrK_oNv5auE_oB2D.BD1PchFkEmIJRGCIQ--
X-Yahoo-Newman-Property: ymail-3
Message-ID: <4D247AA3.1060500@ieca.com>
Date: Wed, 05 Jan 2011 09:05:23 -0500
From: Sean Turner <turners@ieca.com>
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101207 Lightning/1.0b2 Thunderbird/3.1.7
MIME-Version: 1.0
To: privacydir@ietf.org
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Subject: [privacydir] Fwd: Review of draft-ietf-ipfix-anon
X-BeenThere: privacydir@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: "Privacy Directorate to develop the concept of privacy considerations for IETF specifications and to review internet-drafts for privacy considerations." <privacydir.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/privacydir>, <mailto:privacydir-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/privacydir>
List-Post: <mailto:privacydir@ietf.org>
List-Help: <mailto:privacydir-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/privacydir>, <mailto:privacydir-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 05 Jan 2011 14:03:24 -0000
Here's Nick's review. spt -------- Original Message -------- Subject: Review of draft-ietf-ipfix-anon Date: Thu, 30 Dec 2010 14:02:55 -0500 From: Nick Mathewson <nickm@freehaven.net> To: Sean Turner <turners@ieca.com> My expertise here is on anonymity and privacy enhancing technologies in general, and not in IPFIX, so please take my observations with the appropriate amount of skepticism. Also, I've not had much prior experience reviewing IETF drafts, so please forgive any suggestions I make that are out-of-scope for this document. A couple of minor notes on terminology: - The discussion and definition of 'anonymization' here may be the one more usually used when referring to data flows, but there are larger fields at work here that it might be a good idea to harmonize with. The idea of deleting fields or aggregating records in a data set is more closely associated with the topic of inference control (see also work on "k-anonymity"). For these, the Inference Control subsection of Ross Anderson's "Security Engineering" is a reasonably good introduction. The idea of anonymity in general seems more closely tied to notions of untraceability and unlinkability; for those see Pfitzmann and Hansen "A terminology for talking about privacy by data minimization". It's no disaster (and certainly not unprecedented!) to have the meaning of "anonymize" be context-dependent, but it _is_ important to say what property you actually mean to achieve thereby. See also the notes below on section 3. - The authors should decide whether they're going to use American or British spellings. The draft uses the American spellings for categorize, organize, minimize, behavior, and so on, but unaccountably uses the British "-ise" in anonymise and pseudonymise. In the research literature, and in the other relevant RFCs, "anonymize" seems to be more popular, but either spelling type is fine so long as it's consistent. Sec 1: It might be wise to repeat here (or even in the abstract) the note from the Security Considerations section that this draft is only meant to explain how to interchange anonymized data, not to provide any recommendations as to which anonymization techniques to use, or even any guarantee that any particular technique achieves any particular purpose. Otherwise, it is easy to misread some parts of section 4 as promising that particular techniques will prevent particular attacks, which is not in fact the case for reasonable threat models. It would also be useful to note that this draft _only_ considers the kind of inference control you can achieve through transformation of individual flow records. Aggregation-based anonymization is not considered --and not even included in this draft's definition of anonymization!-- even though it provides more robust privacy results. Sec 3: notes on goals So it seems that the goal of anonymization, as stated here, is to prevent IP flow data from being traced to the networks, hosts, or users that participated it. But this is only one possible goal of these techniques. Other uses of the anonymization techniques in this draft include: - One-way untraceability. Under some circumstances, it is fine to identify the originator/recipient of a given flow, but not both. For example, it might be fine to identify users so long as the services they use can't be inferred, and it might be fine to identify servers so long as you can't tell who their users are. - Resistance to partner profiling. Even if no particular flow can be linked to a particular entity, it might be undesirable for the set of flows as a whole to be useful for statistically inferring certain properties of networks, hosts, or users. For example, even if an attacker can trace no specific flow to users Alice and Bob with confidence greater than 0.01%, if they could nevertheless infer that Alice and Bob communicate regularly with P=99%, Alice and Bob would reasonably consider their privacy to have been compromised. For a more rigorous of an attack that achieves profiling without tracing specific interactions, see Danezis's Statistical Disclosure attack. - Non-observability. It might be undesirable for a flow or set of flows to confirm that a particular entity was in fact present or absent at a given time. Some related desirable properties include: - Non-linkability. It might be undesirable for two flows generated by the same entity to be linked to one another. Linkability between flows is a strong amplifier for traceability attacks: if through mischance, misdesign, or external knowledge, an adversary manages to trace a single flow to one of its entities, then linkability between flows means that _all_ of that entity's flows are now traced. Website Fingerprinting ("Fingerprinting Websites Using Traffic Analysis", Andrew Hintz, 2002) is an example of a more subtle attack enabled by flow linkability. - Ccorrelation resistance. It might be undesirable for anonymized flows processed by different IPFIX installations to be correlated to one another. For example, suppose that the same flow is anonymized one way as it travels through network A, and another way as it travels through network B. It might be the case that neither anonymized flow on its own has enough information to identify a user, but that both flows, taken together, can identify the user. If this is so, and an attacker might see both anonymized flows, then it becomes critical to ensure that the adversary cannot easily learn that the two anonymized flows refer to the same flow. I'm not proposing that every IPFIX user should want all of these properties under all circumstances, but without them, the untraceability properties become more fragile and much harder to achieve. It seems that elsewhere in the document, requirements _like_ these are considered, though they're usually not explicit. Instead, the document only says that certain properties that are not themselves identifiers "can be used to identify hosts and users" without much considering how in some cases. Sec 3: notes on threat modeling (Perhaps this applies better to the security consideration section. Either way, without a discussion of known attacks against entities' privacy, it's hard to have a meaningful discussion of how privacy can be achieved.) Privacy, like security, requires us to consider threat models. In other words, we need to state our privacy requirements in terms of an attacker's resources, and what count as a successful attack. The part of this draft that worries me most is that, when discussing "untraceability", it does so with no actual explicit attacker in mind. Because there isn't an explicit threat model or collection of threat models, it's not really possible to say whether some of the attacks and caveats below are really "valid" attacks against their anonymization methods, because "without an treat model, there are no vulnerabilities--only surprising features." For example, some places in the draft discuss "traffic injection" attacks, implying an active attacker. But other places in the draft claim that anonymization techniques are effective when they _do not_ resist an active attacker. Sec 4: Throughout this section, it seems potentially misleading to say what various anonymization techniques are "intended to defeat", and so on. A naive reader could take this to mean that a technique _actually does_ defeat one of these attacks, or that it _actually will_ provide a given degree of privacy, which I think is not what the authors are trying to say. On the other hand, if this draft _is_ trying to say which techniques achieve what, then it needs to be much, much more specific about the threat models and circumstances for which its statements are true. sec 4.1.2: "Reverse truncation is effective for making networks unidentifiable" -- this is not so if the adversary can be assumed to know a little bit about the network. For example, assume we have removed all but the last 8 bits of the IPv4 address, but we have left port numbers unchanged. An attacker reading the anonymized flow data can still learn which hosts are running which services, and match this service map (fuzzily) to known service maps on networks of interest. sec 4.1.3: Surely we should say something about the security requirements of the permutation function. There is all the difference in the world between, say, a block cipher and an xor with a known constant, but this section doesn't actually make that distinction. Below, in section 5.4, the draft says that the permutation and its parameters SHOULD not be exported, but that's not quite the same as saying that the permutation SHOULD be hard to invert without knowing its parameters. Similarly, the recommendation to use a hash function can fail badly if the hash function is known to the attacker: it is trivial for the attacker to brute force all IPv4 addresses to deanonymize subjects if a known hash is used. HMAC with a secret key would be more appropriate. 4.2: See notes from 4.1 above. Brute-forcing a 48-bit MAC addresses is harder than brute-forcing a 32-bit IPv4 address, but not out of reach even for a hobbyist. 4.3: There is existing research on the extent to which the beginning and ending times of related flows can be used to link an anonymized view of a flow to a non-anonymized view of the flow. See Murdoch and Zelinski's "Sampled Traffic Analysis by Internet-Exchange-Level Adversaries". [ http://petworkshop.org/2007/papers/PET2007_preproc_Sampled_traffic.pdf ] 4.3.2: An active attacker who can create recognizable flows can turn an enumerated timestamp dataset into a precision-degraded dataset by periodically injecting a recognizable flow. 4.3.3: Adding a uniform random shifts is remarkably fragile: if the adversary can identify the correct time for even one flow, he can learn the times for all other flows. Worse, if datasets are generated continuously, with each one starting right after the previous one finishes, then the attacker who knows the shift for one dataset can place bounds about the shifts for all close-in-time datasets by induction. 4.3 and 4.4, in general: There is a pretty extensive literature about the extent to which perturbing timing and volume information prevents correlation, linkability, and website fingerprinting. Check out the traffic analysis section of freehaven.net/anonbib, and also check out the literature on "stepping stone detection". The results are unintuitive to many people; in general, to resist correlation and linkability attacks, you need to use perturbations of higher-variance or bins of larger size than many implementors would expect. 5.3, 5.4: **This point is critical**: Statements to the effect of "Information about [the particular anonymization technique used] SHOULD NOT be exported" are a total violation of Kerckhoffs' principle: the security of a system should depend only on the secrecy of key-like parameters, not on the secrecy of its algorithms. In practice, after all, any competent attacker will know which permutation functions and binning functions are implemented by the popular IPFIX vendors. Any aspect of permutation/binning which the attacker must not learn needs be keyed with a secret key that can be changed locally. Section 9: A risk that it could be worthwhile to mention: Frequently, anonymized data will be treated by administrators as "not privacy-sensitive" when in fact it should only be treated as "less privacy-sensitive." (For examples in other fields, see the results concerning user reidentification from AOL's search terms, or Netflix film queues.) The anonymization techniques described here do indeed make entities associated with flows harder to trace ... but there is a risk that when they are applied, administrators will treat flow data as "completely safe" when in fact it has only become "less harmful if misused".