[Int-area] Privacy thoughts re draft-boucadair-intarea-nat-reveal-analysis

Alissa Cooper <acooper@cdt.org> Thu, 08 December 2011 12:11 UTC

From: Alissa Cooper <acooper@cdt.org>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Thu, 08 Dec 2011 12:10:46 +0000
Message-Id: <6D1019B7-4A38-48A0-917F-735BC63132ED@cdt.org>
To: int-area@ietf.org
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: [Int-area] Privacy thoughts re draft-boucadair-intarea-nat-reveal-analysis
Precedence: list

I spent some time reviewing draft-boucadair-intarea-nat-reveal-analysis-04 today. These are my preliminary thoughts about the document with respect to privacy. Note that I'm not a network layer expert so in some cases I raise questions to which there may be obvious answers.

The primary way that I think the text could be improved would be to make it more specific about 4 properties of the various solutions:

1) Which identifiers are candidates for being included in the HOST_ID. 
Looking just at the first 5 solutions discussed in the text (since the last two are only given a cursory treatment anyway), there are at least 8 different kinds of identifiers mentioned:

full IPv4 address (IP Option, XFF, Proxy Protocol)
IPv6 prefix (IP Option, XFF)
any unique 16-bit value (IP-ID)
lower 16 bits of IPv4 address (TCP Option)
VLAN ID (TCP Option)
VRF ID (TCP Option)
subscriber ID (TCP Option)
INET + IPv4 address + TCP source port + TCP dest port (Proxy Protocol)

I realized that the general goal of all of these solutions -- to disambiguate hosts behind the same public IP -- is the same, but the implications of using these different identifiers are not always the same (more on that in my suggested text below). I also realize that the selection of which identifier to use may be carrier-specific or implementation-specific. Nonethless, I think it would be helpful to be as precise as possible when discussing which identifiers are candidates for being included in each solution proposal.

2) Uniqueness of identifiers in HOST_ID
The document and other documents that it references talk in various places about disambiguation, uniqueness, and global uniqueness, but there is no consistent statement about each solution proposal as to whether the proposal may/should/must/will support identifiers at a certain level of uniqueness. It would be helpful to state this explicitly. I've made some suggestions about this in my suggested text below. Also, if it's possible it would be good to include a recommendation that HOST_IDs be limited to providing local uniqueness rather than global uniqueness where implementers have a choice.

3) Refresh rate of HOST_ID
The reference to the volatility of HOST_ID information in 1.2 is good, but again I would suggest adding solution-specific text about this where each solution is discussed.

4) Interactions between multiple solutions
Section 3.2.2 makes brief mention of interference when multiple solutions are used, but such interaction also has privacy implications (e.g., if a TCP option exposes subscriber ID and XFF exposes IPv4 address). To the extent that combinations like this are being envisioned, they need a more thorough treatment.

Along these lines, in the sections below I've either suggested new text entirely (1.2) or inserted notes/questions in brackets where I think more text would be helpful.

1.2.  HOST_ID and Privacy

IP address sharing is motivated by a number of different factors. For years, many network operators have conserved the use of public IPv4 addresses by making use of customer premises equipment (CPE) that assigns a single public IPv4 address to all hosts within the customer's local area network and uses NAT to translate between locally unique private IPv4 addresses and the CPE's public address. With the exhaustion of IPv4 address space, address sharing between customers on a much larger scale is likely to become much more prevalent.

While many individual users are unaware of and uninvolved in decisions about whether their unique IPv4 addresses get revealed when they send data via IP, some users realize privacy benefits associated with IP address sharing, and some may even take steps to ensure that NAT functionality sits between them and the public Internet. IP address sharing makes the actions of all users behind the NAT function unattributable to any single host, creating room for abuse but also providing some identity protection for non-abusive users who wish to transmit data with reduced risk of being uniquely identified.

The proposals considered in this document add a measure of uniqueness back to hosts that share a public IPv4 address. The extent of that uniqueness depends on which information is included in the HOST_ID and is discussed in each solution proposal section. 

Similarly, the volatility of the HOST_ID information depends on the particular solution proposal, and in some cases, the particular implementation. In some cases the HOST_ID may be recycled when the host reboots or obtains a new internal IP addresses, while in other cases the HOST_ID may be persistent. As with persistent IP addresses, persistent HOST_IDs facilitate user tracking over time. 

As a general matter, the HOST_ID proposals do not seek to make hosts any more identifiable than they would be if they were using a public, non-shared IP address. However, depending on the solution proposal, the addition of HOST_ID information may allow a device to be fingerprinted more easily than it otherwise would be. Should multiple solutions be combined (e.g., TCP Option and XFF) that include different pieces of information in the HOST_ID, fingerprinting may become even easier. 

The trust placed in the information conveyed in the HOST_ID is likely to be the same as for current practices with source IP addresses. In that sense, a HOST_ID can be spoofed as this is also the case for spoofing an IP address. [Note: Is this statement really true for HOST_ID solutions that rely on something other than IP address, e.g., subscriber ID? What about when SAVI is in use? Also, what are the implications of spoofing for return reachablity? It seems that if spoofing is being put forth as some sort of user-enabled protection mechanism, the actual implications of spoofing require further discussion.] Furthermore, users of network-based anonymity services (like Tor) may be capable of stripping HOST_ID information before it reaches its destination.

[Is it envisioned that the HOST_ID solutions will be used by mobile operators? If so, there is probably a bit more to be said here about a mobile device maintaining its HOST_ID even if its public IP changes.]

3.1.  Define an IP Option

3.1.1.  Description

  This proposal aims to define an IP option [RFC0791] to convey a "host
  identifier".  This identifier can be inserted by the address sharing
  function to uniquely distinguish a host among those sharing the same
  IP address.  The option can convey an IPv4 address, the prefix part
  of an IPv6 address, etc.

[This seems pretty unspecific. What are all the identifiers that are/could be used here?]

3.1.2.  Analysis

  Unlike the solution presented in Section 3.2, this proposal can apply
  for any transport protocol.  Nevertheless, it is widely known that
  routers (and other middle boxes) filter IP options.  IP packets with
  IP options can be dropped by some IP nodes.  Previous studies
  demonstrated that "IP Options are not an option" (Refer to
  [Not_An_Option], [Options]).

[Depending on the answer to my question posed in 3.1.1, there should be some discussion here of the differences in uniqueness and volatility of the different potential identifiers.]

3.2.  Define a TCP Option
3.2.2.  Analysis

[Looking at draft-wing-nat-reveal-option, it does a good job of discussing the max refresh rate of the TCP option, but doesn't discuss the min at all. I presume some implementations might use a persistent identifier if they're not sharing among more than 2^16 hosts. Is it practical to recommend against that (probably would make more sense to do so in draft-wing-nat-reveal-option, but seems like it needs to be discussed somewhere)? 

I don't know enough about the various kinds of IDs listed, but it seems inadvisable to use something globally unique when all you need is local uniqueness.]

o  Interference with current usages such as X-Forwarded-For (see
  Section 3.4) should be elaborated to specify the behavior of
  servers when both options are used; in particular specify which
  information to use: the content of the TCP option or what is
  conveyed in the application headers.

[If the use of both the TCP option and XFF together is a real possibility, it would be good to be able to recommend that they both contain subsets of the same information (e.g., full IP and lower 16 bits of IP).]

3.3.  Use the Identification Field of IP Header (IP-ID)
3.3.1.  Description

  IP-ID (Identification field of IP header) can be used to insert an
  information which uniquely distinguishes a host among those sharing
  the same IPv4 address.  An address sharing function can re-write the
  IP-ID field to insert a value unique to the host (16 bits are
  sufficient to uniquely disambiguate hosts sharing the same IP
  address).

[Is it possible to be more specific about what these bits are?]

3.4.  Inject Application Headers

[It seems like this solution raises some broader issues beyond privacy -- does it really make sense to promote a model where access to more and more resources via HTTP is gated on the presence of a non-standardized extension header, with exceptions made on the basis of a Wikipedia-based list of ISPs?]

Cheers,
Alissa

[Int-area] Privacy thoughts re draft-boucadair-in… Alissa Cooper
Re: [Int-area] Privacy thoughts re draft-boucadai… mohamed.boucadair
Re: [Int-area] Privacy thoughts re draft-boucadai… Alissa Cooper