Re: [DNSOP] Benjamin Kaduk's Discuss on draft-ietf-dnsop-dns-capture-format-08: (with DISCUSS and COMMENT)

Sara Dickinson <sara@sinodun.com> Wed, 21 November 2018 13:53 UTC

Return-Path: <sara@sinodun.com>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6B4731277CC; Wed, 21 Nov 2018 05:53:21 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.3
X-Spam-Level:
X-Spam-Status: No, score=-4.3 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=sinodun.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XufxH_xar674; Wed, 21 Nov 2018 05:53:16 -0800 (PST)
Received: from haggis.mythic-beasts.com (haggis.mythic-beasts.com [IPv6:2a00:1098:0:86:1000:0:2:1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2BF26126CB6; Wed, 21 Nov 2018 05:53:16 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=sinodun.com ; s=haggis-2018; h=To:Date:Subject:From; bh=+7DHx+moCGE9+tCjToowDL+Cf6e+J+sbFhiAkl6TdRg=; b=WXUuTVzwwn/P5ktJ7QcbBf16f4 8+URR2Sm3ytmX54rstC0FJn5HefgrDxln7YyXNRgWbCezEODREFZ2j7VVNns2JcJ6e2azk4jPlsAd nnISP4dEBhwME4kTO63yJr73XXbqR0GOFQdCxkomP6MzGU3olQtcs0RM2oHpKPgWXEE1XPbnz9nj/ ZufS67xJJjHrDh8XrSONc73y2lmQexHIFEB5jnokJ5hE3UKcAAVCSJN8ezJOVNRZ21boCScYuzfnx /5zDSKS6BK5fU7UxJsOArK0hT3T4ygFaG7MBW/iZxAy3KcVAJ94ptMSz/S3+RnzRQkto0YZFlF7FC PBQHr+8g==;
Received: from [62.232.251.194] (port=29344 helo=[192.168.12.23]) by haggis.mythic-beasts.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from <sara@sinodun.com>) id 1gPSwL-0006eQ-EA; Wed, 21 Nov 2018 13:53:13 +0000
From: Sara Dickinson <sara@sinodun.com>
Message-Id: <CAD81299-8C6E-44EA-AFC0-D3A67E0057C3@sinodun.com>
Content-Type: multipart/alternative; boundary="Apple-Mail=_BAEC95B7-DB9A-4EAA-B95B-421AFC05A731"
Mime-Version: 1.0 (Mac OS X Mail 12.0 \(3445.100.39\))
Date: Wed, 21 Nov 2018 13:53:09 +0000
In-Reply-To: <154258729961.2478.12875770828573692533.idtracker@ietfa.amsl.com>
Cc: The IESG <iesg@ietf.org>, draft-ietf-dnsop-dns-capture-format@ietf.org, Tim Wicinski <tjw.ietf@gmail.com>, dnsop-chairs@ietf.org, dnsop@ietf.org
To: Benjamin Kaduk <kaduk@mit.edu>
References: <154258729961.2478.12875770828573692533.idtracker@ietfa.amsl.com>
X-Mailer: Apple Mail (2.3445.100.39)
X-BlackCat-Spam-Score: 4
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/k6mGhGtt7pjRYWGY6_FeGARIrQk>
Subject: Re: [DNSOP] Benjamin Kaduk's Discuss on draft-ietf-dnsop-dns-capture-format-08: (with DISCUSS and COMMENT)
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Nov 2018 13:53:22 -0000


> Begin forwarded message:
> 
> From: Benjamin Kaduk <kaduk@mit.edu <mailto:kaduk@mit.edu>>
> Subject: Benjamin Kaduk's Discuss on draft-ietf-dnsop-dns-capture-format-08: (with DISCUSS and COMMENT)
> Date: 19 November 2018 at 00:28:19 GMT
> To: "The IESG" <iesg@ietf.org <mailto:iesg@ietf.org>>
> Cc: draft-ietf-dnsop-dns-capture-format@ietf.org <mailto:draft-ietf-dnsop-dns-capture-format@ietf.org>, Tim Wicinski <tjw.ietf@gmail.com <mailto:tjw.ietf@gmail.com>>, dnsop-chairs@ietf.org <mailto:dnsop-chairs@ietf.org>, tjw.ietf@gmail.com <mailto:tjw.ietf@gmail.com>, dnsop@ietf.org <mailto:dnsop@ietf.org>
> Resent-From: <alias-bounces@ietf.org <mailto:alias-bounces@ietf.org>>
> Resent-To: jad@sinodun.com <mailto:jad@sinodun.com>, jim@sinodun.com <mailto:jim@sinodun.com>, sara@sinodun.com <mailto:sara@sinodun.com>, terry.manderson@icann.org <mailto:terry.manderson@icann.org>, john.bond@icann.org <mailto:john.bond@icann.org>

Many thanks for the detailed review. 

> 
> ----------------------------------------------------------------------
> DISCUSS:
> ----------------------------------------------------------------------
> 
> It is pretty shocking to not see any discussion of the privacy
> considerations of storing data including client addresses (and ports)
> alongside DNS transactions, given how central DNS resolution is to user
> behavior on the web.  (Note that there are mentions of potentially
> anonymized data in Sections 6.2 and 6.2.3 which would presumably
> forward-reference the privacy considerations.)  Data normalization would
> probably also be mentioned in this section, since (e.g.) the case used for
> a query/response could be used in fingerprinting an implementation.

There have been extensive discussion of data storage risks and practices in two DPRIVE documents so I’d suggest the following changes in the first instance to address this:

New Privacy Considerations section:
“ Storage of DNS traffic by operators in PCAP and other formats is a long standing and widespread practice. Section 2.5 of draft-bortzmeyer-dprive-rfc7626-bis is an analysis of the risks to Internet users of the storage of DNS traffic data in servers (recursive resolvers, authoritative and rogue server). 

Section 5.2 of draft-dickinson-dprive-bcp-op describes mitigations for those risks for data stored on recursive resolvers (but which could by extension apply to authoritative servers). These include data handling practices and methods for data minimisation, IP address pseudonymization and anonymization. Appendix B of that document presents an analysis of 7 published anonymization processes. In addition RSSAC have recently published RSSAC04: " Recommendations on Anonymization Processes for Source IP Addresses Submitted for Future Analysis”[1].

The above analyses consider full data capture (e.g using PCAP) as a baseline for privacy considerations and therefore this format specification introduces no new user privacy issues beyond those of full data capture. It does provides mechanisms to selectively record only certain fields at the time of data capture to improve user privacy and to explicitly indicate that data is sampled and or anonymised. It also provide flags to indicate if data normalisation has been performed; data normalisation increases user privacy by reducing the potential for fingerprinting individuals however a trade-off is potentially reducing the capacity to identify attack traffic via query name signatures. Operators should carefully consider their operational requirements and privacy policies and SHOULD capture at source the minimum user data required to meet their needs“

[1] https://www.icann.org/en/system/files/files/rssac-040-07aug18-en.pdf <https://www.icann.org/en/system/files/files/rssac-040-07aug18-en.pdf>


As noted, there are a few other places we can also highlight the privacy aspects:

Introduction:
OLD: “The PCAP [pcap] or PCAP-NG [pcapng] formats are typically used in practice for packet captures, but these file formats can contain a great deal of additional  information that is not directly pertinent to DNS traffic analysis  and thus unnecessarily increases the capture file size.”

NEW: “The PCAP [pcap] or PCAP-NG [pcapng] formats are typically used in practice for packet captures, but these file formats can contain a great deal of additional  information that is not directly pertinent to DNS traffic analysis  and thus unnecessarily increases the capture file size. Additionally these tools and format typically have no filter mechanism to selectively record only certain fields at capture time, requiring post-processing for anonymisation or pseudonymistaion of data to protect user privacy.

Section 4, bullet point 2:

OLD: “Different users will have different requirements
          for data to be available for analysis.  Users with minimal
          requirements should not have to pay the cost of recording full
          data, though this will limit the ability to perform certain
          kinds of data analysis and also to reconstruct packet
          captures.  For example, omitting the resource records from a
          Response will reduce the C-DNS file size; in principle
          responses can be synthesized if there is enough context.”

NEW: “Different operators will have different requirements
          for data to be available for analysis.  Operators with minimal
          requirements should not have to pay the cost of recording full
          data, though this will limit the ability to perform certain
          kinds of data analysis and also to reconstruct packet
          captures.  For example, omitting the resource records from a
          Response will reduce the C-DNS file size; in principle
          responses can be synthesized if there is enough context.
          Operators may have different policies for collecting user data
          and can choose to omit or anonymise certain fields at
         capture time e.g. client address."

And yes, in both sections 6.2 and 6.2.3 add forward references to the Privacy Considerations section


> 
> I'm also concerned about the policy/procedure for allocating/extending the
> various bitfields and similar potential extension points in the data
> structures.  Section 8 covers the major/minor versioning semantics with
> respect to new map keys and new maps, but not addition of new bits within
> existing (uint) bitmaps.  Given the usage of the CDDL .bits constraint,
> it's not really clear that an IANA registry is the right tool to use, but I
> think some indication of the expected way to allocate new bits is in order,
> whether it's "a future standards-track document that updates this document"
> or otherwise.  (I've noted many, but not all, instances of such bitmaps in
> my COMMENT section.)

We are inclined to follow the lead of existing RFCs making use of CBOR, namely
* RFC8152 'CBOR Object Signing and Encryption' (July 2017)
* RFC8392 ‘CBOR Web Token (CWT)' (May 2018) and 
* RFC8428 'Sensor Measurement Lists (SenML)' (Aug 2018) 
and request IANA create a C-DNS registry with
subregistries with keys for each of the different maps used in C-DNS.
New entries in these subregistries would follow Expert Review as defined
in RFC8126. This appears to be the emerging usual way of dealing with
CBOR map key values, particularly integer.

> 
> There are also a couple of fields whose semantics don't seem to be
> sufficiently well specified for a proposed-standard document, such as
> vlan-ids, generator-id, name-rdata, and ae-code.  (I understand that some
> of them are probably only going to have locally relevant semantics, but we
> should be explicit about when that's the case.)

Acknowledged, we’ll add references or clarifications for these (will put details in a follow up mail that will also address your comments below).

> 
> If I'm reading things correctly that the IP address type is inferred from
> the bytestring length, then I think we need to enforce a restriction on the
> address prefix length(s) to allow for that inference to be unambiguous
> (noting that we only have the *byte* length of the address fields at our
> disposal for disabmgituation, and not the more precise bit-length).

Ah, the first bit of the qr-transport-flags contains a IPv4/IPv6 flag so the address type can be explicitly determined from that if it is set but of course there is a corner case where that field isn’t present we hadn’t considered so we’ll have to address that. Making that field mandatory if prefixes are used would be simplest. 


> 
> 
> ----------------------------------------------------------------------
> COMMENT:
> ----------------------------------------------------------------------
> 
> Section 2
> 
> Please consider using the RFC 8174 version of the BCP 14 boilerplate.
> 
> Section 3
> 
>   Because of these considerations, a major factor in the design of the
>   format is minimal storage size of the capture files.
> 
> maybe "storage and transmission"?
> 
> Section 6
> 
> In Figure 2, the Query name is marked as "(q)" (only present if there is a
> query), but the running text in Section 4 (bullet 1) says that the Question
> section from the response can be used as an identifying QNAME if there is a
> response with no corresponding query.  Am I misexpanding QNAME here, or is
> there a disagreement between these two parts of the text?  In particular, I
> do not see a part of Figure 2 that would correspond to a Question section
> in the response, given the various "(q)"/"(r)" markings.
> 
> Section 6.2.2
> 
>   Messages with OPCODES known to the recording application but not
>   listed in the Storage Parameters are discarded (regardless of whether
>   they are malformed or not).
> 
> (Do we need to say anything that the "discarded" is only w.r.t. the capture
> process, and not meant to imply that DNS queries would not get a normal
> response?)
> 
> Section 6.2.4
> 
> Please consider using IPv6 examples, per
> https://www.iab.org/2016/11/07/iab-statement-on-ipv6/ <https://www.iab.org/2016/11/07/iab-statement-on-ipv6/> .
> 
> Section 7.2
> 
>   o  The column T gives the CBOR data type of the item.
> 
>      *  U - Unsigned integer
> 
>      *  I - Signed integer
> 
> This is venturing a bit far from my normal area of expertise, but my
> understanding is that CBOR native major types are only provided for
> unsigned integer and negative integer, with "signed integer" being an
> abstraction at a slightly higher layer that needs to be managed in the
> application.  Do we need to add any clarifying text here or will the
> meaning be clear to the reader?
> 
> Section 7.4
> 
> Should probably forward-reference section 8 for the format version numbers'
> semantics.
> 
> Section 7.4.1.1
> 
> We should we reference the IANA registries by name for any of these fields
> (e.g., opcodes, rr-types, etc.).  (Also in Section 7.5.3.1, etc.)
> 
> Are the storage flags going to be allocated in sequence by updating
> standards-track documents, or some other mechanism?  (Is a registry
> necessary?)
> 
> For the various address prefix fields, do we need to specify that the full
> addresses are stored when the corresponding prefix field is absent?
> 
> Section 7.4.1.1.1
> 
> Am I parsing the "query-response-hints" text correctly to say that a bit is
> set in the bitmap if the corresponding field is recorded (if present) by
> the collecting implementation?  The causality of "if the field is omitted
> the bit is unset" goes in a direction that is not what I expected.
> (Similarly for the other fields in this table.)
> 
> Section 7.4.2
> 
> Do we need a reference for "promiscuous mode"?
> 
> Just to check: in "server-addresses", I just infer the IP version from the
> length of the byte string?
> 
> Do we need to say more about where the vlan-ids identifiers are taken from?
> 
> Is the "generator-id" string intended to only be human readable?  Only
> within a specific (administrative) context?
> 
> Section 7.5.1
> 
> Does "earliest-time" include leap seconds?
> 
> Section 7.5.3
> 
> The "ip-address" description seems to imply that very short ipv6 prefix
> lengths could cause confusion as to the address type being indicated (e.g.,
> setting to 32 when no ipv4 prefix length is set, or setting to the same
> value as the ipv4 prefix length).  Do we need to restrict the ipv6 prefix
> lengths to being 33 or larger?
> 
> Are the "name-rdata" contents in wire format or presentation format?
> 
> Section 7.5.3.2
> 
> What's the allocation policy/procedure for the remaining
> qr-transport-flags transport values?  For additional bits in any/all of the
> flags fields listed here?
> 
> Something of a side note, what's the mnemonic for the "sig" in
> "qr-sig-flags"?  That is, what is it a signature of or over (it doesn't
> seem like it's a cryptographic signature, which may be what is confusing
> me)?
> 
> For "query-rcode"/"response-rcode", should there be a reference for "OPT",
> and/or for any of the EDNS stuff in here?  (The Terminology section only
> mentions using the naming from RFC 1035, that I can see.)
> 
> The "mm-transport-flags" here bear a striking resemblance to the
> "qr-transport-flags" from Section 7.5.3.2; should there be a shared
> registry for their contents?  (I guess the TransportFlags CDDL to some
> extent serves this function.)
> 
> Section 7.7
> 
> How is the value of the "ae-code" determined?
> 
> Appendix A
> 
> We could perhaps apply some constraints on (e.g.) the address-prefex length
> fields to be .le the relevant lengths.
> 
> Appendix C.6
> 
>                                           Using a strong compression,
>   block sizes over 10,000 query/response pairs would seem to offer
>   limited improvements.
> 
> nit: Using a strong compression scheme
> 
>