Re: [DNSOP] Benjamin Kaduk's Discuss on draft-ietf-dnsop-dns-capture-format-08: (with DISCUSS and COMMENT)

Benjamin Kaduk <kaduk@mit.edu> Sat, 24 November 2018 03:35 UTC

Return-Path: <kaduk@mit.edu>
X-Original-To: dnsop@ietfa.amsl.com
Delivered-To: dnsop@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 12A07130DDF; Fri, 23 Nov 2018 19:35:46 -0800 (PST)
X-Quarantine-ID: <WqwFXq0cT7Cp>
X-Virus-Scanned: amavisd-new at amsl.com
X-Amavis-Alert: BAD HEADER SECTION, Non-encoded 8-bit data (char 9C hex): Received: ...s kaduk@ATHENA.MIT.EDU)\n\t\234by outgoing.mit[...]
X-Spam-Flag: NO
X-Spam-Score: -4.199
X-Spam-Level:
X-Spam-Status: No, score=-4.199 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, UNPARSEABLE_RELAY=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WqwFXq0cT7Cp; Fri, 23 Nov 2018 19:35:43 -0800 (PST)
Received: from dmz-mailsec-scanner-4.mit.edu (dmz-mailsec-scanner-4.mit.edu [18.9.25.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id A4FA812D4E7; Fri, 23 Nov 2018 19:35:42 -0800 (PST)
X-AuditID: 1209190f-4bbff700000047cf-4b-5bf8c70cbcb8
Received: from mailhub-auth-2.mit.edu ( [18.7.62.36]) (using TLS with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by dmz-mailsec-scanner-4.mit.edu (Symantec Messaging Gateway) with SMTP id 40.07.18383.C07C8FB5; Fri, 23 Nov 2018 22:35:41 -0500 (EST)
Received: from outgoing.mit.edu (OUTGOING-AUTH-1.MIT.EDU [18.9.28.11]) by mailhub-auth-2.mit.edu (8.14.7/8.9.2) with ESMTP id wAO3ZZI1009914; Fri, 23 Nov 2018 22:35:36 -0500
Received: from kduck.kaduk.org (24-107-191-124.dhcp.stls.mo.charter.com [24.107.191.124]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) œby outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id wAO3ZUMK016409 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 23 Nov 2018 22:35:33 -0500
Date: Fri, 23 Nov 2018 21:35:30 -0600
From: Benjamin Kaduk <kaduk@mit.edu>
To: Sara Dickinson <sara@sinodun.com>
Cc: Tim Wicinski <tjw.ietf@gmail.com>, dnsop@ietf.org, dnsop-chairs@ietf.org, The IESG <iesg@ietf.org>, draft-ietf-dnsop-dns-capture-format@ietf.org
Message-ID: <20181124033529.GF68416@kduck.kaduk.org>
References: <154258729961.2478.12875770828573692533.idtracker@ietfa.amsl.com> <CAD81299-8C6E-44EA-AFC0-D3A67E0057C3@sinodun.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAD81299-8C6E-44EA-AFC0-D3A67E0057C3@sinodun.com>
User-Agent: Mutt/1.9.1 (2017-09-22)
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrBKsWRmVeSWpSXmKPExsUixG6nost7/Ee0Qf8RXos32yexWNx9c5nF YsmDHcwWM/5MZLZo+/WL2WJa22ZmBzaPnbPusnssWfKTyeNez2XGAOYoLpuU1JzMstQifbsE roxlx+YyFmyvrDg0V6GBcXdCFyMnh4SAiUTrlx9MXYxcHEICa5gkNkxZwg6SEBLYyCjRvNQL InGXSaL70CVWkASLgKpE08FdbCA2m4CKREP3ZWYQWwQofmLRfVaQBmaBuYwSp2fvA5skLJAr sbJxMguIzQu0bu7Ee+wQU5sYJbqbNjFBJAQlTs58AlbELKAu8WfeJaCpHEC2tMTyfxwQYXmJ 5q2zwZZxCthLXNz4ixHEFhVQltjbd4h9AqPgLCSTZiGZNAth0iwkkxYwsqxilE3JrdLNTczM KU5N1i1OTszLSy3SNdHLzSzRS00p3cQIigROSf4djHMavA8xCnAwKvHwGjD/iBZiTSwrrsw9 xCjJwaQkyntrPlCILyk/pTIjsTgjvqg0J7X4EKMEB7OSCO83B6Acb0piZVVqUT5MSpqDRUmc 95fI42ghgfTEktTs1NSC1CKYrAwHh5IEL/8xoEbBotT01Iq0zJwShDQTByfIcB6g4Q+Oggwv LkjMLc5Mh8ifYtTl6Fi3bA6zEEtefl6qlDhvHEiRAEhRRmke3BxQApPI3l/zilEc6C1h3k8g VTzA5Ac36RXQEiagJfLzv4MsKUlESEk1MPo80Lur4mI14fwJCd+dW4ymeRnsjoz63NTJYm93 Z+/2xdYPfBS8d95vP/VyfqzoNX6jizNr3XZvOLr6x3e3uy+frt8hLvSORXCC/TcmieTb8/ym rPsbV1MtPlUwX3Z+0fYUo66ZZZ47ba2ZdiV3nbX2E1L59/J100WRC3uj9hyclbrRLYA95ooS S3FGoqEWc1FxIgAyow75OwMAAA==
Archived-At: <https://mailarchive.ietf.org/arch/msg/dnsop/nfIkXrK8eP7rUIyQuuZHcvQO43E>
Subject: Re: [DNSOP] Benjamin Kaduk's Discuss on draft-ietf-dnsop-dns-capture-format-08: (with DISCUSS and COMMENT)
X-BeenThere: dnsop@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF DNSOP WG mailing list <dnsop.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dnsop>, <mailto:dnsop-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dnsop/>
List-Post: <mailto:dnsop@ietf.org>
List-Help: <mailto:dnsop-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dnsop>, <mailto:dnsop-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 24 Nov 2018 03:35:46 -0000

On Wed, Nov 21, 2018 at 01:53:09PM +0000, Sara Dickinson wrote:
> 
> 
> > Begin forwarded message:
> > 
> > From: Benjamin Kaduk <kaduk@mit.edu <mailto:kaduk@mit.edu>>
> > Subject: Benjamin Kaduk's Discuss on draft-ietf-dnsop-dns-capture-format-08: (with DISCUSS and COMMENT)
> > Date: 19 November 2018 at 00:28:19 GMT
> > To: "The IESG" <iesg@ietf.org <mailto:iesg@ietf.org>>
> > Cc: draft-ietf-dnsop-dns-capture-format@ietf.org <mailto:draft-ietf-dnsop-dns-capture-format@ietf.org>, Tim Wicinski <tjw.ietf@gmail.com <mailto:tjw.ietf@gmail.com>>, dnsop-chairs@ietf.org <mailto:dnsop-chairs@ietf.org>, tjw.ietf@gmail.com <mailto:tjw.ietf@gmail.com>, dnsop@ietf.org <mailto:dnsop@ietf.org>
> > Resent-From: <alias-bounces@ietf.org <mailto:alias-bounces@ietf.org>>
> > Resent-To: jad@sinodun.com <mailto:jad@sinodun.com>, jim@sinodun.com <mailto:jim@sinodun.com>, sara@sinodun.com <mailto:sara@sinodun.com>, terry.manderson@icann.org <mailto:terry.manderson@icann.org>, john.bond@icann.org <mailto:john.bond@icann.org>
> 
> Many thanks for the detailed review. 
> 
> > 
> > ----------------------------------------------------------------------
> > DISCUSS:
> > ----------------------------------------------------------------------
> > 
> > It is pretty shocking to not see any discussion of the privacy
> > considerations of storing data including client addresses (and ports)
> > alongside DNS transactions, given how central DNS resolution is to user
> > behavior on the web.  (Note that there are mentions of potentially
> > anonymized data in Sections 6.2 and 6.2.3 which would presumably
> > forward-reference the privacy considerations.)  Data normalization would
> > probably also be mentioned in this section, since (e.g.) the case used for
> > a query/response could be used in fingerprinting an implementation.
> 
> There have been extensive discussion of data storage risks and practices in two DPRIVE documents so I’d suggest the following changes in the first instance to address this:

This is exactly the sort of thing I was hoping to see, thank you!  I have
just a couple tweaks to suggest, inline.

> New Privacy Considerations section:
> “ Storage of DNS traffic by operators in PCAP and other formats is a long standing and widespread practice. Section 2.5 of draft-bortzmeyer-dprive-rfc7626-bis is an analysis of the risks to Internet users of the storage of DNS traffic data in servers (recursive resolvers, authoritative and rogue server). 
> 
> Section 5.2 of draft-dickinson-dprive-bcp-op describes mitigations for those risks for data stored on recursive resolvers (but which could by extension apply to authoritative servers). These include data handling practices and methods for data minimisation, IP address pseudonymization and anonymization. Appendix B of that document presents an analysis of 7 published anonymization processes. In addition RSSAC have recently published RSSAC04: " Recommendations on Anonymization Processes for Source IP Addresses Submitted for Future Analysis”[1].
> 
> The above analyses consider full data capture (e.g using PCAP) as a
> baseline for privacy considerations and therefore this format
> specification introduces no new user privacy issues beyond those of full
> data capture. It does provides mechanisms to selectively record only

I would say "beyond those of full data capture (which are quite severe)".
That is, while the current state of affairs is a valid baseline for
comparison, that does not absolve us of responsibility for analyzing the
current state of affairs.  (To be clear,
draft-bortzmeyer-dprive-rfc7626-bis is a fine place for the bulk of that
anlaysis to live, but in this document we should not pretend that the
current state of affairs is a good situation to be in.)

> certain fields at the time of data capture to improve user privacy and to
> explicitly indicate that data is sampled and or anonymised. It also
> provide flags to indicate if data normalisation has been performed; data
> normalisation increases user privacy by reducing the potential for
> fingerprinting individuals however a trade-off is potentially reducing

I think "however" would be offset by commas on both sides.

> the capacity to identify attack traffic via query name signatures.
> Operators should carefully consider their operational requirements and
> privacy policies and SHOULD capture at source the minimum user data
> required to meet their needs“
> 
> [1] https://www.icann.org/en/system/files/files/rssac-040-07aug18-en.pdf <https://www.icann.org/en/system/files/files/rssac-040-07aug18-en.pdf>
> 
> 
> As noted, there are a few other places we can also highlight the privacy aspects:
> 
> Introduction:
> OLD: “The PCAP [pcap] or PCAP-NG [pcapng] formats are typically used in practice for packet captures, but these file formats can contain a great deal of additional  information that is not directly pertinent to DNS traffic analysis  and thus unnecessarily increases the capture file size.”
> 
> NEW: “The PCAP [pcap] or PCAP-NG [pcapng] formats are typically used in practice for packet captures, but these file formats can contain a great deal of additional  information that is not directly pertinent to DNS traffic analysis  and thus unnecessarily increases the capture file size. Additionally these tools and format typically have no filter mechanism to selectively record only certain fields at capture time, requiring post-processing for anonymisation or pseudonymistaion of data to protect user privacy.
> 
> Section 4, bullet point 2:
> 
> OLD: “Different users will have different requirements
>           for data to be available for analysis.  Users with minimal
>           requirements should not have to pay the cost of recording full
>           data, though this will limit the ability to perform certain
>           kinds of data analysis and also to reconstruct packet
>           captures.  For example, omitting the resource records from a
>           Response will reduce the C-DNS file size; in principle
>           responses can be synthesized if there is enough context.”
> 
> NEW: “Different operators will have different requirements
>           for data to be available for analysis.  Operators with minimal
>           requirements should not have to pay the cost of recording full
>           data, though this will limit the ability to perform certain
>           kinds of data analysis and also to reconstruct packet
>           captures.  For example, omitting the resource records from a
>           Response will reduce the C-DNS file size; in principle
>           responses can be synthesized if there is enough context.
>           Operators may have different policies for collecting user data
>           and can choose to omit or anonymise certain fields at
>          capture time e.g. client address."
> 
> And yes, in both sections 6.2 and 6.2.3 add forward references to the Privacy Considerations section
> 
> 
> > 
> > I'm also concerned about the policy/procedure for allocating/extending the
> > various bitfields and similar potential extension points in the data
> > structures.  Section 8 covers the major/minor versioning semantics with
> > respect to new map keys and new maps, but not addition of new bits within
> > existing (uint) bitmaps.  Given the usage of the CDDL .bits constraint,
> > it's not really clear that an IANA registry is the right tool to use, but I
> > think some indication of the expected way to allocate new bits is in order,
> > whether it's "a future standards-track document that updates this document"
> > or otherwise.  (I've noted many, but not all, instances of such bitmaps in
> > my COMMENT section.)
> 
> We are inclined to follow the lead of existing RFCs making use of CBOR, namely
> * RFC8152 'CBOR Object Signing and Encryption' (July 2017)
> * RFC8392 ‘CBOR Web Token (CWT)' (May 2018) and 
> * RFC8428 'Sensor Measurement Lists (SenML)' (Aug 2018) 
> and request IANA create a C-DNS registry with
> subregistries with keys for each of the different maps used in C-DNS.
> New entries in these subregistries would follow Expert Review as defined
> in RFC8126. This appears to be the emerging usual way of dealing with
> CBOR map key values, particularly integer.

That sounds like a fine path forward, thanks.

> > 
> > There are also a couple of fields whose semantics don't seem to be
> > sufficiently well specified for a proposed-standard document, such as
> > vlan-ids, generator-id, name-rdata, and ae-code.  (I understand that some
> > of them are probably only going to have locally relevant semantics, but we
> > should be explicit about when that's the case.)
> 
> Acknowledged, we’ll add references or clarifications for these (will put details in a follow up mail that will also address your comments below).

Sounds good.

> > 
> > If I'm reading things correctly that the IP address type is inferred from
> > the bytestring length, then I think we need to enforce a restriction on the
> > address prefix length(s) to allow for that inference to be unambiguous
> > (noting that we only have the *byte* length of the address fields at our
> > disposal for disabmgituation, and not the more precise bit-length).
> 
> Ah, the first bit of the qr-transport-flags contains a IPv4/IPv6 flag so the address type can be explicitly determined from that if it is set but of course there is a corner case where that field isn’t present we hadn’t considered so we’ll have to address that. Making that field mandatory if prefixes are used would be simplest. 

I guess I had forgotten about that bit in the qr-transport-flags on my
first read.  Making it mandatory if prefix lengths are present ought to
work.

-Benjamin

> 
> > 
> > 
> > ----------------------------------------------------------------------
> > COMMENT:
> > ----------------------------------------------------------------------
> > 
> > Section 2
> > 
> > Please consider using the RFC 8174 version of the BCP 14 boilerplate.
> > 
> > Section 3
> > 
> >   Because of these considerations, a major factor in the design of the
> >   format is minimal storage size of the capture files.
> > 
> > maybe "storage and transmission"?
> > 
> > Section 6
> > 
> > In Figure 2, the Query name is marked as "(q)" (only present if there is a
> > query), but the running text in Section 4 (bullet 1) says that the Question
> > section from the response can be used as an identifying QNAME if there is a
> > response with no corresponding query.  Am I misexpanding QNAME here, or is
> > there a disagreement between these two parts of the text?  In particular, I
> > do not see a part of Figure 2 that would correspond to a Question section
> > in the response, given the various "(q)"/"(r)" markings.
> > 
> > Section 6.2.2
> > 
> >   Messages with OPCODES known to the recording application but not
> >   listed in the Storage Parameters are discarded (regardless of whether
> >   they are malformed or not).
> > 
> > (Do we need to say anything that the "discarded" is only w.r.t. the capture
> > process, and not meant to imply that DNS queries would not get a normal
> > response?)
> > 
> > Section 6.2.4
> > 
> > Please consider using IPv6 examples, per
> > https://www.iab.org/2016/11/07/iab-statement-on-ipv6/ <https://www.iab.org/2016/11/07/iab-statement-on-ipv6/> .
> > 
> > Section 7.2
> > 
> >   o  The column T gives the CBOR data type of the item.
> > 
> >      *  U - Unsigned integer
> > 
> >      *  I - Signed integer
> > 
> > This is venturing a bit far from my normal area of expertise, but my
> > understanding is that CBOR native major types are only provided for
> > unsigned integer and negative integer, with "signed integer" being an
> > abstraction at a slightly higher layer that needs to be managed in the
> > application.  Do we need to add any clarifying text here or will the
> > meaning be clear to the reader?
> > 
> > Section 7.4
> > 
> > Should probably forward-reference section 8 for the format version numbers'
> > semantics.
> > 
> > Section 7.4.1.1
> > 
> > We should we reference the IANA registries by name for any of these fields
> > (e.g., opcodes, rr-types, etc.).  (Also in Section 7.5.3.1, etc.)
> > 
> > Are the storage flags going to be allocated in sequence by updating
> > standards-track documents, or some other mechanism?  (Is a registry
> > necessary?)
> > 
> > For the various address prefix fields, do we need to specify that the full
> > addresses are stored when the corresponding prefix field is absent?
> > 
> > Section 7.4.1.1.1
> > 
> > Am I parsing the "query-response-hints" text correctly to say that a bit is
> > set in the bitmap if the corresponding field is recorded (if present) by
> > the collecting implementation?  The causality of "if the field is omitted
> > the bit is unset" goes in a direction that is not what I expected.
> > (Similarly for the other fields in this table.)
> > 
> > Section 7.4.2
> > 
> > Do we need a reference for "promiscuous mode"?
> > 
> > Just to check: in "server-addresses", I just infer the IP version from the
> > length of the byte string?
> > 
> > Do we need to say more about where the vlan-ids identifiers are taken from?
> > 
> > Is the "generator-id" string intended to only be human readable?  Only
> > within a specific (administrative) context?
> > 
> > Section 7.5.1
> > 
> > Does "earliest-time" include leap seconds?
> > 
> > Section 7.5.3
> > 
> > The "ip-address" description seems to imply that very short ipv6 prefix
> > lengths could cause confusion as to the address type being indicated (e.g.,
> > setting to 32 when no ipv4 prefix length is set, or setting to the same
> > value as the ipv4 prefix length).  Do we need to restrict the ipv6 prefix
> > lengths to being 33 or larger?
> > 
> > Are the "name-rdata" contents in wire format or presentation format?
> > 
> > Section 7.5.3.2
> > 
> > What's the allocation policy/procedure for the remaining
> > qr-transport-flags transport values?  For additional bits in any/all of the
> > flags fields listed here?
> > 
> > Something of a side note, what's the mnemonic for the "sig" in
> > "qr-sig-flags"?  That is, what is it a signature of or over (it doesn't
> > seem like it's a cryptographic signature, which may be what is confusing
> > me)?
> > 
> > For "query-rcode"/"response-rcode", should there be a reference for "OPT",
> > and/or for any of the EDNS stuff in here?  (The Terminology section only
> > mentions using the naming from RFC 1035, that I can see.)
> > 
> > The "mm-transport-flags" here bear a striking resemblance to the
> > "qr-transport-flags" from Section 7.5.3.2; should there be a shared
> > registry for their contents?  (I guess the TransportFlags CDDL to some
> > extent serves this function.)
> > 
> > Section 7.7
> > 
> > How is the value of the "ae-code" determined?
> > 
> > Appendix A
> > 
> > We could perhaps apply some constraints on (e.g.) the address-prefex length
> > fields to be .le the relevant lengths.
> > 
> > Appendix C.6
> > 
> >                                           Using a strong compression,
> >   block sizes over 10,000 query/response pairs would seem to offer
> >   limited improvements.
> > 
> > nit: Using a strong compression scheme
> > 
> >