[Dots] AD review of draft-ietf-dots-telemetry-16
Benjamin Kaduk <kaduk@mit.edu> Wed, 24 November 2021 22:41 UTC
Return-Path: <kaduk@mit.edu>
X-Original-To: dots@ietfa.amsl.com
Delivered-To: dots@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 571963A0D26; Wed, 24 Nov 2021 14:41:54 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.498
X-Spam-Level:
X-Spam-Status: No, score=-1.498 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, KHOP_HELO_FCRDNS=0.399, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id o2ZA6PECg0HO; Wed, 24 Nov 2021 14:41:50 -0800 (PST)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C9BC73A0D0C; Wed, 24 Nov 2021 14:41:49 -0800 (PST)
Received: from kduck.mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 1AOMfeKl020365 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 24 Nov 2021 17:41:45 -0500
Date: Wed, 24 Nov 2021 14:41:40 -0800
From: Benjamin Kaduk <kaduk@mit.edu>
To: draft-ietf-dots-telemetry.all@ietf.org
Cc: dots@ietf.org
Message-ID: <20211124224140.GH93060@kduck.mit.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/dots/oP2MrvGEhXsct4qVed6B0uC_QfU>
Subject: [Dots] AD review of draft-ietf-dots-telemetry-16
X-BeenThere: dots@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "List for discussion of DDoS Open Threat Signaling \(DOTS\) technology and directions." <dots.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dots>, <mailto:dots-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dots/>
List-Post: <mailto:dots@ietf.org>
List-Help: <mailto:dots-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dots>, <mailto:dots-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Nov 2021 22:42:00 -0000
Hi all, Sorry to have been working on this for so long -- some health issues intervened, and it is perhaps easier to let onesself be interrupted if there is no hope of finishing in a single sitting :-/ [Note that I reviewed the -16 but the -17 is the current version, which includes some editorial work I had sent in a PR. It looks like that will not have changed many things I comment on.] This review is pretty long (which I guess befits a long document). We probably want to chunk up replies to it so that the emails/threads stay manageable. A couple meta-comments on the other reviews so far: - the shepherd writeup only answers one of the three questions in point (1). It is likely that Murray will point this out in his ballot if not updated before then. - it's a little surprising that the yangdoctor didn't ask for encoded examples of the RESTCONF (data channel) functionality. (For the sx:structure parts, I think we're already in good shape on examples.) That said, I don't think our use of RESTCONF is particularly novel, and am happy to proceed without such examples if desired. High-level remarks before going in to the detailed section-by-section comments: I'm happy to see the note at the end of §4.5 that the data channel is available to optimize the data that needs to be exchanged over the signal channel during attack-time, in some sense reiterating the original split between signal and data channels (but see my note in the inline comments about the "MAY"). But this is already rather far into the design overview, let alone the document overall! I think it would be helpful to have a paragraph or two in the toplevel §4 to get in front of how the telemetry mechanisms integrate into the signal+data channels, perhaps something like: % The DOTS protocol suite is divided into two logical channels: the signal % channel and data channel. This division is due to the vastly different % requirements placed upon the traffic they carry. The signal channel must % remain available and usable even in the face of attack traffic that might, % for example, saturate one direction of the links involved, rendering % acknowledgment-based mechanisms unreliable and strongly incentivizing % messages to be small enough to be contained in a single IP packet. In % contrast, the data channel is available for high-bandwidth data transfer % before or after an attack, using more conventional transport protocol % techniques. It is generally preferable to perform advance configuration % over the data channel, including configuring short aliases for static or % nearly static data sets such as sets of network addresses/prefixes that % might be subject to related attacks -- this helps to optimize the use of % the signal channel for the small messages that truly require reliable % delivery during an attack. % % Telemetry information has aspects that correspond to both operational % modes: there is certainly a need to convey updated information about % ongoing attack traffic and targets during an attack, so as to convey % detailed information about mitigation status and inform updates to % mitigation strategy in the face of adaptive attacks, but it is also useful % to provide mitigation services with a picture of normal or "baseline" % traffic towards potential targets, to aid in detecting when incoming % traffic deviates from normal into being an attack. Likewise, one might % populate a "database" of classifications of known types of attack so that % a short alias can be used during attack time to describe an observed % attack. This specification does make provision for use of the data % channel for the latter function, but otherwise retains most telemetry % functionality in the signal channel. This is partly out of necessity and % partly out of expedience -- it is a functional requirement to convey % information about ongoing attack traffic during an attack, and information % about baseline traffic uses an essentially identical datastructure that is % naturally defined to sit next to the description of attack traffic. The % related telemetry setup information used to parametrize actual traffic % data is also sent over the signal channel, out of expediency. [I can't % actually populate this part; maybe it's something about having the % configuration values that define the percentile measurements be in-band % with the percentile measurements themselves, and then a desire to not % define two similar setup schemes over data+signal channels that justifies % the other telemetry-config and pipe branches?] The use of the "list telemetry" in the telemetry-setup message confused me for a while, since it seems like the client can only send a zero- or one-element list to the server (based on the 'tsid' for that direction being part of the Uri-Path). Can you confirm that the list structure was used just for the server-to-client direction combined with maximizing data-structure reuse? We should say somewhere that the examples use CBOR presentation format. (This could be a note near the terminology section or accompanying each example.) I think we should probably include a bit of a brief refresher on percentile calculation, in addition to the reference to RFC 2330. Right now we basically just say that the peers negotiation parameter including "percentile-related measurement parameters" which was not quite enough to clue me in that 2330 was basically required reading to know what we actually do. So this might look something like: % DOTS telemetry uses several percentile values to provide a picture of a % traffic distribution overall, as opposed to just a single snapshot of % observed traffic at a single point in time. Modeling raw traffic flow % data as a distribution and describing that distribution entails choosing a % measurement period that the distribution will describe, and a number of % sampling intervals, or "buckets", within that measurement period. Traffic % within each bucket is treated as a single event (i.e., averaged), and then % the distribution of buckets is used to describe the distribution of % traffic over the measurement period. A distribution an be characterized % by statistical measures such as mean, median, and standard deviation, and % also by reporting the value of the distribution at various percentile % levels of the data set in question, such as "quartiles" that correspond to % 25th, 50th, and 75th percentile. More details about percentile values and % their computation are found in Section 11.3 of [RFC2330]; DOTS telemetry % uses up to three percentile values, plus the overall peak, to characterize % traffic distributions. Which percentile thresholds are used for these % "low", "medium", and "high" percentile values is configurable (with % suitable defaults). It seems like we are setting up somewhat conflicting goals between the overall desire for compact signal-channel messages and our guidance to "include in any request to update information related to [a thing] the information of other [like things] (already communicated using a lower ['tsid'/'tmid'] value" that results in larger individual messages (albeit fewer of them). Do we want to make some statement about how this guidance to coalesce can be ignored if needed to make individual signal-channel messages fit in a single packet? (Hmm, I guess we do have a note about "assumes that all link information can fit in one single message" at least for the 'tsid'/link case, but the text from that note only appears once.) My editorial suggestions in https://github.com/boucadair/draft-dots-telemetry/pull/9 included some edits relating to whether a given target is "susceptible to" vs "subject to" (or even "actively being subjected to") different types of attacks. Med has diligently already merged that PR while I was sick, so I mostly assume that these already got reviewed for correctness. Still, I want to call out that there is a difference and I was not always 100% sure that I picked the right one for each case. It looks like we have to use backslash continuation markers for several of the CoAP request target resources in the examples; do we want to reference RFC 8792 folding and use its specific note text about line wrapping? There are a few places where we say something like "the DOTS client MUST auto-scale so that the appropriate unit is used". (1) We don't specifically say what the goal of this automatic behavior is (I guess, use the "largest unit" that gives a value greater than one?), and (2) it seems that we are in part relying on this behavior to ensure that the client only specifies one measure in a given unit class. That is, I didn't see anything else that would prevent a client from sending (say) both Mbps and Gbps values, that could of course be in conflict with each other -- if there's a potential for conflict we'd need to say how to resolve the conflict. Relatedly, there seems to *also* be potential for conflict in that we allow both bit/second and Byte/second as unit-classes. What is to happen if I say that something is both 2 bit/second and 2 Byte/second? We have a lot of examples that use the same 'cuid' of dz6pHjaADkaFTbjr0JGBpw, which is fine and representative of normal operation. Do we want to curate the 'tsid' and 'tmid' values used by that client? I see some reuse, which I think is not prohibited (at least for 'tsid'), but may merit checking, and the values are not monotonically increasing throughout the course of the document, which may or may not be desirable. (nit) we seem to have both "signalled" and "signaled" present, and should standardize on our spelling. (It looks like the one-'l' form is more common in the RFC archive so far.) Section 1 Distributed Denial of Service (DDoS) attacks have become more sophisticated. [...] We might want to say something about the baseline for the comparison ("more sophisticated compared to what?"). 100 Mbps to 100s of Gbps or even Tbps. Attacks are commonly carried out leveraging botnets and attack reflectors for amplification attacks such as NTP (Network Time Protocol), DNS (Domain Name System), SNMP (Simple Network Management Protocol), or SSDP (Simple Service Discovery Protocol). Is https://datatracker.ietf.org/doc/html/rfc4732#section-3.1 still current enough to make a useful reference for amplification attacks? Nevertheless, when DOTS telemetry attributes are available to a DOTS agent, and absent any policy, it can signal the attributes in order to optimize the overall mitigation service provisioned using DOTS. I'm not sure what type of policy we have in mind, here. Section 2 The reader should be familiar with the terms defined in [RFC8612]. It looks like RFC 8612 doesn't define "idle time" and "attack time", so we may want to pull in a couple more documents as required reading or define them ourselves. Section 3.1 If the DOTS server's mitigation resources have the capabilities to facilitate the DOTS telemetry, the DOTS server adapts its protection strategy and activates the required countermeasures immediately (automation enabled) for the sake of optimized attack mitigation decisions and actions. I'm not entirely sure I understand this part. It seems to be talking about telemetry between the DOTS server and the associated mitigation resources, but the previous discussion has been about telemetry between DOTS client and DOTS server (and I do not think the mitigation resources associated with the DOTS server have been said to be DOTS clients in any of the previous DOTS work). Section 3.2 DOTS telemetry can also be used to tune the DDoS mitigators with the correct state of an attack. During the last few years, DDoS attack (nit) I don't think "with" is the right verb -- "tune with <X>" implies that X is used as a tool to effectuate the tuning, but I think what's going on here is more like using the telemetry data as an input for determining what values to use for the tuning parameters available on the mitigation resources. Mitigation of attacks without having certain knowledge of normal traffic can be inaccurate at best. This is especially true for recursive signaling (see Section 3.2.3 in [I-D.ietf-dots-use-cases]). RFC 8903 makes no mention of "recursive", "recursion", or any hierarchical mitigation scenario (at least in my quick search). Even the linked draft version (-25) doesn't have a section 3.2.3. I do remember reading about this type of recursive or hierarchical scenario somwhere, so I think we just need to locate the right reference to put here...I'm just not sure what reference that is, offhand. In addition, the highly diverse types of use cases where DOTS clients are integrated also emphasize the need for knowledge of each DOTS client domain behavior. Consequently, common global thresholds for attack detection practically cannot be realized. [...] (editorial?) When we say "the highly diverse types of use cases where DOTS clients are integrated" that's essentially an unsupported claim, but the way it's written we are presupposing that claim to be true without directly stating it as an observation or assumption. I think we might be better served by starting off with a declarative statement like "DOTS clients can be integrated in a highly diverse set of scenarios and use cases", and then moving on from that now-explicit assumption/fact to conclude that any single global threshold will mischaracterize the traffic for some of those diverse DOTS clients. Section 4.3 DOTS clients can also use CoAP Block1 Option in a PUT request (see Section 2.5 of [RFC7959]) to initiate large transfers, but these Block1 transfers will fail if the inbound "pipe" is running full, so consideration needs to be made to try to fit this PUT into a single transfer, or to separate out the PUT into several discrete PUTs where each of them fits into a single packet. If I understand/recall correctly, the Block1 transfer is expected to fail only in a statistical sense, since the client can't send more blocks until it gets the positive reply from the server to continue to the next block (on a block-by-block basis), and that reply from the server is competing with the attack traffic for the inbound "pipe" and likely to fail for at least one of the blocks. I think we should probably reword slightly to say only "expected to fail" or "likely fail", since there is some small chance of success; we may also want to give a brief reminder about why it is expected to fail (e.g., "the transfer requires a message from the server for each block, which would likely be lost in the incoming flood"). Section 6 Telemetry setup configuration is bound to a DOTS client domain. DOTS servers MUST NOT expect DOTS clients to send regular requests to refresh the telemetry setup configuration. Any available telemetry setup configuration has a validity timeout of the DOTS association with a DOTS client domain. [...] The term "DOTS association" does not seem to have been used previously (in RFC 9132 we discuss the "DOTS session" heavily, though). Also, I don't remember any previous requirement to keep state on the server for the duration of anything scoped to the entire client *domain*, just individual DOTS sessions. We do mention detecting conflicts/overlapping requests within the scope of a client domain, in RFC 9132, but as far as I can tell that only holds when all the DOTS sessions involved in the conflict are still active. Perhaps related, this discussion (at least so far) is not clear to me about what level of coordination/consistency is expected between clients in the same domain. I believe that stock DOTS works fine with no such coordination amongst clients within a domain, so if we are introducing such a requirement we should say prominently that it's a change. Section 6.1.1 Upon receipt of such request, and assuming no error is encountered by processing the request, the DOTS server replies with a 2.05 (Content) response that conveys the current and telemetry parameters acceptable by the DOTS server. [...] (editorial) Something seems off, here (around "current and telemetry parameters acceptable"). Is it returning current configuration, acceptable parameter values, or some combination thereof? | | +-- query-type* query-type Since this is a leaf-list of supported query types, should the list name include the word "supported" or similar? Section 6.1.2 The PUT request with a higher numeric 'tsid' value overrides the DOTS telemetry configuration data installed by a PUT request with a lower numeric 'tsid' value. To avoid maintaining a long list of 'tsid' requests for requests carrying telemetry configuration data from a DOTS client, the lower numeric 'tsid' MUST be automatically deleted and no longer be available at the DOTS server. The way this is phrased with "the lower" and "higher" (vs "highest") assumes or implies that there is only one "lower" 'tsid' value, perhaps leaving some ambiguity if there is actually more than one. We might make a statement about how there is only at most one active 'tsid' per 'cuid'+'cdid' at a time other than during a config change (if that's the intent), or alternatively to qualify that the requirement to remove is only incurred in the event of a conflict (as is done for other types of config, later). o If the request is missing a mandatory attribute, does not include 'cuid' or 'tsid' Uri-Path parameters, or contains one or more invalid or unknown parameters, 4.00 (Bad Request) MUST be returned in the response. Just to confirm: this does not rule out the ability to define new parameters in the future (for example, the client might learn of new ones in the response to a GET request)? Section 6.3 * The maximum number of requests allowed per second to the target. Should we say anything about the requirement on the protocol in question that "request" is a meaningful concept and observable by the mitigator (analogous to what we have about "embryonic connections" earlier)? | +-- baseline* [id] | +-- id | | uint32 | +-- target-prefix* | | inet:ip-prefix (nit) the formatting here is a bit surprising, to wrap the line between leaf name and type. But if that's what pyang gives, we probably don't want to mess with it... | | +-- partial-request-ps? uint64 | | +-- partial-request-client-ps? uint64 I wonder whether the limit on "partial requests" should really be a rate ("per second") versus a point-in-time cap (i.e., "oustanding partial requests"). It seems like a given request could in principle remain in "partial" state for an extended period of time, and that having remaing in such a state for a second should not justify the client being able to produce more partial requests...but the current formulation as a rate seems to do so. Section 6.3.1 Two PUT requests from a DOTS client have overlapping targets if there is a common IP address, IP prefix, FQDN, URI, or alias-name. Also, two PUT requests from a DOTS client have overlapping targets if the addresses associated with the FQDN, URI, or alias are overlapping with each other or with 'target-prefix'. There can be some subtlety here involving where the FQDN/URI/alias are resolved into IP addresses from; we may want to say "from the perspective of the server", "as observed by the server", or similar. Section 7 DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data with mitigation requests relying upon the target attribute. In (nit??) Is this "target attribute" a telemetry attribute, or something else (an attack target?)? When generating telemetry data to send to a peer, the DOTS agent MUST auto-scale so that appropriate unit(s) are used. (editorial) This requirement is not terribly tightly connected to either what comes after it in the text or what comes before it in the text. We might consider moving it (earlier, maybe?) and adding a bit more transition phrasing about how the agent may send telemetry data many times and in different situations, so the right unit to use will vary over time. Section 7.1.1 If the target is subjected to bandwidth consuming attack, the attributes representing the percentile values of the 'attack-id' attack traffic are included. If the target is subjected to resource consuming DDoS attacks, the same attributes defined for Section 7.1.4 are applicable for representing the attack. One of these just says "the attributes representing" and the other references attributes defined in a specific section; can we use the same formulation/phrasing to talk about the two cases? This is an optional attribute. Er, which one is optional? Just a few paragraphs earlier we said that at least one attribute MUST be present (so as to be able to identify a target). Section 7.1.2 The 'total-traffic' attribute (Figure 26) conveys the percentile values (including peak and current observed values) of total traffic observed during a DDoS attack. [...] This attribute is under the "pre-or-ongoing-mitigation" hierarchy; is it really right to only say "traffic observed during a DDoS attack" (i.e., implicitly excluding the "pre-attack" case)? Section 7.1.3 The 'total-attack-traffic' attribute (Figure 27) conveys the total attack traffic identified by the DOTS client domain's DDoS Mitigation System (or DDoS Detector). [...] (editorial?) "Identified by [the client domain]" seems to imply that this is only used in telemetry reports from client to server, and not in updates flowing the other direction. Is that correct? Section 7.1.4 +-- total-attack-connection | +-- low-percentile-l* [protocol] | | +-- protocol uint8 | | +-- connection? yang:gauge64 | | +-- embryonic? yang:gauge64 | | +-- connection-ps? yang:gauge64 | | +-- request-ps? yang:gauge64 | | +-- partial-request-ps? yang:gauge64 I think I'm confused about what semantics this is trying to represent. The "low-percentile-l" seems like it should be providing aggregate information about some particular snapshot of traffic that we have determined to be representative of the "low percentile" level of attack traffic. That is, if we take the recorded attack traffic and divide it into a bunch of bins on the time axis, we can order the respective bins from "least noticable attack" to "worst attack", and we pick the fifth percentile (or whatever is configured) bin to represent the "low-percentile" bucket. But now we're reporting a bunch of attributes of that bin -- number of connections, embryonic connections, connections per second, etc., even though the bins were ordered based on a single "worst" metric (and I don't see us specify what metric we actually use to do that ordering). So even though this is the "low percentile" (e.g., fifth percentile) bin of attack traffic (i.e., probably not a very bad attack), we can't say specifically that the reported total attack connection count is fifth percentile and the embryonic connection count is fifth percentile and the requests-per-second is also fifth percentile. Which makes me confused at why the list is structured this way -- if we were just saying "for each of connection, embryonic, etc., bin the samples over that metric and compute the low/med/high percentage values for that single metric, and this list holds those resulting values", it seems like a more natural structure to group things would be to have a list of low/med/high for each of the attributes/metrics. Section 7.1.5 vendor-id: Vendor ID is a security vendor's Enterprise Number as registered with IANA [Enterprise-Numbers]. It is a four-byte integer value. attack-id: Unique identifier assigned for the attack. If the vendor-id is being used as a scoping value to let each vendor assign attack-id values, then we should say so here, the first place we really handle this part of the YANG tree. Even if not, we should probably say more about why we care what the vendor-id is, and we should also say who assigns the attack-id. (Maybe it's scoped to the DOTS server; I don't know yet at this point in the document!) Furthermore, the module structure where attack-id and vendor-id are always required, but there is going to need to be the ability to assign a new attack description at runtime, seems to force us to conflate externally/statically assigned attack-ids and dynamically assigned ones in the same number space. Should we give any guidance to vendors about how to allocate IDs in a way that will not produce collisions between predistributed/fixed attack IDs and new ones created at runtime? Or is the thought that there will be no dynamic attack-ids since that just corresponds to the attack-description string and a generic description can be used if there is not a better one available? Regardless, I think we should have more verbiage as to what the scope of the attack-id is -- what information is it supposed to indicate or replace? start-time: The time the attack started. The attack's start time is expressed in seconds relative to 1970-01-01T00:00Z in UTC time (nit) "in UTC time" and "Z" seem redundant (which does not necessarily imply that we should only use one of them...). (Section 3.4.2 of [RFC8949]). The CBOR encoding is modified so that the leading tag 1 (epoch-based date/time) MUST be omitted. (nit) Maybe we should move the CBOR section reference to after we start talking about CBOR encoding. (And we should do the same for 'end-time', if we do anything.) top-talker: A list of top talkers among attack sources. The top talkers are represented using the 'source-prefix'. Is there a good reference or short description for the notion of top-talkers? I know it's a well-established term of art in many different circles, but it's hard to be fully confident that all readers will be familiar with it. Also, should we say anything about how the number of talkers to include is determined? (I assume it is best left to the discretion of the sender, but we probably want to set a clear expectation that the list is not all talkers or a complete list in any other way.) 'spoofed-status' indicates whether a top talker is a spoofed IP address (e.g., reflection attacks) or not. Is this something that can be unambiguously determined by all parties that might be producing this telemetry data, or should we use a tri-statae (for "unknown") rather than a boolean to represent it? In order to optimize the size of telemetry data conveyed over the DOTS signal channel, DOTS agents MAY use the DOTS data channel [RFC8783] to exchange vendor specific attack mapping details (that is, {vendor identifier, attack identifier} ==> attack description). As such, DOTS agents do not have to convey systematically an attack description in their telemetry messages over the DOTS signal channel. It's a little surprising to me that this is only a MAY. Does it make sense as a SHOULD (expected to usually happen, but leaving open the flexibility when an observed attack does not match a previously characterized attack description)? Also, when reading this I assumed that the "attack description" being mapped to could include things like the target port/protocol, whether it was spoofed, etc. But reading on to the ietf-dots-mapping module, it seems that this is intended to just be the single "attack-description" string. If so, we might give some more indications of that, e.g., by spelling it that way and having some prose about it being a "string", or similar. Also^2, from here to the end of the section might benefit from being split off into a dedicated subsection on using the data channel to pre-populate vendor and attack description information tables are at different revisions. The DOTS client SHOULD transmit telemetry information using the vendor mapping(s) that it provided to the DOTS server and the DOTS server SHOULD use the vendor mappings(s) provided to the DOTS client when transmitting telemetry data to peer DOTS agent. Having read a bit further on, I'm not actually sure where in the protocol the client has provided vendor mappings to the server, and the server provided vendor mappings to the client. It *might* be the GET and POST from/to dots-data/ietf-dots-mapping:vendor-mapping, but we don't really give a clear indication of that. I think we should try to be more precise about what we mean by "provided". (editorial) would this then be "SHOULD use any vendor mapping(s)" (s/the/any/)? augment /data-channel:dots-data/data-channel:capabilities: +--ro vendor-mapping-enabled? boolean {dots-telemetry}? None of the other capabilities in this tree (as specified by RFC 8783) include a name suffix like "-enabled". Should this just be "vendor-mapping"? "vendor-id": 1234, "vendor-name": "mitigator-s", "last-updated": "1576856561", "attack-mapping": [] [this last updated is from December 2019; we could probably pick something more current if we wanted, but it doesn't really matter.] The DOTS client reiterates the above procedure regularly (e.g., once a week) to update the DOTS server's vendor attack mapping details. Do the udpates have to preserve any existing assignments? If the DOTS client concludes that the DOTS server does not have any reference to the specific vendor attack mapping details, the DOTS client uses a POST request to install its vendor attack mapping details. [...] This suggests that a client would not bother sending its own vendor mapping if the server already has one, and at least to me further implies that the client would use the sever-provided mapping for the messages that the client generates. This seems at odds with the earlier guidance that the client should send using the vendor-mapping it provided to the server, and the server should send using the vendor-mapping it provided to the client. Which one is it? The DOTS server indicates the result of processing the POST request using the status-line. Concretely, "201 Created" status-line MUST be returned in the response if the DOTS server has accepted the vendor attack mapping details. If the request is missing a mandatory attribute or contains an invalid or unknown parameter, "400 Bad Request" status-line MUST be returned by the DOTS server in the response. [...] I think (but did not specifically check) that these specific status code values are preexisting requirements of core RESTCONF (and HTTP), so we may not need to write new "MUST" keywords. If the request is received via a server-domain DOTS gateway, but the DOTS server does not maintain a 'cdid' for this 'cuid' while a 'cdid' is expected to be supplied, the DOTS server MUST reply with "403 Forbidden" status-line and the error-tag "access-denied". Upon receipt of this message, the DOTS client MUST register (Section 5.1 of [RFC8783]). (nit) the analogous text in RFC 8783 itself refers only to "Section 5" (not 5.1). The DOTS client uses the PUT request to modify its vendor attack mapping details maintained by the DOTS server (e.g., add a new mapping). As above, is this really just an example behavior or a hard requirement to preserve existing mappings? Section 7.2 "attack-detail": [ { "vendor-id": 1234, The vendor-ID is supposed to be an IANA-assigned PEN, and 1234 is assigned to Linkage Software Inc. We have other options, like 32473, "Example Enterprise Number for Documentation Use" (RFC 5612). "start-time": "1957811234", This value represents a time in 2032; should we be using something a little more current? Section 7.3 This request MUST be maintained active by the DOTS server until a delete request is received from the same DOTS client to clear this pre-or-ongoing-mitigation telemetry. It seems like we might want to allow some provision for a server to clean up state associated with a client that disappears entirely without cleaning itself up. For example, is the server allowed to discard state when it reboots, or does this requirement extend to writing the state to persistent storage? If more than one Uri-Query option is included in a request, these options are interpreted in the same way as when multiple target attributes are included in a message body. I a little bit wonder if we could put a section reference here for "interpreted in the same way", since the reader may not remember what that way is, at least on the first reading. parameters. DOTS clients MUST NOT include a name in which the "*" character is included in a label other than the leftmost label. Do we want to say what the server should do if it gets one anyway? (It's not really clear that we need to say anything, strictly speaking.) "vendor-id": 1234, "attack-id": 77, "start-time": "1957818434", [same as above] A DOTS client that is not interested to receive pre-or-ongoing- mitigation telemetry data for a target MUST send a delete request similar to the one depicted in Figure 37. I'm not sure that this strictly needs to be a "MUST"; it seems like we could say that such a client "sends a delete request" and leave it at that. Section 8.1 +-- attack-detail* [vendor-id attack-id] [...] +-- top-talker +-- talker* [source-prefix] [...] +-- total-attack-connection +-- low-percentile-c | +-- connection? yang:gauge64 I'm kind of surprised that there is no 'protocol' leaf in this subtree; it seems to usually appear in combination with these connection counts. | +-- peak-g? yang:gauge64 | +-- current-g? yang:gauge64 [...] | +-- peak-g? yang:gauge64 | +-- current-g? yang:gauge64 (nit) the indentation seems off for type of these two 'current-g's. In order to signal telemetry data in a mitigation efficacy update, it is RECOMMENDED that the DOTS client has already established a DOTS telemetry setup session with the server in 'idle' time. If I understand correctly, it seems that at least part of the need for having a preestablished telemetry session is that the mitigation efficacy update is identified only by the 'mid', and so we cannot specify a 'tmid' as part of such an efficacy update. Thus, in order for the server to associate the telemetry data in the efficacy update with an ongoing telemetry status session, the 'tsid' must be associated with the 'mid' that is the subject of the efficacy update. If that is correct, then I think we should have some text here about how the efficacy update does not/cannot include the 'tmid', and so is associated with the 'mid' of the update. Section 8.2 As defined in [RFC8612], the actual mitigation activities can include several countermeasure mechanisms. The DOTS server signals the current operational status of relevant countermeasures. A list of attacks detected by each countermeasure MAY also be included. The I see how in stock DOTS signal channel, the server signals whether an attack is fully/partially/not mitigated, but I don't see how the server indicates the specific relevant countermeasures that were used. Am I just missing something there? (Similarly, the list of attacks we returned may or may not be at the granularity of "detected by each countermeasure", depending on the ansewr to the previous part.) Also (editorial), I think we should be more clear about when we switch from describing RFC 9132 behavior to to describing the new functionality provided by this document. Figure 46 shows an example of an asynchronous notification of attack mitigation status from the DOTS server. This notification signals Is the "new stuff" in Figure 46 just the subtree under :(server-to-client-only)? The 'total-attack-traffic' leaf seems to be the same content as in Figure 45, and maybe could be omitted? "mitigation-start": "1507818434", (This date is from 2017; as above, we could use a more current one if we want.) "source-count": { "peak-g": "10000" (Suspiciously round numbers like this are indications of fabricated data ... which is totally reasonable for an example like this! But if we wanted to be "more realistic" we could use a less-round number.) DOTS clients can filter out the asynchronous notifications from the DOTS server by indicating one or more Uri-Query options in its GET request. A Uri-Query option can include the following parameters: 'target-prefix', 'target-port', 'target-protocol', 'target-fqdn', 'target-uri', 'alias-name', and 'c' (content) (Section 4.2.4). The Up in §7.3 the analogous list also included 'mid'. Is 'mid' appropriate here (or is it implicitly already included by the base signal channel mechanisms)? Hmm, but maybe 'c' is also already allowed by the base signal channel mechanism, so that theory is not great... If the target query does not match the target of the enclosed 'mid' as maintained by the DOTS server, the latter MUST respond with a 4.04 (Not Found) error Response Code. The DOTS server MUST NOT add a new observe entry if this query overlaps with an existing one. How should the server respond if this query overlaps with an existing one? Section 10.1 This module uses types defined in [RFC6991] and [RFC8345]. Should we mention the data structure extension from RFC 8791 here, as well? typedef attack-severity { type enumeration { enum none { value 1; description "No effect on the DOTS client domain."; It seems like the closest analogue to our attack-serverity typedef in RFC 7970 is in §3.12.2, "BusinessImpact Class". But the phrasing used in RFC 7970 is slightly different than the descriptions we have here. Do we want to tweak our phrasings and/or clarify in the typedef description that we use a slightly modified formulation? (Indeed, we do not have an extension point, either, though 7970 does have an extension mechanism.) Also, a section reference within RFC 7970 might be helpful. enum kilopacket-ps { value 4; description "Kilo packets per second (kpps)."; We may get someone who asks if we are using 1000 or 1024 ("kibi"), but I'm inclined to leave it alone for now. leaf unit-status { type boolean; default true; description "Enable/disable the use of the measurement unit class."; Does the "default true" imply that even unit classes not present in "list unit-config" are assumed to be supported by default? grouping connection { description "A set of data nodes which represent the attack characteristics."; [...] leaf embryonic { type yang:gauge64; description "The number of simultaneous embryonic connections to the target server."; It's a little interesting that this is the only leaf in the grouping that we can't attach the word "attack" to the description of, but as far as I can tell we shouldn't make any change here. (I originally wrote a comment that we shouldn't use "attack" for any of them, since we have analogous nodes in the total-connection-capacity config entries, but those seem to not use this grouping.) list talker { key "source-prefix"; description "IPv4 or IPv6 prefix identifying the attacker(s)."; My read of the YANG is that this is the description for "list talker", but the content matches the description given for the "source-prefix" leaf within the list. Should the list itself have a different description? grouping top-talker-aggregate { [...] grouping top-talker { The diff between these two is pretty small -- just the top-level description and connection-all vs connection-protocol-all. Is there any value in introducing another grouping to hold the common elements? container max-config-values { description "Maximum acceptable configuration values."; uses telemetry-parameters; I guess the convenience of having the reusable grouping probably outweighs the value of having YANG-level constraints that the 'max' values are >= the 'min' values (which we can set for the "telemetry-notify-interval" that is not part of "telemetry-parameters", but can't set here because we use the grouping). leaf telemetry-notify-interval { type uint32 { range "1 .. 3600"; (Pedantically, this range would fit in a uint16, though maybe there is some other reason to prefer the 32-bit type.) case baseline { [...] list baseline { [...] leaf id { type uint32; must '. >= 1'; description "An identifier that uniquely identifies a baseline entry communicated by a DOTS client."; I a little bit wonder if we should have a little bit of prose up in §6.3 discussing the need for "id" and how it is used. leaf tmid { type uint32; description "An identifier to uniquely demux telemetry data sent using the same message."; The description for "tsid" in the analogous setup was just "an identifier for the DOTS telemetry setup data". As far as I can tell, the usage of the two is analogous, so shouldn't the descriptions be analogous as well? In particular, I am not sure that "uniquely demux" is the only usage of this leaf, since in server-initiated telemetry messages, this leaf might be needed to indicate which telemetry entry is being described (right?). leaf-list mid-list { type uint32; description "Reference a list of associated mitigation requests."; Should we reference RFC 9132 somehow to indicate that the base signal channel is how these "mid" values are assigned (and assigned meaning)? Section 10.2 description "Vendor ID is a security vendor's Enterprise Number."; Should we say something about "IANA-assigned" and/or link to the corresponding registry? Section 11, 12.1 A handful of the entries in the tables still have an "ietf-dots-telemetry:" prefix, and I don't understand why only some of the entries would need it but not others. In particular, there seems to be an entry for both "total-traffic" and "ietf-dots-telemetry:total-traffic", and I don't understand how they are different. (I did not check if there are other "duplicate"s like that.) Section 11 | telemetry | container |TBA2 | 5 map | Object | I see both a "list telemetry" and a "case telemetry" in the YANG module, but I'm not sure which of them is supposed to correspond to the YANG type "container". https://datatracker.ietf.org/doc/html/rfc7950#section-7.9.2 does not really give me the impression that there is an implicit container of this name. | baseline | container |TBA49 | 5 map | Object | Similarly, I see a 'baseline' that's a list, as the only content of a "case baseline" statement. Section 12.1 Note that 'lower-type' and 'upper-type' are also requested for assignment in the call-home I-D. Both I-Ds should be sync'ed as depending the one that will make it first to the IANA. (I think we tweaked call-home already to account for the registration requests being from different ranges between the two documents, but mention it just in case I'm misremembering.) Section 12.3 URI: urn:ietf:params:xml:ns:yang:ietf-dots-mapping Registrant Contact: The IESG. XML: N/A; the requested URI is an XML namespace. I wonder if this (module) name is perhaps somewhat more generic than the functionality it provides. Section 13 Should we mention the security considerations for the blockwise transfer technologies? Looking through the YANG tree, the only thing that really sticks out as potentially worth mentioning here is the server-originated-telemetry option. But even for that, I'm not sure that there's much to say. The DOTS telemetry information includes DOTS client network topology, DOTS client domain pipe capacity, normal traffic baseline and connections capacity, and threat and mitigation information. Such information is sensitive; it MUST be protected at rest by the DOTS server domain to prevent data leakage. I'd consider adding a sentence or two here noting that even though this data is sensitive, sending it explicitly to the DOTS server does not introduce any new significant considerations (other than the need for protection at rest) because the DOTS server is already trusted to have access to that kind of information by being in the position to mitigate (and observe) attacks. Section 16.1 A normative reference on draft-ietf-dots-signal-filter-control will cause a document cluster, delaying publication until that document is ready to be an RFC. It's not clear to me that such a dependency is needed in this case, since there is only one citation to that document and it seems more descriptive than imposing a strong dependency. Similarly, we seem to cite RFC 7641 just as a statement of fact (clients can use CoAP OBSERVE), which may not require it to be classified as a normative reference. Section 16.2 The current state of this document doesn't really include enough detail about percentiles and their calculation to be implementable without referring to RFC 2330, currently listed only as a normative reference. I would prefer to add more text to this document rather than promote 2330 to a normative refrence, though I think we would need more text than I suggested above in order to achieve that effect. We say we assume familiarity with RFC 8612 at least for terminology, which may indicate that it is best classified as a normative reference. Thanks, Ben
- [Dots] AD review of draft-ietf-dots-telemetry-16 Benjamin Kaduk
- Re: [Dots] AD review of draft-ietf-dots-telemetry… tirumal reddy
- Re: [Dots] AD review of draft-ietf-dots-telemetry… Benjamin Kaduk
- Re: [Dots] AD review of draft-ietf-dots-telemetry… mohamed.boucadair