Re: [Dots] AD review of draft-ietf-dots-telemetry-16
Benjamin Kaduk <kaduk@mit.edu> Tue, 28 December 2021 15:40 UTC
Return-Path: <kaduk@mit.edu>
X-Original-To: dots@ietfa.amsl.com
Delivered-To: dots@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0D60E3A16F1; Tue, 28 Dec 2021 07:40:49 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.496
X-Spam-Level:
X-Spam-Status: No, score=-1.496 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, KHOP_HELO_FCRDNS=0.399, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LH8LPl8vU1Pj; Tue, 28 Dec 2021 07:40:43 -0800 (PST)
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C602D3A16F0; Tue, 28 Dec 2021 07:40:42 -0800 (PST)
Received: from mit.edu ([24.16.140.251]) (authenticated bits=56) (User authenticated as kaduk@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 1BSFeXHo020534 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 28 Dec 2021 10:40:39 -0500
Date: Tue, 28 Dec 2021 07:40:33 -0800
From: Benjamin Kaduk <kaduk@mit.edu>
To: tirumal reddy <kondtir@gmail.com>
Cc: draft-ietf-dots-telemetry.all@ietf.org, dots@ietf.org
Message-ID: <20211228154033.GU11486@mit.edu>
References: <20211124224140.GH93060@kduck.mit.edu> <CAFpG3gfBSarzE7aPm6JM0h1LJHt=DwfKXnaRUGosU6_=VY150A@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAFpG3gfBSarzE7aPm6JM0h1LJHt=DwfKXnaRUGosU6_=VY150A@mail.gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dots/v-7Svr2oJZMKVBxcVB-7Eu7yVNE>
Subject: Re: [Dots] AD review of draft-ietf-dots-telemetry-16
X-BeenThere: dots@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "List for discussion of DDoS Open Threat Signaling \(DOTS\) technology and directions." <dots.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dots>, <mailto:dots-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dots/>
List-Post: <mailto:dots@ietf.org>
List-Help: <mailto:dots-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dots>, <mailto:dots-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Dec 2021 15:40:49 -0000
Hi Tiru, Thanks for these (and thanks Med as well for the additional follow-up; I don't think I have any replies to that part). A few comments inline... On Tue, Nov 30, 2021 at 12:59:02PM +0530, tirumal reddy wrote: > Hi Ben, > > Thanks for the detailed review. Please see inline for responses till > Section 7.1.4. > > On Thu, 25 Nov 2021 at 04:12, Benjamin Kaduk <kaduk@mit.edu> wrote: > > > Hi all, > > > > Sorry to have been working on this for so long -- some health issues > > intervened, and it is perhaps easier to let onesself be interrupted if > > there is no hope of finishing in a single sitting :-/ > > > > [Note that I reviewed the -16 but the -17 is the current version, which > > includes some editorial work I had sent in a PR. It looks like that > > will not have changed many things I comment on.] > > > > This review is pretty long (which I guess befits a long document). We > > probably want to chunk up replies to it so that the emails/threads stay > > manageable. > > > > A couple meta-comments on the other reviews so far: > > > > - the shepherd writeup only answers one of the three questions in point > > (1). It is likely that Murray will point this out in his ballot if > > not updated before then. > > > > - it's a little surprising that the yangdoctor didn't ask for encoded > > examples of the RESTCONF (data channel) functionality. (For the > > sx:structure parts, I think we're already in good shape on examples.) > > That > > said, I don't think our use of RESTCONF is particularly novel, and am > > happy > > to proceed without such examples if desired. > > > > High-level remarks before going in to the detailed section-by-section > > comments: > > > > I'm happy to see the note at the end of §4.5 that the data channel is > > available to optimize the data that needs to be exchanged over the signal > > channel during attack-time, in some sense reiterating the original split > > between signal and data channels (but see my note in the inline comments > > about the "MAY"). But this is already rather far into the design overview, > > let alone the document overall! I think it would be helpful to have a > > paragraph or two in the toplevel §4 to get in front of how the telemetry > > mechanisms integrate into the signal+data channels, perhaps something like: > > > > % The DOTS protocol suite is divided into two logical channels: the signal > > % channel and data channel. This division is due to the vastly different > > % requirements placed upon the traffic they carry. The signal channel must > > % remain available and usable even in the face of attack traffic that > > might, > > % for example, saturate one direction of the links involved, rendering > > % acknowledgment-based mechanisms unreliable and strongly incentivizing > > % messages to be small enough to be contained in a single IP packet. In > > % contrast, the data channel is available for high-bandwidth data transfer > > % before or after an attack, using more conventional transport protocol > > % techniques. It is generally preferable to perform advance configuration > > % over the data channel, including configuring short aliases for static or > > % nearly static data sets such as sets of network addresses/prefixes that > > % might be subject to related attacks -- this helps to optimize the use of > > % the signal channel for the small messages that truly require reliable > > % delivery during an attack. > > % > > % Telemetry information has aspects that correspond to both operational > > % modes: there is certainly a need to convey updated information about > > % ongoing attack traffic and targets during an attack, so as to convey > > % detailed information about mitigation status and inform updates to > > % mitigation strategy in the face of adaptive attacks, but it is also > > useful > > % to provide mitigation services with a picture of normal or "baseline" > > % traffic towards potential targets, to aid in detecting when incoming > > % traffic deviates from normal into being an attack. Likewise, one might > > % populate a "database" of classifications of known types of attack so that > > % a short alias can be used during attack time to describe an observed > > % attack. This specification does make provision for use of the data > > % channel for the latter function, but otherwise retains most telemetry > > % functionality in the signal channel. This is partly out of necessity and > > % partly out of expedience -- it is a functional requirement to convey > > % information about ongoing attack traffic during an attack, and > > information > > % about baseline traffic uses an essentially identical datastructure that > > is > > % naturally defined to sit next to the description of attack traffic. The > > % related telemetry setup information used to parametrize actual traffic > > % data is also sent over the signal channel, out of expediency. [I can't > > % actually populate this part; maybe it's something about having the > > % configuration values that define the percentile measurements be in-band > > % with the percentile measurements themselves, and then a desire to not > > % define two similar setup schemes over data+signal channels that justifies > > % the other telemetry-config and pipe branches?] > > > > Proposed text looks good, we will add the above text to section 4. > > > > > > The use of the "list telemetry" in the telemetry-setup message confused me > > for a while, since it seems like the client can only send a zero- or > > one-element list to the server (based on the 'tsid' for that direction > > being > > part of the Uri-Path). Can you confirm that the list structure was used > > just for the server-to-client direction combined with maximizing > > data-structure reuse? > > > > We should say somewhere that the examples use CBOR presentation format. > > (This could be a note near the terminology section or accompanying each > > example.) > > > > I think we should probably include a bit of a brief refresher on percentile > > calculation, in addition to the reference to RFC 2330. Right now we > > basically just say that the peers negotiation parameter including > > "percentile-related measurement parameters" which was not quite enough to > > clue me in that 2330 was basically required reading to know what we > > actually > > do. So this might look something like: > > > > % DOTS telemetry uses several percentile values to provide a picture of a > > % traffic distribution overall, as opposed to just a single snapshot of > > % observed traffic at a single point in time. Modeling raw traffic flow > > % data as a distribution and describing that distribution entails choosing > > a > > % measurement period that the distribution will describe, and a number of > > % sampling intervals, or "buckets", within that measurement period. > > Traffic > > % within each bucket is treated as a single event (i.e., averaged), and > > then > > % the distribution of buckets is used to describe the distribution of > > % traffic over the measurement period. A distribution an be characterized > > % by statistical measures such as mean, median, and standard deviation, and > > % also by reporting the value of the distribution at various percentile > > % levels of the data set in question, such as "quartiles" that correspond > > to > > % 25th, 50th, and 75th percentile. More details about percentile values > > and > > % their computation are found in Section 11.3 of [RFC2330]; DOTS telemetry > > % uses up to three percentile values, plus the overall peak, to > > characterize > > % traffic distributions. Which percentile thresholds are used for these > > % "low", "medium", and "high" percentile values is configurable (with > > % suitable defaults). > > > > Looks good, we will update the draft. > > > > > > > > It seems like we are setting up somewhat conflicting goals between the > > overall desire for compact signal-channel messages and our guidance to > > "include in any request to update information related to [a thing] the > > information of other [like things] (already communicated using a lower > > ['tsid'/'tmid'] value" that results in larger individual messages (albeit > > fewer of them). Do we want to make some statement about how this guidance > > to coalesce can be ignored if needed to make individual signal-channel > > messages fit in a single packet? (Hmm, I guess we do have a note about > > "assumes that all link information can fit in one single message" at > > least for the 'tsid'/link case, but the text from that note only appears > > once.) > > > > My editorial suggestions in > > https://github.com/boucadair/draft-dots-telemetry/pull/9 included some > > edits > > relating to whether a given target is "susceptible to" vs "subject to" (or > > even "actively being subjected to") different types of attacks. Med has > > diligently already merged that PR while I was sick, so I mostly assume that > > these already got reviewed for correctness. Still, I want to call out that > > there is a difference and I was not always 100% sure that I picked the > > right > > one for each case. > > > > It looks like we have to use backslash continuation markers for several of > > the CoAP request target resources in the examples; do we want to reference > > RFC 8792 folding and use its specific note text about line wrapping? > > > > There are a few places where we say something like "the DOTS client MUST > > auto-scale so that the appropriate unit is used". (1) We don't > > specifically > > say what the goal of this automatic behavior is (I guess, use the "largest > > unit" that gives a value greater than one?), and (2) it seems that we are > > in > > part relying on this behavior to ensure that the client only specifies one > > measure in a given unit class. That is, I didn't see anything else that > > would prevent a client from sending (say) both Mbps and Gbps values, that > > could of course be in conflict with each other -- if there's a potential > > for > > conflict we'd need to say how to resolve the conflict. > > > > Relatedly, there seems to *also* be potential for conflict in that we allow > > both bit/second and Byte/second as unit-classes. What is to happen if I > > say > > that something is both 2 bit/second and 2 Byte/second? Med's follow-up implies that we are to just treat that as "invalid parameters". I'd prefer to be more explicit about that, e.g., with a statement like "at most one of the bit/second and Byte/second unit-class entries can be present; if such conflicting values are received the combination is treated as specifying an illegal parameter and rejected with 4.00 (Bad Request)". > > We have a lot of examples that use the same 'cuid' of > > dz6pHjaADkaFTbjr0JGBpw, which is fine and representative of normal > > operation. Do we want to curate the 'tsid' and 'tmid' values used by that > > client? I see some reuse, which I think is not prohibited (at least for > > 'tsid'), but may merit checking, and the values are not monotonically > > increasing throughout the course of the document, which may or may not be > > desirable. Did you have any thoughts on this part? > > (nit) we seem to have both "signalled" and "signaled" present, and > > should standardize on our spelling. (It looks like the one-'l' form is > > more common in the RFC archive so far.) > > > > Section 1 > > > > Distributed Denial of Service (DDoS) attacks have become more > > sophisticated. [...] > > > > We might want to say something about the baseline for the comparison > > ("more sophisticated compared to what?"). > > > > 100 Mbps to 100s of Gbps or even Tbps. Attacks are commonly > > carried out leveraging botnets and attack reflectors for > > amplification attacks such as NTP (Network Time Protocol), DNS > > (Domain Name System), SNMP (Simple Network Management Protocol), > > or SSDP (Simple Service Discovery Protocol). > > > > Is https://datatracker.ietf.org/doc/html/rfc4732#section-3.1 still > > current enough to make a useful reference for amplification attacks? > > > > Nevertheless, when DOTS > > telemetry attributes are available to a DOTS agent, and absent any > > policy, it can signal the attributes in order to optimize the overall > > mitigation service provisioned using DOTS. > > > > I'm not sure what type of policy we have in mind, here. > > > > Section 2 > > > > The reader should be familiar with the terms defined in [RFC8612]. > > > > It looks like RFC 8612 doesn't define "idle time" and "attack time", so > > we may want to pull in a couple more documents as required reading or > > define > > them ourselves. > > > > Sure, we will elaborate these terms with an example. > > > > > > Section 3.1 > > > > If the DOTS server's mitigation resources have the capabilities to > > facilitate the DOTS telemetry, the DOTS server adapts its protection > > strategy and activates the required countermeasures immediately > > (automation enabled) for the sake of optimized attack mitigation > > decisions and actions. > > > > I'm not entirely sure I understand this part. It seems to be talking > > about telemetry between the DOTS server and the associated mitigation > > resources, but the previous discussion has been about telemetry between > > DOTS client and DOTS server (and I do not think the mitigation resources > > associated with the DOTS server have been said to be DOTS clients in any > > of the previous DOTS work). > > > > Section 3.2 > > > > DOTS telemetry can also be used to tune the DDoS mitigators with the > > correct state of an attack. During the last few years, DDoS attack > > > > (nit) I don't think "with" is the right verb -- "tune with <X>" implies > > that X is used as a tool to effectuate the tuning, but I think what's > > going on here is more like using the telemetry data as an input for > > determining what values to use for the tuning parameters available on > > the mitigation resources. > > > > Fixed. > > > > > > Mitigation of attacks without having certain knowledge of normal > > traffic can be inaccurate at best. This is especially true for > > recursive signaling (see Section 3.2.3 in [I-D.ietf-dots-use-cases]). > > > > RFC 8903 makes no mention of "recursive", "recursion", or any > > hierarchical mitigation scenario (at least in my quick search). Even > > the linked draft version (-25) doesn't have a section 3.2.3. I do > > remember reading about this type of recursive or hierarchical scenario > > somwhere, so I think we just need to locate the right reference to put > > here...I'm just not sure what reference that is, offhand. > > > > Good catch, recursive signaling is discussed in RFC8811. Ah, and that's where the section reference points to. I probably should have figured that out on my own, but thanks for fixing it regardless. > > > > In addition, the highly diverse types of use cases where DOTS clients > > are integrated also emphasize the need for knowledge of each DOTS > > client domain behavior. Consequently, common global thresholds for > > attack detection practically cannot be realized. [...] > > > > (editorial?) When we say "the highly diverse types of use cases where > > DOTS clients are integrated" that's essentially an unsupported claim, > > but the way it's written we are presupposing that claim to be true > > without directly stating it as an observation or assumption. I think we > > might be better served by starting off with a declarative statement like > > "DOTS clients can be integrated in a highly diverse set of scenarios and > > use cases", and then moving on from that now-explicit assumption/fact to > > conclude that any single global threshold will mischaracterize the > > traffic for some of those diverse DOTS clients. > > > > Section 4.3 > > > > DOTS clients can also use CoAP Block1 Option in a PUT request (see > > Section 2.5 of [RFC7959]) to initiate large transfers, but these > > Block1 transfers will fail if the inbound "pipe" is running full, so > > consideration needs to be made to try to fit this PUT into a single > > transfer, or to separate out the PUT into several discrete PUTs where > > each of them fits into a single packet. > > > > If I understand/recall correctly, the Block1 transfer is expected to > > fail only in a statistical sense, since the client can't send more > > blocks until it gets the positive reply from the server to continue to > > the next block (on a block-by-block basis), and that reply from the > > server is competing with the attack traffic for the inbound "pipe" and > > likely to fail for at least one of the blocks. I think we should > > probably reword slightly to say only "expected to fail" or "likely > > fail", since there is some small chance of success; we may also want to > > give a brief reminder about why it is expected to fail (e.g., "the > > transfer requires a message from the server for each block, which would > > likely be lost in the incoming flood"). > > > > Thanks, we will fix the text. > > > > > > Section 6 > > > > Telemetry setup configuration is bound to a DOTS client domain. DOTS > > servers MUST NOT expect DOTS clients to send regular requests to > > refresh the telemetry setup configuration. Any available telemetry > > setup configuration has a validity timeout of the DOTS association > > with a DOTS client domain. [...] > > > > The term "DOTS association" does not seem to have been used previously > > (in RFC 9132 we discuss the "DOTS session" heavily, though). Also, I Are we planning to keep the "DOTS association" phrasing? > > don't remember any previous requirement to keep state on the server for > > the duration of anything scoped to the entire client *domain*, just > > individual DOTS sessions. We do mention detecting conflicts/overlapping > > requests within the scope of a client domain, in RFC 9132, but as far as > > I can tell that only holds when all the DOTS sessions involved in the > > conflict are still active. > > > > Perhaps related, this discussion (at least so far) is not clear to me > > about what level of coordination/consistency is expected between clients > > in the same domain. I believe that stock DOTS works fine with no such > > coordination amongst clients within a domain, so if we are introducing such > > a requirement we should say prominently that it's a change. > > > > Yes, telemetry configuration is not specific to the client. For example, > the pipe capacity is specific to the site. So what happens if there are two clients in a site and they are reporting different values for pipe-capacity? I think I'm just generally unsure what the expected division of work is between multiple clients in a domain, so my questions involve a lot of speculating and I don't have a good suggestion for changes to the text yet. > > > > > Section 6.1.1 > > > > Upon receipt of such request, and assuming no error is encountered by > > processing the request, the DOTS server replies with a 2.05 (Content) > > response that conveys the current and telemetry parameters acceptable > > by the DOTS server. [...] > > > > NEW: > > Upon receipt of such request, and assuming no error is encountered by > > processing the request, the DOTS server replies with a 2.05 (Content) > > response that conveys the telemetry parameters acceptable by the DOTS > server > > and the current baseline information maintained by the DOTS server. > > > > > > (editorial) Something seems off, here (around "current and telemetry > > parameters acceptable"). Is it returning current configuration, acceptable > > parameter values, or some combination thereof? > > > > | | +-- query-type* query-type > > > > Since this is a leaf-list of supported query types, should the list name > > include the word "supported" or similar? > > > > Section 6.1.2 > > > > The PUT request with a higher numeric 'tsid' value overrides the DOTS > > telemetry configuration data installed by a PUT request with a lower > > numeric 'tsid' value. To avoid maintaining a long list of 'tsid' > > requests for requests carrying telemetry configuration data from a > > DOTS client, the lower numeric 'tsid' MUST be automatically deleted > > and no longer be available at the DOTS server. > > > > The way this is phrased with "the lower" and "higher" (vs "highest") > > assumes > > or implies that there is only one "lower" 'tsid' value, perhaps leaving > > some > > ambiguity if there is actually more than one. We might make a statement > > about how there is only at most one active 'tsid' per 'cuid'+'cdid' at a > > time other than during a config change (if that's the intent), or > > alternatively to qualify that the requirement to remove is only incurred in > > the event of a conflict (as is done for other types of config, later). > > > > We can have multiple tsids for the same client for pipe/baseline > information. This is why we have: > Ok. Should we say something like "when telemetry configuration data is received that has overlapping scope with existing telemetry configuration, the PUT request with higher numeric 'tsid' value [...]"? The rest of this looks good; thanks again for all the updates. -Ben > > DOTS clients SHOULD minimize the number of active 'tsid's used for > > baseline information. In order to avoid maintaining a long list of > > 'tsid's for baseline information, it is RECOMMENDED that DOTS clients > > include in a request to update information related to a given target, > > the information of other targets (already communicated using a lower > > 'tsid' value) (assuming this fits within one single datagram). This > > update request will override these existing requests and hence > > optimize the number of 'tsid' requests per DOTS client. > > > > > > o If the request is missing a mandatory attribute, does not include > > 'cuid' or 'tsid' Uri-Path parameters, or contains one or more > > invalid or unknown parameters, 4.00 (Bad Request) MUST be returned > > in the response. > > > > Just to confirm: this does not rule out the ability to define new > > parameters in the future (for example, the client might learn of new > > ones in the response to a GET request)? > > > > Yes. > > > > > > Section 6.3 > > > > * The maximum number of requests allowed per second to the > > target. > > > > Should we say anything about the requirement on the protocol in question > > that "request" is a meaningful concept and observable by the mitigator > > (analogous to what we have about "embryonic connections" earlier)? > > > > Okay, updated text to say: The maximum number of requests (e.g., > HTTP/DNS/SIP requests) allowed per second to the target. > > > > > > | +-- baseline* [id] > > | +-- id > > | | uint32 > > | +-- target-prefix* > > | | inet:ip-prefix > > > > (nit) the formatting here is a bit surprising, to wrap the line between > > leaf name and type. But if that's what pyang gives, we probably don't > > want to mess with it... > > > > | | +-- partial-request-ps? uint64 > > | | +-- partial-request-client-ps? uint64 > > > > I wonder whether the limit on "partial requests" should really be a rate > > ("per second") versus a point-in-time cap (i.e., "oustanding partial > > requests"). It seems like a given request could in principle remain in > > "partial" state for an extended period of time, and that having remaing > > in such a state for a second should not justify the client being able to > > produce more partial requests...but the current formulation as a rate seems > > to do so. > > > > Good point, added partial requests pending per client. > > > > > > Section 6.3.1 > > > > Two PUT requests from a DOTS client have overlapping targets if there > > is a common IP address, IP prefix, FQDN, URI, or alias-name. Also, > > two PUT requests from a DOTS client have overlapping targets if the > > addresses associated with the FQDN, URI, or alias are overlapping > > with each other or with 'target-prefix'. > > > > There can be some subtlety here involving where the FQDN/URI/alias are > > resolved into IP addresses from; we may want to say "from the perspective > > of > > the server", "as observed by the server", or similar. > > > > Fixed. > > > > > > Section 7 > > > > DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data with > > mitigation requests relying upon the target attribute. In > > > > (nit??) Is this "target attribute" a telemetry attribute, or something > > else (an attack target?)? > > > > NEW: > > DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data to > > mitigation requests relying upon the resources under attack. > > > > > > > When generating telemetry data to send to a peer, the DOTS agent MUST > > auto-scale so that appropriate unit(s) are used. > > > > (editorial) This requirement is not terribly tightly connected to either > > what comes after it in the text or what comes before it in the text. We > > might consider moving it (earlier, maybe?) and adding a bit more > > transition phrasing about how the agent may send telemetry data many > > times and in different situations, so the right unit to use will vary > > over time. > > > > We will add an example like DDoS attack enhancing the attack volume > from Gbps to Tbps to justify the need > > for auto-scaling the units. > > > > > > > Section 7.1.1 > > > > If the target is subjected to bandwidth consuming attack, the > > attributes representing the percentile values of the 'attack-id' > > attack traffic are included. > > > > If the target is subjected to resource consuming DDoS attacks, the > > same attributes defined for Section 7.1.4 are applicable for > > representing the attack. > > > > One of these just says "the attributes representing" and the other > > references attributes defined in a specific section; can we use the same > > formulation/phrasing to talk about the two cases? > > > > This is an optional attribute. > > > > Er, which one is optional? Just a few paragraphs earlier we said that > > at least one attribute MUST be present (so as to be able to identify a > > target). > > > > Added the following line: > At least the 'target' attribute and another > pre-or-ongoing-mitigation attribute MUST be present in the DOTS telemetry > message. > > Removed the line "This is an optional attribute" > > > > > > Section 7.1.2 > > > > The 'total-traffic' attribute (Figure 26) conveys the percentile > > values (including peak and current observed values) of total traffic > > observed during a DDoS attack. [...] > > > > This attribute is under the "pre-or-ongoing-mitigation" hierarchy; is it > > really right to only say "traffic observed during a DDoS attack" (i.e., > > implicitly excluding the "pre-attack" case)? > > > > Good catch, fixed. > > > > > > Section 7.1.3 > > > > The 'total-attack-traffic' attribute (Figure 27) conveys the total > > attack traffic identified by the DOTS client domain's DDoS Mitigation > > System (or DDoS Detector). [...] > > > > (editorial?) "Identified by [the client domain]" seems to imply that > > this is only used in telemetry reports from client to server, and not in > > updates flowing the other direction. Is that correct? > > > > Yes, telemetry is applicable in both directions and not specific to client > to server only. > > > > > > Section 7.1.4 > > > > +-- total-attack-connection > > | +-- low-percentile-l* [protocol] > > | | +-- protocol uint8 > > | | +-- connection? yang:gauge64 > > | | +-- embryonic? yang:gauge64 > > | | +-- connection-ps? yang:gauge64 > > | | +-- request-ps? yang:gauge64 > > | | +-- partial-request-ps? yang:gauge64 > > > > I think I'm confused about what semantics this is trying to represent. > > The "low-percentile-l" seems like it should be providing aggregate > > information about some particular snapshot of traffic that we have > > determined to be representative of the "low percentile" level of attack > > traffic. That is, if we take the recorded attack traffic and divide it > > into a bunch of bins on the time axis, we can order the respective bins > > from "least noticable attack" to "worst attack", and we pick the fifth > > percentile (or whatever is configured) bin to represent the > > "low-percentile" bucket. But now we're reporting a bunch of attributes > > of that bin -- number of connections, embryonic connections, connections > > per second, etc., even though the bins were ordered based on a single > > "worst" metric (and I don't see us specify what metric we actually use > > to do that ordering). So even though this is the "low percentile" (e.g., > > fifth percentile) bin of attack traffic (i.e., probably not a very bad > > attack), we can't say specifically that the reported total attack > > connection count is fifth percentile and the embryonic connection count > > is fifth percentile and the requests-per-second is also fifth > > percentile. Which makes me confused at why the list is structured this > > way -- if we were just saying "for each of connection, embryonic, etc., bin > > the samples over that metric and compute the low/med/high percentage values > > for that single metric, and this list holds those resulting values", it > > seems like a more natural structure to group things would be to have a list > > of low/med/high for each of the attributes/metrics. > > > > Good point, we should have a list of low/med/high/peak/current for each of > the connection attributes. > > Cheers, > -Tiru > > > > Section 7.1.5 > > > > vendor-id: Vendor ID is a security vendor's Enterprise Number as > > registered with IANA [Enterprise-Numbers]. It is a four-byte > > integer value. > > > > attack-id: Unique identifier assigned for the attack. > > > > If the vendor-id is being used as a scoping value to let each vendor > > assign attack-id values, then we should say so here, the first place we > > really handle this part of the YANG tree. Even if not, we should probably > > say more about why we care what the vendor-id is, and we should also say > > who > > assigns the attack-id. (Maybe it's scoped to the DOTS server; I don't know > > yet at this point in the document!) > > > > Furthermore, the module structure where attack-id and vendor-id are > > always required, but there is going to need to be the ability to assign > > a new attack description at runtime, seems to force us to conflate > > externally/statically assigned attack-ids and dynamically assigned ones > > in the same number space. Should we give any guidance to vendors about > > how to allocate IDs in a way that will not produce collisions between > > predistributed/fixed attack IDs and new ones created at runtime? Or is the > > thought that there will be no dynamic attack-ids since that just > > corresponds > > to the attack-description string and a generic description can be used if > > there is not a better one available? > > > > Regardless, I think we should have more verbiage as to what the scope of > > the > > attack-id is -- what information is it supposed to indicate or replace? > > > > start-time: The time the attack started. The attack's start time is > > expressed in seconds relative to 1970-01-01T00:00Z in UTC time > > > > (nit) "in UTC time" and "Z" seem redundant (which does not necessarily > > imply that we should only use one of them...). > > > > (Section 3.4.2 of [RFC8949]). The CBOR encoding is modified so > > that the leading tag 1 (epoch-based date/time) MUST be omitted. > > > > (nit) Maybe we should move the CBOR section reference to after we start > > talking about CBOR encoding. (And we should do the same for 'end-time', > > if we do anything.) > > > > top-talker: A list of top talkers among attack sources. The top > > talkers are represented using the 'source-prefix'. > > > > Is there a good reference or short description for the notion of > > top-talkers? I know it's a well-established term of art in many different > > circles, but it's hard to be fully confident that all readers will be > > familiar with it. > > > > Also, should we say anything about how the number of talkers to include > > is determined? (I assume it is best left to the discretion of the > > sender, but we probably want to set a clear expectation that the list is > > not all talkers or a complete list in any other way.) > > > > 'spoofed-status' indicates whether a top talker is a spoofed IP > > address (e.g., reflection attacks) or not. > > > > Is this something that can be unambiguously determined by all parties > > that might be producing this telemetry data, or should we use a > > tri-statae (for "unknown") rather than a boolean to represent it? > > > > In order to optimize the size of telemetry data conveyed over the > > DOTS signal channel, DOTS agents MAY use the DOTS data channel > > [RFC8783] to exchange vendor specific attack mapping details (that > > is, {vendor identifier, attack identifier} ==> attack description). > > As such, DOTS agents do not have to convey systematically an attack > > description in their telemetry messages over the DOTS signal channel. > > > > It's a little surprising to me that this is only a MAY. Does it make > > sense as a SHOULD (expected to usually happen, but leaving open the > > flexibility when an observed attack does not match a previously > > characterized attack description)? > > > > Also, when reading this I assumed that the "attack description" being > > mapped to could include things like the target port/protocol, whether it > > was spoofed, etc. But reading on to the ietf-dots-mapping module, it > > seems that this is intended to just be the single "attack-description" > > string. If so, we might give some more indications of that, e.g., by > > spelling it that way and having some prose about it being a "string", or > > similar. > > > > Also^2, from here to the end of the section might benefit from being > > split off into a dedicated subsection on using the data channel to > > pre-populate vendor and attack description information > > > > tables are at different revisions. The DOTS client SHOULD transmit > > telemetry information using the vendor mapping(s) that it provided to > > the DOTS server and the DOTS server SHOULD use the vendor mappings(s) > > provided to the DOTS client when transmitting telemetry data to peer > > DOTS agent. > > > > Having read a bit further on, I'm not actually sure where in the > > protocol the client has provided vendor mappings to the server, and the > > server provided vendor mappings to the client. It *might* be the GET > > and POST from/to dots-data/ietf-dots-mapping:vendor-mapping, but we > > don't really give a clear indication of that. I think we should try to > > be more precise about what we mean by "provided". > > (editorial) would this then be "SHOULD use any vendor mapping(s)" > > (s/the/any/)? > > > > augment /data-channel:dots-data/data-channel:capabilities: > > +--ro vendor-mapping-enabled? boolean {dots-telemetry}? > > > > None of the other capabilities in this tree (as specified by RFC 8783) > > include a name suffix like "-enabled". Should this just be > > "vendor-mapping"? > > > > "vendor-id": 1234, > > "vendor-name": "mitigator-s", > > "last-updated": "1576856561", > > "attack-mapping": [] > > > > [this last updated is from December 2019; we could probably pick > > something more current if we wanted, but it doesn't really matter.] > > > > The DOTS client reiterates the above procedure regularly (e.g., once > > a week) to update the DOTS server's vendor attack mapping details. > > > > Do the udpates have to preserve any existing assignments? > > > > If the DOTS client concludes that the DOTS server does not have any > > reference to the specific vendor attack mapping details, the DOTS > > client uses a POST request to install its vendor attack mapping > > details. [...] > > > > This suggests that a client would not bother sending its own vendor > > mapping if the server already has one, and at least to me further > > implies that the client would use the sever-provided mapping for the > > messages that the client generates. This seems at odds with the earlier > > guidance that the client should send using the vendor-mapping it > > provided to the server, and the server should send using the > > vendor-mapping it provided to the client. Which one is it? > > > > The DOTS server indicates the result of processing the POST request > > using the status-line. Concretely, "201 Created" status-line MUST be > > returned in the response if the DOTS server has accepted the vendor > > attack mapping details. If the request is missing a mandatory > > attribute or contains an invalid or unknown parameter, "400 Bad > > Request" status-line MUST be returned by the DOTS server in the > > response. [...] > > > > I think (but did not specifically check) that these specific status code > > values are preexisting requirements of core RESTCONF (and HTTP), so we > > may not need to write new "MUST" keywords. > > > > If the request is received via a server-domain DOTS gateway, but the > > DOTS server does not maintain a 'cdid' for this 'cuid' while a 'cdid' > > is expected to be supplied, the DOTS server MUST reply with "403 > > Forbidden" status-line and the error-tag "access-denied". Upon > > receipt of this message, the DOTS client MUST register (Section 5.1 > > of [RFC8783]). > > > > (nit) the analogous text in RFC 8783 itself refers only to "Section 5" > > (not 5.1). > > > > The DOTS client uses the PUT request to modify its vendor attack > > mapping details maintained by the DOTS server (e.g., add a new > > mapping). > > > > As above, is this really just an example behavior or a hard requirement > > to preserve existing mappings? > > > > Section 7.2 > > > > "attack-detail": [ > > { > > "vendor-id": 1234, > > > > The vendor-ID is supposed to be an IANA-assigned PEN, and 1234 is > > assigned to Linkage Software Inc. We have other options, like 32473, > > "Example Enterprise Number for Documentation Use" (RFC 5612). > > > > "start-time": "1957811234", > > > > This value represents a time in 2032; should we be using something a > > little more current? > > > > Section 7.3 > > > > This request MUST be maintained active by the DOTS server until a > > delete request is received from the same DOTS client to clear this > > pre-or-ongoing-mitigation telemetry. > > > > It seems like we might want to allow some provision for a server to > > clean up state associated with a client that disappears entirely without > > cleaning itself up. For example, is the server allowed to discard state > > when it reboots, or does this requirement extend to writing the state to > > persistent storage? > > > > If more than one Uri-Query option is included in a request, these > > options are interpreted in the same way as when multiple target > > attributes are included in a message body. > > > > I a little bit wonder if we could put a section reference here for > > "interpreted in the same way", since the reader may not remember what > > that way is, at least on the first reading. > > > > parameters. DOTS clients MUST NOT include a name in which the "*" > > character is included in a label other than the leftmost label. > > > > Do we want to say what the server should do if it gets one anyway? > > (It's not really clear that we need to say anything, strictly speaking.) > > > > "vendor-id": 1234, > > "attack-id": 77, > > "start-time": "1957818434", > > > > [same as above] > > > > A DOTS client that is not interested to receive pre-or-ongoing- > > mitigation telemetry data for a target MUST send a delete request > > similar to the one depicted in Figure 37. > > > > I'm not sure that this strictly needs to be a "MUST"; it seems like we > > could say that such a client "sends a delete request" and leave it at > > that. > > > > Section 8.1 > > > > +-- attack-detail* [vendor-id attack-id] > > [...] > > +-- top-talker > > +-- talker* [source-prefix] > > [...] > > +-- total-attack-connection > > +-- low-percentile-c > > | +-- connection? yang:gauge64 > > > > I'm kind of surprised that there is no 'protocol' leaf in this subtree; > > it seems to usually appear in combination with these connection counts. > > > > | +-- peak-g? yang:gauge64 > > | +-- current-g? yang:gauge64 > > [...] > > | +-- peak-g? yang:gauge64 > > | +-- current-g? yang:gauge64 > > > > (nit) the indentation seems off for type of these two 'current-g's. > > > > In order to signal telemetry data in a mitigation efficacy update, it > > is RECOMMENDED that the DOTS client has already established a DOTS > > telemetry setup session with the server in 'idle' time. > > > > If I understand correctly, it seems that at least part of the need for > > having a preestablished telemetry session is that the mitigation > > efficacy update is identified only by the 'mid', and so we cannot > > specify a 'tmid' as part of such an efficacy update. Thus, in order > > for the server to associate the telemetry data in the efficacy update > > with an ongoing telemetry status session, the 'tsid' must be associated > > with the 'mid' that is the subject of the efficacy update. > > If that is correct, then I think we should have some text here about how > > the efficacy update does not/cannot include the 'tmid', and so is > > associated with the 'mid' of the update. > > > > Section 8.2 > > > > As defined in [RFC8612], the actual mitigation activities can include > > several countermeasure mechanisms. The DOTS server signals the > > current operational status of relevant countermeasures. A list of > > attacks detected by each countermeasure MAY also be included. The > > > > I see how in stock DOTS signal channel, the server signals whether an > > attack is fully/partially/not mitigated, but I don't see how the server > > indicates the specific relevant countermeasures that were used. Am I > > just missing something there? (Similarly, the list of attacks we > > returned may or may not be at the granularity of "detected by each > > countermeasure", depending on the ansewr to the previous part.) > > > > Also (editorial), I think we should be more clear about when we switch > > from describing RFC 9132 behavior to to describing the new functionality > > provided by this document. > > > > Figure 46 shows an example of an asynchronous notification of attack > > mitigation status from the DOTS server. This notification signals > > > > Is the "new stuff" in Figure 46 just the subtree under > > :(server-to-client-only)? The 'total-attack-traffic' leaf seems to be > > the same content as in Figure 45, and maybe could be omitted? > > > > "mitigation-start": "1507818434", > > > > (This date is from 2017; as above, we could use a more current one if we > > want.) > > > > "source-count": { > > "peak-g": "10000" > > > > (Suspiciously round numbers like this are indications of fabricated data > > ... which is totally reasonable for an example like this! But if we > > wanted to be "more realistic" we could use a less-round number.) > > > > DOTS clients can filter out the asynchronous notifications from the > > DOTS server by indicating one or more Uri-Query options in its GET > > request. A Uri-Query option can include the following parameters: > > 'target-prefix', 'target-port', 'target-protocol', 'target-fqdn', > > 'target-uri', 'alias-name', and 'c' (content) (Section 4.2.4). The > > > > Up in §7.3 the analogous list also included 'mid'. Is 'mid' appropriate > > here (or is it implicitly already included by the base signal channel > > mechanisms)? Hmm, but maybe 'c' is also already allowed by the base > > signal channel mechanism, so that theory is not great... > > > > If the target query does not match the target of the enclosed 'mid' > > as maintained by the DOTS server, the latter MUST respond with a 4.04 > > (Not Found) error Response Code. The DOTS server MUST NOT add a new > > observe entry if this query overlaps with an existing one. > > > > How should the server respond if this query overlaps with an existing > > one? > > > > Section 10.1 > > > > This module uses types defined in [RFC6991] and [RFC8345]. > > > > Should we mention the data structure extension from RFC 8791 here, as > > well? > > > > typedef attack-severity { > > type enumeration { > > enum none { > > value 1; > > description > > "No effect on the DOTS client domain."; > > > > It seems like the closest analogue to our attack-serverity typedef in > > RFC 7970 is in §3.12.2, "BusinessImpact Class". But the phrasing used in > > RFC 7970 is slightly different than the descriptions we have here. > > Do we want to tweak our phrasings and/or clarify in the typedef > > description that we use a slightly modified formulation? (Indeed, we do > > not have an extension point, either, though 7970 does have an extension > > mechanism.) Also, a section reference within RFC 7970 might be helpful. > > > > enum kilopacket-ps { > > value 4; > > description > > "Kilo packets per second (kpps)."; > > > > We may get someone who asks if we are using 1000 or 1024 ("kibi"), but > > I'm inclined to leave it alone for now. > > > > leaf unit-status { > > type boolean; > > default true; > > description > > "Enable/disable the use of the measurement unit class."; > > > > Does the "default true" imply that even unit classes not present in > > "list unit-config" are assumed to be supported by default? > > > > grouping connection { > > description > > "A set of data nodes which represent the attack > > characteristics."; > > [...] > > leaf embryonic { > > type yang:gauge64; > > description > > "The number of simultaneous embryonic connections to > > the target server."; > > > > It's a little interesting that this is the only leaf in the grouping > > that we can't attach the word "attack" to the description of, but as far > > as I can tell we shouldn't make any change here. > > (I originally wrote a comment that we shouldn't use "attack" for any of > > them, since we have analogous nodes in the total-connection-capacity > > config entries, but those seem to not use this grouping.) > > > > list talker { > > key "source-prefix"; > > description > > "IPv4 or IPv6 prefix identifying the attacker(s)."; > > > > My read of the YANG is that this is the description for "list talker", > > but the content matches the description given for the "source-prefix" leaf > > within the list. Should the list itself have a different description? > > > > grouping top-talker-aggregate { > > [...] > > grouping top-talker { > > > > The diff between these two is pretty small -- just the top-level > > description and connection-all vs connection-protocol-all. Is there any > > value in introducing another grouping to hold the common elements? > > > > container max-config-values { > > description > > "Maximum acceptable configuration values."; > > uses telemetry-parameters; > > > > I guess the convenience of having the reusable grouping probably > > outweighs the value of having YANG-level constraints that the 'max' > > values are >= the 'min' values (which we can set for the > > "telemetry-notify-interval" that is not part of "telemetry-parameters", but > > can't set here because we use the grouping). > > > > leaf telemetry-notify-interval { > > type uint32 { > > range "1 .. 3600"; > > > > (Pedantically, this range would fit in a uint16, though maybe there is > > some other reason to prefer the 32-bit type.) > > > > case baseline { > > [...] > > list baseline { > > [...] > > leaf id { > > type uint32; > > must '. >= 1'; > > description > > "An identifier that uniquely identifies a baseline > > entry communicated by a DOTS client."; > > > > I a little bit wonder if we should have a little bit of prose up in §6.3 > > discussing the need for "id" and how it is used. > > > > leaf tmid { > > type uint32; > > description > > "An identifier to uniquely demux telemetry data sent > > using the same message."; > > > > The description for "tsid" in the analogous setup was just "an > > identifier for the DOTS telemetry setup data". As far as I can tell, > > the usage of the two is analogous, so shouldn't the descriptions be > > analogous as well? In particular, I am not sure that "uniquely demux" is > > the only usage of this leaf, since in server-initiated telemetry messages, > > this leaf might be needed to indicate which telemetry entry is being > > described (right?). > > > > leaf-list mid-list { > > type uint32; > > description > > "Reference a list of associated mitigation requests."; > > > > Should we reference RFC 9132 somehow to indicate that the base signal > > channel is how these "mid" values are assigned (and assigned meaning)? > > > > Section 10.2 > > > > description > > "Vendor ID is a security vendor's Enterprise Number."; > > > > Should we say something about "IANA-assigned" and/or link to the > > corresponding registry? > > > > Section 11, 12.1 > > > > A handful of the entries in the tables still have an > > "ietf-dots-telemetry:" prefix, and I don't understand why only some of > > the entries would need it but not others. > > > > In particular, there seems to be an entry for both "total-traffic" and > > "ietf-dots-telemetry:total-traffic", and I don't understand how they are > > different. (I did not check if there are other "duplicate"s like that.) > > > > Section 11 > > > > | telemetry | container |TBA2 | 5 map | Object | > > > > I see both a "list telemetry" and a "case telemetry" in the YANG module, > > but I'm not sure which of them is supposed to correspond to the YANG > > type "container". > > https://datatracker.ietf.org/doc/html/rfc7950#section-7.9.2 does not > > really give me the impression that there is an implicit container of > > this name. > > > > | baseline | container |TBA49 | 5 map | Object | > > > > Similarly, I see a 'baseline' that's a list, as the only content of a > > "case baseline" statement. > > > > Section 12.1 > > > > Note that 'lower-type' and 'upper-type' are also requested for > > assignment in the call-home I-D. Both I-Ds should be sync'ed as > > depending the one that will make it first to the IANA. > > > > (I think we tweaked call-home already to account for the registration > > requests being from different ranges between the two documents, but > > mention it just in case I'm misremembering.) > > > > Section 12.3 > > > > URI: urn:ietf:params:xml:ns:yang:ietf-dots-mapping > > Registrant Contact: The IESG. > > XML: N/A; the requested URI is an XML namespace. > > > > I wonder if this (module) name is perhaps somewhat more generic than the > > functionality it provides. > > > > Section 13 > > > > Should we mention the security considerations for the blockwise transfer > > technologies? > > > > Looking through the YANG tree, the only thing that really sticks out as > > potentially worth mentioning here is the server-originated-telemetry > > option. But even for that, I'm not sure that there's much to say. > > > > The DOTS telemetry information includes DOTS client network topology, > > DOTS client domain pipe capacity, normal traffic baseline and > > connections capacity, and threat and mitigation information. Such > > information is sensitive; it MUST be protected at rest by the DOTS > > server domain to prevent data leakage. > > > > I'd consider adding a sentence or two here noting that even though this > > data is sensitive, sending it explicitly to the DOTS server does not > > introduce any new significant considerations (other than the need for > > protection at rest) because the DOTS server is already trusted to have > > access to that kind of information by being in the position to mitigate > > (and observe) attacks. > > > > Section 16.1 > > > > A normative reference on draft-ietf-dots-signal-filter-control will > > cause a document cluster, delaying publication until that document is > > ready to be an RFC. It's not clear to me that such a dependency is > > needed in this case, since there is only one citation to that document > > and it seems more descriptive than imposing a strong dependency. > > > > Similarly, we seem to cite RFC 7641 just as a statement of fact (clients > > can use CoAP OBSERVE), which may not require it to be classified as a > > normative reference. > > > > Section 16.2 > > > > The current state of this document doesn't really include enough detail > > about percentiles and their calculation to be implementable without > > referring to RFC 2330, currently listed only as a normative reference. > > I would prefer to add more text to this document rather than promote > > 2330 to a normative refrence, though I think we would need more text > > than I suggested above in order to achieve that effect. > > > > We say we assume familiarity with RFC 8612 at least for terminology, > > which may indicate that it is best classified as a normative reference. > > > > > > Thanks, > > > > Ben > > > >
- [Dots] AD review of draft-ietf-dots-telemetry-16 Benjamin Kaduk
- Re: [Dots] AD review of draft-ietf-dots-telemetry… tirumal reddy
- Re: [Dots] AD review of draft-ietf-dots-telemetry… Benjamin Kaduk
- Re: [Dots] AD review of draft-ietf-dots-telemetry… mohamed.boucadair