[Dots] AD review of draft-ietf-dots-telemetry-16

Benjamin Kaduk <kaduk@mit.edu> Wed, 24 November 2021 22:41 UTC
Date: Wed, 24 Nov 2021 14:41:40 -0800
From: Benjamin Kaduk <kaduk@mit.edu>
To: draft-ietf-dots-telemetry.all@ietf.org
Cc: dots@ietf.org
Message-ID: <20211124224140.GH93060@kduck.mit.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/dots/oP2MrvGEhXsct4qVed6B0uC_QfU>
Subject: [Dots] AD review of draft-ietf-dots-telemetry-16
Precedence: list
Hi all,

Sorry to have been working on this for so long -- some health issues
intervened, and it is perhaps easier to let onesself be interrupted if
there is no hope of finishing in a single sitting :-/

[Note that I reviewed the -16 but the -17 is the current version, which
includes some editorial work I had sent in a PR.  It looks like that
will not have changed many things I comment on.]

This review is pretty long (which I guess befits a long document).  We
probably want to chunk up replies to it so that the emails/threads stay
manageable.

A couple meta-comments on the other reviews so far:

- the shepherd writeup only answers one of the three questions in point
  (1).  It is likely that Murray will point this out in his ballot if
  not updated before then.

- it's a little surprising that the yangdoctor didn't ask for encoded
  examples of the RESTCONF (data channel) functionality.  (For the
  sx:structure parts, I think we're already in good shape on examples.)  That
  said, I don't think our use of RESTCONF is particularly novel, and am happy
  to proceed without such examples if desired.

High-level remarks before going in to the detailed section-by-section
comments:

I'm happy to see the note at the end of §4.5 that the data channel is
available to optimize the data that needs to be exchanged over the signal
channel during attack-time, in some sense reiterating the original split
between signal and data channels (but see my note in the inline comments
about the "MAY").  But this is already rather far into the design overview,
let alone the document overall!  I think it would be helpful to have a
paragraph or two in the toplevel §4 to get in front of how the telemetry
mechanisms integrate into the signal+data channels, perhaps something like:

% The DOTS protocol suite is divided into two logical channels: the signal
% channel and data channel.  This division is due to the vastly different
% requirements placed upon the traffic they carry.  The signal channel must
% remain available and usable even in the face of attack traffic that might,
% for example, saturate one direction of the links involved, rendering
% acknowledgment-based mechanisms unreliable and strongly incentivizing
% messages to be small enough to be contained in a single IP packet.  In
% contrast, the data channel is available for high-bandwidth data transfer
% before or after an attack, using more conventional transport protocol
% techniques.  It is generally preferable to perform advance configuration
% over the data channel, including configuring short aliases for static or
% nearly static data sets such as sets of network addresses/prefixes that
% might be subject to related attacks -- this helps to optimize the use of
% the signal channel for the small messages that truly require reliable
% delivery during an attack.
%
% Telemetry information has aspects that correspond to both operational
% modes: there is certainly a need to convey updated information about
% ongoing attack traffic and targets during an attack, so as to convey
% detailed information about mitigation status and inform updates to
% mitigation strategy in the face of adaptive attacks, but it is also useful
% to provide mitigation services with a picture of normal or "baseline"
% traffic towards potential targets, to aid in detecting when incoming
% traffic deviates from normal into being an attack.  Likewise, one might
% populate a "database" of classifications of known types of attack so that
% a short alias can be used during attack time to describe an observed
% attack.  This specification does make provision for use of the data
% channel for the latter function, but otherwise retains most telemetry
% functionality in the signal channel.  This is partly out of necessity and
% partly out of expedience -- it is a functional requirement to convey
% information about ongoing attack traffic during an attack, and information
% about baseline traffic uses an essentially identical datastructure that is
% naturally defined to sit next to the description of attack traffic.  The
% related telemetry setup information used to parametrize actual traffic
% data is also sent over the signal channel, out of expediency. [I can't
% actually populate this part; maybe it's something about having the
% configuration values that define the percentile measurements be in-band
% with the percentile measurements themselves, and then a desire to not
% define two similar setup schemes over data+signal channels that justifies
% the other telemetry-config and pipe branches?]

The use of the "list telemetry" in the telemetry-setup message confused me
for a while, since it seems like the client can only send a zero- or
one-element list to the server (based on the 'tsid' for that direction being
part of the Uri-Path).  Can you confirm that the list structure was used
just for the server-to-client direction combined with maximizing
data-structure reuse?

We should say somewhere that the examples use CBOR presentation format.
(This could be a note near the terminology section or accompanying each
example.)

I think we should probably include a bit of a brief refresher on percentile
calculation, in addition to the reference to RFC 2330.  Right now we
basically just say that the peers negotiation parameter including
"percentile-related measurement parameters" which was not quite enough to
clue me in that 2330 was basically required reading to know what we actually
do.  So this might look something like:

% DOTS telemetry uses several percentile values to provide a picture of a
% traffic distribution overall, as opposed to just a single snapshot of
% observed traffic at a single point in time.  Modeling raw traffic flow
% data as a distribution and describing that distribution entails choosing a
% measurement period that the distribution will describe, and a number of
% sampling intervals, or "buckets", within that measurement period.  Traffic
% within each bucket is treated as a single event (i.e., averaged), and then
% the distribution of buckets is used to describe the distribution of
% traffic over the measurement period.  A distribution an be characterized
% by statistical measures such as mean, median, and standard deviation, and
% also by reporting the value of the distribution at various percentile
% levels of the data set in question, such as "quartiles" that correspond to
% 25th, 50th, and 75th percentile.  More details about percentile values and
% their computation are found in Section 11.3 of [RFC2330]; DOTS telemetry
% uses up to three percentile values, plus the overall peak, to characterize
% traffic distributions.  Which percentile thresholds are used for these
% "low", "medium", and "high" percentile values is configurable (with
% suitable defaults).


It seems like we are setting up somewhat conflicting goals between the
overall desire for compact signal-channel messages and our guidance to
"include in any request to update information related to [a thing] the
information of other [like things] (already communicated using a lower
['tsid'/'tmid'] value" that results in larger individual messages (albeit
fewer of them).  Do we want to make some statement about how this guidance
to coalesce can be ignored if needed to make individual signal-channel
messages fit in a single packet?  (Hmm, I guess we do have a note about
"assumes that all link information can fit in one single message" at
least for the 'tsid'/link case, but the text from that note only appears
once.)

My editorial suggestions in
https://github.com/boucadair/draft-dots-telemetry/pull/9 included some edits
relating to whether a given target is "susceptible to" vs "subject to" (or
even "actively being subjected to") different types of attacks.  Med has
diligently already merged that PR while I was sick, so I mostly assume that
these already got reviewed for correctness.  Still, I want to call out that
there is a difference and I was not always 100% sure that I picked the right
one for each case.

It looks like we have to use backslash continuation markers for several of
the CoAP request target resources in the examples; do we want to reference
RFC 8792 folding and use its specific note text about line wrapping?

There are a few places where we say something like "the DOTS client MUST
auto-scale so that the appropriate unit is used".  (1) We don't specifically
say what the goal of this automatic behavior is (I guess, use the "largest
unit" that gives a value greater than one?), and (2) it seems that we are in
part relying on this behavior to ensure that the client only specifies one
measure in a given unit class.  That is, I didn't see anything else that
would prevent a client from sending (say) both Mbps and Gbps values, that
could of course be in conflict with each other -- if there's a potential for
conflict we'd need to say how to resolve the conflict.

Relatedly, there seems to *also* be potential for conflict in that we allow
both bit/second and Byte/second as unit-classes.  What is to happen if I say
that something is both 2 bit/second and 2 Byte/second?

We have a lot of examples that use the same 'cuid' of
dz6pHjaADkaFTbjr0JGBpw, which is fine and representative of normal
operation.  Do we want to curate the 'tsid' and 'tmid' values used by that
client?  I see some reuse, which I think is not prohibited (at least for
'tsid'), but may merit checking, and the values are not monotonically
increasing throughout the course of the document, which may or may not be
desirable.

(nit) we seem to have both "signalled" and "signaled" present, and
should standardize on our spelling.  (It looks like the one-'l' form is
more common in the RFC archive so far.)

Section 1

   Distributed Denial of Service (DDoS) attacks have become more
   sophisticated.  [...]

We might want to say something about the baseline for the comparison
("more sophisticated compared to what?").

       100 Mbps to 100s of Gbps or even Tbps.  Attacks are commonly
       carried out leveraging botnets and attack reflectors for
       amplification attacks such as NTP (Network Time Protocol), DNS
       (Domain Name System), SNMP (Simple Network Management Protocol),
       or SSDP (Simple Service Discovery Protocol).

Is https://datatracker.ietf.org/doc/html/rfc4732#section-3.1 still
current enough to make a useful reference for amplification attacks?

                                          Nevertheless, when DOTS
   telemetry attributes are available to a DOTS agent, and absent any
   policy, it can signal the attributes in order to optimize the overall
   mitigation service provisioned using DOTS.

I'm not sure what type of policy we have in mind, here.

Section 2

   The reader should be familiar with the terms defined in [RFC8612].

It looks like RFC 8612 doesn't define "idle time" and "attack time", so
we may want to pull in a couple more documents as required reading or define
them ourselves.

Section 3.1

   If the DOTS server's mitigation resources have the capabilities to
   facilitate the DOTS telemetry, the DOTS server adapts its protection
   strategy and activates the required countermeasures immediately
   (automation enabled) for the sake of optimized attack mitigation
   decisions and actions.

I'm not entirely sure I understand this part.  It seems to be talking
about telemetry between the DOTS server and the associated mitigation
resources, but the previous discussion has been about telemetry between
DOTS client and DOTS server (and I do not think the mitigation resources
associated with the DOTS server have been said to be DOTS clients in any
of the previous DOTS work).

Section 3.2

   DOTS telemetry can also be used to tune the DDoS mitigators with the
   correct state of an attack.  During the last few years, DDoS attack

(nit) I don't think "with" is the right verb -- "tune with <X>" implies
that X is used as a tool to effectuate the tuning, but I think what's
going on here is more like using the telemetry data as an input for
determining what values to use for the tuning parameters available on
the mitigation resources.

   Mitigation of attacks without having certain knowledge of normal
   traffic can be inaccurate at best.  This is especially true for
   recursive signaling (see Section 3.2.3 in [I-D.ietf-dots-use-cases]).

RFC 8903 makes no mention of "recursive", "recursion", or any
hierarchical mitigation scenario (at least in my quick search).   Even
the linked draft version (-25) doesn't have a section 3.2.3.  I do
remember reading about this type of recursive or hierarchical scenario
somwhere, so I think we just need to locate the right reference to put
here...I'm just not sure what reference that is, offhand.

   In addition, the highly diverse types of use cases where DOTS clients
   are integrated also emphasize the need for knowledge of each DOTS
   client domain behavior.  Consequently, common global thresholds for
   attack detection practically cannot be realized.  [...]

(editorial?) When we say "the highly diverse types of use cases where
DOTS clients are integrated" that's essentially an unsupported claim,
but the way it's written we are presupposing that claim to be true
without directly stating it as an observation or assumption.  I think we
might be better served by starting off with a declarative statement like
"DOTS clients can be integrated in a highly diverse set of scenarios and
use cases", and then moving on from that now-explicit assumption/fact to
conclude that any single global threshold will mischaracterize the
traffic for some of those diverse DOTS clients.

Section 4.3

   DOTS clients can also use CoAP Block1 Option in a PUT request (see
   Section 2.5 of [RFC7959]) to initiate large transfers, but these
   Block1 transfers will fail if the inbound "pipe" is running full, so
   consideration needs to be made to try to fit this PUT into a single
   transfer, or to separate out the PUT into several discrete PUTs where
   each of them fits into a single packet.

If I understand/recall correctly, the Block1 transfer is expected to
fail only in a statistical sense, since the client can't send more
blocks until it gets the positive reply from the server to continue to
the next block (on a block-by-block basis), and that reply from the
server is competing with the attack traffic for the inbound "pipe" and
likely to fail for at least one of the blocks.  I think we should
probably reword slightly to say only "expected to fail" or "likely
fail", since there is some small chance of success; we may also want to
give a brief reminder about why it is expected to fail (e.g., "the
transfer requires a message from the server for each block, which would
likely be lost in the incoming flood").

Section 6

   Telemetry setup configuration is bound to a DOTS client domain.  DOTS
   servers MUST NOT expect DOTS clients to send regular requests to
   refresh the telemetry setup configuration.  Any available telemetry
   setup configuration has a validity timeout of the DOTS association
   with a DOTS client domain.  [...]

The term "DOTS association" does not seem to have been used previously
(in RFC 9132 we discuss the "DOTS session" heavily, though).  Also, I
don't remember any previous requirement to keep state on the server for
the duration of anything scoped to the entire client *domain*, just
individual DOTS sessions.  We do mention detecting conflicts/overlapping
requests within the scope of a client domain, in RFC 9132, but as far as
I can tell that only holds when all the DOTS sessions involved in the
conflict are still active.

Perhaps related, this discussion (at least so far) is not clear to me
about what level of coordination/consistency is expected between clients
in the same domain.  I believe that stock DOTS works fine with no such
coordination amongst clients within a domain, so if we are introducing such
a requirement we should say prominently that it's a change.

Section 6.1.1

   Upon receipt of such request, and assuming no error is encountered by
   processing the request, the DOTS server replies with a 2.05 (Content)
   response that conveys the current and telemetry parameters acceptable
   by the DOTS server.  [...]

(editorial) Something seems off, here (around "current and telemetry
parameters acceptable").  Is it returning current configuration, acceptable
parameter values, or some combination thereof?

          |  |     +-- query-type*            query-type

Since this is a leaf-list of supported query types, should the list name
include the word "supported" or similar?

Section 6.1.2

   The PUT request with a higher numeric 'tsid' value overrides the DOTS
   telemetry configuration data installed by a PUT request with a lower
   numeric 'tsid' value.  To avoid maintaining a long list of 'tsid'
   requests for requests carrying telemetry configuration data from a
   DOTS client, the lower numeric 'tsid' MUST be automatically deleted
   and no longer be available at the DOTS server.

The way this is phrased with "the lower" and "higher" (vs "highest") assumes
or implies that there is only one "lower" 'tsid' value, perhaps leaving some
ambiguity if there is actually more than one.  We might make a statement
about how there is only at most one active 'tsid' per 'cuid'+'cdid' at a
time other than during a config change (if that's the intent), or
alternatively to qualify that the requirement to remove is only incurred in
the event of a conflict (as is done for other types of config, later).

   o  If the request is missing a mandatory attribute, does not include
      'cuid' or 'tsid' Uri-Path parameters, or contains one or more
      invalid or unknown parameters, 4.00 (Bad Request) MUST be returned
      in the response.

Just to confirm: this does not rule out the ability to define new
parameters in the future (for example, the client might learn of new
ones in the response to a GET request)?

Section 6.3

      *  The maximum number of requests allowed per second to the
         target.

Should we say anything about the requirement on the protocol in question
that "request" is a meaningful concept and observable by the mitigator
(analogous to what we have about "embryonic connections" earlier)?

          |           +-- baseline* [id]
          |              +-- id
          |              |       uint32
          |              +-- target-prefix*
          |              |       inet:ip-prefix

(nit) the formatting here is a bit surprising, to wrap the line between
leaf name and type.  But if that's what pyang gives, we probably don't
want to mess with it...

          |              |  +-- partial-request-ps?          uint64
          |              |  +-- partial-request-client-ps?   uint64

I wonder whether the limit on "partial requests" should really be a rate
("per second") versus a point-in-time cap (i.e., "oustanding partial
requests").  It seems like a given request could in principle remain in
"partial" state for an extended period of time, and that having remaing
in such a state for a second should not justify the client being able to
produce more partial requests...but the current formulation as a rate seems
to do so.

Section 6.3.1

   Two PUT requests from a DOTS client have overlapping targets if there
   is a common IP address, IP prefix, FQDN, URI, or alias-name.  Also,
   two PUT requests from a DOTS client have overlapping targets if the
   addresses associated with the FQDN, URI, or alias are overlapping
   with each other or with 'target-prefix'.

There can be some subtlety here involving where the FQDN/URI/alias are
resolved into IP addresses from; we may want to say "from the perspective of
the server", "as observed by the server", or similar.

Section 7

   DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data with
   mitigation requests relying upon the target attribute.  In

(nit??) Is this "target attribute" a telemetry attribute, or something
else (an attack target?)?

   When generating telemetry data to send to a peer, the DOTS agent MUST
   auto-scale so that appropriate unit(s) are used.

(editorial) This requirement is not terribly tightly connected to either
what comes after it in the text or what comes before it in the text.  We
might consider moving it (earlier, maybe?) and adding a bit more
transition phrasing about how the agent may send telemetry data many
times and in different situations, so the right unit to use will vary
over time.

Section 7.1.1

   If the target is subjected to bandwidth consuming attack, the
   attributes representing the percentile values of the 'attack-id'
   attack traffic are included.

   If the target is subjected to resource consuming DDoS attacks, the
   same attributes defined for Section 7.1.4 are applicable for
   representing the attack.

One of these just says "the attributes representing" and the other
references attributes defined in a specific section; can we use the same
formulation/phrasing to talk about the two cases?

   This is an optional attribute.

Er, which one is optional?  Just a few paragraphs earlier we said that
at least one attribute MUST be present (so as to be able to identify a
target).

Section 7.1.2

   The 'total-traffic' attribute (Figure 26) conveys the percentile
   values (including peak and current observed values) of total traffic
   observed during a DDoS attack.  [...]

This attribute is under the "pre-or-ongoing-mitigation" hierarchy; is it
really right to only say "traffic observed during a DDoS attack" (i.e.,
implicitly excluding the "pre-attack" case)?

Section 7.1.3

   The 'total-attack-traffic' attribute (Figure 27) conveys the total
   attack traffic identified by the DOTS client domain's DDoS Mitigation
   System (or DDoS Detector).  [...]

(editorial?) "Identified by [the client domain]" seems to imply that
this is only used in telemetry reports from client to server, and not in
updates flowing the other direction.  Is that correct?

Section 7.1.4

                +-- total-attack-connection
                |  +-- low-percentile-l* [protocol]
                |  |  +-- protocol              uint8
                |  |  +-- connection?           yang:gauge64
                |  |  +-- embryonic?            yang:gauge64
                |  |  +-- connection-ps?        yang:gauge64
                |  |  +-- request-ps?           yang:gauge64
                |  |  +-- partial-request-ps?   yang:gauge64

I think I'm confused about what semantics this is trying to represent.
The "low-percentile-l" seems like it should be providing aggregate
information about some particular snapshot of traffic that we have
determined to be representative of the "low percentile" level of attack
traffic.  That is, if we take the recorded attack traffic and divide it
into a bunch of bins on the time axis, we can order the respective bins
from "least noticable attack" to "worst attack", and we pick the fifth
percentile (or whatever is configured) bin to represent the
"low-percentile" bucket.  But now we're reporting a bunch of attributes
of that bin -- number of connections, embryonic connections, connections
per second, etc., even though the bins were ordered based on a single
"worst" metric (and I don't see us specify what metric we actually use
to do that ordering).  So even though this is the "low percentile" (e.g.,
fifth percentile) bin of attack traffic (i.e., probably not a very bad
attack), we can't say specifically that the reported total attack
connection count is fifth percentile and the embryonic connection count
is fifth percentile and the requests-per-second is also fifth
percentile.  Which makes me confused at why the list is structured this
way -- if we were just saying "for each of connection, embryonic, etc., bin
the samples over that metric and compute the low/med/high percentage values
for that single metric, and this list holds those resulting values", it
seems like a more natural structure to group things would be to have a list
of low/med/high for each of the attributes/metrics.

Section 7.1.5

   vendor-id:  Vendor ID is a security vendor's Enterprise Number as
      registered with IANA [Enterprise-Numbers].  It is a four-byte
      integer value.

   attack-id:  Unique identifier assigned for the attack.

If the vendor-id is being used as a scoping value to let each vendor
assign attack-id values, then we should say so here, the first place we
really handle this part of the YANG tree.  Even if not, we should probably
say more about why we care what the vendor-id is, and we should also say who
assigns the attack-id.  (Maybe it's scoped to the DOTS server; I don't know
yet at this point in the document!)

Furthermore, the module structure where attack-id and vendor-id are
always required, but there is going to need to be the ability to assign
a new attack description at runtime, seems to force us to conflate
externally/statically assigned attack-ids and dynamically assigned ones
in the same number space.  Should we give any guidance to vendors about
how to allocate IDs in a way that will not produce collisions between
predistributed/fixed attack IDs and new ones created at runtime?  Or is the
thought that there will be no dynamic attack-ids since that just corresponds
to the attack-description string and a generic description can be used if
there is not a better one available?

Regardless, I think we should have more verbiage as to what the scope of the
attack-id is -- what information is it supposed to indicate or replace?

   start-time:  The time the attack started.  The attack's start time is
      expressed in seconds relative to 1970-01-01T00:00Z in UTC time

(nit) "in UTC time" and "Z" seem redundant (which does not necessarily
imply that we should only use one of them...).

      (Section 3.4.2 of [RFC8949]).  The CBOR encoding is modified so
      that the leading tag 1 (epoch-based date/time) MUST be omitted.

(nit) Maybe we should move the CBOR section reference to after we start
talking about CBOR encoding.  (And we should do the same for 'end-time',
if we do anything.)

   top-talker:  A list of top talkers among attack sources.  The top
      talkers are represented using the 'source-prefix'.

Is there a good reference or short description for the notion of
top-talkers?  I know it's a well-established term of art in many different
circles, but it's hard to be fully confident that all readers will be
familiar with it.

Also, should we say anything about how the number of talkers to include
is determined?  (I assume it is best left to the discretion of the
sender, but we probably want to set a clear expectation that the list is
not all talkers or a complete list in any other way.)

      'spoofed-status' indicates whether a top talker is a spoofed IP
      address (e.g., reflection attacks) or not.

Is this something that can be unambiguously determined by all parties
that might be producing this telemetry data, or should we use a
tri-statae (for "unknown") rather than a boolean to represent it?

   In order to optimize the size of telemetry data conveyed over the
   DOTS signal channel, DOTS agents MAY use the DOTS data channel
   [RFC8783] to exchange vendor specific attack mapping details (that
   is, {vendor identifier, attack identifier} ==> attack description).
   As such, DOTS agents do not have to convey systematically an attack
   description in their telemetry messages over the DOTS signal channel.

It's a little surprising to me that this is only a MAY.  Does it make
sense as a SHOULD (expected to usually happen, but leaving open the
flexibility when an observed attack does not match a previously
characterized attack description)?

Also, when reading this I assumed that the "attack description" being
mapped to could include things like the target port/protocol, whether it
was spoofed, etc.  But reading on to the ietf-dots-mapping module, it
seems that this is intended to just be the single "attack-description"
string.  If so, we might give some more indications of that, e.g., by
spelling it that way and having some prose about it being a "string", or
similar.

Also^2, from here to the end of the section might benefit from being
split off into a dedicated subsection on using the data channel to
pre-populate vendor and attack description information

   tables are at different revisions.  The DOTS client SHOULD transmit
   telemetry information using the vendor mapping(s) that it provided to
   the DOTS server and the DOTS server SHOULD use the vendor mappings(s)
   provided to the DOTS client when transmitting telemetry data to peer
   DOTS agent.

Having read a bit further on, I'm not actually sure where in the
protocol the client has provided vendor mappings to the server, and the
server provided vendor mappings to the client.  It *might* be the GET
and POST from/to dots-data/ietf-dots-mapping:vendor-mapping, but we
don't really give a clear indication of that.  I think we should try to
be more precise about what we mean by "provided".
(editorial) would this then be "SHOULD use any vendor mapping(s)"
(s/the/any/)?

     augment /data-channel:dots-data/data-channel:capabilities:
       +--ro vendor-mapping-enabled?   boolean {dots-telemetry}?

None of the other capabilities in this tree (as specified by RFC 8783)
include a name suffix like "-enabled".  Should this just be
"vendor-mapping"?

           "vendor-id": 1234,
           "vendor-name": "mitigator-s",
           "last-updated": "1576856561",
           "attack-mapping": []

[this last updated is from December 2019; we could probably pick
something more current if we wanted, but it doesn't really matter.]

   The DOTS client reiterates the above procedure regularly (e.g., once
   a week) to update the DOTS server's vendor attack mapping details.

Do the udpates have to preserve any existing assignments?

   If the DOTS client concludes that the DOTS server does not have any
   reference to the specific vendor attack mapping details, the DOTS
   client uses a POST request to install its vendor attack mapping
   details.  [...]

This suggests that a client would not bother sending its own vendor
mapping if the server already has one, and at least to me further
implies that the client would use the sever-provided mapping for the
messages that the client generates.  This seems at odds with the earlier
guidance that the client should send using the vendor-mapping it
provided to the server, and the server should send using the
vendor-mapping it provided to the client.  Which one is it?

   The DOTS server indicates the result of processing the POST request
   using the status-line.  Concretely, "201 Created" status-line MUST be
   returned in the response if the DOTS server has accepted the vendor
   attack mapping details.  If the request is missing a mandatory
   attribute or contains an invalid or unknown parameter, "400 Bad
   Request" status-line MUST be returned by the DOTS server in the
   response.  [...]

I think (but did not specifically check) that these specific status code
values are preexisting requirements of core RESTCONF (and HTTP), so we
may not need to write new "MUST" keywords.

   If the request is received via a server-domain DOTS gateway, but the
   DOTS server does not maintain a 'cdid' for this 'cuid' while a 'cdid'
   is expected to be supplied, the DOTS server MUST reply with "403
   Forbidden" status-line and the error-tag "access-denied".  Upon
   receipt of this message, the DOTS client MUST register (Section 5.1
   of [RFC8783]).

(nit) the analogous text in RFC 8783 itself refers only to "Section 5"
(not 5.1).

   The DOTS client uses the PUT request to modify its vendor attack
   mapping details maintained by the DOTS server (e.g., add a new
   mapping).

As above, is this really just an example behavior or a hard requirement
to preserve existing mappings?

Section 7.2

           "attack-detail": [
             {
               "vendor-id": 1234,

The vendor-ID is supposed to be an IANA-assigned PEN, and 1234 is
assigned to Linkage Software Inc.  We have other options, like 32473,
"Example Enterprise Number for Documentation Use" (RFC 5612).

               "start-time": "1957811234",

This value represents a time in 2032; should we be using something a
little more current?

Section 7.3

   This request MUST be maintained active by the DOTS server until a
   delete request is received from the same DOTS client to clear this
   pre-or-ongoing-mitigation telemetry.

It seems like we might want to allow some provision for a server to
clean up state associated with a client that disappears entirely without
cleaning itself up.  For example, is the server allowed to discard state
when it reboots, or does this requirement extend to writing the state to
persistent storage?

      If more than one Uri-Query option is included in a request, these
      options are interpreted in the same way as when multiple target
      attributes are included in a message body.

I a little bit wonder if we could put a section reference here for
"interpreted in the same way", since the reader may not remember what
that way is, at least on the first reading.

      parameters.  DOTS clients MUST NOT include a name in which the "*"
      character is included in a label other than the leftmost label.

Do we want to say what the server should do if it gets one anyway?
(It's not really clear that we need to say anything, strictly speaking.)

               "vendor-id": 1234,
               "attack-id": 77,
               "start-time": "1957818434",

[same as above]

   A DOTS client that is not interested to receive pre-or-ongoing-
   mitigation telemetry data for a target MUST send a delete request
   similar to the one depicted in Figure 37.

I'm not sure that this strictly needs to be a "MUST"; it seems like we
could say that such a client "sends a delete request" and leave it at
that.

Section 8.1

       +-- attack-detail* [vendor-id attack-id]
       [...]
          +-- top-talker
             +-- talker* [source-prefix]
             [...]
                +-- total-attack-connection
                   +-- low-percentile-c
                   |  +-- connection?           yang:gauge64

I'm kind of surprised that there is no 'protocol' leaf in this subtree;
it seems to usually appear in combination with these connection counts.

          |  +-- peak-g?              yang:gauge64
          |  +-- current-g?              yang:gauge64
[...]
                |  +-- peak-g?              yang:gauge64
                |  +-- current-g?              yang:gauge64

(nit) the indentation seems off for type of these two 'current-g's.

   In order to signal telemetry data in a mitigation efficacy update, it
   is RECOMMENDED that the DOTS client has already established a DOTS
   telemetry setup session with the server in 'idle' time.

If I understand correctly, it seems that at least part of the need for
having a preestablished telemetry session is that the mitigation
efficacy update is identified only by the 'mid', and so we cannot
specify a 'tmid' as part of such an efficacy update.  Thus, in order
for the server to associate the telemetry data in the efficacy update
with an ongoing telemetry status session, the 'tsid' must be associated
with the 'mid' that is the subject of the efficacy update.
If that is correct, then I think we should have some text here about how
the efficacy update does not/cannot include the 'tmid', and so is
associated with the 'mid' of the update.

Section 8.2

   As defined in [RFC8612], the actual mitigation activities can include
   several countermeasure mechanisms.  The DOTS server signals the
   current operational status of relevant countermeasures.  A list of
   attacks detected by each countermeasure MAY also be included.  The

I see how in stock DOTS signal channel, the server signals whether an
attack is fully/partially/not mitigated, but I don't see how the server
indicates the specific relevant countermeasures that were used.  Am I
just missing something there?  (Similarly, the list of attacks we
returned may or may not be at the granularity of "detected by each
countermeasure", depending on the ansewr to the previous part.)

Also (editorial), I think we should be more clear about when we switch
from describing RFC 9132 behavior to to describing the new functionality
provided by this document.

   Figure 46 shows an example of an asynchronous notification of attack
   mitigation status from the DOTS server.  This notification signals

Is the "new stuff" in Figure 46 just the subtree under
:(server-to-client-only)?  The 'total-attack-traffic' leaf seems to be
the same content as in Figure 45, and maybe could be omitted?

           "mitigation-start": "1507818434",

(This date is from 2017; as above, we could use a more current one if we
want.)

               "source-count": {
                 "peak-g": "10000"

(Suspiciously round numbers like this are indications of fabricated data
... which is totally reasonable for an example like this!  But if we
wanted to be "more realistic" we could use a less-round number.)

   DOTS clients can filter out the asynchronous notifications from the
   DOTS server by indicating one or more Uri-Query options in its GET
   request.  A Uri-Query option can include the following parameters:
   'target-prefix', 'target-port', 'target-protocol', 'target-fqdn',
   'target-uri', 'alias-name', and 'c' (content) (Section 4.2.4).  The

Up in §7.3 the analogous list also included 'mid'.  Is 'mid' appropriate
here (or is it implicitly already included by the base signal channel
mechanisms)?  Hmm, but maybe 'c' is also already allowed by the base
signal channel mechanism, so that theory is not great...

   If the target query does not match the target of the enclosed 'mid'
   as maintained by the DOTS server, the latter MUST respond with a 4.04
   (Not Found) error Response Code.  The DOTS server MUST NOT add a new
   observe entry if this query overlaps with an existing one.

How should the server respond if this query overlaps with an existing
one?

Section 10.1

   This module uses types defined in [RFC6991] and [RFC8345].

Should we mention the data structure extension from RFC 8791 here, as
well?

  typedef attack-severity {
    type enumeration {
      enum none {
        value 1;
        description
          "No effect on the DOTS client domain.";

It seems like the closest analogue to our attack-serverity typedef in
RFC 7970 is in §3.12.2, "BusinessImpact Class". But the phrasing used in
RFC 7970 is slightly different than the descriptions we have here.
Do we want to tweak our phrasings and/or clarify in the typedef
description that we use a slightly modified formulation?  (Indeed, we do
not have an extension point, either, though 7970 does have an extension
mechanism.)  Also, a section reference within RFC 7970 might be helpful.

      enum kilopacket-ps {
        value 4;
        description
          "Kilo packets per second (kpps).";

We may get someone who asks if we are using 1000 or 1024 ("kibi"), but
I'm inclined to leave it alone for now.

      leaf unit-status {
        type boolean;
        default true;
        description
          "Enable/disable the use of the measurement unit class.";

Does the "default true" imply that even unit classes not present in
"list unit-config" are assumed to be supported by default?

  grouping connection {
    description
      "A set of data nodes which represent the attack
       characteristics.";
    [...]
    leaf embryonic {
      type yang:gauge64;
      description
        "The number of simultaneous embryonic connections to
         the target server.";

It's a little interesting that this is the only leaf in the grouping
that we can't attach the word "attack" to the description of, but as far
as I can tell we shouldn't make any change here.
(I originally wrote a comment that we shouldn't use "attack" for any of
them, since we have analogous nodes in the total-connection-capacity
config entries, but those seem to not use this grouping.)

    list talker {
      key "source-prefix";
      description
        "IPv4 or IPv6 prefix identifying the attacker(s).";

My read of the YANG is that this is the description for "list talker",
but the content matches the description given for the "source-prefix" leaf
within the list.  Should the list itself have a different description?

  grouping top-talker-aggregate {
  [...]
  grouping top-talker {

The diff between these two is pretty small -- just the top-level
description and connection-all vs connection-protocol-all.  Is there any
value in introducing another grouping to hold the common elements?

            container max-config-values {
              description
                "Maximum acceptable configuration values.";
              uses telemetry-parameters;

I guess the convenience of having the reusable grouping probably
outweighs the value of having YANG-level constraints that the 'max'
values are >= the 'min' values (which we can set for the
"telemetry-notify-interval" that is not part of "telemetry-parameters", but
can't set here because we use the grouping).

              leaf telemetry-notify-interval {
                type uint32 {
                  range "1 .. 3600";

(Pedantically, this range would fit in a uint16, though maybe there is
some other reason to prefer the 32-bit type.)

            case baseline {
              [...]
              list baseline {
                [...]
                leaf id {
                  type uint32;
                  must '. >= 1';
                  description
                    "An identifier that uniquely identifies a baseline
                     entry communicated by a DOTS client.";

I a little bit wonder if we should have a little bit of prose up in §6.3
discussing the need for "id" and how it is used.

              leaf tmid {
                type uint32;
                description
                  "An identifier to uniquely demux telemetry data sent
                   using the same message.";

The description for "tsid" in the analogous setup was just "an
identifier for the DOTS telemetry setup data".  As far as I can tell,
the usage of the two is analogous, so shouldn't the descriptions be
analogous as well?  In particular, I am not sure that "uniquely demux" is
the only usage of this leaf, since in server-initiated telemetry messages,
this leaf might be needed to indicate which telemetry entry is being
described (right?).

            leaf-list mid-list {
              type uint32;
              description
                "Reference a list of associated mitigation requests.";

Should we reference RFC 9132 somehow to indicate that the base signal
channel is how these "mid" values are assigned (and assigned meaning)?

Section 10.2

         description
           "Vendor ID is a security vendor's Enterprise Number.";

Should we say something about "IANA-assigned" and/or link to the
corresponding registry?

Section 11, 12.1

A handful of the entries in the tables still have an
"ietf-dots-telemetry:" prefix, and I don't understand why only some of
the entries would need it but not others.

In particular, there seems to be an entry for both "total-traffic" and
"ietf-dots-telemetry:total-traffic", and I don't understand how they are
different.  (I did not check if there are other "duplicate"s like that.)

Section 11

  | telemetry            | container   |TBA2  | 5 map         | Object |

I see both a "list telemetry" and a "case telemetry" in the YANG module,
but I'm not sure which of them is supposed to correspond to the YANG
type "container".
https://datatracker.ietf.org/doc/html/rfc7950#section-7.9.2 does not
really give me the impression that there is an implicit container of
this name.

  | baseline             | container   |TBA49 | 5 map         | Object |

Similarly, I see a 'baseline' that's a list, as the only content of a
"case baseline" statement.

Section 12.1

   Note that 'lower-type' and 'upper-type' are also requested for
   assignment in the call-home I-D.  Both I-Ds should be sync'ed as
   depending the one that will make it first to the IANA.

(I think we tweaked call-home already to account for the registration
requests being from different ranges between the two documents, but
mention it just in case I'm misremembering.)

Section 12.3

            URI: urn:ietf:params:xml:ns:yang:ietf-dots-mapping
            Registrant Contact: The IESG.
            XML: N/A; the requested URI is an XML namespace.

I wonder if this (module) name is perhaps somewhat more generic than the
functionality it provides.

Section 13

Should we mention the security considerations for the blockwise transfer
technologies?

Looking through the YANG tree, the only thing that really sticks out as
potentially worth mentioning here is the server-originated-telemetry
option.  But even for that, I'm not sure that there's much to say.

   The DOTS telemetry information includes DOTS client network topology,
   DOTS client domain pipe capacity, normal traffic baseline and
   connections capacity, and threat and mitigation information.  Such
   information is sensitive; it MUST be protected at rest by the DOTS
   server domain to prevent data leakage.

I'd consider adding a sentence or two here noting that even though this
data is sensitive, sending it explicitly to the DOTS server does not
introduce any new significant considerations (other than the need for
protection at rest) because the DOTS server is already trusted to have
access to that kind of information by being in the position to mitigate
(and observe) attacks.

Section 16.1

A normative reference on draft-ietf-dots-signal-filter-control will
cause a document cluster, delaying publication until that document is
ready to be an RFC.  It's not clear to me that such a dependency is
needed in this case, since there is only one citation to that document
and it seems more descriptive than imposing a strong dependency.

Similarly, we seem to cite RFC 7641 just as a statement of fact (clients
can use CoAP OBSERVE), which may not require it to be classified as a
normative reference.

Section 16.2

The current state of this document doesn't really include enough detail
about percentiles and their calculation to be implementable without
referring to RFC 2330, currently listed only as a normative reference.
I would prefer to add more text to this document rather than promote
2330 to a normative refrence, though I think we would need more text
than I suggested above in order to achieve that effect.

We say we assume familiarity with RFC 8612 at least for terminology,
which may indicate that it is best classified as a normative reference.


Thanks,

Ben
[Dots] AD review of draft-ietf-dots-telemetry-16 Benjamin Kaduk
Re: [Dots] AD review of draft-ietf-dots-telemetry… tirumal reddy
Re: [Dots] AD review of draft-ietf-dots-telemetry… Benjamin Kaduk
Re: [Dots] AD review of draft-ietf-dots-telemetry… mohamed.boucadair