Re: [Dots] AD review of draft-ietf-dots-telemetry-16

Benjamin Kaduk <kaduk@mit.edu> Tue, 28 December 2021 15:40 UTC
Date: Tue, 28 Dec 2021 07:40:33 -0800
From: Benjamin Kaduk <kaduk@mit.edu>
To: tirumal reddy <kondtir@gmail.com>
Cc: draft-ietf-dots-telemetry.all@ietf.org, dots@ietf.org
Message-ID: <20211228154033.GU11486@mit.edu>
References: <20211124224140.GH93060@kduck.mit.edu> <CAFpG3gfBSarzE7aPm6JM0h1LJHt=DwfKXnaRUGosU6_=VY150A@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAFpG3gfBSarzE7aPm6JM0h1LJHt=DwfKXnaRUGosU6_=VY150A@mail.gmail.com>
Archived-At: <https://mailarchive.ietf.org/arch/msg/dots/v-7Svr2oJZMKVBxcVB-7Eu7yVNE>
Subject: Re: [Dots] AD review of draft-ietf-dots-telemetry-16
Precedence: list
Hi Tiru,

Thanks for these (and thanks Med as well for the additional follow-up; I
don't think I have any replies to that part).
A few comments inline...

On Tue, Nov 30, 2021 at 12:59:02PM +0530, tirumal reddy wrote:
> Hi Ben,
> 
> Thanks for the detailed review. Please see inline for responses till
> Section 7.1.4.
> 
> On Thu, 25 Nov 2021 at 04:12, Benjamin Kaduk <kaduk@mit.edu> wrote:
> 
> > Hi all,
> >
> > Sorry to have been working on this for so long -- some health issues
> > intervened, and it is perhaps easier to let onesself be interrupted if
> > there is no hope of finishing in a single sitting :-/
> >
> > [Note that I reviewed the -16 but the -17 is the current version, which
> > includes some editorial work I had sent in a PR.  It looks like that
> > will not have changed many things I comment on.]
> >
> > This review is pretty long (which I guess befits a long document).  We
> > probably want to chunk up replies to it so that the emails/threads stay
> > manageable.
> >
> > A couple meta-comments on the other reviews so far:
> >
> > - the shepherd writeup only answers one of the three questions in point
> >   (1).  It is likely that Murray will point this out in his ballot if
> >   not updated before then.
> >
> > - it's a little surprising that the yangdoctor didn't ask for encoded
> >   examples of the RESTCONF (data channel) functionality.  (For the
> >   sx:structure parts, I think we're already in good shape on examples.)
> > That
> >   said, I don't think our use of RESTCONF is particularly novel, and am
> > happy
> >   to proceed without such examples if desired.
> >
> > High-level remarks before going in to the detailed section-by-section
> > comments:
> >
> > I'm happy to see the note at the end of §4.5 that the data channel is
> > available to optimize the data that needs to be exchanged over the signal
> > channel during attack-time, in some sense reiterating the original split
> > between signal and data channels (but see my note in the inline comments
> > about the "MAY").  But this is already rather far into the design overview,
> > let alone the document overall!  I think it would be helpful to have a
> > paragraph or two in the toplevel §4 to get in front of how the telemetry
> > mechanisms integrate into the signal+data channels, perhaps something like:
> >
> > % The DOTS protocol suite is divided into two logical channels: the signal
> > % channel and data channel.  This division is due to the vastly different
> > % requirements placed upon the traffic they carry.  The signal channel must
> > % remain available and usable even in the face of attack traffic that
> > might,
> > % for example, saturate one direction of the links involved, rendering
> > % acknowledgment-based mechanisms unreliable and strongly incentivizing
> > % messages to be small enough to be contained in a single IP packet.  In
> > % contrast, the data channel is available for high-bandwidth data transfer
> > % before or after an attack, using more conventional transport protocol
> > % techniques.  It is generally preferable to perform advance configuration
> > % over the data channel, including configuring short aliases for static or
> > % nearly static data sets such as sets of network addresses/prefixes that
> > % might be subject to related attacks -- this helps to optimize the use of
> > % the signal channel for the small messages that truly require reliable
> > % delivery during an attack.
> > %
> > % Telemetry information has aspects that correspond to both operational
> > % modes: there is certainly a need to convey updated information about
> > % ongoing attack traffic and targets during an attack, so as to convey
> > % detailed information about mitigation status and inform updates to
> > % mitigation strategy in the face of adaptive attacks, but it is also
> > useful
> > % to provide mitigation services with a picture of normal or "baseline"
> > % traffic towards potential targets, to aid in detecting when incoming
> > % traffic deviates from normal into being an attack.  Likewise, one might
> > % populate a "database" of classifications of known types of attack so that
> > % a short alias can be used during attack time to describe an observed
> > % attack.  This specification does make provision for use of the data
> > % channel for the latter function, but otherwise retains most telemetry
> > % functionality in the signal channel.  This is partly out of necessity and
> > % partly out of expedience -- it is a functional requirement to convey
> > % information about ongoing attack traffic during an attack, and
> > information
> > % about baseline traffic uses an essentially identical datastructure that
> > is
> > % naturally defined to sit next to the description of attack traffic.  The
> > % related telemetry setup information used to parametrize actual traffic
> > % data is also sent over the signal channel, out of expediency. [I can't
> > % actually populate this part; maybe it's something about having the
> > % configuration values that define the percentile measurements be in-band
> > % with the percentile measurements themselves, and then a desire to not
> > % define two similar setup schemes over data+signal channels that justifies
> > % the other telemetry-config and pipe branches?]
> >
> 
> Proposed text looks good, we will add the above text to section 4.
> 
> 
> >
> > The use of the "list telemetry" in the telemetry-setup message confused me
> > for a while, since it seems like the client can only send a zero- or
> > one-element list to the server (based on the 'tsid' for that direction
> > being
> > part of the Uri-Path).  Can you confirm that the list structure was used
> > just for the server-to-client direction combined with maximizing
> > data-structure reuse?
> >
> > We should say somewhere that the examples use CBOR presentation format.
> > (This could be a note near the terminology section or accompanying each
> > example.)
> >
> > I think we should probably include a bit of a brief refresher on percentile
> > calculation, in addition to the reference to RFC 2330.  Right now we
> > basically just say that the peers negotiation parameter including
> > "percentile-related measurement parameters" which was not quite enough to
> > clue me in that 2330 was basically required reading to know what we
> > actually
> > do.  So this might look something like:
> >
> > % DOTS telemetry uses several percentile values to provide a picture of a
> > % traffic distribution overall, as opposed to just a single snapshot of
> > % observed traffic at a single point in time.  Modeling raw traffic flow
> > % data as a distribution and describing that distribution entails choosing
> > a
> > % measurement period that the distribution will describe, and a number of
> > % sampling intervals, or "buckets", within that measurement period.
> > Traffic
> > % within each bucket is treated as a single event (i.e., averaged), and
> > then
> > % the distribution of buckets is used to describe the distribution of
> > % traffic over the measurement period.  A distribution an be characterized
> > % by statistical measures such as mean, median, and standard deviation, and
> > % also by reporting the value of the distribution at various percentile
> > % levels of the data set in question, such as "quartiles" that correspond
> > to
> > % 25th, 50th, and 75th percentile.  More details about percentile values
> > and
> > % their computation are found in Section 11.3 of [RFC2330]; DOTS telemetry
> > % uses up to three percentile values, plus the overall peak, to
> > characterize
> > % traffic distributions.  Which percentile thresholds are used for these
> > % "low", "medium", and "high" percentile values is configurable (with
> > % suitable defaults).
> >
> 
> Looks good, we will update the draft.
> 
> 
> >
> >
> > It seems like we are setting up somewhat conflicting goals between the
> > overall desire for compact signal-channel messages and our guidance to
> > "include in any request to update information related to [a thing] the
> > information of other [like things] (already communicated using a lower
> > ['tsid'/'tmid'] value" that results in larger individual messages (albeit
> > fewer of them).  Do we want to make some statement about how this guidance
> > to coalesce can be ignored if needed to make individual signal-channel
> > messages fit in a single packet?  (Hmm, I guess we do have a note about
> > "assumes that all link information can fit in one single message" at
> > least for the 'tsid'/link case, but the text from that note only appears
> > once.)
> >
> > My editorial suggestions in
> > https://github.com/boucadair/draft-dots-telemetry/pull/9 included some
> > edits
> > relating to whether a given target is "susceptible to" vs "subject to" (or
> > even "actively being subjected to") different types of attacks.  Med has
> > diligently already merged that PR while I was sick, so I mostly assume that
> > these already got reviewed for correctness.  Still, I want to call out that
> > there is a difference and I was not always 100% sure that I picked the
> > right
> > one for each case.
> >
> > It looks like we have to use backslash continuation markers for several of
> > the CoAP request target resources in the examples; do we want to reference
> > RFC 8792 folding and use its specific note text about line wrapping?
> >
> > There are a few places where we say something like "the DOTS client MUST
> > auto-scale so that the appropriate unit is used".  (1) We don't
> > specifically
> > say what the goal of this automatic behavior is (I guess, use the "largest
> > unit" that gives a value greater than one?), and (2) it seems that we are
> > in
> > part relying on this behavior to ensure that the client only specifies one
> > measure in a given unit class.  That is, I didn't see anything else that
> > would prevent a client from sending (say) both Mbps and Gbps values, that
> > could of course be in conflict with each other -- if there's a potential
> > for
> > conflict we'd need to say how to resolve the conflict.
> >
> > Relatedly, there seems to *also* be potential for conflict in that we allow
> > both bit/second and Byte/second as unit-classes.  What is to happen if I
> > say
> > that something is both 2 bit/second and 2 Byte/second?

Med's follow-up implies that we are to just treat that as "invalid
parameters".  I'd prefer to be more explicit about that, e.g., with a
statement like "at most one of the bit/second and Byte/second unit-class
entries can be present; if such conflicting values are received the
combination is treated as specifying an illegal parameter and rejected with
4.00 (Bad Request)".

> > We have a lot of examples that use the same 'cuid' of
> > dz6pHjaADkaFTbjr0JGBpw, which is fine and representative of normal
> > operation.  Do we want to curate the 'tsid' and 'tmid' values used by that
> > client?  I see some reuse, which I think is not prohibited (at least for
> > 'tsid'), but may merit checking, and the values are not monotonically
> > increasing throughout the course of the document, which may or may not be
> > desirable.

Did you have any thoughts on this part?

> > (nit) we seem to have both "signalled" and "signaled" present, and
> > should standardize on our spelling.  (It looks like the one-'l' form is
> > more common in the RFC archive so far.)
> >
> > Section 1
> >
> >    Distributed Denial of Service (DDoS) attacks have become more
> >    sophisticated.  [...]
> >
> > We might want to say something about the baseline for the comparison
> > ("more sophisticated compared to what?").
> >
> >        100 Mbps to 100s of Gbps or even Tbps.  Attacks are commonly
> >        carried out leveraging botnets and attack reflectors for
> >        amplification attacks such as NTP (Network Time Protocol), DNS
> >        (Domain Name System), SNMP (Simple Network Management Protocol),
> >        or SSDP (Simple Service Discovery Protocol).
> >
> > Is https://datatracker.ietf.org/doc/html/rfc4732#section-3.1 still
> > current enough to make a useful reference for amplification attacks?
> >
> >                                           Nevertheless, when DOTS
> >    telemetry attributes are available to a DOTS agent, and absent any
> >    policy, it can signal the attributes in order to optimize the overall
> >    mitigation service provisioned using DOTS.
> >
> > I'm not sure what type of policy we have in mind, here.
> >
> > Section 2
> >
> >    The reader should be familiar with the terms defined in [RFC8612].
> >
> > It looks like RFC 8612 doesn't define "idle time" and "attack time", so
> > we may want to pull in a couple more documents as required reading or
> > define
> > them ourselves.
> >
> 
> Sure, we will elaborate these terms with an example.
> 
> 
> >
> > Section 3.1
> >
> >    If the DOTS server's mitigation resources have the capabilities to
> >    facilitate the DOTS telemetry, the DOTS server adapts its protection
> >    strategy and activates the required countermeasures immediately
> >    (automation enabled) for the sake of optimized attack mitigation
> >    decisions and actions.
> >
> > I'm not entirely sure I understand this part.  It seems to be talking
> > about telemetry between the DOTS server and the associated mitigation
> > resources, but the previous discussion has been about telemetry between
> > DOTS client and DOTS server (and I do not think the mitigation resources
> > associated with the DOTS server have been said to be DOTS clients in any
> > of the previous DOTS work).
> >
> > Section 3.2
> >
> >    DOTS telemetry can also be used to tune the DDoS mitigators with the
> >    correct state of an attack.  During the last few years, DDoS attack
> >
> > (nit) I don't think "with" is the right verb -- "tune with <X>" implies
> > that X is used as a tool to effectuate the tuning, but I think what's
> > going on here is more like using the telemetry data as an input for
> > determining what values to use for the tuning parameters available on
> > the mitigation resources.
> >
> 
> Fixed.
> 
> 
> >
> >    Mitigation of attacks without having certain knowledge of normal
> >    traffic can be inaccurate at best.  This is especially true for
> >    recursive signaling (see Section 3.2.3 in [I-D.ietf-dots-use-cases]).
> >
> > RFC 8903 makes no mention of "recursive", "recursion", or any
> > hierarchical mitigation scenario (at least in my quick search).   Even
> > the linked draft version (-25) doesn't have a section 3.2.3.  I do
> > remember reading about this type of recursive or hierarchical scenario
> > somwhere, so I think we just need to locate the right reference to put
> > here...I'm just not sure what reference that is, offhand.
> >
> 
> Good catch, recursive signaling is discussed in RFC8811.

Ah, and that's where the section reference points to.  I probably should
have figured that out on my own, but thanks for fixing it regardless.

> >
> >    In addition, the highly diverse types of use cases where DOTS clients
> >    are integrated also emphasize the need for knowledge of each DOTS
> >    client domain behavior.  Consequently, common global thresholds for
> >    attack detection practically cannot be realized.  [...]
> >
> > (editorial?) When we say "the highly diverse types of use cases where
> > DOTS clients are integrated" that's essentially an unsupported claim,
> > but the way it's written we are presupposing that claim to be true
> > without directly stating it as an observation or assumption.  I think we
> > might be better served by starting off with a declarative statement like
> > "DOTS clients can be integrated in a highly diverse set of scenarios and
> > use cases", and then moving on from that now-explicit assumption/fact to
> > conclude that any single global threshold will mischaracterize the
> > traffic for some of those diverse DOTS clients.
> >
> > Section 4.3
> >
> >    DOTS clients can also use CoAP Block1 Option in a PUT request (see
> >    Section 2.5 of [RFC7959]) to initiate large transfers, but these
> >    Block1 transfers will fail if the inbound "pipe" is running full, so
> >    consideration needs to be made to try to fit this PUT into a single
> >    transfer, or to separate out the PUT into several discrete PUTs where
> >    each of them fits into a single packet.
> >
> > If I understand/recall correctly, the Block1 transfer is expected to
> > fail only in a statistical sense, since the client can't send more
> > blocks until it gets the positive reply from the server to continue to
> > the next block (on a block-by-block basis), and that reply from the
> > server is competing with the attack traffic for the inbound "pipe" and
> > likely to fail for at least one of the blocks.  I think we should
> > probably reword slightly to say only "expected to fail" or "likely
> > fail", since there is some small chance of success; we may also want to
> > give a brief reminder about why it is expected to fail (e.g., "the
> > transfer requires a message from the server for each block, which would
> > likely be lost in the incoming flood").
> >
> 
> Thanks, we will fix the text.
> 
> 
> >
> > Section 6
> >
> >    Telemetry setup configuration is bound to a DOTS client domain.  DOTS
> >    servers MUST NOT expect DOTS clients to send regular requests to
> >    refresh the telemetry setup configuration.  Any available telemetry
> >    setup configuration has a validity timeout of the DOTS association
> >    with a DOTS client domain.  [...]
> >
> > The term "DOTS association" does not seem to have been used previously
> > (in RFC 9132 we discuss the "DOTS session" heavily, though).  Also, I

Are we planning to keep the "DOTS association" phrasing?

> > don't remember any previous requirement to keep state on the server for
> > the duration of anything scoped to the entire client *domain*, just
> > individual DOTS sessions.  We do mention detecting conflicts/overlapping
> > requests within the scope of a client domain, in RFC 9132, but as far as
> > I can tell that only holds when all the DOTS sessions involved in the
> > conflict are still active.
> >
> > Perhaps related, this discussion (at least so far) is not clear to me
> > about what level of coordination/consistency is expected between clients
> > in the same domain.  I believe that stock DOTS works fine with no such
> > coordination amongst clients within a domain, so if we are introducing such
> > a requirement we should say prominently that it's a change.
> >
> 
> Yes, telemetry configuration is not specific to the client. For example,
> the pipe capacity is specific to the site.

So what happens if there are two clients in a site and they are reporting
different values for pipe-capacity?

I think I'm just generally unsure what the expected division of work is
between multiple clients in a domain, so my questions involve a lot of
speculating and I don't have a good suggestion for changes to the text yet.

> 
> >
> > Section 6.1.1
> >
> >    Upon receipt of such request, and assuming no error is encountered by
> >    processing the request, the DOTS server replies with a 2.05 (Content)
> >    response that conveys the current and telemetry parameters acceptable
> >    by the DOTS server.  [...]
> >
> 
> NEW:
> 
> Upon receipt of such request, and assuming no error is encountered by
> 
> processing the request, the DOTS server replies with a 2.05 (Content)
> 
> response that conveys the telemetry parameters acceptable by the DOTS
> server
> 
> and the current baseline information maintained by the DOTS server.
> 
> 
> >
> > (editorial) Something seems off, here (around "current and telemetry
> > parameters acceptable").  Is it returning current configuration, acceptable
> > parameter values, or some combination thereof?
> >
> >           |  |     +-- query-type*            query-type
> >
> > Since this is a leaf-list of supported query types, should the list name
> > include the word "supported" or similar?
> >
> > Section 6.1.2
> >
> >    The PUT request with a higher numeric 'tsid' value overrides the DOTS
> >    telemetry configuration data installed by a PUT request with a lower
> >    numeric 'tsid' value.  To avoid maintaining a long list of 'tsid'
> >    requests for requests carrying telemetry configuration data from a
> >    DOTS client, the lower numeric 'tsid' MUST be automatically deleted
> >    and no longer be available at the DOTS server.
> >
> > The way this is phrased with "the lower" and "higher" (vs "highest")
> > assumes
> > or implies that there is only one "lower" 'tsid' value, perhaps leaving
> > some
> > ambiguity if there is actually more than one.  We might make a statement
> > about how there is only at most one active 'tsid' per 'cuid'+'cdid' at a
> > time other than during a config change (if that's the intent), or
> > alternatively to qualify that the requirement to remove is only incurred in
> > the event of a conflict (as is done for other types of config, later).
> >
> 
> We can have multiple tsids for the same client for pipe/baseline
> information. This is why we have:
> 

Ok.  Should we say something like "when telemetry configuration data is
received that has overlapping scope with existing telemetry configuration,
the PUT request with higher numeric 'tsid' value [...]"?

The rest of this looks good; thanks again for all the updates.

-Ben

> 
>    DOTS clients SHOULD minimize the number of active 'tsid's used for
> 
>    baseline information.  In order to avoid maintaining a long list of
> 
>    'tsid's for baseline information, it is RECOMMENDED that DOTS clients
> 
>    include in a request to update information related to a given target,
> 
>    the information of other targets (already communicated using a lower
> 
>    'tsid' value) (assuming this fits within one single datagram).  This
> 
>    update request will override these existing requests and hence
> 
>    optimize the number of 'tsid' requests per DOTS client.
> 
> 
> >
> >    o  If the request is missing a mandatory attribute, does not include
> >       'cuid' or 'tsid' Uri-Path parameters, or contains one or more
> >       invalid or unknown parameters, 4.00 (Bad Request) MUST be returned
> >       in the response.
> >
> > Just to confirm: this does not rule out the ability to define new
> > parameters in the future (for example, the client might learn of new
> > ones in the response to a GET request)?
> >
> 
> Yes.
> 
> 
> >
> > Section 6.3
> >
> >       *  The maximum number of requests allowed per second to the
> >          target.
> >
> > Should we say anything about the requirement on the protocol in question
> > that "request" is a meaningful concept and observable by the mitigator
> > (analogous to what we have about "embryonic connections" earlier)?
> >
> 
> Okay, updated text to say: The maximum number of requests (e.g.,
> HTTP/DNS/SIP requests) allowed per second to the target.
> 
> 
> >
> >           |           +-- baseline* [id]
> >           |              +-- id
> >           |              |       uint32
> >           |              +-- target-prefix*
> >           |              |       inet:ip-prefix
> >
> > (nit) the formatting here is a bit surprising, to wrap the line between
> > leaf name and type.  But if that's what pyang gives, we probably don't
> > want to mess with it...
> >
> >           |              |  +-- partial-request-ps?          uint64
> >           |              |  +-- partial-request-client-ps?   uint64
> >
> > I wonder whether the limit on "partial requests" should really be a rate
> > ("per second") versus a point-in-time cap (i.e., "oustanding partial
> > requests").  It seems like a given request could in principle remain in
> > "partial" state for an extended period of time, and that having remaing
> > in such a state for a second should not justify the client being able to
> > produce more partial requests...but the current formulation as a rate seems
> > to do so.
> >
> 
> Good point, added partial requests pending per client.
> 
> 
> >
> > Section 6.3.1
> >
> >    Two PUT requests from a DOTS client have overlapping targets if there
> >    is a common IP address, IP prefix, FQDN, URI, or alias-name.  Also,
> >    two PUT requests from a DOTS client have overlapping targets if the
> >    addresses associated with the FQDN, URI, or alias are overlapping
> >    with each other or with 'target-prefix'.
> >
> > There can be some subtlety here involving where the FQDN/URI/alias are
> > resolved into IP addresses from; we may want to say "from the perspective
> > of
> > the server", "as observed by the server", or similar.
> >
> 
> Fixed.
> 
> 
> >
> > Section 7
> >
> >    DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data with
> >    mitigation requests relying upon the target attribute.  In
> >
> > (nit??) Is this "target attribute" a telemetry attribute, or something
> > else (an attack target?)?
> >
> 
> NEW:
> 
>    DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data to
> 
>    mitigation requests relying upon the resources under attack.
> 
> 
> 
> >
> >    When generating telemetry data to send to a peer, the DOTS agent MUST
> >    auto-scale so that appropriate unit(s) are used.
> >
> > (editorial) This requirement is not terribly tightly connected to either
> > what comes after it in the text or what comes before it in the text.  We
> > might consider moving it (earlier, maybe?) and adding a bit more
> > transition phrasing about how the agent may send telemetry data many
> > times and in different situations, so the right unit to use will vary
> > over time.
> >
> 
> We will add an example like DDoS attack enhancing the attack volume
> from Gbps to Tbps to justify the need
> 
> for auto-scaling the units.
> 
> 
> 
> >
> > Section 7.1.1
> >
> >    If the target is subjected to bandwidth consuming attack, the
> >    attributes representing the percentile values of the 'attack-id'
> >    attack traffic are included.
> >
> >    If the target is subjected to resource consuming DDoS attacks, the
> >    same attributes defined for Section 7.1.4 are applicable for
> >    representing the attack.
> >
> > One of these just says "the attributes representing" and the other
> > references attributes defined in a specific section; can we use the same
> > formulation/phrasing to talk about the two cases?
> >
> >    This is an optional attribute.
> >
> > Er, which one is optional?  Just a few paragraphs earlier we said that
> > at least one attribute MUST be present (so as to be able to identify a
> > target).
> >
> 
> Added the following line:
> At least the 'target' attribute and another
> pre-or-ongoing-mitigation attribute MUST be present in the DOTS telemetry
> message.
> 
> Removed the line "This is an optional attribute"
> 
> 
> >
> > Section 7.1.2
> >
> >    The 'total-traffic' attribute (Figure 26) conveys the percentile
> >    values (including peak and current observed values) of total traffic
> >    observed during a DDoS attack.  [...]
> >
> > This attribute is under the "pre-or-ongoing-mitigation" hierarchy; is it
> > really right to only say "traffic observed during a DDoS attack" (i.e.,
> > implicitly excluding the "pre-attack" case)?
> >
> 
> Good catch, fixed.
> 
> 
> >
> > Section 7.1.3
> >
> >    The 'total-attack-traffic' attribute (Figure 27) conveys the total
> >    attack traffic identified by the DOTS client domain's DDoS Mitigation
> >    System (or DDoS Detector).  [...]
> >
> > (editorial?) "Identified by [the client domain]" seems to imply that
> > this is only used in telemetry reports from client to server, and not in
> > updates flowing the other direction.  Is that correct?
> >
> 
> Yes, telemetry is applicable in both directions and not specific to client
> to server only.
> 
> 
> >
> > Section 7.1.4
> >
> >                 +-- total-attack-connection
> >                 |  +-- low-percentile-l* [protocol]
> >                 |  |  +-- protocol              uint8
> >                 |  |  +-- connection?           yang:gauge64
> >                 |  |  +-- embryonic?            yang:gauge64
> >                 |  |  +-- connection-ps?        yang:gauge64
> >                 |  |  +-- request-ps?           yang:gauge64
> >                 |  |  +-- partial-request-ps?   yang:gauge64
> >
> > I think I'm confused about what semantics this is trying to represent.
> > The "low-percentile-l" seems like it should be providing aggregate
> > information about some particular snapshot of traffic that we have
> > determined to be representative of the "low percentile" level of attack
> > traffic.  That is, if we take the recorded attack traffic and divide it
> > into a bunch of bins on the time axis, we can order the respective bins
> > from "least noticable attack" to "worst attack", and we pick the fifth
> > percentile (or whatever is configured) bin to represent the
> > "low-percentile" bucket.  But now we're reporting a bunch of attributes
> > of that bin -- number of connections, embryonic connections, connections
> > per second, etc., even though the bins were ordered based on a single
> > "worst" metric (and I don't see us specify what metric we actually use
> > to do that ordering).  So even though this is the "low percentile" (e.g.,
> > fifth percentile) bin of attack traffic (i.e., probably not a very bad
> > attack), we can't say specifically that the reported total attack
> > connection count is fifth percentile and the embryonic connection count
> > is fifth percentile and the requests-per-second is also fifth
> > percentile.  Which makes me confused at why the list is structured this
> > way -- if we were just saying "for each of connection, embryonic, etc., bin
> > the samples over that metric and compute the low/med/high percentage values
> > for that single metric, and this list holds those resulting values", it
> > seems like a more natural structure to group things would be to have a list
> > of low/med/high for each of the attributes/metrics.
> >
> 
> Good point, we should have a list of low/med/high/peak/current for each of
> the connection attributes.
> 
> Cheers,
> -Tiru
> 
> 
> > Section 7.1.5
> >
> >    vendor-id:  Vendor ID is a security vendor's Enterprise Number as
> >       registered with IANA [Enterprise-Numbers].  It is a four-byte
> >       integer value.
> >
> >    attack-id:  Unique identifier assigned for the attack.
> >
> > If the vendor-id is being used as a scoping value to let each vendor
> > assign attack-id values, then we should say so here, the first place we
> > really handle this part of the YANG tree.  Even if not, we should probably
> > say more about why we care what the vendor-id is, and we should also say
> > who
> > assigns the attack-id.  (Maybe it's scoped to the DOTS server; I don't know
> > yet at this point in the document!)
> >
> > Furthermore, the module structure where attack-id and vendor-id are
> > always required, but there is going to need to be the ability to assign
> > a new attack description at runtime, seems to force us to conflate
> > externally/statically assigned attack-ids and dynamically assigned ones
> > in the same number space.  Should we give any guidance to vendors about
> > how to allocate IDs in a way that will not produce collisions between
> > predistributed/fixed attack IDs and new ones created at runtime?  Or is the
> > thought that there will be no dynamic attack-ids since that just
> > corresponds
> > to the attack-description string and a generic description can be used if
> > there is not a better one available?
> >
> > Regardless, I think we should have more verbiage as to what the scope of
> > the
> > attack-id is -- what information is it supposed to indicate or replace?
> >
> >    start-time:  The time the attack started.  The attack's start time is
> >       expressed in seconds relative to 1970-01-01T00:00Z in UTC time
> >
> > (nit) "in UTC time" and "Z" seem redundant (which does not necessarily
> > imply that we should only use one of them...).
> >
> >       (Section 3.4.2 of [RFC8949]).  The CBOR encoding is modified so
> >       that the leading tag 1 (epoch-based date/time) MUST be omitted.
> >
> > (nit) Maybe we should move the CBOR section reference to after we start
> > talking about CBOR encoding.  (And we should do the same for 'end-time',
> > if we do anything.)
> >
> >    top-talker:  A list of top talkers among attack sources.  The top
> >       talkers are represented using the 'source-prefix'.
> >
> > Is there a good reference or short description for the notion of
> > top-talkers?  I know it's a well-established term of art in many different
> > circles, but it's hard to be fully confident that all readers will be
> > familiar with it.
> >
> > Also, should we say anything about how the number of talkers to include
> > is determined?  (I assume it is best left to the discretion of the
> > sender, but we probably want to set a clear expectation that the list is
> > not all talkers or a complete list in any other way.)
> >
> >       'spoofed-status' indicates whether a top talker is a spoofed IP
> >       address (e.g., reflection attacks) or not.
> >
> > Is this something that can be unambiguously determined by all parties
> > that might be producing this telemetry data, or should we use a
> > tri-statae (for "unknown") rather than a boolean to represent it?
> >
> >    In order to optimize the size of telemetry data conveyed over the
> >    DOTS signal channel, DOTS agents MAY use the DOTS data channel
> >    [RFC8783] to exchange vendor specific attack mapping details (that
> >    is, {vendor identifier, attack identifier} ==> attack description).
> >    As such, DOTS agents do not have to convey systematically an attack
> >    description in their telemetry messages over the DOTS signal channel.
> >
> > It's a little surprising to me that this is only a MAY.  Does it make
> > sense as a SHOULD (expected to usually happen, but leaving open the
> > flexibility when an observed attack does not match a previously
> > characterized attack description)?
> >
> > Also, when reading this I assumed that the "attack description" being
> > mapped to could include things like the target port/protocol, whether it
> > was spoofed, etc.  But reading on to the ietf-dots-mapping module, it
> > seems that this is intended to just be the single "attack-description"
> > string.  If so, we might give some more indications of that, e.g., by
> > spelling it that way and having some prose about it being a "string", or
> > similar.
> >
> > Also^2, from here to the end of the section might benefit from being
> > split off into a dedicated subsection on using the data channel to
> > pre-populate vendor and attack description information
> >
> >    tables are at different revisions.  The DOTS client SHOULD transmit
> >    telemetry information using the vendor mapping(s) that it provided to
> >    the DOTS server and the DOTS server SHOULD use the vendor mappings(s)
> >    provided to the DOTS client when transmitting telemetry data to peer
> >    DOTS agent.
> >
> > Having read a bit further on, I'm not actually sure where in the
> > protocol the client has provided vendor mappings to the server, and the
> > server provided vendor mappings to the client.  It *might* be the GET
> > and POST from/to dots-data/ietf-dots-mapping:vendor-mapping, but we
> > don't really give a clear indication of that.  I think we should try to
> > be more precise about what we mean by "provided".
> > (editorial) would this then be "SHOULD use any vendor mapping(s)"
> > (s/the/any/)?
> >
> >      augment /data-channel:dots-data/data-channel:capabilities:
> >        +--ro vendor-mapping-enabled?   boolean {dots-telemetry}?
> >
> > None of the other capabilities in this tree (as specified by RFC 8783)
> > include a name suffix like "-enabled".  Should this just be
> > "vendor-mapping"?
> >
> >            "vendor-id": 1234,
> >            "vendor-name": "mitigator-s",
> >            "last-updated": "1576856561",
> >            "attack-mapping": []
> >
> > [this last updated is from December 2019; we could probably pick
> > something more current if we wanted, but it doesn't really matter.]
> >
> >    The DOTS client reiterates the above procedure regularly (e.g., once
> >    a week) to update the DOTS server's vendor attack mapping details.
> >
> > Do the udpates have to preserve any existing assignments?
> >
> >    If the DOTS client concludes that the DOTS server does not have any
> >    reference to the specific vendor attack mapping details, the DOTS
> >    client uses a POST request to install its vendor attack mapping
> >    details.  [...]
> >
> > This suggests that a client would not bother sending its own vendor
> > mapping if the server already has one, and at least to me further
> > implies that the client would use the sever-provided mapping for the
> > messages that the client generates.  This seems at odds with the earlier
> > guidance that the client should send using the vendor-mapping it
> > provided to the server, and the server should send using the
> > vendor-mapping it provided to the client.  Which one is it?
> >
> >    The DOTS server indicates the result of processing the POST request
> >    using the status-line.  Concretely, "201 Created" status-line MUST be
> >    returned in the response if the DOTS server has accepted the vendor
> >    attack mapping details.  If the request is missing a mandatory
> >    attribute or contains an invalid or unknown parameter, "400 Bad
> >    Request" status-line MUST be returned by the DOTS server in the
> >    response.  [...]
> >
> > I think (but did not specifically check) that these specific status code
> > values are preexisting requirements of core RESTCONF (and HTTP), so we
> > may not need to write new "MUST" keywords.
> >
> >    If the request is received via a server-domain DOTS gateway, but the
> >    DOTS server does not maintain a 'cdid' for this 'cuid' while a 'cdid'
> >    is expected to be supplied, the DOTS server MUST reply with "403
> >    Forbidden" status-line and the error-tag "access-denied".  Upon
> >    receipt of this message, the DOTS client MUST register (Section 5.1
> >    of [RFC8783]).
> >
> > (nit) the analogous text in RFC 8783 itself refers only to "Section 5"
> > (not 5.1).
> >
> >    The DOTS client uses the PUT request to modify its vendor attack
> >    mapping details maintained by the DOTS server (e.g., add a new
> >    mapping).
> >
> > As above, is this really just an example behavior or a hard requirement
> > to preserve existing mappings?
> >
> > Section 7.2
> >
> >            "attack-detail": [
> >              {
> >                "vendor-id": 1234,
> >
> > The vendor-ID is supposed to be an IANA-assigned PEN, and 1234 is
> > assigned to Linkage Software Inc.  We have other options, like 32473,
> > "Example Enterprise Number for Documentation Use" (RFC 5612).
> >
> >                "start-time": "1957811234",
> >
> > This value represents a time in 2032; should we be using something a
> > little more current?
> >
> > Section 7.3
> >
> >    This request MUST be maintained active by the DOTS server until a
> >    delete request is received from the same DOTS client to clear this
> >    pre-or-ongoing-mitigation telemetry.
> >
> > It seems like we might want to allow some provision for a server to
> > clean up state associated with a client that disappears entirely without
> > cleaning itself up.  For example, is the server allowed to discard state
> > when it reboots, or does this requirement extend to writing the state to
> > persistent storage?
> >
> >       If more than one Uri-Query option is included in a request, these
> >       options are interpreted in the same way as when multiple target
> >       attributes are included in a message body.
> >
> > I a little bit wonder if we could put a section reference here for
> > "interpreted in the same way", since the reader may not remember what
> > that way is, at least on the first reading.
> >
> >       parameters.  DOTS clients MUST NOT include a name in which the "*"
> >       character is included in a label other than the leftmost label.
> >
> > Do we want to say what the server should do if it gets one anyway?
> > (It's not really clear that we need to say anything, strictly speaking.)
> >
> >                "vendor-id": 1234,
> >                "attack-id": 77,
> >                "start-time": "1957818434",
> >
> > [same as above]
> >
> >    A DOTS client that is not interested to receive pre-or-ongoing-
> >    mitigation telemetry data for a target MUST send a delete request
> >    similar to the one depicted in Figure 37.
> >
> > I'm not sure that this strictly needs to be a "MUST"; it seems like we
> > could say that such a client "sends a delete request" and leave it at
> > that.
> >
> > Section 8.1
> >
> >        +-- attack-detail* [vendor-id attack-id]
> >        [...]
> >           +-- top-talker
> >              +-- talker* [source-prefix]
> >              [...]
> >                 +-- total-attack-connection
> >                    +-- low-percentile-c
> >                    |  +-- connection?           yang:gauge64
> >
> > I'm kind of surprised that there is no 'protocol' leaf in this subtree;
> > it seems to usually appear in combination with these connection counts.
> >
> >           |  +-- peak-g?              yang:gauge64
> >           |  +-- current-g?              yang:gauge64
> > [...]
> >                 |  +-- peak-g?              yang:gauge64
> >                 |  +-- current-g?              yang:gauge64
> >
> > (nit) the indentation seems off for type of these two 'current-g's.
> >
> >    In order to signal telemetry data in a mitigation efficacy update, it
> >    is RECOMMENDED that the DOTS client has already established a DOTS
> >    telemetry setup session with the server in 'idle' time.
> >
> > If I understand correctly, it seems that at least part of the need for
> > having a preestablished telemetry session is that the mitigation
> > efficacy update is identified only by the 'mid', and so we cannot
> > specify a 'tmid' as part of such an efficacy update.  Thus, in order
> > for the server to associate the telemetry data in the efficacy update
> > with an ongoing telemetry status session, the 'tsid' must be associated
> > with the 'mid' that is the subject of the efficacy update.
> > If that is correct, then I think we should have some text here about how
> > the efficacy update does not/cannot include the 'tmid', and so is
> > associated with the 'mid' of the update.
> >
> > Section 8.2
> >
> >    As defined in [RFC8612], the actual mitigation activities can include
> >    several countermeasure mechanisms.  The DOTS server signals the
> >    current operational status of relevant countermeasures.  A list of
> >    attacks detected by each countermeasure MAY also be included.  The
> >
> > I see how in stock DOTS signal channel, the server signals whether an
> > attack is fully/partially/not mitigated, but I don't see how the server
> > indicates the specific relevant countermeasures that were used.  Am I
> > just missing something there?  (Similarly, the list of attacks we
> > returned may or may not be at the granularity of "detected by each
> > countermeasure", depending on the ansewr to the previous part.)
> >
> > Also (editorial), I think we should be more clear about when we switch
> > from describing RFC 9132 behavior to to describing the new functionality
> > provided by this document.
> >
> >    Figure 46 shows an example of an asynchronous notification of attack
> >    mitigation status from the DOTS server.  This notification signals
> >
> > Is the "new stuff" in Figure 46 just the subtree under
> > :(server-to-client-only)?  The 'total-attack-traffic' leaf seems to be
> > the same content as in Figure 45, and maybe could be omitted?
> >
> >            "mitigation-start": "1507818434",
> >
> > (This date is from 2017; as above, we could use a more current one if we
> > want.)
> >
> >                "source-count": {
> >                  "peak-g": "10000"
> >
> > (Suspiciously round numbers like this are indications of fabricated data
> > ... which is totally reasonable for an example like this!  But if we
> > wanted to be "more realistic" we could use a less-round number.)
> >
> >    DOTS clients can filter out the asynchronous notifications from the
> >    DOTS server by indicating one or more Uri-Query options in its GET
> >    request.  A Uri-Query option can include the following parameters:
> >    'target-prefix', 'target-port', 'target-protocol', 'target-fqdn',
> >    'target-uri', 'alias-name', and 'c' (content) (Section 4.2.4).  The
> >
> > Up in §7.3 the analogous list also included 'mid'.  Is 'mid' appropriate
> > here (or is it implicitly already included by the base signal channel
> > mechanisms)?  Hmm, but maybe 'c' is also already allowed by the base
> > signal channel mechanism, so that theory is not great...
> >
> >    If the target query does not match the target of the enclosed 'mid'
> >    as maintained by the DOTS server, the latter MUST respond with a 4.04
> >    (Not Found) error Response Code.  The DOTS server MUST NOT add a new
> >    observe entry if this query overlaps with an existing one.
> >
> > How should the server respond if this query overlaps with an existing
> > one?
> >
> > Section 10.1
> >
> >    This module uses types defined in [RFC6991] and [RFC8345].
> >
> > Should we mention the data structure extension from RFC 8791 here, as
> > well?
> >
> >   typedef attack-severity {
> >     type enumeration {
> >       enum none {
> >         value 1;
> >         description
> >           "No effect on the DOTS client domain.";
> >
> > It seems like the closest analogue to our attack-serverity typedef in
> > RFC 7970 is in §3.12.2, "BusinessImpact Class". But the phrasing used in
> > RFC 7970 is slightly different than the descriptions we have here.
> > Do we want to tweak our phrasings and/or clarify in the typedef
> > description that we use a slightly modified formulation?  (Indeed, we do
> > not have an extension point, either, though 7970 does have an extension
> > mechanism.)  Also, a section reference within RFC 7970 might be helpful.
> >
> >       enum kilopacket-ps {
> >         value 4;
> >         description
> >           "Kilo packets per second (kpps).";
> >
> > We may get someone who asks if we are using 1000 or 1024 ("kibi"), but
> > I'm inclined to leave it alone for now.
> >
> >       leaf unit-status {
> >         type boolean;
> >         default true;
> >         description
> >           "Enable/disable the use of the measurement unit class.";
> >
> > Does the "default true" imply that even unit classes not present in
> > "list unit-config" are assumed to be supported by default?
> >
> >   grouping connection {
> >     description
> >       "A set of data nodes which represent the attack
> >        characteristics.";
> >     [...]
> >     leaf embryonic {
> >       type yang:gauge64;
> >       description
> >         "The number of simultaneous embryonic connections to
> >          the target server.";
> >
> > It's a little interesting that this is the only leaf in the grouping
> > that we can't attach the word "attack" to the description of, but as far
> > as I can tell we shouldn't make any change here.
> > (I originally wrote a comment that we shouldn't use "attack" for any of
> > them, since we have analogous nodes in the total-connection-capacity
> > config entries, but those seem to not use this grouping.)
> >
> >     list talker {
> >       key "source-prefix";
> >       description
> >         "IPv4 or IPv6 prefix identifying the attacker(s).";
> >
> > My read of the YANG is that this is the description for "list talker",
> > but the content matches the description given for the "source-prefix" leaf
> > within the list.  Should the list itself have a different description?
> >
> >   grouping top-talker-aggregate {
> >   [...]
> >   grouping top-talker {
> >
> > The diff between these two is pretty small -- just the top-level
> > description and connection-all vs connection-protocol-all.  Is there any
> > value in introducing another grouping to hold the common elements?
> >
> >             container max-config-values {
> >               description
> >                 "Maximum acceptable configuration values.";
> >               uses telemetry-parameters;
> >
> > I guess the convenience of having the reusable grouping probably
> > outweighs the value of having YANG-level constraints that the 'max'
> > values are >= the 'min' values (which we can set for the
> > "telemetry-notify-interval" that is not part of "telemetry-parameters", but
> > can't set here because we use the grouping).
> >
> >               leaf telemetry-notify-interval {
> >                 type uint32 {
> >                   range "1 .. 3600";
> >
> > (Pedantically, this range would fit in a uint16, though maybe there is
> > some other reason to prefer the 32-bit type.)
> >
> >             case baseline {
> >               [...]
> >               list baseline {
> >                 [...]
> >                 leaf id {
> >                   type uint32;
> >                   must '. >= 1';
> >                   description
> >                     "An identifier that uniquely identifies a baseline
> >                      entry communicated by a DOTS client.";
> >
> > I a little bit wonder if we should have a little bit of prose up in §6.3
> > discussing the need for "id" and how it is used.
> >
> >               leaf tmid {
> >                 type uint32;
> >                 description
> >                   "An identifier to uniquely demux telemetry data sent
> >                    using the same message.";
> >
> > The description for "tsid" in the analogous setup was just "an
> > identifier for the DOTS telemetry setup data".  As far as I can tell,
> > the usage of the two is analogous, so shouldn't the descriptions be
> > analogous as well?  In particular, I am not sure that "uniquely demux" is
> > the only usage of this leaf, since in server-initiated telemetry messages,
> > this leaf might be needed to indicate which telemetry entry is being
> > described (right?).
> >
> >             leaf-list mid-list {
> >               type uint32;
> >               description
> >                 "Reference a list of associated mitigation requests.";
> >
> > Should we reference RFC 9132 somehow to indicate that the base signal
> > channel is how these "mid" values are assigned (and assigned meaning)?
> >
> > Section 10.2
> >
> >          description
> >            "Vendor ID is a security vendor's Enterprise Number.";
> >
> > Should we say something about "IANA-assigned" and/or link to the
> > corresponding registry?
> >
> > Section 11, 12.1
> >
> > A handful of the entries in the tables still have an
> > "ietf-dots-telemetry:" prefix, and I don't understand why only some of
> > the entries would need it but not others.
> >
> > In particular, there seems to be an entry for both "total-traffic" and
> > "ietf-dots-telemetry:total-traffic", and I don't understand how they are
> > different.  (I did not check if there are other "duplicate"s like that.)
> >
> > Section 11
> >
> >   | telemetry            | container   |TBA2  | 5 map         | Object |
> >
> > I see both a "list telemetry" and a "case telemetry" in the YANG module,
> > but I'm not sure which of them is supposed to correspond to the YANG
> > type "container".
> > https://datatracker.ietf.org/doc/html/rfc7950#section-7.9.2 does not
> > really give me the impression that there is an implicit container of
> > this name.
> >
> >   | baseline             | container   |TBA49 | 5 map         | Object |
> >
> > Similarly, I see a 'baseline' that's a list, as the only content of a
> > "case baseline" statement.
> >
> > Section 12.1
> >
> >    Note that 'lower-type' and 'upper-type' are also requested for
> >    assignment in the call-home I-D.  Both I-Ds should be sync'ed as
> >    depending the one that will make it first to the IANA.
> >
> > (I think we tweaked call-home already to account for the registration
> > requests being from different ranges between the two documents, but
> > mention it just in case I'm misremembering.)
> >
> > Section 12.3
> >
> >             URI: urn:ietf:params:xml:ns:yang:ietf-dots-mapping
> >             Registrant Contact: The IESG.
> >             XML: N/A; the requested URI is an XML namespace.
> >
> > I wonder if this (module) name is perhaps somewhat more generic than the
> > functionality it provides.
> >
> > Section 13
> >
> > Should we mention the security considerations for the blockwise transfer
> > technologies?
> >
> > Looking through the YANG tree, the only thing that really sticks out as
> > potentially worth mentioning here is the server-originated-telemetry
> > option.  But even for that, I'm not sure that there's much to say.
> >
> >    The DOTS telemetry information includes DOTS client network topology,
> >    DOTS client domain pipe capacity, normal traffic baseline and
> >    connections capacity, and threat and mitigation information.  Such
> >    information is sensitive; it MUST be protected at rest by the DOTS
> >    server domain to prevent data leakage.
> >
> > I'd consider adding a sentence or two here noting that even though this
> > data is sensitive, sending it explicitly to the DOTS server does not
> > introduce any new significant considerations (other than the need for
> > protection at rest) because the DOTS server is already trusted to have
> > access to that kind of information by being in the position to mitigate
> > (and observe) attacks.
> >
> > Section 16.1
> >
> > A normative reference on draft-ietf-dots-signal-filter-control will
> > cause a document cluster, delaying publication until that document is
> > ready to be an RFC.  It's not clear to me that such a dependency is
> > needed in this case, since there is only one citation to that document
> > and it seems more descriptive than imposing a strong dependency.
> >
> > Similarly, we seem to cite RFC 7641 just as a statement of fact (clients
> > can use CoAP OBSERVE), which may not require it to be classified as a
> > normative reference.
> >
> > Section 16.2
> >
> > The current state of this document doesn't really include enough detail
> > about percentiles and their calculation to be implementable without
> > referring to RFC 2330, currently listed only as a normative reference.
> > I would prefer to add more text to this document rather than promote
> > 2330 to a normative refrence, though I think we would need more text
> > than I suggested above in order to achieve that effect.
> >
> > We say we assume familiarity with RFC 8612 at least for terminology,
> > which may indicate that it is best classified as a normative reference.
> >
> >
> > Thanks,
> >
> > Ben
> >
> >
[Dots] AD review of draft-ietf-dots-telemetry-16 Benjamin Kaduk
Re: [Dots] AD review of draft-ietf-dots-telemetry… tirumal reddy
Re: [Dots] AD review of draft-ietf-dots-telemetry… Benjamin Kaduk
Re: [Dots] AD review of draft-ietf-dots-telemetry… mohamed.boucadair