Re: [Dots] AD review of draft-ietf-dots-telemetry-16 (Sections 1- 7.1.4)

mohamed.boucadair@orange.com Wed, 01 December 2021 09:03 UTC

From: mohamed.boucadair@orange.com
To: tirumal reddy <kondtir@gmail.com>, Benjamin Kaduk <kaduk@mit.edu>
CC: "draft-ietf-dots-telemetry.all@ietf.org" <draft-ietf-dots-telemetry.all@ietf.org>, "dots@ietf.org" <dots@ietf.org>
Thread-Topic: [Dots] AD review of draft-ietf-dots-telemetry-16 (Sections 1- 7.1.4)
Thread-Index: AdfmikjKkLxN20g2QW+WcJg192Y8Cg==
Content-Class:
Date: Wed, 01 Dec 2021 09:03:47 +0000
Message-ID: <24993_1638349429_61A73A75_24993_73_1_787AE7BB302AE849A7480A190F8B93303545D099@OPEXCAUBMA2.corporate.adroot.infra.ftgroup>
Accept-Language: fr-FR, en-US
Content-Language: fr-FR
msip_labels: MSIP_Label_07222825-62ea-40f3-96b5-5375c07996e2_Enabled=true; MSIP_Label_07222825-62ea-40f3-96b5-5375c07996e2_SetDate=2021-12-01T08:06:14Z; MSIP_Label_07222825-62ea-40f3-96b5-5375c07996e2_Method=Privileged; MSIP_Label_07222825-62ea-40f3-96b5-5375c07996e2_Name=unrestricted_parent.2; MSIP_Label_07222825-62ea-40f3-96b5-5375c07996e2_SiteId=90c7a20a-f34b-40bf-bc48-b9253b6f5d20; MSIP_Label_07222825-62ea-40f3-96b5-5375c07996e2_ActionId=3ce94cd0-2b4a-4b2b-b814-1dced15d03ac; MSIP_Label_07222825-62ea-40f3-96b5-5375c07996e2_ContentBits=0
x-originating-ip: [10.114.13.247]
Content-Type: multipart/alternative; boundary="_000_787AE7BB302AE849A7480A190F8B93303545D099OPEXCAUBMA2corp_"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/dots/nubTNz2zxhiuXmmUAG8eJext1_8>
Subject: Re: [Dots] AD review of draft-ietf-dots-telemetry-16 (Sections 1- 7.1.4)
X-BeenThere: dots@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "List for discussion of DDoS Open Threat Signaling \(DOTS\) technology and directions." <dots.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/dots>, <mailto:dots-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/dots/>
List-Post: <mailto:dots@ietf.org>
List-Help: <mailto:dots-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/dots>, <mailto:dots-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 01 Dec 2021 09:03:58 -0000

Hi Ben,

Please see inline some more replies in addition to those already provided by Tiru. The changes can be tracked in the github.

Cheers,
Med

De : Dots <dots-bounces@ietf.org> De la part de tirumal reddy
Envoyé : mardi 30 novembre 2021 08:29
À : Benjamin Kaduk <kaduk@mit.edu>
Cc : draft-ietf-dots-telemetry.all@ietf.org; dots@ietf.org
Objet : Re: [Dots] AD review of draft-ietf-dots-telemetry-16

Hi Ben,

Thanks for the detailed review. Please see inline for responses till Section 7.1.4.

On Thu, 25 Nov 2021 at 04:12, Benjamin Kaduk <kaduk@mit.edu<mailto:kaduk@mit.edu>> wrote:
Hi all,

Sorry to have been working on this for so long -- some health issues
intervened, and it is perhaps easier to let onesself be interrupted if
there is no hope of finishing in a single sitting :-/

[Note that I reviewed the -16 but the -17 is the current version, which
includes some editorial work I had sent in a PR. It looks like that
will not have changed many things I comment on.]

This review is pretty long (which I guess befits a long document). We
probably want to chunk up replies to it so that the emails/threads stay
manageable.

A couple meta-comments on the other reviews so far:

- the shepherd writeup only answers one of the three questions in point
(1). It is likely that Murray will point this out in his ballot if
not updated before then.

- it's a little surprising that the yangdoctor didn't ask for encoded
examples of the RESTCONF (data channel) functionality. (For the
sx:structure parts, I think we're already in good shape on examples.) That
said, I don't think our use of RESTCONF is particularly novel, and am happy
to proceed without such examples if desired.

[Med] We do already have examples for that part as well, e.g., Fig 33/34

High-level remarks before going in to the detailed section-by-section
comments:

I'm happy to see the note at the end of §4.5 that the data channel is
available to optimize the data that needs to be exchanged over the signal
channel during attack-time, in some sense reiterating the original split
between signal and data channels (but see my note in the inline comments
about the "MAY"). But this is already rather far into the design overview,
let alone the document overall! I think it would be helpful to have a
paragraph or two in the toplevel §4 to get in front of how the telemetry
mechanisms integrate into the signal+data channels, perhaps something like:

[snip]

The use of the "list telemetry" in the telemetry-setup message confused me
for a while, since it seems like the client can only send a zero- or
one-element list to the server (based on the 'tsid' for that direction being
part of the Uri-Path).

[Med] Yes.

Can you confirm that the list structure was used
just for the server-to-client direction combined with maximizing
data-structure reuse?

[Med] It is modelled as a list for both but only one item is allowed in a request. The restriction is explicitly covered in the text:

“'tsid' MUST NOT appear in the PUT request message body.”

The server can send a list; the tsid key will appear in the message body.

Please note that this was designed in the same way we handled the mitigation scopes in 9132.

We should say somewhere that the examples use CBOR presentation format.
(This could be a note near the terminology section or accompanying each
example.)

[Med] We do already have this text:

Telemetry messages exchanged between DOTS agents are serialized using
Concise Binary Object Representation (CBOR) [RFC8949]. CBOR-encoded
payloads are used to carry signal-channel-specific payload messages
which convey request parameters and response information such as
errors.

Added this NEW to 4.6:

JSON encoding of YANG-modeled data is used to illustrate the various
telemetry operations.

[snip]
It seems like we are setting up somewhat conflicting goals between the
overall desire for compact signal-channel messages and our guidance to
"include in any request to update information related to [a thing] the
information of other [like things] (already communicated using a lower
['tsid'/'tmid'] value" that results in larger individual messages (albeit
fewer of them). Do we want to make some statement about how this guidance
to coalesce can be ignored if needed to make individual signal-channel
messages fit in a single packet? (Hmm, I guess we do have a note about
"assumes that all link information can fit in one single message" at
least for the 'tsid'/link case, but the text from that note only appears
once.)

[Med] We do have two such statements in the draft. The second one is in 6.3.1: “assuming this fits within one single datagram”.

We don’t include any further guideline as the root requirement is to “minimize the number of active”.

My editorial suggestions in
https://github.com/boucadair/draft-dots-telemetry/pull/9 included some edits
relating to whether a given target is "susceptible to" vs "subject to" (or
even "actively being subjected to") different types of attacks. Med has
diligently already merged that PR while I was sick, so I mostly assume that
these already got reviewed for correctness. Still, I want to call out that
there is a difference and I was not always 100% sure that I picked the right
one for each case.

It looks like we have to use backslash continuation markers for several of
the CoAP request target resources in the examples; do we want to reference
RFC 8792 folding and use its specific note text about line wrapping?

[Med] I’m using the 8792 in other specs, but I was hesitating to use it for telemetry for consistency with the convention we used in 8783:

Within the
examples, many protocol header lines and message-body text are split
into multiple lines for display purposes only. When a line ends with
backslash ('\') as the last character, the line is wrapped for
display purposes. It is to be considered to be joined to the next
line by deleting the backslash, the following line break, and the
leading whitespace of the next line.

There are a few places where we say something like "the DOTS client MUST
auto-scale so that the appropriate unit is used". (1) We don't specifically
say what the goal of this automatic behavior is (I guess, use the "largest
unit" that gives a value greater than one?), and (2) it seems that we are in
part relying on this behavior to ensure that the client only specifies one
measure in a given unit class.

[Med] Fixed.

That is, I didn't see anything else that
would prevent a client from sending (say) both Mbps and Gbps values

[Med] We do have:

“Only one unit per unit class is used owing to
unit auto-scaling.»

, that
could of course be in conflict with each other -- if there's a potential for
conflict we'd need to say how to resolve the conflict.

Relatedly, there seems to *also* be potential for conflict in that we allow
both bit/second and Byte/second as unit-classes. What is to happen if I say
that something is both 2 bit/second and 2 Byte/second?

[Med] This falls under:

o If the request is missing a mandatory attribute, does not include
'cuid' or 'tsid' Uri-Path parameters, or contains one or more
invalid or unknown parameters, 4.00 (Bad Request) MUST be returned
in the response.

We have a lot of examples that use the same 'cuid' of
dz6pHjaADkaFTbjr0JGBpw, which is fine and representative of normal
operation. Do we want to curate the 'tsid' and 'tmid' values used by that
client? I see some reuse, which I think is not prohibited (at least for
'tsid'), but may merit checking, and the values are not monotonically
increasing throughout the course of the document, which may or may not be
desirable.

(nit) we seem to have both "signalled" and "signaled" present, and
should standardize on our spelling. (It looks like the one-'l' form is
more common in the RFC archive so far.)

[Med] Fixed

Section 1

Distributed Denial of Service (DDoS) attacks have become more
sophisticated. [...]

We might want to say something about the baseline for the comparison
("more sophisticated compared to what?").

100 Mbps to 100s of Gbps or even Tbps. Attacks are commonly
carried out leveraging botnets and attack reflectors for
amplification attacks such as NTP (Network Time Protocol), DNS
(Domain Name System), SNMP (Simple Network Management Protocol),
or SSDP (Simple Service Discovery Protocol).

Is https://datatracker.ietf.org/doc/html/rfc4732#section-3.1 still
current enough to make a useful reference for amplification attacks?

[Med] Sure. We do cite in 9132 as well.

Nevertheless, when DOTS
telemetry attributes are available to a DOTS agent, and absent any
policy, it can signal the attributes in order to optimize the overall
mitigation service provisioned using DOTS.

I'm not sure what type of policy we have in mind, here.

[Med] An administrator of a dots client domain may grant access to telemetry data or specific telemetry data (e.g., pipe configuration) to a subset of these clients. Such policy can be configured out of band (e.g., during service subscription). Added a NEW note.

Section 2

The reader should be familiar with the terms defined in [RFC8612].

It looks like RFC 8612 doesn't define "idle time" and "attack time", so
we may want to pull in a couple more documents as required reading or define
them ourselves.

Sure, we will elaborate these terms with an example.

Section 3.1

If the DOTS server's mitigation resources have the capabilities to
facilitate the DOTS telemetry, the DOTS server adapts its protection
strategy and activates the required countermeasures immediately
(automation enabled) for the sake of optimized attack mitigation
decisions and actions.

I'm not entirely sure I understand this part. It seems to be talking
about telemetry between the DOTS server and the associated mitigation
resources, but the previous discussion has been about telemetry between
DOTS client and DOTS server (and I do not think the mitigation resources
associated with the DOTS server have been said to be DOTS clients in any
of the previous DOTS work).

[Med] Added this NEW: “The interface from the DOTS server to the mitigator to signal the telemetry data is out of scope.”

Section 3.2

DOTS telemetry can also be used to tune the DDoS mitigators with the
correct state of an attack. During the last few years, DDoS attack

(nit) I don't think "with" is the right verb -- "tune with <X>" implies
that X is used as a tool to effectuate the tuning, but I think what's
going on here is more like using the telemetry data as an input for
determining what values to use for the tuning parameters available on
the mitigation resources.

Fixed.

Mitigation of attacks without having certain knowledge of normal
traffic can be inaccurate at best. This is especially true for
recursive signaling (see Section 3.2.3 in [I-D.ietf-dots-use-cases]).

RFC 8903 makes no mention of "recursive", "recursion", or any
hierarchical mitigation scenario (at least in my quick search). Even
the linked draft version (-25) doesn't have a section 3.2.3. I do
remember reading about this type of recursive or hierarchical scenario
somwhere, so I think we just need to locate the right reference to put
here...I'm just not sure what reference that is, offhand.

Good catch, recursive signaling is discussed in RFC8811.

In addition, the highly diverse types of use cases where DOTS clients
are integrated also emphasize the need for knowledge of each DOTS
client domain behavior. Consequently, common global thresholds for
attack detection practically cannot be realized. [...]

(editorial?) When we say "the highly diverse types of use cases where
DOTS clients are integrated" that's essentially an unsupported claim,
but the way it's written we are presupposing that claim to be true
without directly stating it as an observation or assumption. I think we
might be better served by starting off with a declarative statement like
"DOTS clients can be integrated in a highly diverse set of scenarios and
use cases", and then moving on from that now-explicit assumption/fact to
conclude that any single global threshold will mischaracterize the
traffic for some of those diverse DOTS clients.

[Med] Fixed.

Section 4.3

DOTS clients can also use CoAP Block1 Option in a PUT request (see
Section 2.5 of [RFC7959]) to initiate large transfers, but these
Block1 transfers will fail if the inbound "pipe" is running full, so
consideration needs to be made to try to fit this PUT into a single
transfer, or to separate out the PUT into several discrete PUTs where
each of them fits into a single packet.

If I understand/recall correctly, the Block1 transfer is expected to
fail only in a statistical sense, since the client can't send more
blocks until it gets the positive reply from the server to continue to
the next block (on a block-by-block basis), and that reply from the
server is competing with the attack traffic for the inbound "pipe" and
likely to fail for at least one of the blocks. I think we should
probably reword slightly to say only "expected to fail" or "likely
fail", since there is some small chance of success; we may also want to
give a brief reminder about why it is expected to fail (e.g., "the
transfer requires a message from the server for each block, which would
likely be lost in the incoming flood").

Thanks, we will fix the text.

Section 6

Telemetry setup configuration is bound to a DOTS client domain. DOTS
servers MUST NOT expect DOTS clients to send regular requests to
refresh the telemetry setup configuration. Any available telemetry
setup configuration has a validity timeout of the DOTS association
with a DOTS client domain. [...]

The term "DOTS association" does not seem to have been used previously
(in RFC 9132 we discuss the "DOTS session" heavily, though). Also, I
don't remember any previous requirement to keep state on the server for
the duration of anything scoped to the entire client *domain*, just
individual DOTS sessions. We do mention detecting conflicts/overlapping
requests within the scope of a client domain, in RFC 9132, but as far as
I can tell that only holds when all the DOTS sessions involved in the
conflict are still active.

Perhaps related, this discussion (at least so far) is not clear to me
about what level of coordination/consistency is expected between clients
in the same domain. I believe that stock DOTS works fine with no such
coordination amongst clients within a domain, so if we are introducing such
a requirement we should say prominently that it's a change.

Yes, telemetry configuration is not specific to the client. For example, the pipe capacity is specific to the site.

Section 6.1.1

Upon receipt of such request, and assuming no error is encountered by
processing the request, the DOTS server replies with a 2.05 (Content)
response that conveys the current and telemetry parameters acceptable
by the DOTS server. [...]

NEW:

Upon receipt of such request, and assuming no error is encountered by

processing the request, the DOTS server replies with a 2.05 (Content)

response that conveys the telemetry parameters acceptable by the DOTS server

and the current baseline information maintained by the DOTS server.

(editorial) Something seems off, here (around "current and telemetry
parameters acceptable"). Is it returning current configuration, acceptable
parameter values, or some combination thereof?

| | +-- query-type* query-type

Since this is a leaf-list of supported query types, should the list name
include the word "supported" or similar?

[Med] Prefixed with “supported-" for consistency with unit class naming. Thanks.

Section 6.1.2

The PUT request with a higher numeric 'tsid' value overrides the DOTS
telemetry configuration data installed by a PUT request with a lower
numeric 'tsid' value. To avoid maintaining a long list of 'tsid'
requests for requests carrying telemetry configuration data from a
DOTS client, the lower numeric 'tsid' MUST be automatically deleted
and no longer be available at the DOTS server.

The way this is phrased with "the lower" and "higher" (vs "highest") assumes
or implies that there is only one "lower" 'tsid' value, perhaps leaving some
ambiguity if there is actually more than one. We might make a statement
about how there is only at most one active 'tsid' per 'cuid'+'cdid' at a
time other than during a config change (if that's the intent), or
alternatively to qualify that the requirement to remove is only incurred in
the event of a conflict (as is done for other types of config, later).

We can have multiple tsids for the same client for pipe/baseline information. This is why we have:

DOTS clients SHOULD minimize the number of active 'tsid's used for

baseline information. In order to avoid maintaining a long list of

'tsid's for baseline information, it is RECOMMENDED that DOTS clients

include in a request to update information related to a given target,

the information of other targets (already communicated using a lower

'tsid' value) (assuming this fits within one single datagram). This

update request will override these existing requests and hence

optimize the number of 'tsid' requests per DOTS client.

Just to confirm: this does not rule out the ability to define new
parameters in the future (for example, the client might learn of new
ones in the response to a GET request)?

Yes.

Section 6.3

* The maximum number of requests allowed per second to the
target.

Should we say anything about the requirement on the protocol in question
that "request" is a meaningful concept and observable by the mitigator
(analogous to what we have about "embryonic connections" earlier)?

Okay, updated text to say: The maximum number of requests (e.g., HTTP/DNS/SIP requests) allowed per second to the target.

(nit) the formatting here is a bit surprising, to wrap the line between
leaf name and type. But if that's what pyang gives, we probably don't
want to mess with it...

[Med] This is generated by pyang when “--tree-line-length 69" argument is used.

| | +-- partial-request-ps? uint64
| | +-- partial-request-client-ps? uint64

I wonder whether the limit on "partial requests" should really be a rate
("per second") versus a point-in-time cap (i.e., "oustanding partial
requests"). It seems like a given request could in principle remain in
"partial" state for an extended period of time, and that having remaing
in such a state for a second should not justify the client being able to
produce more partial requests...but the current formulation as a rate seems
to do so.

Good point, added partial requests pending per client.

Section 6.3.1

Two PUT requests from a DOTS client have overlapping targets if there
is a common IP address, IP prefix, FQDN, URI, or alias-name. Also,
two PUT requests from a DOTS client have overlapping targets if the
addresses associated with the FQDN, URI, or alias are overlapping
with each other or with 'target-prefix'.

There can be some subtlety here involving where the FQDN/URI/alias are
resolved into IP addresses from; we may want to say "from the perspective of
the server", "as observed by the server", or similar.

Fixed.

Section 7

DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data with
mitigation requests relying upon the target attribute. In

(nit??) Is this "target attribute" a telemetry attribute, or something
else (an attack target?)?

NEW:

DOTS agents SHOULD bind pre-or-ongoing-mitigation telemetry data to

mitigation requests relying upon the resources under attack.

When generating telemetry data to send to a peer, the DOTS agent MUST
auto-scale so that appropriate unit(s) are used.

(editorial) This requirement is not terribly tightly connected to either
what comes after it in the text or what comes before it in the text. We
might consider moving it (earlier, maybe?) and adding a bit more
transition phrasing about how the agent may send telemetry data many
times and in different situations, so the right unit to use will vary
over time.

We will add an example like DDoS attack enhancing the attack volume from Gbps to Tbps to justify the need

for auto-scaling the units.

Section 7.1.1

If the target is subjected to bandwidth consuming attack, the
attributes representing the percentile values of the 'attack-id'
attack traffic are included.

If the target is subjected to resource consuming DDoS attacks, the
same attributes defined for Section 7.1.4 are applicable for
representing the attack.

One of these just says "the attributes representing" and the other
references attributes defined in a specific section; can we use the same
formulation/phrasing to talk about the two cases?

This is an optional attribute.

Er, which one is optional? Just a few paragraphs earlier we said that
at least one attribute MUST be present (so as to be able to identify a
target).

Added the following line:
At least the 'target' attribute and another pre-or-ongoing-mitigation attribute MUST be present in the DOTS telemetry message.

Removed the line "This is an optional attribute"

Section 7.1.2

The 'total-traffic' attribute (Figure 26) conveys the percentile
values (including peak and current observed values) of total traffic
observed during a DDoS attack. [...]

This attribute is under the "pre-or-ongoing-mitigation" hierarchy; is it
really right to only say "traffic observed during a DDoS attack" (i.e.,
implicitly excluding the "pre-attack" case)?

Good catch, fixed.

Section 7.1.3

The 'total-attack-traffic' attribute (Figure 27) conveys the total
attack traffic identified by the DOTS client domain's DDoS Mitigation
System (or DDoS Detector). [...]

(editorial?) "Identified by [the client domain]" seems to imply that
this is only used in telemetry reports from client to server, and not in
updates flowing the other direction. Is that correct?

Yes, telemetry is applicable in both directions and not specific to client to server only.

Section 7.1.4

I think I'm confused about what semantics this is trying to represent.
The "low-percentile-l" seems like it should be providing aggregate
information about some particular snapshot of traffic that we have
determined to be representative of the "low percentile" level of attack
traffic. That is, if we take the recorded attack traffic and divide it
into a bunch of bins on the time axis, we can order the respective bins
from "least noticable attack" to "worst attack", and we pick the fifth
percentile (or whatever is configured) bin to represent the
"low-percentile" bucket. But now we're reporting a bunch of attributes
of that bin -- number of connections, embryonic connections, connections
per second, etc., even though the bins were ordered based on a single
"worst" metric (and I don't see us specify what metric we actually use
to do that ordering). So even though this is the "low percentile" (e.g.,
fifth percentile) bin of attack traffic (i.e., probably not a very bad
attack), we can't say specifically that the reported total attack
connection count is fifth percentile and the embryonic connection count
is fifth percentile and the requests-per-second is also fifth
percentile. Which makes me confused at why the list is structured this
way -- if we were just saying "for each of connection, embryonic, etc., bin
the samples over that metric and compute the low/med/high percentage values
for that single metric, and this list holds those resulting values", it
seems like a more natural structure to group things would be to have a list
of low/med/high for each of the attributes/metrics.

Good point, we should have a list of low/med/high/peak/current for each of the connection attributes.

Cheers,
-Tiru

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

Re: [Dots] AD review of draft-ietf-dots-telemetry… mohamed.boucadair