Re: [Dots] Comments for draft-reddy-dots-telemetry

"Panwei (William)" <william.panwei@huawei.com> Thu, 22 August 2019 12:43 UTC

From: "Panwei (William)" <william.panwei@huawei.com>
To: "Konda, Tirumaleswar Reddy" <TirumaleswarReddy_Konda@McAfee.com>
CC: dots <dots@ietf.org>, "draft-reddy-dots-telemetry@ietf.org" <draft-reddy-dots-telemetry@ietf.org>
Thread-Topic: Comments for draft-reddy-dots-telemetry
Thread-Index: AdVY3j2uxxImJy8tR8GGGvdCYIuylQ==
Date: Thu, 22 Aug 2019 12:43:37 +0000
Message-ID: <30E95A901DB42F44BA42D69DB20DFA6A6DE9CDD0@nkgeml513-mbx.china.huawei.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
Content-Type: multipart/alternative; boundary="_000_30E95A901DB42F44BA42D69DB20DFA6A6DE9CDD0nkgeml513mbxchi_"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/dots/epdh9aWkdobZwnqpRfsAvWoVJ2Y>
Subject: Re: [Dots] Comments for draft-reddy-dots-telemetry
Precedence: list

Hi Tiru,

Please see inline.

Regards & Thanks!
潘伟 Wei Pan
华为技术有限公司 Huawei Technologies Co., Ltd.

From: Dots [mailto:dots-bounces@ietf.org] On Behalf Of Konda, Tirumaleswar Reddy
Sent: Tuesday, August 20, 2019 6:15 PM
To: Panwei (William) <william.panwei@huawei.com>; draft-reddy-dots-telemetry@ietf.org
Cc: dots <dots@ietf.org>; MeiLing Chen <chenmeiling@chinamobile.com>
Subject: Re: [Dots] Comments for draft-reddy-dots-telemetry

Hi Wei,

Thanks for the detailed review. Please see inline [TR]

From: Dots <dots-bounces@ietf.org<mailto:dots-bounces@ietf.org>> On Behalf Of Panwei (William)
Sent: Thursday, August 15, 2019 12:44 PM
To: draft-reddy-dots-telemetry@ietf.org<mailto:draft-reddy-dots-telemetry@ietf.org>
Cc: dots <dots@ietf.org<mailto:dots@ietf.org>>; MeiLing Chen <chenmeiling@chinamobile.com<mailto:chenmeiling@chinamobile.com>>
Subject: [Dots] Comments for draft-reddy-dots-telemetry


CAUTION: External email. Do not click links or open attachments unless you recognize the sender and know the content is safe.


________________________________
1. I agree that DOTS telemetry should has a dedicated URI.

[TR] Yes, will add path-suffix '/telemetry'

2. In Section 4.1.1, totally I’m OK with the idea of the ‘Total Traffic Normal Baseline’.

  2.1. I didn’t figure out what the low, mid and high percentile actually stand for, or how to understand them. I think average bandwidth, peak bandwidth and usual bandwidth range (which contains minimum bandwidth and maximum bandwidth) are useful for reference.

[TR] Percentile can be used for statistical analysis and is better than average, see https://www.elastic.co/blog/averages-can-dangerous-use-percentile and https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/
[Wei] Thanks for sharing this, I have a better understanding about percentile now.
This is like, for example, continuously sampling 100 times at regular intervals, sorting these sample values in ascending order, then the value of the 10th sample is the 10th percentile, the value of the 50th sample is the 50th percentile, the value of the 90th sample is the 90th percentile.
But I still have some questions:
1. The percentile looks like more useful, but does it mean the average value is totally useless?
2. The text in the draft is very simple, so I didn’t get the reason why time ranges information is not required. Is there an existing standard about the traffic baseline, like what information the baseline contains, how to create the baseline, how to understand or use the baseline? Can you give me a reference or elaborate on these?
3. There are 100 percentile values, from 1% to 100%, what’s the reason to only choose the 10%, 50% and 90% three values? Will more values be more helpful? Can all these 100 values be sent?
Is it better to make this attribute as a list, for example as below, to let the DOTS client decide what percentiles to be sent?
+--rw total-traffic-normal-baseline
   +--rw bandwidth-percentile* [percentile-position]
      +--rw percentile-position     uint8 // range is from 1 to 100
      +--rw percentile-value-pps    uint64
      +--rw percentile-value-bps    uint64
4. If the DOTS server wants to know some percentiles that the DOTS client didn’t send, can the DOTS server actively ask the DOTS client to send a specific percentile?

There may be a confusion between the peak bandwidth and the maximum bandwidth. But IMO they are different: the peak bandwidth is a burst high value that only lasts for a short while, e.g., a few seconds or minutes, and the maximum bandwidth is a continuous high value that can last for a long time, e.g., a few hours.

[TR] I thought both peak and maximum bandwidth are the same, Please point me to a reference distinguishing between peak and maximum bandwidth.
[Wei] If using the percentile as the ‘list’ way that I described above, I think no longer need a separate peak bandwidth, the 100th percentile is the peak value.

  2.2. The statistics results can be quite different during different time ranges, for example, the bandwidth may be much higher during 6:00 pm to 12:00 pm than 6:00 am to 12:00 am. So for accuracy, it’s better to separately calculate the baseline in different time ranges.

[TR] I don’t think time ranges are required to create baseline for statistical analysis. The bandwidth may also vary depending on the day in a week, holidays, specific events (e.g. games) and flash crowd scenarios (https://www.radware.com/resources/ddos_mitigation_layers.aspx)
[Wei] I’m really confused. Can you elaborate more details about the baseline?

3. In Section 4.1.3 and 4.1.4, about the ‘Total Attack Traffic’ and ‘Total Traffic’, I think they should reflect the current attack traffic bandwidth and current traffic bandwidth, so whether only one current value is enough? Are the low percentile, mid percentile, high percentile and peak values still needed ?

[TR] Yes, percentile is required to analyze the attack pattern.
[Wei] Aren’t the ‘Total Attack Traffic’ and ‘Total Traffic’ instantaneous values? Why do the sampling and statistics of these two attributes?

4. In Section 4.1.5, about the ‘Attack Details’:
  4.1. I don’t like the ‘vendor-id’ and ‘attack-id’ unless they are optional. The ‘attack-id’ is maintained by each vendor, the combination of ‘vendor-id’ and ‘attack-id’ can be enormous, so it’s a burden for implementation to understand and map these elements. Especially the DOTS client needs to implement the ‘Attack Details’ attribute as described in Section 4.3.1.

  4.2. Here the ‘attack-name’ is designed to use textual representation to express the attack type. Meiling also designed a mechanism to express the attack type in her draft (draft-chen-dots-attack-informations-02). Meiling’s mechanism is more complicated in rules definition, but it may be easier to implement and understand. The textual representation mechanism here seems like very easy because it has no rules definition, but it needs Natural Language Processing techniques, are these NLP techniques easy for implementing? Because of no unified rules, will the analysis results be different by different implementation?

[TR] You may want to look into the discussion https://mailarchive.ietf.org/arch/msg/dots/uyq-AB4me7qZ2apuaw8b3J6JDnA
[Wei] I didn’t see a conclusion.

  4.3. For the ‘attack-severity’, I feel this element is too subjective, what’s the standards for distinguishing among ‘emergency’, ‘critical’ and ‘alert’?

[TR] Yes, it is subjective and only a hint.
[Wei] So I don’t think this is needed.

5. In Section 4.2, about the ‘Mitigation Efficacy DOTS Telemetry Attributes’, except the ‘Total Attack Traffic’, I think ‘Total Traffic’ and ‘Total Pipe Capability’ can also be included here. Because for some cases, the DOTS client can’t distinguish attack traffic from total traffic, then it will not be able to send the ‘Total Attack Traffic’, but it can send the current ‘Total Traffic’ and ‘Total Pipe Capability’ to indicate the mitigation efficacy. This is also mentioned in Meiling’s draft.

[TR] If the traffic is scrubbed by the DDoS mitigation provider, the DOTS server already knows the ‘Total Traffic’. ‘Total pipe capability’ is a pre-mitigation attribute. The pipe capacity won’t change during a DDoS attack.
[Wei] Is there a strong binding relation between the pre-mitigation telemetry and the mitigation efficacy telemetry? Must the DOTS client send the pre-mitigation telemetry before the mitigation efficacy telemetry?
In my understanding, they are separated, so when just consider the mitigation efficacy telemetry, the ‘Total Traffic’ and the ‘Total Pipe Capability’ can also be used for measuring efficacy.

6. The telemetry attributes are divided into three categories: Pre-mitigation, Mitigation Efficacy, Mitigation Status. I think these categories are reasonable and clear. But I found the attributes are basically related to bandwidth. Bandwidth is useful for volume-based DDoS attack, but for resource-based DDoS attack, other attributes are needed.

[TR] Good point, will update draft.
[Wei] OK

  6.1. To assess the resource-based DDoS attack, the statistics of session will be helpful. This statistics can be made from different dimensions: the number of sessions based on protocols like TCP/UDP/ICMP, the number of sessions per source IP, the number of source IPs, etc..

[TR] If it is resource-based DDoS, what is the use of number of sessions per source IP and the number of source IPs ?
[Wei] I will do more study to find some useful attributes and related examples.

        6.2. This statistics of session can be added into the ‘Total Traffic Normal Baseline’, also be added into ‘Total traffic’ and ‘Total Capability’. The YANG module tree of my understanding is attached at the end for reference.

[TR] What type of statistics of a session are you referring to ?
[Wei] I mean all the attributes related to sessions.

  6.3. Some other information which can help identify an attack can also be considered and included. For example, in some attacks the attackers establish many sessions with a very long lifetime, so the statistics of session lifetime may help.

[TR] Please point me to DDoS attacks using sessions with very long lifetime.
[Wei] I will do more study to find some useful attributes and related examples

7. Discussion:
I’d like to raise a discussion here. I tried to consider the telemetry from aspects of ‘why’, ‘what’, ‘who’, ‘where’, ‘when’ and ‘how’. ‘Why we need telemetry’ is described in Section 3 and ‘What are the telemetry attributes’ is describe is Section 4. For the left ‘who’, ‘where’, ‘when’ and ‘how’, I conclude them as ‘how will we use this telemetry’, i.e., in which scenario which role will send which telemetry attributes by which channel, this is not described yet. So do we need to describe ‘how will we use this telemetry’?

[TR] The use of the telemetry is implementation specific. For example, an DOTS server can use the telemetry for statistical analysis or deep learning or notify the DOTS server security operation teams.
[Wei] But, without use cases, how to prove what attributes are needed and what are not needed?

[Dots] Comments for draft-reddy-dots-telemetry Panwei (William)
Re: [Dots] Comments for draft-reddy-dots-telemetry Konda, Tirumaleswar Reddy
Re: [Dots] Comments for draft-reddy-dots-telemetry Panwei (William)
Re: [Dots] Comments for draft-reddy-dots-telemetry Konda, Tirumaleswar Reddy