[Nmlrg] Review for draft-jiang-nmlrg-traffic-machine-learning-00.txt

Albert Cabellos <albert.cabellos@gmail.com> Sat, 16 July 2016 14:11 UTC

MIME-Version: 1.0
From: Albert Cabellos <albert.cabellos@gmail.com>
Date: Sat, 16 Jul 2016 16:11:23 +0200
Message-ID: <CAGE_QewtGRL58K-XLrFOE9a-vMjJEV8v5sthMQ3OeHdzAOKK8A@mail.gmail.com>
To: draft-jiang-nmlrg-traffic-machine-learning@ietf.org, nmlrg@irtf.org
Content-Type: multipart/alternative; boundary="001a11443e3063bad00537c14f75"
Archived-At: <https://mailarchive.ietf.org/arch/msg/nmlrg/bTITpyP38HiD4xsnaKXvIrqLTaI>
Subject: [Nmlrg] Review for draft-jiang-nmlrg-traffic-machine-learning-00.txt
Precedence: list

Hi all

Find below a review for draft-jiang-nmlrg-traffic-machine-learning-00.txt,
thanks for putting up this very relevant draft.

Albert


1.  Introduction


[snip]

   It is natural to utilize powerful machine learning technology to
>    analyze the large mount of data regarding network traffic, to
>    understand the network's status, such as performance, failures,
>    security, etc.  It is a big advantage that machines can measure and
>    analyse the network traffic, then report the results and predictions
>    to humans for further decision.  The machines could handle vast
>    amounts of data which is almost impossible for humans to deal with,
>    in close to real time.  Even more, if the speed and accuracy of the
>    prediction is high enough, it is possible that the subsequent action
>    based on the prediction result could form a closed control loop to
>    achieve autonomic management.  However, the maturity of latter might
>    be far in the future.  Today, the traditional control programs still
>    look more reliable than machine learning based control mechanisms.


I don´t personally see the closed control-loop that far in the future, a
ML-based (flow-analysis) application detection can be used to automatically
choose appropriate routes for the flows.

>
>    This document firstly analyzes the data of the network traffic from
>    various perspectives; and also discusses several important practical
>    considerations, including the training data source, data storage and
>    the learning system architecture.  It then introduce a set of use
>    cases, which have been shown to work well although there is large
>    scope for improvements, including ML-based traffic classification,
>    traffic management, interface failure prediction, etc.


This is a very relevant objective, I´d also like to see general
conclusions, lessons learnt, maybe a general architecture that accommodates
all the use-cases, etc? What do these use-cases have in common? What are
the common challenges? Are there traffic features common to all the
use-cases? Can we list such traffic features?

2.  Terminology


[snip]


   Traffic Flow  A sequence of packets from a source computer to a
>       destination [RFC6437].  It is the unit of network traffic.



To me the unit of network traffic is a packet, I haven´t read 6437, maybe I
am missing something.

[snip]

3.1.  Data of the Network Traffic


[snip]

   User content  User contents are the payload of packets, which might
>       be obtained by DPI (Deep Packet Inspection) within the transit
>       network if the packets are unencrypted, or they could be analyzed
>       by the source or destination nodes.


Why ‘user content’? Why not ‘content’? There are initiatives for DPI over
encrypted traffic:

http://iot.stanford.edu/pubs/sherry-blindbox-sigcomm15.pdf

3.3.  Architecture Considerations


>    Offline & online learning
>
>       *  Co-located mode: training (offline, based on historic data) and
>          prediction (online, based on real-time data) are both done
>          within the same entity.  The entity could be a central
>          repository or a specific node.
>
>       *  De-coupled mode: training is done in the central repository,
>          and prediction is made by the routers/switches/firewalls or
>          other devices that directly process the network traffic.
>
>
>    Central learning & distributed learning  Central learning means the
>       learning process is done at a single entity, which is either a
>       central repository or a node.  Distributed learning refer to
>       ensemble learning that multiple entities do the learning
>       simultaneously and ensemble the results together to sort out a
>       final results.  Since network devices are naturally distributed,
>       it could be foreseen that ensemble learning is a good approach for
>       a certain of use cases.


This last paragraph is obscure and unclear to me, what is the relation
between this an offline/online learning? I think that this could be
simplified by stating than learning can be offline and applied online, or
directly online, either on a network device or in a centralized entity.

3.4.  Closed Control Loop
>
>    The prediction made by machine learning mechanism could be directly
>    used on manipulating the network traffic, or other relevant actions,
>    such as changing the device configuration, etc.
>
>    However, as the introduction section said, this kind of utilization
>    might be suitable only for a small set of the use cases, due to the
>    limited accuracy of machine learning technologies.  Besides, some
>    critical usages simply cannot tolerate any false decision.


See my comment above, this paragraph suggests that closed control-loop is a
very long-term application, why? ML-based closed control-loops are
currently applied in many fields (e.g, computer vision, self-driving cars,
etc). Some people may argue that such applications are more complex and/or
critical than network traffic analysis.

[snip]

4.1.  HTTPS Traffic Classification


[snip]

   As a concrete example, Google, Facebook or Amazon are service
>    providers while maps, drive, gmail are services of Google.  To
>    identify them when they are accessed by a user, IP addresses and DNS
>    (Domain Name System) names based identification is not reliable as
>    the users can relies on intermediates to respectively serve as proxy
>    or resolve DNS requests.  The SNI (Server Name Indication) [RFC5246]
>    is an extension of HTTPS which is indicated by the user when
>    initiating the TLS handshake (Client Hello).  SNI actually contains
>    the hostname to which the request is addressed.  Such an hostname is
>    significative of the service and service provider name.  However, SNI
>    is an optional field and can be easily forged to circumvent HTTPS
>    filtering without impacting service use [bypasssni].  More advanced
>    mechanisms are hence necessary to improve the robustness of
>    identification even in the case of non collaborative users.


I suggest being vendor-agnostic in the examples, the specific examples do
not improve the draft by any means.

[snip]

>
>
>      HTTPS Connection
>            +
>            |(1)
>    +-------v------+
>    |TLS Connection|
>    |Reconstruction|
>    +-------+------+
>            |(2)
>    +-------v------+    (3')                    (4')
>    |  Features    +-------------+----------------------------+
>    |  Extraction  |             |                            |
>    +-------+------+     +-------v---------+             +----v----+
>            |            |Service Provider +------------->Services |
>            |(3)         |L1 model         |   Load      |L2 model |
>            |            +-------^---------+   services  +----^----+
>    +-------v------+             |             model X        |
>    |SNI Labelling |             +----------------------------+
>    +-------+------+                         |(5)
>            |            +-----------------------------------------+
>            +------------>              Training and               |
>                    (4)  |              Models building            |
>                         +-----------------------------------------+
>
>    Two-levels HTTPS traffic classification
>
>    In figure above, step(1) consists in reconstructing the HTTPS
>    connection and retrieving packets on top of which the following
>    metrics are observed (2):
>
>    o  Inter Arrival Time
>
>    o  Packet size
>
>    o  Encrypted data size: this feature has the advantage to be strongly
>       related to the service accessed instead of the packet size which
>       is biased by other lower layer headers
>
>    Based on these values, aggregated features are computed: average,
>    minimum, maximum, 25th percentile, median, 75th percentile.
>
>
Does the authors see value on listing all the traffic features in an ANNEX?


[snip]


4.2.  Malicious Domains: Automatic Detection with DNS Traffic Analysis


[snip]


>    As a result, in an automated fashion, a large variety of suspicious
>    domains can be detected, including phishing, malware, but also other
>    types, such as fake pharmaceutical shops as well as counterfeit
>    sneakers.  In this particular case, the responsible registrars are
>    notified in this pilot study about these websites.  Ultimately, it
>    allows these websites to be taken down, minimizing the potential
>    number of victims.



Can this use-cases be elaborated a little bit further? Which are features
used? And how ML is applied? Which algorithm? Where does the training set
come from?


4.3.  Machine-learning based Policy Derivation and Evaluation in
>       Broadband Networks



[snip]

   It is evident that machine learning can have significant importance
>    in policy derivation and evaluation in broadband networks, especially
>    towards in 5G infrastructures which will be complex, heterogeneous
>    and need to accommodate multi-services ranging from mobile broadband
>    to massive machine type, mission critical and vehicular
>    communications.


This use-cases relates to a very relevant area in ML (5G), but I don´t see
how ML is being applied? Is ML generating knowledge (unsupervised) to offer
recommendations to the netadmins? Is ML used to estimate the performance of
a given policy?

4.4.  Traffic Anomaly Detection in the Router


[snip]

   Besides wavelet analysis, there might be more techniques to explore,
>    such as correlation analysis of traffic anomaly events among multiple
>    devices.


In some cases and beyond correlation with time, ML can be applied to
correlate traffic with external information (weather, calendar, etc) that
may play an important role in the traffic profile.

4.5.  Applications of Machine Learning to Flow Monitoring


>    A commercial cloud-based flow monitoring service from Network
>    Polygraph [polygraph] has used Machine Learning analysis as a cost-
>    effective alternative to DPI for traffic classification, which
>    identifies the application responsible for each network traffic flow.


I suggest to make the use-case neutral.

   The target objective is to progressively reduce the dependence on DPI
>    technologies, which are expensive, difficult to deploy, not scalable,
>    and not robust against encryption, in favor of flow-based machine
>    learning approaches that are more cost-effective and can be easily
>    offered as a cloud service.  In this direction, some research
>    challenges include the classification of web services and CDN traffic
>    from flow-based measurements, and the combination of multiple ground
>    truths obtained from vantage points in different networks.
>

Does the authors see value on listing all traffic features used in an ANNEX?

5.  Security Considerations
>
>    This document is focused on applying machine learning in network,
>    including of course applying machine learning in network security, on
>    higher-layer concepts.  Therefore, it does not itself create any new
>    security issues.


I second Brian’s comment, a brief discussion on privacy concerns for these
very specific use-cases would be very valuable.

Re: [Nmlrg] Review for draft-jiang-nmlrg-traffic-… Albert Cabellos
Re: [Nmlrg] Review for draft-jiang-nmlrg-traffic-… Jérôme François
[Nmlrg] Review for draft-jiang-nmlrg-traffic-mach… Albert Cabellos