Re: [Nmlrg] links to each slide of presentations//RE: slides of NMLRG #3, June 27th, Athens

On Wed, Jun 29, 2016 at 1:10 AM, Sheng Jiang <jiangsheng@huawei.com> wrote:

> Hi, David,
>
>
>
> Thanks so much for your email. I am sharing the same concern you have.
> Actually, this Monday in our NMLRG #3, Athens, we did discussed this issue
> regarding to the potential standardized dataset. And we are planning to
> have an open discussion to regarding to the requirements and validation
> qualification of the potential common dataset in the coming NMLRG #4,
> Berlin, IETF96.
>

Looking forward to it.

>
>
> However, upto now, I have serveral concerns. Firstly, before we could
> discuss such dataset, we need to narrow our target scenarios to a few
> very specific use cases, giving that there are so many various scenarios in
> the network area and they are so different from each other and require
> different dataset.
>

I'm hoping that not every network is not a one off. Rather, what we seek is
to learn what are the fundamental properties of networks that can be the
basis of reasoning about networks do. For example, we should be able to
learn something from work done on transfer learning [0] and representation
learning [1]. In the image recognition space there are even startups that
provide pretrained AlexNet, GooLeNet, etc trained on ImageNet (and others)
that you can further train/customize for your particular application. In
any event, if every network requires something totally different then we
haven't yet understood what is fundamental to the structure of our
networks.

> Secondly, as we did talk with operators, the real data is the important
> properties for them. So, it would be almost impossible for them to share
> them. It leaves two possibilities on how to get dataset: a) generating
> dataset using simulation;
>

This likely won't work. For example, if you do this then try to train a DNN
on that data set the NN will learn how the simulator works. This is
unlikely to be the outcome that we'd like (we already know how the
simulator works).

 b) try to get some partial dataset from operators that have removed their
sensenble data. Both way are problemstic, in the risk that a lot of efforts
may waste in wrong areas.

Agree that this is problematic.

Thirdly, the availability of data may very different in different network
environments. It depends on the measurement/congnitive functions that have
been deployed or implemented.

What is a "cognitive function"?

In any event, you have outlined just a few of the problems with creating
"standardized" data sets for our use. I have quoted standardized because
I'm beginning to think its the wrong word (invokes "standards" e.g., IETF,
<X>..., and that is not what is meant here).  Not sure what the right term
is but  "standardized" seems to be the wrong word here.

>
> Although having the above concerns, I  think we should: a) research on
> certain dataset. It would be very useful to prove the machine learning
> could perform better or provide more flexibility and adaptibility;
>

Not sure what you mean here. By "certain dataset" you mean what?

> b) continue our discuss on various network use cases that could benefit
> from applying machine learning, in order to serve wider network communities.
>

Use cases are good, however machine learning isn't something we can just
apply to a use case because we've defined the use case. There are really at
least three parts of this that need to be considered as part of a solution:
(i). use case (as you mention), (ii). algorithm(s) that are going to
provide the regression/classification/ desired in (i), and (iii). data
sets.

(ii). requires understanding of the algorithms that can be applied to
produce the inferences (regressions, ...)  we are interested in. What
algorithms we can use depend on the data we can get (iii).

That said, one of the things we've learned over the past decade or so is
that learning itself is a very general process. We see this in how powerful
something as simple as SGD/backprop is (see e.g., [2]). We can also use
this to our advantage and re-purpose algorithms originally developed for
other tasks. A simple example of this is LDA (see [3,4]). Here the original
idea was to model topics in a corpus of documents, where the topics
themselves (along with the per-document topic proportions) were the
latent/hidden variables. One way to re-purpose LDA for applications like
ours (essentially classification/anomaly detection tasks) is to view each
to/from pair of IP addresses as a "document" (in the language of LDA) and
then words are constructed from features of the to/from traffic (e.g., to
ports, etc). See [5] for a decent example of how you might think about this
kind of "statistical clustering" using the LDA Probabilistic Graphical
Model approach. So there is hope, but there is a lot of work left to be
done on both the theory and practice sides of this (which as I mentioned
earlier are, at least currently, essentially the same thing).

>
>
> Another important point I would like to discuss is that you mentioned "our
> goal is to provide accurate, repeatable, and explainable results." Ideally,
> we do want this. It is our traditional way for logic-based programming.
>
What is "logic-based programming"?  Do you mean logic programming, e.g.,
Prolog? If so I understand logic programming (I did a lot of work on logic
programming back in the late 80s/early 90s, see
http://dl.acm.org/citation.cfm?id=87991). But even so, I don't understand
the connection, so can you be more explicit?

> However, I am not sure that is a doable target with machine learning. Many
> of machine learning algorithms are based on statistics theory, which may
> not be that much accurate.
>
Sorry, I don't understand this. Things like Bayesian inference (e.g., LDA)
are inherently about quantifying uncertainty. In any event can you be more
specific? A learning algorithm can "accurately" approximate something like
a posterior distribution, its just that the output is just that, a
posterior distribution. So there is a difference between "accuracy" (which
isn't really defined here) and the stochastic nature of the results we get
from ML.

OTOH, if you mean accuracy of our estimated distributions then there are
obviously ways this can go wrong. For example, even an algorithm as simple
as k-means can go wrong since the optimization objective (distortion
function) is non-convex, so it is sensitive to initial conditions (in
particular, where you start; if you are unlucky you can get stuck in a
local minimum  due to the non-convex nature of the objective). See e.g.,
[8]. BTW,  k-means was called out in one of the decks below but as I
mentioned, not much discussion of the important topics we are discussing
here there. There is also approximate inference (for example, MCMC or
variational approaches), but these approaches also have
provably quantitative error bounds (http://www.1-4-5.net/~dmm/ml/ps.pdf
talks a bit about how variational approaches, essential trading inference
for optimization where the inference (integral) is intractable, find a
tight lower-bound using Jensen's inequality.

> Some of machine learning algorithms are black box or have very
> complexicated internal computing process. So, the resulf of these
> algorithms may be hard to be explained.
>
This is definitely true to some extent today with DNNs, but that is
changing rapidly. Results can be repeatable but explaining exactly what
units (artificial neurons) far from the input are doing can be difficult.
But this too is an area of active research and development (see [6] from
last year's ICML, or the work I pointed to on GANs in my last note as
progress in this area). Other approaches such probabilistic graphical
models or signal processing approaches are perhaps more easily explained,
but I don't think this is the level of explanation that is going to be
useful to people deploying and operating networks, so we'll have to find
different ways of explaining what things like DNNs to in a way that is
consumable by operators, etc.

All this by way of saying that we are gaining better understanding how DNNs
work every day (literally) and as a result  I'm optimistic that we can
build technology that will be accurate (again, stochastic results),
repeatable and explainable.

I'll just note here that if you have a model with literally billions of
parameters being trained by SGD its not that surprising that humans have a
hard time understanding what the computation is. As a concrete example,
DNNs can solve the so-called "selectivity-invariance" problem [7], but we
humans don't really  know how to do, so this is another example.

> Also the process of machine learning may be hard to be controlled or
> intervened.
>

Again, not sure what you mean here. Please give a bit more detail.

> Repeatable, yes, we have to have this feature to be widely adopted.
> However, repeatable may have to come with some level of toleration for the
> inaccurate.
>

All of this is inherently probabilistic. A DNN, for example, can output the
posterior distribution over some hidden variables of interest accurately.
LDA finds the latent "topics" and word distributions, both of which are
multinomals so again we're essentially getting a posterior over the hidden
variables.  Even k-means is  just the EM (Expectation Maximization)
algorithm applied to a particular Naive Bayes model. The same can be said
of almost every learning algorithm you might mention. So "inaccurate" isn't
the correct word here (unless you mean whatever you trained didn't predict
whatever you wanted to predict, in which case something else is wrong). So
the DNN gives you as stochastic result, which may be different than what
many network people are used to. But its not inaccurate, its just
stochastic.

Thx,

--dmm

[0] http://people.csail.mit.edu/mcollins/papers/CVPR_2008.pdf
[1]  https://arxiv.org/abs/1206.5538
[2] http://deliprao.com/archives/153
[3] http://www.1-4-5.net/~dmm/ml/lda_intro.pdf
[4] http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
[5] http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5949386
[6]
http://yosinski.com/media/papers/Yosinski__2015__ICML_DL__Understanding_Neural_Networks_Through_Deep_Visualization__.pdf
[7]
https://cbmm.mit.edu/sites/default/files/publications/Anselmi_Invariance_CBMM_memo_29.pdf
[8] http://cs229.stanford.edu/notes/cs229-notes7a.pdf

> Best regards,
>
>
>
> Sheng
>
>
>
> ------------------------------
>
> *From:* David Meyer [dmm@1-4-5.net]
> *Sent:* 28 June 2016 23:30http://deliprao.com/archives/153
> *To:* Sheng Jiang
> *Cc:* nmlrg@irtf.org
> *Subject:* Re: [Nmlrg] links to each slide of presentations//RE: slides
> of NMLRG #3, June 27th, Athens
>
> Sheng,
>
> Thanks for the pointers. One thing I notice is that we don't have much on
> what the characteristics of network data might be, and as such, what kinds
> of existing learning algorithms might be suitable and where additional
> development might be required. For example, is the collected data IID? Is
> it time series? Is the underlying distribution stationary? If not, how does
> that constrain algorithms we might use or develop? For example, how does a
> given algorithm deal with concept drift/internal covariate shift (in the
> case of DNNs; see e.g., batch normalization [0]). There are many other such
> questions, such as is the data categorical (e.g., ports, IP addresses) or
> is it continuous/discrete (e.g., counters). And if the data in question is
> categorical, what is the cardinality of the categories (this will inform
> how such data can be encoded); in the case of IP addresses we can't really
> one-hot encode addresses because their cardinality is too large (2^32 or
> 2^128); this has implications for how we build classifiers (in particular,
> for softmax layers in DNNs of various kinds).
>
> Related to the above is the question of features? What are good features
> for networking? Where do they come from? Are they domain specific? Can we
> learn features the way a DNNs does in the network space? Can we use
> autoencoders to discover such features? Or can we use GANs to train DNNs
> for network classification tasks in an unsupervised manner? Are there
> other, non-ad-hoc (well founded) methods we can use, or is every use case a
> one-off (one would hope not).
>
> We can carry the same kinds of analyzes to the algorithms applied. For
> example, while something k-means is an effective way to get a feeling for
> how continuous/discrete hangs together, if our data is categorical
> statistical clustering approaches such as LDA might provide a more
> well-founded approach (of course, as with most Bayesian techniques is the
> question of approximate inference arises since in most interesting cases
> the integral that we need to solve, namely the marginal probability data
> isn't tractable so we need to resort to MCMC or more likely variational
> inference). And what about the use of SGD/batch normalization etc with
> DNNs, and perhaps more importantly, can we use network data to train DNN
> policy networks for reinforcement learning like we saw in deep Q-learning
> and AlphaGo?
>
> These comments are all by way of saying that we don't have a solid
> theoretical understanding (yet) of how techniques that have been so
> successful in other domains (e.g., DNNs for perceptual tasks) generalize to
> networking use cases. We will need this understanding if our goal is to
> provide accurate, repeatable, and explainable results.
>
> In order to accomplish all of this we need, as I have been saying , not
> only a good understanding of how these algorithms work but also
> standardized data sets and associated benchmarks so we can tell if we are
> making progress (or even if our techniques work). Analogies here include
> MNIST and ImageNet and their associated benchmarks, among others. As
> mentioned  standardized data sets are key to making progress in the ML for
> networking space (otherwise how do you know your technique works and/or
> improves on another techniques?). One might assume that these data sets
> would need to be labeled (as supervised learning is where most of the
> progress is being made these days), but not necessarily; Generative
> Adversarial Networks (GANs) have emerged as a new way to train DNNs in an
> unsupervised manner (this is moving very rapidly; see e.g.,
> https://openai.com/blog/generative-models/).
>
> The summary here is that the "distance" between theory and practice in ML
> is effectively zero right now due to the incredible rate of progress in the
> field; this means we need  to understand both sides of the theory/practice
> coin in order to be effective. None of the slide decks provide much
> background on what the proposed algorithms are, how they work, or why they
> should be expected to work on network data.
>
> Finally, if you are interested in LDA or other algorithms there are a few
> short explanatory pieces I have written for my team on
> http://www.1-4-5.net/~dmm/ml (works in progress).
>
> Thanks,
>
> Dave
>
> [0] https://arxiv.org/pdf/1502.03167.pdf
>
> On Tue, Jun 28, 2016 at 7:03 AM, Sheng Jiang <jiangsheng@huawei.com>
> wrote:
>
>> Oops... The proceeding page for interim meeting seems not be as
>> intelligent as the proceeding pages of IETF meetings. Our proceeding page
>> does not autonomically show the slides. I have sent an email to IETF
>> secretary to ask them to fix it. Meanwhile, in this email, here is the
>> links for each presentations:
>>
>> Slide Filename Edit Replace Delete
>> Chair Slides
>>
>> https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-0.pdf
>>
>> Introduction to Network Machine Learning & NMLRG
>>
>> https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-1.pdf
>>
>> Data Collection and Analysis At High Security Lab
>>
>> https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-2.pdf
>>
>> Use Cases of Applying Machine Learning Mechanism with Network Traffic
>>
>> https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-3.pdf
>>
>> Mobile network state characterization and prediction
>>
>> https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-4.pdf
>>
>> Learning how to route
>>
>> https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-5.pdf
>>
>> Regards,
>>
>> Sheng
>> ________________________________________
>> From: nmlrg [nmlrg-bounces@irtf.org] on behalf of Sheng Jiang [
>> jiangsheng@huawei.com]
>> Sent: 28 June 2016 21:03
>> To: nmlrg@irtf.org
>> Subject: [Nmlrg] slides of NMLRG #3, June 27th, Athens
>>
>> Hi, nmlrg,
>>
>> All slides that have been presented in our NMLRG #3 meeting, June 27th,
>> 2016, Athens, Greece, with EUCNC2016, have been uploaded. They can be
>> accessed through below link
>>
>> https://www.ietf.org/proceedings/interim/2016/06/27/nmlrg/proceedings.html
>>
>> Best regards,
>>
>> Sheng
>> _______________________________________________
>> nmlrg mailing list
>> nmlrg@irtf.org
>> https://www.irtf.org/mailman/listinfo/nmlrg
>> _______________________________________________
>> nmlrg mailing list
>> nmlrg@irtf.org
>> https://www.irtf.org/mailman/listinfo/nmlrg
>>
>
>