Re: [Nmlrg] links to each slide of presentations//RE: slides of NMLRG #3, June 27th, Athens

Hi David,

Pardon for jump in the thread. A couple of comments inline.

发件人: nmlrg [mailto:nmlrg-bounces@irtf.org] 代表 David Meyer
发送时间: 2016年6月29日 18:36
收件人: Sheng Jiang
抄送: nmlrg@irtf.org
主题: Re: [Nmlrg] links to each slide of presentations//RE: slides of NMLRG #3, June 27th, Athens

On Wed, Jun 29, 2016 at 1:10 AM, Sheng Jiang <jiangsheng@huawei.com<mailto:jiangsheng@huawei.com>> wrote:

Hi, David,

Thanks so much for your email. I am sharing the same concern you have. Actually, this Monday in our NMLRG #3, Athens, we did discussed this issue regarding to the potential standardized dataset. And we are planning to have an open discussion to regarding to the requirements and validation qualification of the potential common dataset in the coming NMLRG #4, Berlin, IETF96.

Looking forward to it.

However, upto now, I have serveral concerns. Firstly, before we could discuss such dataset, we need to narrow our target scenarios to a few very specific use cases, giving that there are so many various scenarios in the network area and they are so different from each other and require different dataset.

I'm hoping that not every network is not a one off. Rather, what we seek is to learn what are the fundamental properties of networks that can be the basis of reasoning about networks do. For example, we should be able to learn something from work done on transfer learning [0] and representation learning [1]. In the image recognition space there are even startups that provide pretrained AlexNet, GooLeNet, etc trained on ImageNet (and others) that you can further train/customize for your particular application. In any event, if every network requires something totally different then we haven't yet understood what is fundamental to the structure of our networks.

[Bing] In technical perspective, I think you’re raising an inspiring idea, and probably the future direction of this research field. OTOH, I also feel this might be too far away, where should we start?

Secondly, as we did talk with operators, the real data is the important properties for them. So, it would be almost impossible for them to share them. It leaves two possibilities on how to get dataset: a) generating dataset using simulation;

This likely won't work. For example, if you do this then try to train a DNN on that data set the NN will learn how the simulator works. This is unlikely to be the outcome that we'd like (we already know how the simulator works).

[Bing] I think ML could do two things: 1) mining some unknown knowledge; 2) do something which is already known, but very hard to be solved by programming rules.
I think the simulation data could fit into 2).

 b) try to get some partial dataset from operators that have removed their sensenble data. Both way are problemstic, in the risk that a lot of efforts may waste in wrong areas.

Agree that this is problematic.

Thirdly, the availability of data may very different in different network environments. It depends on the measurement/congnitive functions that have been deployed or implemented.

What is a "cognitive function"?

In any event, you have outlined just a few of the problems with creating "standardized" data sets for our use. I have quoted standardized because I'm beginning to think its the wrong word (invokes "standards" e.g., IETF, <X>..., and that is not what is meant here).  Not sure what the right term is but  "standardized" seems to be the wrong word here.

Although having the above concerns, I  think we should: a) research on certain dataset. It would be very useful to prove the machine learning could perform better or provide more flexibility and adaptibility;

Not sure what you mean here. By "certain dataset" you mean what?

b) continue our discuss on various network use cases that could benefit from applying machine learning, in order to serve wider network communities.

Use cases are good, however machine learning isn't something we can just apply to a use case because we've defined the use case. There are really at least three parts of this that need to be considered as part of a solution: (i). use case (as you mention), (ii). algorithm(s) that are going to provide the regression/classification/ desired in (i), and (iii). data sets.

(ii). requires understanding of the algorithms that can be applied to produce the inferences (regressions, ...)  we are interested in. What algorithms we can use depend on the data we can get (iii).

That said, one of the things we've learned over the past decade or so is that learning itself is a very general process. We see this in how powerful something as simple as SGD/backprop is (see e.g., [2]). We can also use this to our advantage and re-purpose algorithms originally developed for other tasks. A simple example of this is LDA (see [3,4]). Here the original idea was to model topics in a corpus of documents, where the topics themselves (along with the per-document topic proportions) were the latent/hidden variables. One way to re-purpose LDA for applications like ours (essentially classification/anomaly detection tasks) is to view each to/from pair of IP addresses as a "document" (in the language of LDA) and then words are constructed from features of the to/from traffic (e.g., to ports, etc). See [5] for a decent example of how you might think about this kind of "statistical clustering" using the LDA Probabilistic Graphical Model approach. So there is hope, but there is a lot of work left to be done on both the theory and practice sides of this (which as I mentioned earlier are, at least currently, essentially the same thing).

[Bing]  I agree the algorithms are basically general to different applications. However, besides algorithms, another important part is the “learning path”. E.g. feature selecting, identifying the target problem as classification/regression/clustering etc., and probably multiple learning stages (e.g. do clustering first, and then do classification). Such kind of “learning path” in my view is case-by-case, and seems very difficult to be generalized for current technologies.

Best regards,
Bing

Another important point I would like to discuss is that you mentioned "our goal is to provide accurate, repeatable, and explainable results." Ideally, we do want this. It is our traditional way for logic-based programming.
What is "logic-based programming"?  Do you mean logic programming, e.g., Prolog? If so I understand logic programming (I did a lot of work on logic programming back in the late 80s/early 90s, see http://dl.acm.org/citation.cfm?id=87991). But even so, I don't understand the connection, so can you be more explicit?

However, I am not sure that is a doable target with machine learning. Many of machine learning algorithms are based on statistics theory, which may not be that much accurate.
Sorry, I don't understand this. Things like Bayesian inference (e.g., LDA) are inherently about quantifying uncertainty. In any event can you be more specific? A learning algorithm can "accurately" approximate something like a posterior distribution, its just that the output is just that, a posterior distribution. So there is a difference between "accuracy" (which isn't really defined here) and the stochastic nature of the results we get from ML.

OTOH, if you mean accuracy of our estimated distributions then there are obviously ways this can go wrong. For example, even an algorithm as simple as k-means can go wrong since the optimization objective (distortion function) is non-convex, so it is sensitive to initial conditions (in particular, where you start; if you are unlucky you can get stuck in a local minimum  due to the non-convex nature of the objective). See e.g., [8]. BTW,  k-means was called out in one of the decks below but as I mentioned, not much discussion of the important topics we are discussing here there. There is also approximate inference (for example, MCMC or variational approaches), but these approaches also have provably quantitative error bounds (http://www.1-4-5.net/~dmm/ml/ps.pdf talks a bit about how variational approaches, essential trading inference for optimization where the inference (integral) is intractable, find a tight lower-bound using Jensen's inequality.

Some of machine learning algorithms are black box or have very complexicated internal computing process. So, the resulf of these algorithms may be hard to be explained.
This is definitely true to some extent today with DNNs, but that is changing rapidly. Results can be repeatable but explaining exactly what units (artificial neurons) far from the input are doing can be difficult. But this too is an area of active research and development (see [6] from last year's ICML, or the work I pointed to on GANs in my last note as progress in this area). Other approaches such probabilistic graphical models or signal processing approaches are perhaps more easily explained, but I don't think this is the level of explanation that is going to be useful to people deploying and operating networks, so we'll have to find different ways of explaining what things like DNNs to in a way that is consumable by operators, etc.

All this by way of saying that we are gaining better understanding how DNNs work every day (literally) and as a result  I'm optimistic that we can build technology that will be accurate (again, stochastic results), repeatable and explainable.

I'll just note here that if you have a model with literally billions of parameters being trained by SGD its not that surprising that humans have a hard time understanding what the computation is. As a concrete example, DNNs can solve the so-called "selectivity-invariance" problem [7], but we humans don't really  know how to do, so this is another example.

Also the process of machine learning may be hard to be controlled or intervened.

Again, not sure what you mean here. Please give a bit more detail.

Repeatable, yes, we have to have this feature to be widely adopted. However, repeatable may have to come with some level of toleration for the inaccurate.

All of this is inherently probabilistic. A DNN, for example, can output the posterior distribution over some hidden variables of interest accurately. LDA finds the latent "topics" and word distributions, both of which are multinomals so again we're essentially getting a posterior over the hidden variables.  Even k-means is  just the EM (Expectation Maximization) algorithm applied to a particular Naive Bayes model. The same can be said of almost every learning algorithm you might mention. So "inaccurate" isn't the correct word here (unless you mean whatever you trained didn't predict whatever you wanted to predict, in which case something else is wrong). So the DNN gives you as stochastic result, which may be different than what many network people are used to. But its not inaccurate, its just stochastic.

Thx,

--dmm

[0] http://people.csail.mit.edu/mcollins/papers/CVPR_2008.pdf
[1]  https://arxiv.org/abs/1206.5538
[2] http://deliprao.com/archives/153
[3] http://www.1-4-5.net/~dmm/ml/lda_intro.pdf
[4] http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
[5] http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5949386
[6] http://yosinski.com/media/papers/Yosinski__2015__ICML_DL__Understanding_Neural_Networks_Through_Deep_Visualization__.pdf
[7] https://cbmm.mit.edu/sites/default/files/publications/Anselmi_Invariance_CBMM_memo_29.pdf
[8] http://cs229.stanford.edu/notes/cs229-notes7a.pdf

Best regards,

Sheng

________________________________
From: David Meyer [dmm@1-4-5.net<mailto:dmm@1-4-5.net>]
Sent: 28 June 2016 23:30http://deliprao.com/archives/153
To: Sheng Jiang
Cc: nmlrg@irtf.org<mailto:nmlrg@irtf.org>
Subject: Re: [Nmlrg] links to each slide of presentations//RE: slides of NMLRG #3, June 27th, Athens
Sheng,

Thanks for the pointers. One thing I notice is that we don't have much on what the characteristics of network data might be, and as such, what kinds of existing learning algorithms might be suitable and where additional development might be required. For example, is the collected data IID? Is it time series? Is the underlying distribution stationary? If not, how does that constrain algorithms we might use or develop? For example, how does a given algorithm deal with concept drift/internal covariate shift (in the case of DNNs; see e.g., batch normalization [0]). There are many other such questions, such as is the data categorical (e.g., ports, IP addresses) or is it continuous/discrete (e.g., counters). And if the data in question is categorical, what is the cardinality of the categories (this will inform how such data can be encoded); in the case of IP addresses we can't really one-hot encode addresses because their cardinality is too large (2^32 or 2^128); this has implications for how we build classifiers (in particular, for softmax layers in DNNs of various kinds).

Related to the above is the question of features? What are good features for networking? Where do they come from? Are they domain specific? Can we learn features the way a DNNs does in the network space? Can we use autoencoders to discover such features? Or can we use GANs to train DNNs for network classification tasks in an unsupervised manner? Are there other, non-ad-hoc (well founded) methods we can use, or is every use case a one-off (one would hope not).

We can carry the same kinds of analyzes to the algorithms applied. For example, while something k-means is an effective way to get a feeling for how continuous/discrete hangs together, if our data is categorical statistical clustering approaches such as LDA might provide a more well-founded approach (of course, as with most Bayesian techniques is the question of approximate inference arises since in most interesting cases the integral that we need to solve, namely the marginal probability data isn't tractable so we need to resort to MCMC or more likely variational inference). And what about the use of SGD/batch normalization etc with DNNs, and perhaps more importantly, can we use network data to train DNN policy networks for reinforcement learning like we saw in deep Q-learning and AlphaGo?

These comments are all by way of saying that we don't have a solid theoretical understanding (yet) of how techniques that have been so successful in other domains (e.g., DNNs for perceptual tasks) generalize to networking use cases. We will need this understanding if our goal is to provide accurate, repeatable, and explainable results.

In order to accomplish all of this we need, as I have been saying , not only a good understanding of how these algorithms work but also standardized data sets and associated benchmarks so we can tell if we are making progress (or even if our techniques work). Analogies here include MNIST and ImageNet and their associated benchmarks, among others. As mentioned  standardized data sets are key to making progress in the ML for networking space (otherwise how do you know your technique works and/or improves on another techniques?). One might assume that these data sets would need to be labeled (as supervised learning is where most of the progress is being made these days), but not necessarily; Generative Adversarial Networks (GANs) have emerged as a new way to train DNNs in an unsupervised manner (this is moving very rapidly; see e.g., https://openai.com/blog/generative-models/).

The summary here is that the "distance" between theory and practice in ML is effectively zero right now due to the incredible rate of progress in the field; this means we need  to understand both sides of the theory/practice coin in order to be effective. None of the slide decks provide much background on what the proposed algorithms are, how they work, or why they should be expected to work on network data.

Finally, if you are interested in LDA or other algorithms there are a few short explanatory pieces I have written for my team on http://www.1-4-5.net/~dmm/ml (works in progress).

Thanks,

Dave

[0] https://arxiv.org/pdf/1502.03167.pdf

On Tue, Jun 28, 2016 at 7:03 AM, Sheng Jiang <jiangsheng@huawei.com<mailto:jiangsheng@huawei.com>> wrote:
Oops... The proceeding page for interim meeting seems not be as intelligent as the proceeding pages of IETF meetings. Our proceeding page does not autonomically show the slides. I have sent an email to IETF secretary to ask them to fix it. Meanwhile, in this email, here is the links for each presentations:

Slide Filename Edit Replace Delete
Chair Slides
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-0.pdf

Introduction to Network Machine Learning & NMLRG
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-1.pdf

Data Collection and Analysis At High Security Lab
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-2.pdf

Use Cases of Applying Machine Learning Mechanism with Network Traffic
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-3.pdf

Mobile network state characterization and prediction
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-4.pdf

Learning how to route
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-5.pdf

Regards,

Sheng
________________________________________
From: nmlrg [nmlrg-bounces@irtf.org<mailto:nmlrg-bounces@irtf.org>] on behalf of Sheng Jiang [jiangsheng@huawei.com<mailto:jiangsheng@huawei.com>]
Sent: 28 June 2016 21:03
To: nmlrg@irtf.org<mailto:nmlrg@irtf.org>
Subject: [Nmlrg] slides of NMLRG #3, June 27th, Athens

Hi, nmlrg,

All slides that have been presented in our NMLRG #3 meeting, June 27th, 2016, Athens, Greece, with EUCNC2016, have been uploaded. They can be accessed through below link

https://www.ietf.org/proceedings/interim/2016/06/27/nmlrg/proceedings.html

Best regards,

Sheng
_______________________________________________
nmlrg mailing list
nmlrg@irtf.org<mailto:nmlrg@irtf.org>
https://www.irtf.org/mailman/listinfo/nmlrg
_______________________________________________
nmlrg mailing list
nmlrg@irtf.org<mailto:nmlrg@irtf.org>
https://www.irtf.org/mailman/listinfo/nmlrg