Re: [Nmlrg] links to each slide of presentations//RE: slides of NMLRG #3, June 27th, Athens

Hi, David,

Please see my replies with <Sheng/> in the lines. Sorry for being slow response due to my traveling.

However, upto now, I have serveral concerns. Firstly, before we could discuss such dataset, we need to narrow our target scenarios to a few very specific use cases, giving that there are so many various scenarios in the network area and they are so different from each other and require different dataset.

I'm hoping that not every network is not a one off. Rather, what we seek is to learn what are the fundamental properties of networks that can be the basis of reasoning about networks do. For example, we should be able to learn something from work done on transfer learning [0] and representation learning [1]. In the image recognition space there are even startups that provide pretrained AlexNet, GooLeNet, etc trained on ImageNet (and others) that you can further train/customize for your particular application. In any event, if every network requires something totally different then we haven't yet understood what is fundamental to the structure of our networks.

<Sheng>If I understand you correctly, you meant we should do some generic learning in order to better understand the network fundamentally, and the result should be helpful for solving these specific tasks in the networks. It is different to what we have done - working on applying ML to solve specific network tasks. I would agree this proposed research direction are important and  I am very interesting in such direction although  it looks ambitions and no guarantee for success so far. </Sheng>

Secondly, as we did talk with operators, the real data is the important properties for them. So, it would be almost impossible for them to share them. It leaves two possibilities on how to get dataset: a) generating dataset using simulation;

This likely won't work. For example, if you do this then try to train a DNN on that data set the NN will learn how the simulator works. This is unlikely to be the outcome that we'd like (we already know how the simulator works).

 b) try to get some partial dataset from operators that have removed their sensenble data. Both way are problemstic, in the risk that a lot of efforts may waste in wrong areas.

Agree that this is problematic.

<Sheng>Here, we seems reach the agreement on that we do NOT have a good way to obtain the useful dataset. It is not good at all. Does that mean we have to wait for operators to work on themselves alone? Any suggestion on how we could get quality real dataset?</Sheng>
Thirdly, the availability of data may very different in different network environments. It depends on the measurement/congnitive functions that have been deployed or implemented.

What is a "cognitive function"?

<Sheng>Here, I use the "cognitive function" to represent that device or network function that collects the existing status or configuration data, against measurement.</Sheng>

In any event, you have outlined just a few of the problems with creating "standardized" data sets for our use. I have quoted standardized because I'm beginning to think its the wrong word (invokes "standards" e.g., IETF, <X>..., and that is not what is meant here).  Not sure what the right term is but  "standardized" seems to be the wrong word here.

Although having the above concerns, I  think we should: a) research on certain dataset. It would be very useful to prove the machine learning could perform better or provide more flexibility and adaptibility;

Not sure what you mean here. By "certain dataset" you mean what?

<Sheng>We did discuss a potential standardized dataset, didn't we?</Sheng>

b) continue our discuss on various network use cases that could benefit from applying machine learning, in order to serve wider network communities.

Use cases are good, however machine learning isn't something we can just apply to a use case because we've defined the use case. There are really at least three parts of this that need to be considered as part of a solution: (i). use case (as you mention), (ii). algorithm(s) that are going to provide the regression/classification/ desired in (i), and (iii). data sets.

(ii). requires understanding of the algorithms that can be applied to produce the inferences (regressions, ...)  we are interested in. What algorithms we can use depend on the data we can get (iii).

<Sheng>What we have done up to now is, in a specific use case, to design an algorithm or a combination of algorithm on chosen dataset (this may be dynamic data from the network environment) to complete a specific task.</Sheng>

That said, one of the things we've learned over the past decade or so is that learning itself is a very general process. We see this in how powerful something as simple as SGD/backprop is (see e.g., [2]). We can also use this to our advantage and re-purpose algorithms originally developed for other tasks. A simple example of this is LDA (see [3,4]). Here the original idea was to model topics in a corpus of documents, where the topics themselves (along with the per-document topic proportions) were the latent/hidden variables. One way to re-purpose LDA for applications like ours (essentially classification/anomaly detection tasks) is to view each to/from pair of IP addresses as a "document" (in the language of LDA) and then words are constructed from features of the to/from traffic (e.g., to ports, etc). See [5] for a decent example of how you might think about this kind of "statistical clustering" using the LDA Probabilistic Graphical Model approach. So there is hope, but there is a lot of work left to be done on both the theory and practice sides of this (which as I mentioned earlier are, at least currently, essentially the same thing).

<Sheng>Agree on that learning and learning algorithm is general. But what we did was only use it in specific way to solve specific tasks/use cases. As I agreed earlier, generlized learning on network could be an important direction and I am interesting in. Let's see how far we could go. On another side, I have to emphasis that the general learning must be helpful on solving specific tasks, in other words, specific tasks should be earlier to solve from the result of general learning. Otherwise, the general learning is not mearningful.</Sheng>

Another important point I would like to discuss is that you mentioned "our goal is to provide accurate, repeatable, and explainable results." Ideally, we do want this. It is our traditional way for logic-based programming.

What is "logic-based programming"?  Do you mean logic programming, e.g., Prolog? If so I understand logic programming (I did a lot of work on logic programming back in the late 80s/early 90s, see http://dl.acm.org/citation.cfm?id=87991). But even so, I don't understand the connection, so can you be more explicit?

<Sheng>Here, "logic-based programming" I referred to all traditional programming that realize human pre-designed logic into computing programs. It is comparing to ML-based programming.</Sheng>

However, I am not sure that is a doable target with machine learning. Many of machine learning algorithms are based on statistics theory, which may not be that much accurate.

Sorry, I don't understand this. Things like Bayesian inference (e.g., LDA) are inherently about quantifying uncertainty. In any event can you be more specific? A learning algorithm can "accurately" approximate something like a posterior distribution, its just that the output is just that, a posterior distribution. So there is a difference between "accuracy" (which isn't really defined here) and the stochastic nature of the results we get from ML.

OTOH, if you mean accuracy of our estimated distributions then there are obviously ways this can go wrong. For example, even an algorithm as simple as k-means can go wrong since the optimization objective (distortion function) is non-convex, so it is sensitive to initial conditions (in particular, where you start; if you are unlucky you can get stuck in a local minimum  due to the non-convex nature of the objective). See e.g., [8]. BTW,  k-means was called out in one of the decks below but as I mentioned, not much discussion of the important topics we are discussing here there. There is also approximate inference (for example, MCMC or variational approaches), but these approaches also have provably quantitative error bounds (http://www.1-4-5.net/~dmm/ml/ps.pdf talks a bit about how variational approaches, essential trading inference for optimization where the inference (integral) is intractable, find a tight lower-bound using Jensen's inequality.

<Sheng>The reason that statistics-based ML algorithms can work is that they chose to focus on only high probability and igrone corner cases. But when mearningful corner cases happen or we want to target the corner cases, machine learning may not accurately work. Again, "accuracy" itself does not properly defined. So, we may talk about very different things. On other side, I also agree that accurate is not important topic here. ML can reach a high-level accurate, which are suffient for majority cases. But ML may never reach full accurate while traditional program can 100% certain the result on giving circumstance.</Sheng>

Some of machine learning algorithms are black box or have very complexicated internal computing process. So, the resulf of these algorithms may be hard to be explained.

This is definitely true to some extent today with DNNs, but that is changing rapidly. Results can be repeatable but explaining exactly what units (artificial neurons) far from the input are doing can be difficult. But this too is an area of active research and development (see [6] from last year's ICML, or the work I pointed to on GANs in my last note as progress in this area). Other approaches such probabilistic graphical models or signal processing approaches are perhaps more easily explained, but I don't think this is the level of explanation that is going to be useful to people deploying and operating networks, so we'll have to find different ways of explaining what things like DNNs to in a way that is consumable by operators, etc.

<Sheng>Sharing the same opinion here. Every we could really explain how DNNs works in a very complicated way. Our customers would not have enough patient to listen. They would still not trust ML. We, actually the whole AI/ML society, needs to find other way to explain and prove things like DNN are trustable. Only then, they can be use for more serious or autonomic tasks.</Sheng>

All this by way of saying that we are gaining better understanding how DNNs work every day (literally) and as a result  I'm optimistic that we can build technology that will be accurate (again, stochastic results), repeatable and explainable.

<Sheng>I hope I can share you optimistic. Actually, I do believe we could do so one day. But I am not sure how long it will take. </Sheng>

I'll just note here that if you have a model with literally billions of parameters being trained by SGD its not that surprising that humans have a hard time understanding what the computation is. As a concrete example, DNNs can solve the so-called "selectivity-invariance" problem [7], but we humans don't really  know how to do, so this is another example.

Also the process of machine learning may be hard to be controlled or intervened.

Again, not sure what you mean here. Please give a bit more detail.

<Sheng> Here, I actually refer to DNN too. We don't know how to intervened DNN in its running time.

Repeatable, yes, we have to have this feature to be widely adopted. However, repeatable may have to come with some level of toleration for the inaccurate.

All of this is inherently probabilistic. A DNN, for example, can output the posterior distribution over some hidden variables of interest accurately. LDA finds the latent "topics" and word distributions, both of which are multinomals so again we're essentially getting a posterior over the hidden variables.  Even k-means is  just the EM (Expectation Maximization) algorithm applied to a particular Naive Bayes model. The same can be said of almost every learning algorithm you might mention. So "inaccurate" isn't the correct word here (unless you mean whatever you trained didn't predict whatever you wanted to predict, in which case something else is wrong). So the DNN gives you as stochastic result, which may be different than what many network people are used to. But its not inaccurate, its just stochastic.

<Sheng>Inaccurate or stochastic, they are not what many network people can accept. In their eyes, network is definite, every parameter is definite. However, actually, if you could analyze every network parameter, the value may be definite, but actually they all have toleration for some level of inaccurate as long as their design purposes could meet. It is the fundemental that we could use ML to solve network task.

Thanks so much. Regards,

Sheng </Sheng>

Thx,

--dmm

[0] http://people.csail.mit.edu/mcollins/papers/CVPR_2008.pdf
[1]  https://arxiv.org/abs/1206.5538
[2] http://deliprao.com/archives/153
[3] http://www.1-4-5.net/~dmm/ml/lda_intro.pdf
[4] http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
[5] http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5949386
[6] http://yosinski.com/media/papers/Yosinski__2015__ICML_DL__Understanding_Neural_Networks_Through_Deep_Visualization__.pdf
[7] https://cbmm.mit.edu/sites/default/files/publications/Anselmi_Invariance_CBMM_memo_29.pdf
[8] http://cs229.stanford.edu/notes/cs229-notes7a.pdf

Best regards,

Sheng

________________________________

From: David Meyer [dmm@1-4-5.net<mailto:dmm@1-4-5.net>]
Sent: 28 June 2016 23:30http://deliprao.com/archives/153
To: Sheng Jiang
Cc: nmlrg@irtf.org<mailto:nmlrg@irtf.org>
Subject: Re: [Nmlrg] links to each slide of presentations//RE: slides of NMLRG #3, June 27th, Athens

Sheng,

Thanks for the pointers. One thing I notice is that we don't have much on what the characteristics of network data might be, and as such, what kinds of existing learning algorithms might be suitable and where additional development might be required. For example, is the collected data IID? Is it time series? Is the underlying distribution stationary? If not, how does that constrain algorithms we might use or develop? For example, how does a given algorithm deal with concept drift/internal covariate shift (in the case of DNNs; see e.g., batch normalization [0]). There are many other such questions, such as is the data categorical (e.g., ports, IP addresses) or is it continuous/discrete (e.g., counters). And if the data in question is categorical, what is the cardinality of the categories (this will inform how such data can be encoded); in the case of IP addresses we can't really one-hot encode addresses because their cardinality is too large (2^32 or 2^128); this has implications for how we build classifiers (in particular, for softmax layers in DNNs of various kinds).

Related to the above is the question of features? What are good features for networking? Where do they come from? Are they domain specific? Can we learn features the way a DNNs does in the network space? Can we use autoencoders to discover such features? Or can we use GANs to train DNNs for network classification tasks in an unsupervised manner? Are there other, non-ad-hoc (well founded) methods we can use, or is every use case a one-off (one would hope not).

We can carry the same kinds of analyzes to the algorithms applied. For example, while something k-means is an effective way to get a feeling for how continuous/discrete hangs together, if our data is categorical statistical clustering approaches such as LDA might provide a more well-founded approach (of course, as with most Bayesian techniques is the question of approximate inference arises since in most interesting cases the integral that we need to solve, namely the marginal probability data isn't tractable so we need to resort to MCMC or more likely variational inference). And what about the use of SGD/batch normalization etc with DNNs, and perhaps more importantly, can we use network data to train DNN policy networks for reinforcement learning like we saw in deep Q-learning and AlphaGo?

These comments are all by way of saying that we don't have a solid theoretical understanding (yet) of how techniques that have been so successful in other domains (e.g., DNNs for perceptual tasks) generalize to networking use cases. We will need this understanding if our goal is to provide accurate, repeatable, and explainable results.

In order to accomplish all of this we need, as I have been saying , not only a good understanding of how these algorithms work but also standardized data sets and associated benchmarks so we can tell if we are making progress (or even if our techniques work). Analogies here include MNIST and ImageNet and their associated benchmarks, among others. As mentioned  standardized data sets are key to making progress in the ML for networking space (otherwise how do you know your technique works and/or improves on another techniques?). One might assume that these data sets would need to be labeled (as supervised learning is where most of the progress is being made these days), but not necessarily; Generative Adversarial Networks (GANs) have emerged as a new way to train DNNs in an unsupervised manner (this is moving very rapidly; see e.g., https://openai.com/blog/generative-models/).

The summary here is that the "distance" between theory and practice in ML is effectively zero right now due to the incredible rate of progress in the field; this means we need  to understand both sides of the theory/practice coin in order to be effective. None of the slide decks provide much background on what the proposed algorithms are, how they work, or why they should be expected to work on network data.

Finally, if you are interested in LDA or other algorithms there are a few short explanatory pieces I have written for my team on http://www.1-4-5.net/~dmm/ml (works in progress).

Thanks,

Dave

[0] https://arxiv.org/pdf/1502.03167.pdf

On Tue, Jun 28, 2016 at 7:03 AM, Sheng Jiang <jiangsheng@huawei.com<mailto:jiangsheng@huawei.com>> wrote:
Oops... The proceeding page for interim meeting seems not be as intelligent as the proceeding pages of IETF meetings. Our proceeding page does not autonomically show the slides. I have sent an email to IETF secretary to ask them to fix it. Meanwhile, in this email, here is the links for each presentations:

Slide Filename Edit Replace Delete
Chair Slides
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-0.pdf

Introduction to Network Machine Learning & NMLRG
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-1.pdf

Data Collection and Analysis At High Security Lab
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-2.pdf

Use Cases of Applying Machine Learning Mechanism with Network Traffic
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-3.pdf

Mobile network state characterization and prediction
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-4.pdf

Learning how to route
https://www.ietf.org/proceedings/interim-2016-nmlrg-01/slides/slides-interim-2016-nmlrg-01-5.pdf

Regards,

Sheng
________________________________________
From: nmlrg [nmlrg-bounces@irtf.org<mailto:nmlrg-bounces@irtf.org>] on behalf of Sheng Jiang [jiangsheng@huawei.com<mailto:jiangsheng@huawei.com>]
Sent: 28 June 2016 21:03
To: nmlrg@irtf.org<mailto:nmlrg@irtf.org>
Subject: [Nmlrg] slides of NMLRG #3, June 27th, Athens

Hi, nmlrg,

All slides that have been presented in our NMLRG #3 meeting, June 27th, 2016, Athens, Greece, with EUCNC2016, have been uploaded. They can be accessed through below link

https://www.ietf.org/proceedings/interim/2016/06/27/nmlrg/proceedings.html

Best regards,

Sheng
_______________________________________________
nmlrg mailing list
nmlrg@irtf.org<mailto:nmlrg@irtf.org>
https://www.irtf.org/mailman/listinfo/nmlrg
_______________________________________________
nmlrg mailing list
nmlrg@irtf.org<mailto:nmlrg@irtf.org>
https://www.irtf.org/mailman/listinfo/nmlrg