Re: [Nmlrg] Machine Learning in network - solicitation for use cases

"Liubing (Leo)" <leo.liubing@huawei.com> Fri, 18 September 2015 03:49 UTC

Return-Path: <leo.liubing@huawei.com>
X-Original-To: nmlrg@ietfa.amsl.com
Delivered-To: nmlrg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id AD6091A1B71 for <nmlrg@ietfa.amsl.com>; Thu, 17 Sep 2015 20:49:23 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.312
X-Spam-Level:
X-Spam-Status: No, score=-2.312 tagged_above=-999 required=5 tests=[BAYES_20=-0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jvJKD5dSAldZ for <nmlrg@ietfa.amsl.com>; Thu, 17 Sep 2015 20:49:21 -0700 (PDT)
Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [119.145.14.66]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 351AB1A1B62 for <nmlrg@irtf.org>; Thu, 17 Sep 2015 20:49:19 -0700 (PDT)
Received: from 172.24.1.51 (EHLO nkgeml407-hub.china.huawei.com) ([172.24.1.51]) by szxrg03-dlp.huawei.com (MOS 4.4.3-GA FastPath queued) with ESMTP id BNM19111; Fri, 18 Sep 2015 11:48:59 +0800 (CST)
Received: from NKGEML506-MBX.china.huawei.com ([169.254.3.238]) by nkgeml407-hub.china.huawei.com ([10.98.56.38]) with mapi id 14.03.0235.001; Fri, 18 Sep 2015 11:48:46 +0800
From: "Liubing (Leo)" <leo.liubing@huawei.com>
To: Sebastian Abt <sabt@sabt.net>
Thread-Topic: [Nmlrg] Machine Learning in network - solicitation for use cases
Thread-Index: AQHQ48AegVsK0UL1fk+2fNxpqlcTQp4u5gnE///RqACAALXdgIAAbfMAgAAYC4CAAYcXgIAAi5ug//+MkwCAAJ0JAIAOg96AgADyEVA=
Date: Fri, 18 Sep 2015 03:48:45 +0000
Message-ID: <8AE0F17B87264D4CAC7DE0AA6C406F45C22A99F2@nkgeml506-mbx.china.huawei.com>
References: <D20A251E.25E52%dacheng.zdc@alibaba-inc.com> <5D36713D8A4E7348A7E10DF7437A4B927BB2B192@nkgeml512-mbx.china.huawei.com> <D20B2C03.25EC7%dacheng.zdc@alibaba-inc.com> <5D36713D8A4E7348A7E10DF7437A4B927BB2D062@nkgeml512-mbx.china.huawei.com> <D211D160.26495%dacheng.zdc@alibaba-inc.com> <D211D7F2.2651C%dacheng.zdc@alibaba-inc.com> <5D36713D8A4E7348A7E10DF7437A4B927BB2D300@nkgeml512-mbx.china.huawei.com> <55EC9987.9030002@gmail.com> <5D36713D8A4E7348A7E10DF7437A4B927BB2D65D@nkgeml512-mbx.china.huawei.com> <55ED09ED.3090406@gmail.com> <5D36713D8A4E7348A7E10DF7437A4B927BB2DD75@nkgeml512-mbx.china.huawei.com> <8AE0F17B87264D4CAC7DE0AA6C406F45C227BE52@nkgeml506-mbx.china.huawei.com> <55EE6648.4040804@gmail.com> <8AE0F17B87264D4CAC7DE0AA6C406F45C227CF25@nkgeml506-mbx.china.huawei.com> <011F781F-9409-44D6-A006-C899A39053A1@sabt.net>
In-Reply-To: <011F781F-9409-44D6-A006-C899A39053A1@sabt.net>
Accept-Language: en-US, zh-CN
Content-Language: zh-CN
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.111.98.117]
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020205.55FB89AC.00EF, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0, ip=169.254.3.238, so=2013-05-26 15:14:31, dmn=2013-03-21 17:37:32
X-Mirapoint-Loop-Id: e8009c78e989d24a87131338d9869bba
Archived-At: <http://mailarchive.ietf.org/arch/msg/nmlrg/BjJ2x1jkxrPebPoc7LMcuzaZX6c>
Cc: "nmlrg@irtf.org" <nmlrg@irtf.org>, Dacheng Zhang <dacheng.zdc@alibaba-inc.com>, Sheng Jiang <jiangsheng@huawei.com>
Subject: Re: [Nmlrg] Machine Learning in network - solicitation for use cases
X-BeenThere: nmlrg@irtf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Network Machine Learning Research Group <nmlrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/nmlrg>, <mailto:nmlrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nmlrg/>
List-Post: <mailto:nmlrg@irtf.org>
List-Help: <mailto:nmlrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/nmlrg>, <mailto:nmlrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Sep 2015 03:49:23 -0000

Hi Sebastian,

Thanks for your further explanation. 
Please see inline.

> >> On 08/09/2015 16:00, Liubing (Leo) wrote:
> >>
> >> ...
> >>> But I'm curious about what is the item that could be labeled as "This is
> not
> >> an attack " or " You missed an attack ". E.g., the item is an packet, a
> stream,
> >> or any other kind of N-tuple things.
> >>
> >> The two cases are rather different.
> >>
> >> 1. The system signals "attack in progress" to the NOC. The operators have
> a
> >> look and decide that there is no attack, it is just some unusual traffic.
> >> (Example: you are live-streaming the Olympic Games. Two seconds after
> the
> >> end of the 100 metres final, there is an enormous burst of traffic.
> >> The machine learning system signals an attack, because it was not trained
> on
> >> the data set from the previous Olympic Games.)
> >>
> >> In this case the NOC operators urgently tell the algorithm it is wrong.
> >> It needs to learn that the signature of a sudden burst just after the end of
> an
> >> event is less likely to be an attack than a sudden burst at another time.
> >>
> >> 2. Someone invents a new kind of DDoS attack, which is therefore not in
> the
> >> historical training data. The system doesn't identify it.
> >> In this case, the NOC operators tell the algorithm "Attack started at
> <time>."
> >
> > [Bing] It feels like the learning objects are mostly traffic burst events?
> When traffic burst happens, the machine judges whether it's a DDoS or not.
> > Then the training data might be a bunch of traffic burst evens marked as
> normal or abnormal. And the challenge should be how to pick a set of
> features out of a burst event for the machine learning program to discover
> the pattern of normal/abnormal classification.
> >
> > This is just my hypothetical case, could be all wrong.
> 
> As you say, the point really is the representation you choose, i.e. the
> features your system uses.  If you encode traffic as bps and pps, then bursts
> won’t tell you anything.  Both in the case of a flash crowd and during a
> volumetric DDoS attack these features would increase.  You can predict a
> little bit more if you also incorporate bps/pps, i.e. bits-per-packet, as in
> reality DDoS attacks typically cause a shift there, but this is all pretty coarse.

[Bing] Indeed. The trick/art is in feature selection.
However, feature selection is basically made by man who understands both the application and the machine learning well. We were always wondering, is there any possibility that machine can select features by itself dynamically according to some general/universal methods.

> And what might be a DDoS attack for a customer might not be considered an
> attack for the carrier - it may even go unnoticed by the carrier as it is too
> low-volume when compared to its usual backbone traffic levels.   What
> this example should tell: for reliable attack detection choosing features that
> are independent of traffic volume is important.  Looking at volume/bursts
> can only be indicative.

[Bing] Yes, it should be one-sided to consider DDoS as "burst" events, especially in a carrier aspect.

> >> This automatically becomes high quality training data for the algorithm:
> the
> >> signature of the new traffic at that time is 100% certain to be an attack.
> >
> >> I think the hard part is extracting useful signatures from the traffic stream
> in
> >> real time; the learning/training part is fairly standard.
> > [Bing] When you said the signature of "100% certain", my perception is
> that it is something like the typical virus detection approach, which doing
> exact match of a particular piece of code that could identify a virus.
> > If my perception was correct, I think the signatures are some special
> combinations of the features (as mentioned above) where the classification
> pattern has a 100% confidence. Then I think it doesn't need to extract the
> signatures in real time, because the features are all pre-defined.
> 
> In both examples the difficulty is backwards-mapping from features to
> packets/flows/etc.  Typically, your features are derived from a set of
> packets packets/flows/etc. collected over a specific period of time. 

[Bing] A clarification question: when you said "your features are derived from a set of packets packets/flows/etc", for the "features", you were referring features that selected manually at the system design stage; not dynamically/automatically selected by the machine at the running stage, right?
 
> So, you have to bin the packets/flows/etc. in a sensible way (i.e., not only time)
> before actually extracting feature vectors in order to be able to draw
> conclusions on very specific packets/flows - which is especially required for
> automatic response.

> BTW: the same applies to the „high quality training data“ as mentioned by
> Brian.  While I agree that manually labelled training data is the gold
> standard, in the case sketched here an operator would only see the
> prediction your ADS makes on the feature vectors, which may be based on a
> bunch of packets/flows/… which might contain both, legitimate and
> illegitimate activity.  So, being able to map labels down to packets or flows
> will be a time consuming manual process.

[Bing] If I understand you correctly, you basically meant the DDoS detection needs to be done at a integral view level (e.g. a bunch of flows/packets in a time period); but the action needs to be performed on a partial objects (e.g. some specific flows). This really is a puzzle.

However, if the goal of the machine is just alerting the DDoS event, then human only need to label the packet/flow "bin" you mentioned, and save a large number of examples of the labeled "bin", then it might be much easier for labeling. But of course, such kind of a system that can only "alert" DDoS makes much less sense than a system that can action on specific packets/flows.

Best regards,
Bing

> sebastian