Re: [Nmlrg] new ways of dealing with sparse categorical data like IP addresses (Google Wide & Deep)

Sheng Jiang <jiangsheng@huawei.com> Fri, 01 July 2016 18:29 UTC

Return-Path: <jiangsheng@huawei.com>
X-Original-To: nmlrg@ietfa.amsl.com
Delivered-To: nmlrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 65CF512D5A5 for <nmlrg@ietfa.amsl.com>; Fri, 1 Jul 2016 11:29:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -5.646
X-Spam-Level:
X-Spam-Status: No, score=-5.646 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.426, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fypdj1MVfYho for <nmlrg@ietfa.amsl.com>; Fri, 1 Jul 2016 11:29:47 -0700 (PDT)
Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [119.145.14.65]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B2C8312D5A2 for <nmlrg@irtf.org>; Fri, 1 Jul 2016 11:29:43 -0700 (PDT)
Received: from 172.24.1.36 (EHLO nkgeml414-hub.china.huawei.com) ([172.24.1.36]) by szxrg02-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id DJQ88835; Sat, 02 Jul 2016 02:26:19 +0800 (CST)
Received: from NKGEML515-MBX.china.huawei.com ([fe80::a54a:89d2:c471:ff]) by nkgeml414-hub.china.huawei.com ([10.98.56.75]) with mapi id 14.03.0235.001; Sat, 2 Jul 2016 02:26:14 +0800
From: Sheng Jiang <jiangsheng@huawei.com>
To: David Meyer <dmm@1-4-5.net>, "nmlrg@irtf.org" <nmlrg@irtf.org>
Thread-Topic: [Nmlrg] new ways of dealing with sparse categorical data like IP addresses (Google Wide & Deep)
Thread-Index: AQHR0jbbC8ZP2gDp8UaVg4AogmAngqACUbXn
Date: Fri, 01 Jul 2016 18:26:13 +0000
Message-ID: <5D36713D8A4E7348A7E10DF7437A4B927CA8A31A@NKGEML515-MBX.china.huawei.com>
References: <CAHiKxWiqk8XAp-XAhmnp3K9vmNOW+YHJ_pjCpFdE_86NdoepYQ@mail.gmail.com>
In-Reply-To: <CAHiKxWiqk8XAp-XAhmnp3K9vmNOW+YHJ_pjCpFdE_86NdoepYQ@mail.gmail.com>
Accept-Language: en-GB, zh-CN, en-US
Content-Language: en-GB
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.47.71.50]
Content-Type: multipart/alternative; boundary="_000_5D36713D8A4E7348A7E10DF7437A4B927CA8A31ANKGEML515MBXchi_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A090202.5776B5CC.000C, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0, ip=0.0.0.0, so=2013-06-18 04:22:30, dmn=2013-03-21 17:37:32
X-Mirapoint-Loop-Id: a9126dc00d156842f283a600fff3395a
Archived-At: <https://mailarchive.ietf.org/arch/msg/nmlrg/CK7_NoU9s5AXRB6_xoOckzgldi0>
Subject: Re: [Nmlrg] new ways of dealing with sparse categorical data like IP addresses (Google Wide & Deep)
X-BeenThere: nmlrg@irtf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Network Machine Learning Research Group <nmlrg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/nmlrg>, <mailto:nmlrg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/nmlrg/>
List-Post: <mailto:nmlrg@irtf.org>
List-Help: <mailto:nmlrg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/nmlrg>, <mailto:nmlrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Fri, 01 Jul 2016 18:29:50 -0000

Hi, David,



Thanks for this information. This looks interesting. It is something worth of trying. However, I cannot claim I understand it pruperly. Particularly, there are a few explanation actually confused me a lot more, like a wide linear model (for memorization), a deep neural network (for generalization), sparse inputs (categorical feature with a large number of possible feature values), also your mentioned IPv{4,6} addresses as a sparse categorical feature, etc. I guess there are a lot terminology and statement in this field have not been properly specificed or standardized. So there are lots chances to misunderstand each other.



Meanwhile, one of my read from the blog is that it is not new to combine more than one learning/analysing approaches for a same task. Multiple approaches could be run simultaneously. If they could reach the same result, they prove each other. Then, we could have more confidence for the predict result. I am sure there will be some case, the result are different. Then what? I am not sure. But it would be interested to continue working on it.



Regards,



Sheng

________________________________
From: nmlrg [nmlrg-bounces@irtf.org] on behalf of David Meyer [dmm@1-4-5.net]
Sent: 30 June 2016 2:48
To: nmlrg@irtf.org
Subject: [Nmlrg] new ways of dealing with sparse categorical data like IP addresses (Google Wide & Deep)

Check out Wide & Deep: https://research.googleblog.com/2016/06/wide-deep-learning-better-together-with.html

Among other things, Wide & Deep is

"useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems."

In the network space we have many such sparse categorical features such as IPv{4,6} addresses. Wide & Deep might give us a few  hints about how we might approach this problem.

--dmm