[Pearg] new reidentification result using GANs...

Joseph Lorenzo Hall <joe@cdt.org> Wed, 24 July 2019 14:38 UTC

Return-Path: <jhall@cdt.org>
X-Original-To: pearg@ietfa.amsl.com
Delivered-To: pearg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id ECDED12037B for <pearg@ietfa.amsl.com>; Wed, 24 Jul 2019 07:38:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.998
X-Spam-Level:
X-Spam-Status: No, score=-1.998 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cdt.org
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2j_LYLoRL9Cd for <pearg@ietfa.amsl.com>; Wed, 24 Jul 2019 07:38:40 -0700 (PDT)
Received: from mail-io1-xd36.google.com (mail-io1-xd36.google.com [IPv6:2607:f8b0:4864:20::d36]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 8F9CD120357 for <pearg@irtf.org>; Wed, 24 Jul 2019 07:38:40 -0700 (PDT)
Received: by mail-io1-xd36.google.com with SMTP id j5so85912715ioj.8 for <pearg@irtf.org>; Wed, 24 Jul 2019 07:38:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cdt.org; s=google; h=mime-version:from:date:message-id:subject:to; bh=1Vs+NtzkHdoM2tC8BctWGsCQs/SOcVqDYwYPGXsF274=; b=rhGM3Fcd/Gylu+W1gYkr52gNur4Xce9j1gGxC85LGSX2EIl/RoH/g9Rz/yKqUUeZaP Hy9sJzW6cC+CCWzipBVFjyOMKeFoQ7az5M5kfYro4xUGg/3ShCsWYAJQB6mYfuCt2Una Fwz4Fhlt0WK1DLwED+Uq/wV7xe/ijO3nCoBAo=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=1Vs+NtzkHdoM2tC8BctWGsCQs/SOcVqDYwYPGXsF274=; b=bNepIUHugojmUxjn1OFCrXrnsdjD6QXD1TiksyB18Yjk/jPEUozrfajAbHKhw/YRH1 g23WFYZSNqV1ZQpAXsrpBuOWpC+eg1eo12ICZ25pJ2I1esjLvtzL4r8T0OlUMJNy/488 CXvrFu40M7VRSi3KH4GqJoneq6yGG5VByb+ANkA4fXUHc8KNQMAkWS4JTyl355pOi/mB dc9WRfnMya62bMPR213j4zVJ5+b2NPTvNnWUUncGVmmoIZef9MBrOYCh/8JOrV0WjPar oVNZ8VXrI7GG6SyXUkoQ7ZsEl4Db2WEmfSoOi+s8mi0KfsXjbPNsOgaTgsNypSy2rXG+ mgGw==
X-Gm-Message-State: APjAAAVsy3oM69/w5QEAfd8Fz+S4/XTmvyATfWe8GpeyjgWTY5f1DhWV jOtmWIy1L+lZejm0Ca0ml7mLrwIYTTILb67hg23d2aSv9X9bjN2A
X-Google-Smtp-Source: APXvYqwKr/4ZfjqOLwZNCXfhVdM/vvQvq1/g370MmIr7f8hm6g6R+p9SGjeKMBeR2XU44vUJ26oEvYbf/Ua0mURJTnA=
X-Received: by 2002:a5e:9506:: with SMTP id r6mr12011786ioj.219.1563979119409; Wed, 24 Jul 2019 07:38:39 -0700 (PDT)
MIME-Version: 1.0
From: Joseph Lorenzo Hall <joe@cdt.org>
Date: Wed, 24 Jul 2019 10:38:28 -0400
Message-ID: <CABtrr-UNtcCsXar_+Zpc7T11xd_scBh1r9n55kKMLMCdh5OqCw@mail.gmail.com>
To: pearg@irtf.org
Content-Type: multipart/alternative; boundary="000000000000d95449058e6e435f"
Archived-At: <https://mailarchive.ietf.org/arch/msg/pearg/eOl06edINFDAD7KsgrUKnbhFOoo>
Subject: [Pearg] new reidentification result using GANs...
X-BeenThere: pearg@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Privacy Enhancements and Assessment Proposed RG <pearg.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/pearg>, <mailto:pearg-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/pearg/>
List-Post: <mailto:pearg@irtf.org>
List-Help: <mailto:pearg-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/pearg>, <mailto:pearg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Jul 2019 14:38:44 -0000

https://www.nature.com/articles/s41467-019-10933-3
(PDF: https://www.nature.com/articles/s41467-019-10933-3.pdf )

# Estimating the success of re-identifications in incomplete datasets using
generative models

Luc Rocher, Julien M. Hendrickx & Yves-Alexandre de Montjoye

Abstract: While rich medical, behavioral, and socio-demographic data are
key to modern data-driven research, their collection and use raise
legitimate privacy concerns. Anonymizing datasets through de-identification
and sampling before sharing them has been the main tool used to address
those concerns. We here propose a generative copula-based method that can
accurately estimate the likelihood of a specific person to be correctly
re-identified, even in a heavily incomplete dataset. On 210 populations,
our method obtains AUC scores for predicting individual uniqueness ranging
from 0.84 to 0.97, with low false-discovery rate. Using our model, we find
that 99.98% of Americans would be correctly re-identified in any dataset
using 15 demographic attributes. Our results suggest that even heavily
sampled anonymized datasets are unlikely to satisfy the modern standards
for anonymization set forth by GDPR and seriously challenge the technical
and legal adequacy of the de-identification release-and-forget model.

-- 
Joseph Lorenzo Hall
Chief Technologist, Center for Democracy & Technology [https://www.cdt.org]
1401 K ST NW STE 200, Washington DC 20005-3497
e: joe@cdt.org, p: 202.407.8825, pgp: https://josephhall.org/gpg-key
Fingerprint: 3CA2 8D7B 9F6D DBD3 4B10  1607 5F86 6987 40A9 A871