[Coin] Some comments on draft-mcbride-edge-data-discovery-overview-01

"David R. Oran" <daveoran@orandom.net> Mon, 18 March 2019 17:50 UTC

Return-Path: <daveoran@orandom.net>
X-Original-To: coin@ietfa.amsl.com
Delivered-To: coin@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0918712D4E8 for <coin@ietfa.amsl.com>; Mon, 18 Mar 2019 10:50:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PflrB32K3_48 for <coin@ietfa.amsl.com>; Mon, 18 Mar 2019 10:50:17 -0700 (PDT)
Received: from spark.crystalorb.net (spark.crystalorb.net [IPv6:2607:fca8:1530::c]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3C08B12F1A6 for <coin@irtf.org>; Mon, 18 Mar 2019 10:50:13 -0700 (PDT)
Received: from [192.168.15.137] ([IPv6:2601:184:4081:19c1:89df:9424:adeb:44b]) (authenticated bits=0) by spark.crystalorb.net (8.14.4/8.14.4/Debian-4+deb7u1) with ESMTP id x2IHnmPl018083 (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256 verify=NO); Mon, 18 Mar 2019 10:49:50 -0700
From: "David R. Oran" <daveoran@orandom.net>
To: "Schooler, Eve M" <eve.m.schooler@intel.com>, michael.mcbride@huawei.com, cjbc@it.uc3m.es
Cc: coin@irtf.org
Date: Mon, 18 Mar 2019 13:49:47 -0400
X-Mailer: MailMate (1.12.4r5618)
Message-ID: <6EB642FF-4FF4-4B1B-88C2-5872166FDFF2@orandom.net>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="=_MailMate_C90A8183-D513-4266-BCFB-11C8128D820D_="
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/coin/j-qcaMcg298h4YCIBrL5MpJutNQ>
Subject: [Coin] Some comments on draft-mcbride-edge-data-discovery-overview-01
X-BeenThere: coin@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "COIN: Computing in the Network" <coin.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/coin>, <mailto:coin-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/coin/>
List-Post: <mailto:coin@irtf.org>
List-Help: <mailto:coin-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/coin>, <mailto:coin-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Mar 2019 17:50:24 -0000

Thanks for an interesting draft - I hadn’t read the -00 version so 
this is the first chance I’ve had to think about it and offer some 
suggestions.

I appreciate the importance of the topic and think it’s worth doing 
research and some protocol design around. The document is well written 
and easy to follow - so thanks for that too. I have some detailed 
comments further along in this note, but I’d like to raise one pretty 
general conceptual difficulty I have with the problem statement.

It is not clear to me that “edge data discovery” has a crisp 
definition or boundary that distinguishes it from the other things that 
are needed to make edge computing work. There is actually a continuum 
from “searching for data”, through “discovering data” to 
“accessing data” and making a sharp division among these things in 
order to craft protocols may not be ideal for either generating valuable 
research or engineering protocols. Conceptually, the document proposes 
that I sharply separate:

- Searching for data: “I don’t know exactly what I want to access, 
only a possibly partial description thereof”

- “Discovering” data: which in the context of this draft seems to 
mean “I know exactly what I want but not where it currently lives”.

- Accessing Data: I know exactly what I want please fetch it - which may 
or may not require me to know where it lives beforehand (c.f. the 
ICN/NDN discussion towards the end of the document).

It may be that a specific data discovery protocol is the right way to 
bridge the gap between searching and access. Perhaps that’s how the 
document could help get a better handle on the problem, but the material 
needs to be more convincing than what this version supplies.

At any rate, here are a few detailed comments, which I’ve done by 
snipping the relevant parts of the draft and inserting my suggestions:

——————————

                     Overview of Edge Data Discovery
              draft-mcbride-edge-data-discovery-overview-01

Abstract

    This document describes the problem of distributed data discovery in
    edge computing.  Increasing numbers of IoT devices and sensors are
    generating a torrent of data that originates at the very edges of 
the
    network and that flows upstream, if it flows at all.  Sometimes that
    data must be processed or transformed (transcoded, subsampled,
    compressed, analyzed, annotated, combined, aggregated, etc.) on edge
    equipment, particularly in places where multiple high bandwidth
    streams converge and where resources are limited.  Support for edge
    data analysis is critical to make local, low-latency decisions 
(e.g.,
    regarding predictive maintenance, the dispatch of emergency 
services,

<DO> I get the “local” part of this and in fact that to me is the 
distinguishing characteristic - it allows you to not depend on 
connectivity to the cloud to operate an application (possibly in a 
degraded mode), or to minimize communication resource usage. What I 
don’t get, at least in most cases, is the “low latency” angle. In 
particular here, why would predictive maintenance require low latency? 
Or even emergency services, where the actual physical dispatch takes 2 
orders of magnitude or longer or more than any communication latency to 
the cloud.

    identity, authorization, etc.).  In addition, (transformed) data may
    be cached, copied and/or stored at multiple locations in the network
    on route to its final destination.  Although the data might 
originate
    at the edge, for example in factories, automobiles, video cameras,
    wind farms, etc., as more and more distributed data is created,
    processed and stored, it becomes increasingly dispersed throughout
    the network.  There needs to be a standard way to find it.  New and
    existing protocols will need to be identified/developed/enhanced for
    distributed data discovery at the network edge and beyond.

[snip]

1.  Introduction

    Edge computing is an architectural shift that migrates Cloud
    functionality (compute, storage, networking, control, data
    management, etc.) out of the back-end data center to be more
    proximate to the IoT data being generated and analyzed at the edges

<DO> why just IoT data - any data, right?

    of the network.  Edge computing provides local compute, storage and
    connectivity services, often required for latency- and bandwidth-
    sensitive applications.  Thus, Edge Computing plays a key role in
    verticals such as Energy, Manufacturing, Automotive, Video 
Analytics,
    Retail, Gaming, Healthcare, Mining, Buildings and Smart Cities.

<DO> here’s an “intelligence test” question about this list. Which 
of these things is not like the others? :-) Which doesn’t fit? Ok, 
enough funnies - it’s “Video analytics” the others on this list 
are in fact verticals, but video analytics is an application, not a 
vertical.

1.1.  Edge Data

    Edge computing is motivated at least in part by the sheer volume of
    data that is being created by IoT devices (sensors, cameras, lights,
    vehicles, drones, wearables, etc.) at the very network edge and that
    flows upstream, in a direction for which the network was not
    originally provisioned.  In fact, in dense IoT deployments (e.g.,

<DO> The problem is deeper than provisioning - a lot of important access 
network technologies are inherently asymmetric: DOCSIS, Cellular 
wireless, etc. No amount of provisioning will make data upload over the 
uplinks cheap compared to download.

    many video cameras are streaming high definition video), where
    multiple data flows collect or converge at edge nodes, data is 
likely
    to need transformation (transcoded, subsampled, compressed, 
analyzed,
    annotated, combined, aggregated, etc.) to fit over the next hop 
link,
    or even to fit in memory or storage.  Note also that the act of
    performing compute on the data creates yet another new data stream!

<DO> the original data streams are needed sometimes but not always. 
It’s important to call out these cases separately, because it affects 
both how you do forensics, and how you express data provenance.

    In addition, data may be cached, copied and/or stored at multiple
    locations in the network on route to its final destination.  With an
    increasing percentage of devices connecting to the Internet being
    mobile, support for in-the-network caching and replication is
    critical for continuous data availability, not to mention efficient
    network and battery usage for endpoint devices.

<DO> I would not throw caching and replication together in one sentence 
- they in fact address quite different needs.

[snip]

    Businesses, such as industrial companies, are starting to understand
    how valuable the data is that they've kept in silos.  Once this data
    is made accessible on edge computing platforms, they may be able to
    monetize the value of the data.  But this will happen only if data

<DO> I don’t see how this follows. Monetization is either attractive 
or not depending on the data ownership model; What about edge changes 
the equation compared to centralized?

    can be discovered and searched among heterogeneous equipment in a
    standard way.  Discovering the data, that its most useful to a given
    market segment, will be extremely useful in building business
    revenues.  Having a mechanism to provide this granular discovery is
    the problem that needs solving either with existing, or new,
    protocols.

[snip]

1.4.  Terminology

    o  Edge: The edge encompasses all entities not in the back-end 
cloud.
       The device edge is the boundary between digital and physical
       entities in the last mile network.

<DO> not sure I follow this sentence or why it’s relevant.

		 Sensors, gateways, compute
       nodes are included.  The infrastructure edge includes equipment 
on
       the network operator side of the last mile network including cell
       towers, edge data centers, cable headends, etc.  See Figure 1 for

<DO>I would include POPs in this list

       other possible tiers of edge clouds between the device edge and
       the back-end cloud data center.

    o  Edge Computing: Distributed computation that is performed near 
the
       edge, where nearness is determined by the system requirements.
       This includes high performance compute, storage and network
       equipment on either the device or infrastructure edge.

<DO> I wonder if including the devices is a good idea here. I can see 
justifications for blurring the distinction between the edge and and the 
end devices, but also some pretty strong justifications for treating 
them completely separately. Some examples:
- from a security standpoint, the device generating the original data is 
the natural custodian and the principal authenticating that data.
- from a privacy standpoint, devices holding data can form a natural 
privacy boundary, as opposed to infrastructure.
- In many (but of course not all) cases, mobility complexities are 
confined to devices since the infrastructure does not move rapidly


    o  Edge Data Discovery: The process of finding required data from
       edge entities, i.e., from databases, files systems, device memory
       that might be physically distributed in the network, and
       consolidating it or providing access to it logically as if it 
were
       a single unified source, perhaps through its namespace, that can
       be evaluated or searched.


    o  NDN: Named Data Networking.  NDN routes data by name (vs 
address),
       caches content natively in the network, and employs data-centric
       security.  Data discovery may require that data be associated 
with
       a name or names, a series of descriptive attributes, and/or a
       unique identifier.

<DO> Probably ought to cite CCNx as well (since it’s about to become 
an experimental RFC), and ICN approaches in general

[snip]

2.1.  A Cloud-Edge Continuum

    Although Edge Computing data typically originates at edge devices,
    there is nothing that precludes edge data from being created 
anywhere
    in the cloud-to-edge computing continuum (Figure 1).  New edge data
    may result as a byproduct of computation being performed on the data
    stream anywhere along its path in the network.  For example,
    infrastructure edges may create new edge data when multiple data
    streams converge upon this aggregation point and require
    transformation to fit within the available resources.  Edge data 
also

<DO> There might be other reasons than fitting within resources. For 
example
- smoothing of raw measurements to eliminate high-frequency noise
- obfuscation of data for privacy

    may be sent to the back-end cloud as needed.  Discovering data which
    has be sent to the cloud is out of scope of this document, the
    assumption being that the cloud boundary is one that does not expose
    or publish the availability of its data.

<DO> Well cloud stuff might be out of scope, but saying that the cloud 
boundary doesn’t expose or publish data is just not quite right.  Or 
that the way you discover data across a multi-cloud by definition has to 
be different from how you do it for the edge seems similarly off the 
mark.

[snip]

    Initially our focus is on discovery of edge data that resides at the
    Device Edge and the Infrastructure Edge.

<DO> I still wonder if discovery of data on devices is somehow different 
because they are “authoritative” for the data they have. There also 
may be a peer-to-peer protocol angle here, which takes a particular (and 
possibly very different) view of discovery.

2.2.  Types of Edge Data

    Besides sensor and measurement data accumulating throughout the edge
    computing infrastructure, edge data may also take the form of
    streaming data (from a camera),

<DO> the boundary between sensor/measurement data and streaming data is 
blurry and probably not too relevant for the problems being talked about 
here. The further distinctions below are relevant, as all of the stuff 
above is time-series (with or without a fixed clock) while the stuff 
below is not.

    meta data (about the data), control
    data (regarding an event that was triggered), and/or an executable
    that embodies a function, service, or any other piece of code or
    algorithm.  Edge data also could be created after multiple streams
    converge at the edge node and are processed, transformed, or
    aggregated together in some manner.

    SFC Data and meta-data discovery

    Service function chaining (SFC) allows the instantiation of an
    ordered set of service functions and subsequent "steering" of 
traffic
    through them.  Service functions provide a specific treatment of
    received packets, therefore they need to be known so they can be 
used

<DO> need to be known to whom?

    in a given service composition via SFC.  So far, how the SFs are
    discovered and composed has been out of the scope of discussions in
    IETF.  While there are some mechanisms that can be used and/or
    extended to provide this functionality, work needs to be done.  An
    example of this can be found in [I-D.bernardos-sfc-discovery].

    In an SFC environment deployed at the edge, the discovery protocol
    may also need to make available the following meta-data information
    per SF:

    o  Service Function Type, identifying the category of SF provided.

    o  SFC-aware: Yes/No.  Indicates if the SF is SFC-aware.

<DO> What exactly does this mean? is it functional or a particular 
protocol supported? The latter seems more useful.

    o  Route Distinguisher (RD): IP address indicating the location of
       the SF(I).

<DO> Huh?

    o  Pricing/costs details.

    o  Migration capabilities of the SF: whether a given function can be
       moved to another provider (potentially including information 
about
       compatible providers topologically close).

    o  Mobility of the device hosting the SF, with e.g. the following
       sub-options:

          Level: no, low, high; or a corresponding scale (e.g., 1 to 
10).

          Current geographical area (e.g., GPS coordinates, post code).

          Target moving area (e.g., GPS coordinates, post code).

    o  Power source of the device hosting the SF, with e.g. the 
following
       sub-options:

          Battery: Yes/No.  If Yes, the following sub-options could be
          defined:

          Capacity of the battery (e.g., mmWh).

          Charge status (e.g., %).

          Lifetime (e.g., minutes).

<DO> all of the above list seems pretty arbitrary and not specific to 
Service Functions.

  [snip]

3.  Scenarios for Discovering Edge Data Resources

    Mainly two types of situations need to be covered:

    1.  A set of data resources appears (e.g., a mobile node hosting 
data
        joins a network) and they want to be discovered by an existing
        but possibly virtualized and/or ephemeral data directory
        infrastructure.

    2.  A device wants to discover data resources available at or near
        its current location.  As some of these resources may be mobile,
        the available set of edge data may vary over time.

<DO> what about discovering where in the edge infrastructure to upload 
data to, as opposed to just where to find it once it’s put there?

4.  Edge Data Discovery

    How can we discover data on the edge and make use of it?  There are
    proprietary implementations that collect data from various databases
    and consolidate it for evaluation.  We need a standard protocol set
    for doing this data discovery, on the device or infrastructure edge,
    in order to meet the requirements of many use cases.  We will have
    terabytes of data on the edge and need a way to identify its
    existence and find the desired data.  A user requires the need to
    search for specific data in a data set and evaluate it using their
    own tools.

<DO> This seems a very different problem than anything addressed in this 
draft.

    The tools are outside the scope of this document, but the
    discovery of that data is in scope.


7.  Security Considerations

    Security considerations will be a critical component of edge data
    discovery particularly as intelligence is moved to the extreme edge
    where data is to be extracted.

<DO> This is content free. There’s lots to say here. Hopefully future 
versions will fill this out. This is also the place to discuss privacy, 
which probably has more edge-specific wrinkles than security 
considerations do.