Re: [Coin] Some comments on draft-mcbride-edge-data-discovery-overview-01

"Schooler, Eve M" <eve.m.schooler@intel.com> Sat, 23 March 2019 23:18 UTC

From: "Schooler, Eve M" <eve.m.schooler@intel.com>
To: "daveoran@orandom.net" <daveoran@orandom.net>, "michael.mcbride@huawei.com" <michael.mcbride@huawei.com>, "cjbc@it.uc3m.es" <cjbc@it.uc3m.es>, Dirk Kutscher <ietf@dkutscher.net>
CC: "coin@irtf.org" <coin@irtf.org>, "Schooler, Eve M" <eve.m.schooler@intel.com>
Thread-Topic: Some comments on draft-mcbride-edge-data-discovery-overview-01
Thread-Index: AQHU3bMS9LzprOfQLE+5AEM5yn7DP6YYurlw
Date: Sat, 23 Mar 2019 23:18:35 +0000
Message-ID: <1BBB5B8548ACEF4093CE0051D9EA9A6BDAC1E4A5@ORSMSX105.amr.corp.intel.com>
References: <6EB642FF-4FF4-4B1B-88C2-5872166FDFF2@orandom.net>
In-Reply-To: <6EB642FF-4FF4-4B1B-88C2-5872166FDFF2@orandom.net>
Accept-Language: en-US
Content-Language: en-US
dlp-product: dlpe-windows
dlp-version: 11.0.400.15
dlp-reaction: no-action
Content-Type: multipart/alternative; boundary="_000_1BBB5B8548ACEF4093CE0051D9EA9A6BDAC1E4A5ORSMSX105amrcor_"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/coin/UK_elSsv-lXHvH-7MvkDLuW_RlE>
Subject: Re: [Coin] Some comments on draft-mcbride-edge-data-discovery-overview-01
Precedence: list

Hi Dave,
Thanks for your thoughtful comments. My responses are in-line below.
E.

From: David R. Oran [mailto:daveoran@orandom.net]
Sent: Monday, March 18, 2019 10:50 AM
To: Schooler, Eve M <eve.m.schooler@intel.com>; michael.mcbride@huawei.com; cjbc@it.uc3m.es
Cc: coin@irtf.org
Subject: Some comments on draft-mcbride-edge-data-discovery-overview-01

Thanks for an interesting draft - I hadn’t read the -00 version so this is the first chance I’ve had to think about it and offer some suggestions.

I appreciate the importance of the topic and think it’s worth doing research and some protocol design around. The document is well written and easy to follow - so thanks for that too. I have some detailed comments further along in this note, but I’d like to raise one pretty general conceptual difficulty I have with the problem statement.

It is not clear to me that “edge data discovery” has a crisp definition or boundary that distinguishes it from the other things that are needed to make edge computing work. There is actually a continuum from “searching for data”, through “discovering data” to “accessing data” and making a sharp division among these things in order to craft protocols may not be ideal for either generating valuable research or engineering protocols. Conceptually, the document proposes that I sharply separate:

·        Searching for data: “I don’t know exactly what I want to access, only a possibly partial description thereof”

·        “Discovering” data: which in the context of this draft seems to mean “I know exactly what I want but not where it currently lives”.

·        Accessing Data: I know exactly what I want please fetch it - which may or may not require me to know where it lives beforehand (c.f. the ICN/NDN discussion towards the end of the document).

[Eve] Point taken. The document is meant to raise the issue that if we are going to try to tackle compute-in-the-network, we must also tackle the data associated with that compute – both the marshalling of data at the outset and the placement of the resultant data afterwards. It would make better sense then to call this draft Edge Data (vs strictly Edge Data Discovery) and speak to each of the data concerns you raise. In fact, there is a data lifecycle that should be comprehended, which also includes placement, migration, expiration, all of which happen after the computation and which are facets of the Edge data management problem.

It may be that a specific data discovery protocol is the right way to bridge the gap between searching and access. Perhaps that’s how the document could help get a better handle on the problem, but the material needs to be more convincing than what this version supplies.

[Eve] Data discovery anchors this early I-D because, from an IETF standpoint, there are discovery services that exist in several other WGs/RGs (DNSSD, CoAP, T2T WISHI discussions re the W3C Thing Directory), and it would seem natural to scrutinize those directories/registries where support for Data might be added. That said, the more we debate the topic the more convinced I am that Data lifecycle management should really be the focus of the draft and the role of the  I-D would be to lay out the problem statement – throughout the lifecycle of the Data.

At any rate, here are a few detailed comments, which I’ve done by snipping the relevant parts of the draft and inserting my suggestions:

——————————

                Overview of Edge Data Discovery

         draft-mcbride-edge-data-discovery-overview-01

Abstract

This document describes the problem of distributed data discovery in
edge computing. Increasing numbers of IoT devices and sensors are
generating a torrent of data that originates at the very edges of the
network and that flows upstream, if it flows at all. Sometimes that
data must be processed or transformed (transcoded, subsampled,
compressed, analyzed, annotated, combined, aggregated, etc.) on edge
equipment, particularly in places where multiple high bandwidth
streams converge and where resources are limited. Support for edge
data analysis is critical to make local, low-latency decisions (e.g.,
regarding predictive maintenance, the dispatch of emergency services,

<DO> I get the “local” part of this and in fact that to me is the distinguishing characteristic - it allows you to not depend on connectivity to the cloud to operate an application (possibly in a degraded mode), or to minimize communication resource usage. What I don’t get, at least in most cases, is the “low latency” angle. In particular here, why would predictive maintenance require low latency? Or even emergency services, where the actual physical dispatch takes 2 orders of magnitude or longer or more than any communication latency to the cloud.

[Eve] The text could easily be updated to read “local and/or low-latency”, to emphasize that local is on equal footing to low-latency. However I do not agree that, just because the dispatch of services is many orders of magnitude larger than the event detection, that the distribution of the detection event shouldn’t care about low latency. In an emergency scenario where safety is of utmost importance, the detection of the event that warrants emergency services might trigger more than just the dispatch of emergency workers or repair people to the location. It may also trigger local entities (e.g., nearby traffic lights, conveyer belt, chip assembly process, autonomous vehicle/robot, etc) to halt or to change course to avoid exacerbating the problem. You are correct that predictive maintenance might not have the same level of urgency IF the arc of the maintenance degradation is long or the miscalibration of components is tolerable in terms of its wear on or cost to other components. We therefore should describe more fully under what conditions predictive maintenance does not abide by those rules and should be included in this list, or else move it somewhere else that does not require the more nuanced explanation.

identity, authorization, etc.). In addition, (transformed) data may
be cached, copied and/or stored at multiple locations in the network
on route to its final destination. Although the data might originate
at the edge, for example in factories, automobiles, video cameras,
wind farms, etc., as more and more distributed data is created,
processed and stored, it becomes increasingly dispersed throughout
the network. There needs to be a standard way to find it. New and
existing protocols will need to be identified/developed/enhanced for
distributed data discovery at the network edge and beyond.

[snip]

1.     Introduction

Edge computing is an architectural shift that migrates Cloud
functionality (compute, storage, networking, control, data
management, etc.) out of the back-end data center to be more
proximate to the IoT data being generated and analyzed at the edges

<DO> why just IoT data - any data, right?

[Eve] Any data. We say IoT data only because IoT data and Edge data have, in many contexts, come to mean the same thing. We also say IoT data because it clearly differentiates the Data as having originated in the Cloud. As a result of not being created in the Cloud, there are attendant problems with the management of the data.

of the network. Edge computing provides local compute, storage and
connectivity services, often required for latency- and bandwidth-
sensitive applications. Thus, Edge Computing plays a key role in
verticals such as Energy, Manufacturing, Automotive, Video Analytics,
Retail, Gaming, Healthcare, Mining, Buildings and Smart Cities.

<DO> here’s an “intelligence test” question about this list. Which of these things is not like the others? :-) Which doesn’t fit? Ok, enough funnies - it’s “Video analytics” the others on this list are in fact verticals, but video analytics is an application, not a vertical.

[Eve] ☺. No intelligence here! Sometimes Video Analytics is used as a more “polite” term for Video Surveillance, which is considered its own Vertical. This is an easy edit.

1.1. Edge Data

Edge computing is motivated at least in part by the sheer volume of
data that is being created by IoT devices (sensors, cameras, lights,
vehicles, drones, wearables, etc.) at the very network edge and that
flows upstream, in a direction for which the network was not
originally provisioned. In fact, in dense IoT deployments (e.g.,

<DO> The problem is deeper than provisioning - a lot of important access network technologies are inherently asymmetric: DOCSIS, Cellular wireless, etc. No amount of provisioning will make data upload over the uplinks cheap compared to download.

[Eve] I propose “provisioned” --> “designed”. Provisioned is too loaded a word.

many video cameras are streaming high definition video), where
multiple data flows collect or converge at edge nodes, data is likely
to need transformation (transcoded, subsampled, compressed, analyzed,
annotated, combined, aggregated, etc.) to fit over the next hop link,
or even to fit in memory or storage. Note also that the act of
performing compute on the data creates yet another new data stream!

<DO> the original data streams are needed sometimes but not always. It’s important to call out these cases separately, because it affects both how you do forensics, and how you express data provenance.

[Eve] When you say “these cases” you mean the handling of original and new data? Part of my fascination with ICN is that it quite naturally supports the relationship between original and new data; it could preserve data lineage through the naming and the preservation of old and new data through native caching.

In addition, data may be cached, copied and/or stored at multiple
locations in the network on route to its final destination. With an
increasing percentage of devices connecting to the Internet being
mobile, support for in-the-network caching and replication is
critical for continuous data availability, not to mention efficient
network and battery usage for endpoint devices.

<DO> I would not throw caching and replication together in one sentence - they in fact address quite different needs.

[Eve] Your comment raises the bigger question: what *are* the functions needed to support data creation of a more distributed nature, i.e., arising in network contexts that are  less managed than data centers, self managed, or unmanaged. Basically where is data to “go” after it is created, either as a new stream or a repurposed stream that has had compute performed on it?

[snip]

Businesses, such as industrial companies, are starting to understand
how valuable the data is that they've kept in silos. Once this data
is made accessible on edge computing platforms, they may be able to
monetize the value of the data. But this will happen only if data

<DO> I don’t see how this follows. Monetization is either attractive or not depending on the data ownership model; What about edge changes the equation compared to centralized?

[Eve] I agree that this text seems like a vestige from the earlier version. While data monetization is an interesting topic, it is not really part of the Background and thus shouldn’t be in this section.

can be discovered and searched among heterogeneous equipment in a
standard way. Discovering the data, that its most useful to a given
market segment, will be extremely useful in building business
revenues. Having a mechanism to provide this granular discovery is
the problem that needs solving either with existing, or new,
protocols.

[snip]

1.4. Terminology

o Edge: The edge encompasses all entities not in the back-end cloud.
The device edge is the boundary between digital and physical
entities in the last mile network.

<DO> not sure I follow this sentence or why it’s relevant.

[Eve] Agreed.

     Sensors, gateways, compute

  nodes are included.  The infrastructure edge includes equipment on

  the network operator side of the last mile network including cell

  towers, edge data centers, cable headends, etc.  See Figure 1 for

<DO>I would include POPs in this list

[Eve] Agreed.

  other possible tiers of edge clouds between the device edge and

  the back-end cloud data center.

o Edge Computing: Distributed computation that is performed near the
edge, where nearness is determined by the system requirements.
This includes high performance compute, storage and network
equipment on either the device or infrastructure edge.

<DO> I wonder if including the devices is a good idea here. I can see justifications for blurring the distinction between the edge and and the end devices, but also some pretty strong justifications for treating them completely separately. Some examples:
- from a security standpoint, the device generating the original data is the natural custodian and the principal authenticating that data.
- from a privacy standpoint, devices holding data can form a natural privacy boundary, as opposed to infrastructure.
- In many (but of course not all) cases, mobility complexities are confined to devices since the infrastructure does not move rapidly

[Eve] These are interesting points you make. Edge computing originally was seen as the creation of a more proximate data center. However, there is now more of a dialog around how Edge computing has led to the proliferation of Edges. My intuition re this trend is that Edge computing in the extreme means that everything can offer itself as an Edge cloud / Data center. In other words every network element may have the capacity to offer (some or all of the) functions that would otherwise live in the back-end data center. Those functions include networking, compute, storage, control, data management. The functions you call out - security, privacy, mobility - are all heavily influenced by the control point(s) for these Edges. So, I wonder if we look at your question slightly differently, if we might get closer to an answer or at least a strong(er) opinion: do other kinds of Edges (the ones that are not a single device but federate multiple elements) need to comprehend security, privacy and mobility, and if so, then in what ways do they solve the problem differently from the device Edge? Let’s debate this further either here on the mailing list or in the upcoming COIN meeting this week….because I don’t have a fully formed opinion yet.

o Edge Data Discovery: The process of finding required data from
edge entities, i.e., from databases, files systems, device memory
that might be physically distributed in the network, and
consolidating it or providing access to it logically as if it were
a single unified source, perhaps through its namespace, that can
be evaluated or searched.

o NDN: Named Data Networking. NDN routes data by name (vs address),
caches content natively in the network, and employs data-centric
security. Data discovery may require that data be associated with
a name or names, a series of descriptive attributes, and/or a
unique identifier.

<DO> Probably ought to cite CCNx as well (since it’s about to become an experimental RFC), and ICN approaches in general

[Eve] Yes, this section should be generalized more to talk about ICN / data-centric networks, and include a broader view of the available code bases and approaches (NDN, CCNx, hICN, MobilityFirst, etc).

[snip]

2.1. A Cloud-Edge Continuum

Although Edge Computing data typically originates at edge devices,
there is nothing that precludes edge data from being created anywhere
in the cloud-to-edge computing continuum (Figure 1). New edge data
may result as a byproduct of computation being performed on the data
stream anywhere along its path in the network. For example,
infrastructure edges may create new edge data when multiple data
streams converge upon this aggregation point and require
transformation to fit within the available resources. Edge data also

<DO> There might be other reasons than fitting within resources. For example
- smoothing of raw measurements to eliminate high-frequency noise
- obfuscation of data for privacy

[Eve]: Sure, we can cite these as additional examples.

may be sent to the back-end cloud as needed. Discovering data which
has be sent to the cloud is out of scope of this document, the
assumption being that the cloud boundary is one that does not expose
or publish the availability of its data.

<DO> Well cloud stuff might be out of scope, but saying that the cloud boundary doesn’t expose or publish data is just not quite right. Or that the way you discover data across a multi-cloud by definition has to be different from how you do it for the edge seems similarly off the mark.

[Eve] Your comments underscore the need to view the cloud-to-edge continuum as just that. Not all clouds will sequester their data. Not all edges will publish their data. The key point is that the mechanism(s) we offer to provide various data management functions shouldn’t care with which flavor cloud/edge we interact.

[snip]

Initially our focus is on discovery of edge data that resides at the
Device Edge and the Infrastructure Edge.

<DO> I still wonder if discovery of data on devices is somehow different because they are “authoritative” for the data they have. There also may be a peer-to-peer protocol angle here, which takes a particular (and possibly very different) view of discovery.

[Eve] I like the idea of differentiating entities with or without “authoritative” knowledge of the data. What would you consider caches, for examples the content stores in an ICN? Because compute-in-the-network can happen almost anywhere, compute may happen on a device at the very leaves of the network or on a server in the middle of the network. Whether a leaf device or middle-of-the-network server, they both produce a new stream as a byproduct, and both are examples of authoritative sources. Authority might be something to track in the data’s meta-data attributes. It seems related to the notion of lineage, which I also think is worth tracking in some manner.

[Eve] I also like the idea that discovery may look quite different if it is approached in a peer-to-peer  fashion. P2P not only could be used by devices, but also by different peer edges/clouds that want to share their data repositories with each other. I think that ICN or other pub-sub-like mechanisms would be helpful in (possibly ephemeral) cloud-to-cloud sharing. For example, if ICN were to share information about the data already being held in its routing-layer caches (while preserving the data’s access control policies), ICN could be an important and early mechanism toward an edge data discovery/management system.

2.2. Types of Edge Data

Besides sensor and measurement data accumulating throughout the edge
computing infrastructure, edge data may also take the form of
streaming data (from a camera),

<DO> the boundary between sensor/measurement data and streaming data is blurry and probably not too relevant for the problems being talked about here. The further distinctions below are relevant, as all of the stuff above is time-series (with or without a fixed clock) while the stuff below is not.

[Eve] The point we try to make here: get the reader to understand that Data is not just classical constrained-IoT-device low-data-rate sensor measurements.  Maybe we simply say that and contrast it with high(er)-frequency and high(er)-volume streaming data (as might be generated by a camera etc) as the next kind of data that might be encountered at the Edge, and then all the other types we list below. Regardless, I’ll try to tease apart these more traditional classifications….

meta data (about the data), control
data (regarding an event that was triggered), and/or an executable
that embodies a function, service, or any other piece of code or
algorithm. Edge data also could be created after multiple streams
converge at the edge node and are processed, transformed, or
aggregated together in some manner.

SFC Data and meta-data discovery

Service function chaining (SFC) allows the instantiation of an
ordered set of service functions and subsequent "steering" of traffic
through them. Service functions provide a specific treatment of
received packets, therefore they need to be known so they can be used

<DO> need to be known to whom?

[Eve] This section on SFC Data and meta-data discovery should have appeared as a sub sub section, as an example of meta-data. However, it needs further work to be better integrated into the narrative.

in a given service composition via SFC. So far, how the SFs are
discovered and composed has been out of the scope of discussions in
IETF. While there are some mechanisms that can be used and/or
extended to provide this functionality, work needs to be done. An
example of this can be found in [I-D.bernardos-sfc-discovery].

In an SFC environment deployed at the edge, the discovery protocol
may also need to make available the following meta-data information
per SF:

o Service Function Type, identifying the category of SF provided.

o SFC-aware: Yes/No. Indicates if the SF is SFC-aware.

<DO> What exactly does this mean? is it functional or a particular protocol supported? The latter seems more useful.

[Eve] See my previous comment above.

o Route Distinguisher (RD): IP address indicating the location of
the SF(I).

<DO> Huh?

[Eve] See my previous comment above.

o Pricing/costs details.

o Migration capabilities of the SF: whether a given function can be
moved to another provider (potentially including information about
compatible providers topologically close).

o Mobility of the device hosting the SF, with e.g. the following
sub-options:

     Level: no, low, high; or a corresponding scale (e.g., 1 to 10).

     Current geographical area (e.g., GPS coordinates, post code).

     Target moving area (e.g., GPS coordinates, post code).

o Power source of the device hosting the SF, with e.g. the following
sub-options:

     Battery: Yes/No.  If Yes, the following sub-options could be

     defined:

     Capacity of the battery (e.g., mmWh).

     Charge status (e.g., %).

     Lifetime (e.g., minutes).

<DO> all of the above list seems pretty arbitrary and not specific to Service Functions.

[Eve] See my previous comment above.

[snip]

3.     Scenarios for Discovering Edge Data Resources

Mainly two types of situations need to be covered:

     *   A set of data resources appears (e.g., a mobile node hosting data joins a network) and they want to be discovered by an existing but possibly virtualized and/or ephemeral data directory infrastructure.

     *   A device wants to discover data resources available at or near its current location. As some of these resources may be mobile, the available set of edge data may vary over time.

<DO> what about discovering where in the edge infrastructure to upload data to, as opposed to just where to find it once it’s put there?

[Eve] I fully concur. As stated in my earlier response, the full-on Edge data lifecycle needs inclusion. We mention some of it in the Intro, but it needs more detailed attention in the next version.

4.     Edge Data Discovery

How can we discover data on the edge and make use of it? There are
proprietary implementations that collect data from various databases
and consolidate it for evaluation. We need a standard protocol set
for doing this data discovery, on the device or infrastructure edge,
in order to meet the requirements of many use cases. We will have
terabytes of data on the edge and need a way to identify its
existence and find the desired data. A user requires the need to
search for specific data in a data set and evaluate it using their
own tools.

<DO> This seems a very different problem than anything addressed in this draft.

[Eve] To what does “This” refer? The idea of creating a virtual database or a virtual shim, to federate and unify data that is located in physically distinct databases/memory/file systems, is exactly what we want to enable.

The tools are outside the scope of this document, but the
discovery of that data is in scope.

7.     Security Considerations

Security considerations will be a critical component of edge data
discovery particularly as intelligence is moved to the extreme edge
where data is to be extracted.

<DO> This is content free. There’s lots to say here. Hopefully future versions will fill this out. This is also the place to discuss privacy, which probably has more edge-specific wrinkles than security considerations do

[Eve] I fully agree. Onwards to the next version! Thanks again for your very helpful inputs.

[Coin] Some comments on draft-mcbride-edge-data-d… David R. Oran
Re: [Coin] Some comments on draft-mcbride-edge-da… Schooler, Eve M