[Coin] Some comments on draft-mcbride-edge-data-discovery-overview-01
"David R. Oran" <daveoran@orandom.net> Mon, 18 March 2019 17:50 UTC
Return-Path: <daveoran@orandom.net>
X-Original-To: coin@ietfa.amsl.com
Delivered-To: coin@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 0918712D4E8 for <coin@ietfa.amsl.com>; Mon, 18 Mar 2019 10:50:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PflrB32K3_48 for <coin@ietfa.amsl.com>; Mon, 18 Mar 2019 10:50:17 -0700 (PDT)
Received: from spark.crystalorb.net (spark.crystalorb.net [IPv6:2607:fca8:1530::c]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3C08B12F1A6 for <coin@irtf.org>; Mon, 18 Mar 2019 10:50:13 -0700 (PDT)
Received: from [192.168.15.137] ([IPv6:2601:184:4081:19c1:89df:9424:adeb:44b]) (authenticated bits=0) by spark.crystalorb.net (8.14.4/8.14.4/Debian-4+deb7u1) with ESMTP id x2IHnmPl018083 (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256 verify=NO); Mon, 18 Mar 2019 10:49:50 -0700
From: "David R. Oran" <daveoran@orandom.net>
To: "Schooler, Eve M" <eve.m.schooler@intel.com>, michael.mcbride@huawei.com, cjbc@it.uc3m.es
Cc: coin@irtf.org
Date: Mon, 18 Mar 2019 13:49:47 -0400
X-Mailer: MailMate (1.12.4r5618)
Message-ID: <6EB642FF-4FF4-4B1B-88C2-5872166FDFF2@orandom.net>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="=_MailMate_C90A8183-D513-4266-BCFB-11C8128D820D_="
Content-Transfer-Encoding: 8bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/coin/j-qcaMcg298h4YCIBrL5MpJutNQ>
Subject: [Coin] Some comments on draft-mcbride-edge-data-discovery-overview-01
X-BeenThere: coin@irtf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "COIN: Computing in the Network" <coin.irtf.org>
List-Unsubscribe: <https://www.irtf.org/mailman/options/coin>, <mailto:coin-request@irtf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/coin/>
List-Post: <mailto:coin@irtf.org>
List-Help: <mailto:coin-request@irtf.org?subject=help>
List-Subscribe: <https://www.irtf.org/mailman/listinfo/coin>, <mailto:coin-request@irtf.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Mar 2019 17:50:24 -0000
Thanks for an interesting draft - I hadn’t read the -00 version so this is the first chance I’ve had to think about it and offer some suggestions. I appreciate the importance of the topic and think it’s worth doing research and some protocol design around. The document is well written and easy to follow - so thanks for that too. I have some detailed comments further along in this note, but I’d like to raise one pretty general conceptual difficulty I have with the problem statement. It is not clear to me that “edge data discovery” has a crisp definition or boundary that distinguishes it from the other things that are needed to make edge computing work. There is actually a continuum from “searching for data”, through “discovering data” to “accessing data” and making a sharp division among these things in order to craft protocols may not be ideal for either generating valuable research or engineering protocols. Conceptually, the document proposes that I sharply separate: - Searching for data: “I don’t know exactly what I want to access, only a possibly partial description thereof” - “Discovering” data: which in the context of this draft seems to mean “I know exactly what I want but not where it currently lives”. - Accessing Data: I know exactly what I want please fetch it - which may or may not require me to know where it lives beforehand (c.f. the ICN/NDN discussion towards the end of the document). It may be that a specific data discovery protocol is the right way to bridge the gap between searching and access. Perhaps that’s how the document could help get a better handle on the problem, but the material needs to be more convincing than what this version supplies. At any rate, here are a few detailed comments, which I’ve done by snipping the relevant parts of the draft and inserting my suggestions: —————————— Overview of Edge Data Discovery draft-mcbride-edge-data-discovery-overview-01 Abstract This document describes the problem of distributed data discovery in edge computing. Increasing numbers of IoT devices and sensors are generating a torrent of data that originates at the very edges of the network and that flows upstream, if it flows at all. Sometimes that data must be processed or transformed (transcoded, subsampled, compressed, analyzed, annotated, combined, aggregated, etc.) on edge equipment, particularly in places where multiple high bandwidth streams converge and where resources are limited. Support for edge data analysis is critical to make local, low-latency decisions (e.g., regarding predictive maintenance, the dispatch of emergency services, <DO> I get the “local” part of this and in fact that to me is the distinguishing characteristic - it allows you to not depend on connectivity to the cloud to operate an application (possibly in a degraded mode), or to minimize communication resource usage. What I don’t get, at least in most cases, is the “low latency” angle. In particular here, why would predictive maintenance require low latency? Or even emergency services, where the actual physical dispatch takes 2 orders of magnitude or longer or more than any communication latency to the cloud. identity, authorization, etc.). In addition, (transformed) data may be cached, copied and/or stored at multiple locations in the network on route to its final destination. Although the data might originate at the edge, for example in factories, automobiles, video cameras, wind farms, etc., as more and more distributed data is created, processed and stored, it becomes increasingly dispersed throughout the network. There needs to be a standard way to find it. New and existing protocols will need to be identified/developed/enhanced for distributed data discovery at the network edge and beyond. [snip] 1. Introduction Edge computing is an architectural shift that migrates Cloud functionality (compute, storage, networking, control, data management, etc.) out of the back-end data center to be more proximate to the IoT data being generated and analyzed at the edges <DO> why just IoT data - any data, right? of the network. Edge computing provides local compute, storage and connectivity services, often required for latency- and bandwidth- sensitive applications. Thus, Edge Computing plays a key role in verticals such as Energy, Manufacturing, Automotive, Video Analytics, Retail, Gaming, Healthcare, Mining, Buildings and Smart Cities. <DO> here’s an “intelligence test” question about this list. Which of these things is not like the others? :-) Which doesn’t fit? Ok, enough funnies - it’s “Video analytics” the others on this list are in fact verticals, but video analytics is an application, not a vertical. 1.1. Edge Data Edge computing is motivated at least in part by the sheer volume of data that is being created by IoT devices (sensors, cameras, lights, vehicles, drones, wearables, etc.) at the very network edge and that flows upstream, in a direction for which the network was not originally provisioned. In fact, in dense IoT deployments (e.g., <DO> The problem is deeper than provisioning - a lot of important access network technologies are inherently asymmetric: DOCSIS, Cellular wireless, etc. No amount of provisioning will make data upload over the uplinks cheap compared to download. many video cameras are streaming high definition video), where multiple data flows collect or converge at edge nodes, data is likely to need transformation (transcoded, subsampled, compressed, analyzed, annotated, combined, aggregated, etc.) to fit over the next hop link, or even to fit in memory or storage. Note also that the act of performing compute on the data creates yet another new data stream! <DO> the original data streams are needed sometimes but not always. It’s important to call out these cases separately, because it affects both how you do forensics, and how you express data provenance. In addition, data may be cached, copied and/or stored at multiple locations in the network on route to its final destination. With an increasing percentage of devices connecting to the Internet being mobile, support for in-the-network caching and replication is critical for continuous data availability, not to mention efficient network and battery usage for endpoint devices. <DO> I would not throw caching and replication together in one sentence - they in fact address quite different needs. [snip] Businesses, such as industrial companies, are starting to understand how valuable the data is that they've kept in silos. Once this data is made accessible on edge computing platforms, they may be able to monetize the value of the data. But this will happen only if data <DO> I don’t see how this follows. Monetization is either attractive or not depending on the data ownership model; What about edge changes the equation compared to centralized? can be discovered and searched among heterogeneous equipment in a standard way. Discovering the data, that its most useful to a given market segment, will be extremely useful in building business revenues. Having a mechanism to provide this granular discovery is the problem that needs solving either with existing, or new, protocols. [snip] 1.4. Terminology o Edge: The edge encompasses all entities not in the back-end cloud. The device edge is the boundary between digital and physical entities in the last mile network. <DO> not sure I follow this sentence or why it’s relevant. Sensors, gateways, compute nodes are included. The infrastructure edge includes equipment on the network operator side of the last mile network including cell towers, edge data centers, cable headends, etc. See Figure 1 for <DO>I would include POPs in this list other possible tiers of edge clouds between the device edge and the back-end cloud data center. o Edge Computing: Distributed computation that is performed near the edge, where nearness is determined by the system requirements. This includes high performance compute, storage and network equipment on either the device or infrastructure edge. <DO> I wonder if including the devices is a good idea here. I can see justifications for blurring the distinction between the edge and and the end devices, but also some pretty strong justifications for treating them completely separately. Some examples: - from a security standpoint, the device generating the original data is the natural custodian and the principal authenticating that data. - from a privacy standpoint, devices holding data can form a natural privacy boundary, as opposed to infrastructure. - In many (but of course not all) cases, mobility complexities are confined to devices since the infrastructure does not move rapidly o Edge Data Discovery: The process of finding required data from edge entities, i.e., from databases, files systems, device memory that might be physically distributed in the network, and consolidating it or providing access to it logically as if it were a single unified source, perhaps through its namespace, that can be evaluated or searched. o NDN: Named Data Networking. NDN routes data by name (vs address), caches content natively in the network, and employs data-centric security. Data discovery may require that data be associated with a name or names, a series of descriptive attributes, and/or a unique identifier. <DO> Probably ought to cite CCNx as well (since it’s about to become an experimental RFC), and ICN approaches in general [snip] 2.1. A Cloud-Edge Continuum Although Edge Computing data typically originates at edge devices, there is nothing that precludes edge data from being created anywhere in the cloud-to-edge computing continuum (Figure 1). New edge data may result as a byproduct of computation being performed on the data stream anywhere along its path in the network. For example, infrastructure edges may create new edge data when multiple data streams converge upon this aggregation point and require transformation to fit within the available resources. Edge data also <DO> There might be other reasons than fitting within resources. For example - smoothing of raw measurements to eliminate high-frequency noise - obfuscation of data for privacy may be sent to the back-end cloud as needed. Discovering data which has be sent to the cloud is out of scope of this document, the assumption being that the cloud boundary is one that does not expose or publish the availability of its data. <DO> Well cloud stuff might be out of scope, but saying that the cloud boundary doesn’t expose or publish data is just not quite right. Or that the way you discover data across a multi-cloud by definition has to be different from how you do it for the edge seems similarly off the mark. [snip] Initially our focus is on discovery of edge data that resides at the Device Edge and the Infrastructure Edge. <DO> I still wonder if discovery of data on devices is somehow different because they are “authoritative” for the data they have. There also may be a peer-to-peer protocol angle here, which takes a particular (and possibly very different) view of discovery. 2.2. Types of Edge Data Besides sensor and measurement data accumulating throughout the edge computing infrastructure, edge data may also take the form of streaming data (from a camera), <DO> the boundary between sensor/measurement data and streaming data is blurry and probably not too relevant for the problems being talked about here. The further distinctions below are relevant, as all of the stuff above is time-series (with or without a fixed clock) while the stuff below is not. meta data (about the data), control data (regarding an event that was triggered), and/or an executable that embodies a function, service, or any other piece of code or algorithm. Edge data also could be created after multiple streams converge at the edge node and are processed, transformed, or aggregated together in some manner. SFC Data and meta-data discovery Service function chaining (SFC) allows the instantiation of an ordered set of service functions and subsequent "steering" of traffic through them. Service functions provide a specific treatment of received packets, therefore they need to be known so they can be used <DO> need to be known to whom? in a given service composition via SFC. So far, how the SFs are discovered and composed has been out of the scope of discussions in IETF. While there are some mechanisms that can be used and/or extended to provide this functionality, work needs to be done. An example of this can be found in [I-D.bernardos-sfc-discovery]. In an SFC environment deployed at the edge, the discovery protocol may also need to make available the following meta-data information per SF: o Service Function Type, identifying the category of SF provided. o SFC-aware: Yes/No. Indicates if the SF is SFC-aware. <DO> What exactly does this mean? is it functional or a particular protocol supported? The latter seems more useful. o Route Distinguisher (RD): IP address indicating the location of the SF(I). <DO> Huh? o Pricing/costs details. o Migration capabilities of the SF: whether a given function can be moved to another provider (potentially including information about compatible providers topologically close). o Mobility of the device hosting the SF, with e.g. the following sub-options: Level: no, low, high; or a corresponding scale (e.g., 1 to 10). Current geographical area (e.g., GPS coordinates, post code). Target moving area (e.g., GPS coordinates, post code). o Power source of the device hosting the SF, with e.g. the following sub-options: Battery: Yes/No. If Yes, the following sub-options could be defined: Capacity of the battery (e.g., mmWh). Charge status (e.g., %). Lifetime (e.g., minutes). <DO> all of the above list seems pretty arbitrary and not specific to Service Functions. [snip] 3. Scenarios for Discovering Edge Data Resources Mainly two types of situations need to be covered: 1. A set of data resources appears (e.g., a mobile node hosting data joins a network) and they want to be discovered by an existing but possibly virtualized and/or ephemeral data directory infrastructure. 2. A device wants to discover data resources available at or near its current location. As some of these resources may be mobile, the available set of edge data may vary over time. <DO> what about discovering where in the edge infrastructure to upload data to, as opposed to just where to find it once it’s put there? 4. Edge Data Discovery How can we discover data on the edge and make use of it? There are proprietary implementations that collect data from various databases and consolidate it for evaluation. We need a standard protocol set for doing this data discovery, on the device or infrastructure edge, in order to meet the requirements of many use cases. We will have terabytes of data on the edge and need a way to identify its existence and find the desired data. A user requires the need to search for specific data in a data set and evaluate it using their own tools. <DO> This seems a very different problem than anything addressed in this draft. The tools are outside the scope of this document, but the discovery of that data is in scope. 7. Security Considerations Security considerations will be a critical component of edge data discovery particularly as intelligence is moved to the extreme edge where data is to be extracted. <DO> This is content free. There’s lots to say here. Hopefully future versions will fill this out. This is also the place to discuss privacy, which probably has more edge-specific wrinkles than security considerations do.
- [Coin] Some comments on draft-mcbride-edge-data-d… David R. Oran
- Re: [Coin] Some comments on draft-mcbride-edge-da… Schooler, Eve M