[OPSAWG] Roman Danyliw's Discuss on draft-ietf-opsawg-ntf-11: (with DISCUSS and COMMENT)

Roman Danyliw via Datatracker <noreply@ietf.org> Wed, 01 December 2021 02:09 UTC

MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Roman Danyliw via Datatracker <noreply@ietf.org>
To: The IESG <iesg@ietf.org>
Cc: draft-ietf-opsawg-ntf@ietf.org, opsawg-chairs@ietf.org, opsawg@ietf.org, ludwig@clemm.org, ludwig@clemm.org
Auto-Submitted: auto-generated
Precedence: bulk
Reply-To: Roman Danyliw <rdd@cert.org>
Message-ID: <163832458350.9944.17802735924974626797@ietfa.amsl.com>
Date: Tue, 30 Nov 2021 18:09:44 -0800
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/LB-7Wfm6YUOog4HeMeArtVCLfT4>
Subject: [OPSAWG] Roman Danyliw's Discuss on draft-ietf-opsawg-ntf-11: (with DISCUSS and COMMENT)

Roman Danyliw has entered the following ballot position for
draft-ietf-opsawg-ntf-11: Discuss

When responding, please keep the subject line intact and reply to all
email addresses included in the To and CC lines. (Feel free to cut this
introductory paragraph, however.)


Please refer to https://www.ietf.org/blog/handling-iesg-ballot-positions/
for more information about how to handle DISCUSS and COMMENT positions.


The document, along with other ballot positions, can be found here:
https://datatracker.ietf.org/doc/draft-ietf-opsawg-ntf/



----------------------------------------------------------------------
DISCUSS:
----------------------------------------------------------------------

Thank you for being responsive to the SECDIR review threat to improve the
security considerations text.  Specifically,
https://mailarchive.ietf.org/arch/msg/secdir/GUvFWXP7n9IjXW8xlIdMS5ZE5u0/.

Even after these edits, there are a few straightforward ambiguities to clear up.

(a) Section 2.  “When a network's endpoints do not represent individual users
(e.g. in industrial, datacenter, and infrastructure contexts), network
operations can often benefit from large-scale data collection without breaching
user privacy.”

Is network telemetry architecture being restricted to such a limited
applicability?  To quote the original SECDIR thread, is this saying “The
Network Telemetry Framework is not applicable to networks whose endpoints
represent individual users, such as general-purpose access networks”?  If so,
I’d recommend being that explicit.

(b) Section 2.1.  “To preserve user privacy, the user packet content should not
be collected.” This is a great principle, but extremely nuanced and potentially
complicated to implement.  Is this saying (using the words of this framework),
“To preserve the privacy of end-users, no user packet content should be
collected.  Specifically, the data objects generated, exported, and collected
by the Network Telemetry Framework should not include any packet payload from
traffic associated with end-users systems”?

(c) Section 2.5.  Please use stronger and consistent language.

OLD
Disclaimer: large-scale network data collection is a major threat to
user privacy [RFC7258].  The network telemetry framework presented in
this document should not be applied to collect and retain individual
user data or any data that can identify end users without consent.
Any data collection or retention using the framework must be tightly
limited to protect user privacy.

NEW
Large-scale network data collection is a major threat to user privacy and may
be indistinguishable from pervasive monitoring [RFC7258].  The network
telemetry framework presented in this document must not be applied to
generating, exporting, collecting, analyzing or retaining individual user data
or any data that can identify end users or characterize their behavior without
consent.

The principles described in (a), (b) and (c) seems sufficiently important they
shouldn’t be scattered across the document.  Please either make an
applicability statement section early in the document or a dedicated privacy
consideration section.


----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------

(Apologize if any of the below section numbers are wrong.  I conducted most of
my review on -10 and then -11 was published which renumbered the document)

Thanks to Alexey Melnikov for the SECDIR review.

I'm a bit of confusion on the framing of this document.  It seems to me to be
suggesting that “OAM” is a tied to a series of static technologies and
practices, and a set of new practices called “network telemetry” are needed.  I
don’t disagree with the idea that network management practices need to evolve,
and that the “networks of the future” will look different than today.  Relying
on BCP 161 (RFC 6291), I took OAM to mean an evolving set of practices and
technology.  Using Section 3 of BCP 161, O + A + M seemed like a contextual set
of operations that would be done now and still required in networks of the
future.  The document acknowledges that there is some ambiguity in “network
telemetry”.  I think it needs to equally acknowledge that the same is true of
OAM, and that RFC7276 is not OAM.  In the aggregate, I don’t think the text
realizes the clarity that it set out to provide by defining “key
characteristics of network telemetry which set a clear distinction from the
conventional network OAM and show that some conventional OAM technologies can
be considered a subset of the network telemetry technologies.”.  To be clear,
I’m not raising an objection to many of the properties linked to network
telemetry.  Instead, I think the clarity of message is getting diluted because
a very particular distinction is trying to be made (OAM vs. network telemetry)
and it isn’t clear.  See below for a specifics.

** Section 1.
… using a wide variety of techniques including machine learning, data analysis,
and correlation.

ML, data analysis and correlation are unlike things.  ML is a particular AI
technique, data analysis is a generic description of an activity, and is
correlation intended to be a statistical technique?

** Section 1
   Network telemetry extends beyond the historical network Operations,
   Administration, and Management (OAM) techniques and expects to
   support better flexibility, scalability, accuracy, coverage, and
   performance.

This seems hypothetical depending on the definition on which technologies are
considered in scope of network telemetry and OAM.

** Section 2.

Today one can access advanced big data analytics capability through a
   plethora of commercial and open source platforms (e.g., Apache
   Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine
   learning).  Thanks to the advance of computing and storage
   technologies, network big data analytics gives network operators an
   opportunity to gain network insights and move towards network
   autonomy.
In trying to contextual this observation, where is this capability relative to
Figure 1?  In general, I would recommend that this reference architecture when
assessing the ecosystem.

** Section 2.

However, while the data processing capability is improved and
   applications are hungry for more data ...

What does it mean and what applications are “hungry for more data”.  Is a
reference possible here?

** Section 2.  Editorial.  s/concerned in the context/relevant in this document/

** Section 2.1
Less but higher quality data are often better
   than lots of low quality data.

This seems like a broad generalization that doesn’t consider the application
and the cost of acquisition or processing.

** Section 2.2.

The ultimate goal is to achieve the
      ideal security with no, or only minimal, human intervention.

What is “ideal” security?

** Section 2.2.
While machine learning technologies can be used for
      root cause analysis, it up to the network to sense and provide the
      relevant diagnostic data which are either actively fed into, or
      passively retrieved by, machine learning applications.

This text is asymmetric with the others bullets since don’t discuss specific
techniques.   Personally, it also seem odd to include this text as there are
other ways  to do root cause analysis beyond ML (to include other AI
approaches).

** Section 2.3
   For a long time, network operators have relied upon SNMP [RFC3416],
   Command-Line Interface (CLI), or Syslog to monitor the network.  Some
   other OAM techniques as described in [RFC7276] are also used to
   facilitate network troubleshooting.
...
   These challenges were addressed by newer standards and techniques
   (e.g., IPFIX/Netflow, PSAMP, IOAM, and YANG-Push) and more are
   emerging.  These standards and techniques need to be recognized and
   accommodated in a new framework.

This section is an exemplar of the disconnect I noted in the definitions of
OAM.  The first paragraph presents a narrow view of currently used (albeit
older) network monitoring technologies (SNMP, CLI Syslog).  However, in the
closing paragraph, the text names more modern technologies I would also
consider OAM, and these technologies could meet some of the challenges
mentioned in this section.  Furthermore, some of these “newer standards” are
framed as things that need to be “recognized”.  This is puzzling because my
understanding was that technologies like IPFIX/Netflow have been very widely
deployed for quite some time now.  What’s the new framework needed?

** Section 2.4
Network telemetry covers the conventional network OAM and
   has a wider scope.

Can the text be more specific in what way network telemetry is wider.  I
thought OAM was rather ambiguous.

** Section 2.4
Hence, the network telemetry can directly
   trigger the automated network operation, while in contrast some
   conventional OAM tools are designed and used to help human operators
   to monitor and diagnose the networks and guide manual network
   operations.

I’m not sure if this is a fair generalization.  Even “older technologies” like
SNMP currently trigger automated responses based on the values they return.

** Section 2.4.  Per “data fusion,” which part of the Figure 1 is this
happening?

** Section 2.5.

Network data analytics and machine-learning technologies are applied
   for network operation automation, relying on abundant and coherent
   data from networks.

-- What is the difference between a network data analytics system and ML
technologies?  Isn’t analytics a superset of ML?

-- What is coherent data?

** Section 2.5.
In detail, such a framework would benefit application
   development for the following reasons:

It might be helpful to level set what an application is in this context.  Is
this the “network operations application” of Figure 1?

** Section 2.5
All the use cases and
      applications are better to be supported uniformly and coherently
      under a single intelligent agent

-- Editorial.  There is a missing word which leads to this sentence not parsing.

-- What’s the basis for asserting that a “single intelligent agent” is the 
best approach?

-- Maybe the issue is of semantics, what is an “intelligent agent” in this
context?

** Section 2.5.

Network visibility presents multiple viewpoints

and

Efficient data fusion is critical for applications to reduce the
      overall quantity of data and improve the accuracy of analysis.

Are these generalizations expected to be true across the broad use cases?

** Figure 2.  For the management plane, the data model module has MIB and
syslog listed, but the data encodings as GPB, JSON and XML.  These data models
and encodings don’t line up (i.e., MIBs and syslog typically don’t rely on GPB,
JSON or XML).

** Section 3.1.  Where do network security applications such as WAFs, IDS/IPS/
NGF, DLP, web-proxies, and pDNS fit into this taxonomy?

** Section 3.1.* These sections inconsistently describe properties/requirements
for an architectural element and their challenges (but no solutions or
requirements for) a given elements.  As a result, I had trouble understanding
what an implementer should understand these components.  It would have been
clearer is the different modules had common and module specific requirements.

** Section 3.1.1.  Per the requirements of “Convenient Data Subscription”,
“Structured Data”, etc. why wouldn’t those be desirable requirements for all
four of the modules?

** Section 3.1.3.  Providing “timely data” and “structured data”, seem like the
restatements of Section 4.1.1’s “structure data” and “high speed transport”. 
Is this a common requirement?

** Section 3.1.3.  Why wouldn’t it be desirable for all of the modules to
support incremental deployment note here?

** Section 3.2.
   *  Data Query, Analysis, and Storage: This component works at the
      application layer.

I need a bit of topological orientation.  What is the application layer of say
a “forwarding plane” or “external data” be?  What are the other layers?

** Section 5.  Recommend explicitly saying that this document doesn’t define
specific technologies to shift the responsibility of specific considerations.

OLD
   Security considerations for networks that use telemetry methods may
   include:

NEW

This document proposes a conceptual architectural for collecting, transporting,
and analyzing a wide variety of data sources in support of network
applications.  The protocols, data formats, and configurations chosen to
implement this framework will dictate the specific Security Considerations. 
These considerations may include:

** Section 5.

OLD
   *  Telemetry data stores, storage encryption and methods of access;

NEW
   *  Telemetry data stores, storage encryption, methods of access, and
   retention practices.

[OPSAWG] Roman Danyliw's Discuss on draft-ietf-op… Roman Danyliw via Datatracker
Re: [OPSAWG] Roman Danyliw's Discuss on draft-iet… Haoyu Song