[OPSAWG] Tsvart last call review of draft-ietf-opsawg-ntf-09

Michael Scharf via Datatracker <noreply@ietf.org> Sun, 31 October 2021 23:24 UTC

Return-Path: <noreply@ietf.org>
X-Original-To: opsawg@ietf.org
Delivered-To: opsawg@ietfa.amsl.com
Received: from ietfa.amsl.com (localhost [IPv6:::1]) by ietfa.amsl.com (Postfix) with ESMTP id 00DCC3A0E02; Sun, 31 Oct 2021 16:24:30 -0700 (PDT)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
From: Michael Scharf via Datatracker <noreply@ietf.org>
To: tsv-art@ietf.org
Cc: draft-ietf-opsawg-ntf.all@ietf.org, last-call@ietf.org, opsawg@ietf.org
X-Test-IDTracker: no
X-IETF-IDTracker: 7.39.0
Auto-Submitted: auto-generated
Precedence: bulk
Message-ID: <163572266994.9090.12397686878265317058@ietfa.amsl.com>
Reply-To: Michael Scharf <michael.scharf@hs-esslingen.de>
Date: Sun, 31 Oct 2021 16:24:29 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/ySzsaoacHxQC16rA3yNU_sPcQt4>
Subject: [OPSAWG] Tsvart last call review of draft-ietf-opsawg-ntf-09
X-BeenThere: opsawg@ietf.org
X-Mailman-Version: 2.1.29
List-Id: OPSA Working Group Mail List <opsawg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/opsawg>, <mailto:opsawg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/opsawg/>
List-Post: <mailto:opsawg@ietf.org>
List-Help: <mailto:opsawg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/opsawg>, <mailto:opsawg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 31 Oct 2021 23:24:30 -0000

Reviewer: Michael Scharf
Review result: Ready with Issues

This document has been reviewed as part of the transport area review team's
ongoing effort to review key IETF documents. These comments were written
primarily for the transport area directors, but are copied to the document's
authors and WG to allow them to address any issues raised and also to the IETF
discussion list for information.

When done at the time of IETF Last Call, the authors should consider this
review as part of the last-call comments they receive. Please always CC
tsv-art@ietf.org if you reply to or forward this review.

This informational document describes an architectural framework for network
telemetry and the main components of corresponding systems.

It has two issues related to TSV topics:

First, the document lacks a discussion of the importance of congestion control
for telemetry traffic as well as corresponding references, e.g., to RFC 8085.
High-volume telemetry traffic can overload a network unless proper
counter-measures are in place (i.e., at minimum "circuit breakers"). It doesn't
seem appropriate to entirely ignore that issue.

Second, language regarding the ambigous term "transport" and the references to
Internet transport protocols must be improved to be consistent with IETF
standards.

Below are some examples for sections in which these issues are obvious.

Section 3.4

   It is worth noting that a network telemetry system should not be
   intrusive to normal network operations by avoiding the pitfall of the
   "observer effect".  That is, it should not change the network
   behavior and affect the forwarding performance.  Otherwise, the whole
   purpose of network telemetry is compromised.

=> This statement should be extended to be very explicit about the risk of
causing network congestion by high-volume telemetry traffic unless proper
isolation or traffic engineering techniques are in place, or congestion control
mechanisms ensure that telemetry traffic backs off if it exceeds the network
capacity. RFC 8085 is a relevant BCP in this space. As a side note, RFC 8085
discusses other relevant challenges as well, but the issues caused by
potentially inelastic high-volume telemetry traffic seem particularly relevant
for ensuring network stability when telemetry solutions get deployed.

4.1.  Top Level Modules

   +---------+--------------+--------------+---------------+-----------+
   | Module  | Management   | Control      | Forwarding    | External  |
   |         | Plane        | Plane        | Plane         | Data      |
   +---------+--------------+--------------+---------------+-----------+
   |Object   | config. &    | control      | flow & packet | terminal, |
   |         | operation    | protocol &   | QoS, traffic  | social &  |
   |         | state        | signaling,   | stat., buffer | environ-  |
   |         |              | RIB          | & queue stat.,| mental    |
   |         |              |              | ACL, FIB      |           |
   +---------+--------------+--------------+---------------+-----------+
   |Export   | main control | main control | fwding chip   | various   |
   |Location | CPU          | CPU,         | or linecard   |           |
   |         |              | linecard CPU | CPU; main     |           |
   |         |              | or forwarding| control CPU   |           |
   |         |              | chip         | unlikely      |           |
   +---------+--------------+--------------+---------------+-----------+
   |Data     | YANG, MIB,   | YANG,        | template,     | YANG,     |
   |Model    | syslog       | custom       | YANG,         | custom    |
   |         |              |              | custom        |           |
   +---------+--------------+--------------+---------------+-----------+
   |Data     | GPB, JSON,   | GPB, JSON,   | plain         | GPB, JSON |
   |Encoding | XML          | XML, plain   |               | XML, plain|
   +---------+--------------+--------------+---------------+-----------+
   |Protocol | gRPC,NETCONF,| gRPC,NETCONF,| IPFIX, mirror,| gRPC      |
   |         |              | IPFIX, mirror| gRPC, NETFLOW |           |
   +---------+--------------+--------------+---------------+-----------+
   |Transport| HTTP, TCP    | HTTP, TCP,   | UDP           | HTTP,TCP  |
   |         |              | UDP          |               | UDP       |
   +---------+--------------+--------------+---------------+-----------+

=> This table needs to be corrected.

1/ At least the entry in the column "forwarding plane" for IPFIX seems
incorrect, as the IETF has standardized IPFIX use over TCP, UDP and also SCTP.

2/ The label "transport" in the last line should be replaced by an other term
(maybe "data transport"?). In the TCP/IP protocol stack, "HTTP" is not a
transport but an application protocol, unlike TCP and UDP. As a result, the
line headline should use a term that cannot be confused with the name of a
layer in the TCP/IP protocol stack.

3/ The label "protocol" in the second but last line is also misleading. All
entries in the "transport" line are protocols as well. The term "Application
protocol" may be one option; others may exist as well.

4.1.1.  Management Plane Telemetry

   *  High Speed Data Transport: In order to keep up with the velocity
      of information, a server needs to be able to send large amounts of
      data at high frequency.  Compact encoding formats or data
      compression schemes are needed to reduce the quantity of data and
      improve the data transport efficiency.  The subscription mode, by
      replacing the query mode, reduces the interactions between clients
      and servers and helps to improve the server's efficiency.

=> The server is not the only bottleneck. This section needs to discuss the
network as a potential bottleneck as well, and explain that a telemetry
solution must protect the network from congestion by congestion control
mechanisms or at least circuit breakers. RFC 8085 is a relevant BCP in this
space.

4.1.2.  Control Plane Telemetry

=> Discussion of the risk of congestion by telemetry protocols without
congestion control (e.g., using UDP possibly without circuit breakers) is
missing in this section.

4.1.3.  Forwarding Plane Telemetry

   *  The data plane devices must provide timely data with the minimum
      possible delay.  Long processing, transport, storage, and analysis
      delay can impact the effectiveness of the control loop and even
      render the data useless.

=> Similar like in the previous section, this wording entirely ignores the
impact of potential network capacity shortage and congestion. A reference to
RFC 8085 and a corresponding discussion of how to meet the requirements from
RFC 8085 is missing.

4.1.4.  External Data Telemetry

=> As the communication with "external" entites outside the boundary of a
provider network may be realized over the Internet, the risk of congestion as
well as proper counter-measures is even more relevant in this section as
compared to the previous sections.