[OPSAWG] Tsvart last call review of draft-ietf-opsawg-yang-vpn-service-pm-11

Bob Briscoe via Datatracker <noreply@ietf.org> Mon, 26 September 2022 10:53 UTC

MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
From: Bob Briscoe via Datatracker <noreply@ietf.org>
To: tsv-art@ietf.org
Cc: draft-ietf-opsawg-yang-vpn-service-pm.all@ietf.org, last-call@ietf.org, opsawg@ietf.org
Auto-Submitted: auto-generated
Precedence: bulk
Message-ID: <166418960145.62421.7829563511218012388@ietfa.amsl.com>
Reply-To: Bob Briscoe <ietf@bobbriscoe.net>
Date: Mon, 26 Sep 2022 03:53:21 -0700
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/ZVsMmOc7EsKNQN9H5LfUCnXewk4>
Subject: [OPSAWG] Tsvart last call review of draft-ietf-opsawg-yang-vpn-service-pm-11

Reviewer: Bob Briscoe
Review result: On the Right Track

This document has been reviewed as part of the transport area review team's
ongoing effort to review key IETF documents. These comments were written
primarily for the transport area directors, but are copied to the document's
authors and WG to allow them to address any issues raised and also to the IETF
discussion list for information.

When done at the time of IETF Last Call, the authors should consider this
review as part of the last-call comments they receive. Please always CC
tsv-art@ietf.org if you reply to or forward this review.

==============================================================================

This draft is a useful contribution. I was good to see that the ability to
monitor delay percentiles had been included. This review was written against
draft-10, but then checked against draft-11 via the diff.

I'm afraid I have 14 comments, some of which are fairly major (e.g. #3, #4 &
#12). It should be borne in mind that this implies that everything else in the
draft is just fine :) However, all reviews tend to look fairly negative,
because they necessarily focus on points of concern and disagreement.

==1. Granularity of Units==

      Measurement interval ("measurement-interval"):  Specifies the
      performance measurement interval, in seconds.

Making the minimum granularity 1 s seems too coarse. 1 s is quite a common
interval for certain metrics (e.g. utilization), so it seems wrong to also make
it the minimum. However, I wouldn't fight hard if you disagree.

     typedef percentile {
       type decimal64 {
         fraction-digits 2;

2 decimals doesn't seem enough for percentiles. I suggest 3 at least, so that
five-9s percentiles can be specified (for instance, at a packet rate of 100,000
pkt/s, a 99.999%-ile delay statistic would imply 1 pkt/s is above the
percentile).

Is there a reason why the default percentiles are 10% and 90%? I think these
defaults would be rarely used. If this is just an arbitrary choice, 1% and 99%
might be better choices.

==2. Recursion==

I don't think this draft precludes a VPN over a VPN over a physical network
(e.g. a CVLAN over an SVLAN). However, it doesn't mention the possibility
either. An example with multiple layers of VPNs would be useful.

==3. Is the definition of TP adequate to determine different types of loss?==

I could not work out whether this YANG model would enable an operator to
identify whether losses are due to:
  * tail loss (buffer full),
  * selective early discard (AQM),
  * or discards due to transmission errors.
I read RFC8345 which was cited as the reference for the definitions of link and
TP. However, the definition of TP was 'a physical or logical port or, more
generally, an interface', which is not specific enough to determine exactly
where the TP of a physical or logical link is. Is a TP before or after the
output buffer? Before or after the input buffer? Before or after packet error
checking?

Also, is the TP of a VPN before or after encap. Is it before or after decap?
Whether per-class-id PM statistics include a packet or not will be highly
dependent on the answer to this question. Because encap and decap can alter the
class of a packet if the operator is using the pipe model where the DSCP is not
copied to and from the outer on encap and decap [RFC2983].

==4. Per traffic descriptor PM statistics?==

"§4.4 Link & TP PM Augmentation" includes  the following

       |     +--ro one-way-pm-statistics-per-class* [class-id]

Is there a reference for how class-id can be used? The description says:

                   "The class-id is used to identify the
                    class of service. This identifier is internal
                    to the administration.";

but that seems unworkable. The administration of a VPN is often separate from
the administration of the network it runs over. So how does one administration
know what the other means by each class-id? The whole point of OAM
standardization is to ensure that network administration doesn't need to be
complemented by ad hoc manual techniques, which are prone to human error.

Ideally, rather than class-id being just an opaque string, it would be
associated with a generic traffic descriptor (filter) that could also use
wildcards, for example:
    dst_addr==unicast
    next_header!=udp            # in IPv6, equivalent to protocol ID in IPv4
    DSCP==0b000???
    ECN==0b?1
Is there anything like this in YANG already?
If not, it surely needs to be created for this PM draft.
I presume it would have to encompass Ethernet, MPLS etc, not just IP.

==5. Why only unicast and non-unicast?==

       +--ro inbound-unicast?            yang:counter64
       +--ro inbound-non-unicast?        yang:counter64

It seems rather arbitrary to pick this particular traffic filter in the TP part
of the YANG tree in Fig 7. First, it begs the question why no distinction can
be made between anycast and multicast. But, more generally, it begs the
question why more general traffic filters are not possible, as described in my
previous point (which would allow unicast and non-unicast to be specified as
just one of many other possible filters).

==6. ECN marking==

An operator might be just as interested in ECN marking as loss (Congestion
Experienced in the IP-ECN field [RFC3168], or the equivalent in MPLS
[RFC5129]). But it does not appear in the model.

Also, an operator might be interested in loss statistics for ECN-capable
packets separately from non-ECN-capable, for instance, to detect overload when
loss of ECN packets is introduced [RFC7567; §4.2.1]. Again, stats per generic
traffic descriptor, as described above, would support this without needing a
specific set of statistics.

==7. Overload events==

An operator is likely to want alarm notifications when loss or ECN marking
exceeds a configurable threshold.

==8. Utilization per class-id?==

The inbound-octets or outbound-octets counters in Fig 7 allow utilization to be
determined for a TP. But why is there not (AFAICT) an equivalent ability to
monitor the utilization of a link per class-id?

==9. Example test-case==

FWIW, while reviewing, I had in mind some performance monitoring requirements
in one of my own recent drafts. I am not suggesting you cite them, but you
might want to use them as a test case; to see whether this YANG model could
define all the required statistics:
https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-aqm-dualq-coupled-25#section-2.5.2

==10. Fragmentation statistics?==

Given the encap process for a VPN can lead to packet-too-big errors or
fragmentation, wouldn't it be useful to monitor a PTB error count and/or a
count of fragments created during encap? Or perhaps this is more fault
diagnostics than performance monitoring (nonetheless, it impacts performance).

==11. Extensibility, backward and forward compatibility==

There is no discussion of extensibility in the draft. What are the implications
(if any) of adding more metrics or new topology features in future? Considering
cases like: * old device and new mgmt system; * new device and old mgmt system;
* some devices in a network support updated elements of the model and others
don't.

This might just require citing an RFC about what is meant to happen when two
different versions of a YANG model interact, e.g. default ignore?

==12. Security Considerations specific to this YANG model==

§6 gives useful generic security advice for protocols used to access the
content of this model (NETCONF, RESTCONF). And it states the (fairly obvious)
implications of not protecting the main subsets of content with write or read
access.

However, it doesn't state how protecting each main subset impacts on access to
the other subsets, For instance:
  * Is read access to VPN PM statistics possible without read access to VPN
  topology? * Do some write operations on VPN topology require read access to
  the underlying network topology? * Is read access to VPN topology possible
  without read access to underlying network topology? * Is read access to
  underlying network PM statistics advisable in order to diagnose VPN
  performance issues? * etc.

On this last point, it is common for an overlay to conceal occasional errors in
lower layers (e.g. HARQ conceals lower layer losses from higher layers; link
protection conceals lower layer link failures from higher layers). However,
once lower layer errors exceed a certain point, it becomes impossible to
conceal them, resulting in potential catastrophic failure. If a higher layer is
not monitoring underlying errors that are being concealed, it will not know to
initiate appropriate remedial actions (which might include dual-homing so it
can switch to an alternative network operator).

I learned insights like this by reading the outputs of the Resilinets project [
https://resilinets.org/ ], which is well worth a look if you haven't already.

==13. Normative references==

There seem to be an excessively large number of normative references. I suggest
they are checked. I didn't do this systematically myself, except I happened to
notice the following seem to be informatively cited, not normatively:

   [RFC9182] [ITU-T-Y-1731] These are cited as example statistics that _can_ be
   collected, which seems informative. [RFC6241] [RFC8040] NETCONF and RESTCONF
   are given as example ways of accessing this YANG model, which seems
   informative - if another protocol were used, it would not alter this spec.
   [RFC6242] [RFC8446] Altho' SSH is mandatory to implement if NETCONF is used
   and TLS is mandatory to implement if RESTCONF is used, neither NETCONF nor
   RESTCONF is mandatory to implement /for this YANG model/. etc.

==14. Nits==

§1. Intro

   The module can be used to
   monitor and manage network performance on the topology level or the
   service topology between VPN sites, in particular.

Delete 'in particular' (don't know what purpose it's serving here)?

§3. Model Usage

   Before using the model, the controller needs to establish complete
   topology visibility of the network and VPN.

Requiring /complete/ visibility seems brittle and unnecessary. E.g. if there's
a partial failure of the management plane, there's no reason to stop monitoring
performance of the parts that are visible.

s/Protocol(STAMP)/Protocol (STAMP)/

   The performance monitoring information
   reflecting the quality of the network or VPN service (e.g., end-to-
   end network performance data between source node and destination node
   in the network or between VPN sites)

Pls make it clear that 'end-to-end' is being used in the Bell-head sense (not
the Net-head sense as in 'the end-to-end principle' or 'end-to-end protocol'
which means 'application end-point to application end-point').

   The measurement and report intervals that are associated with these
   performance data usually depend on the configuration of the specific
   measurement method or collection method or various combinations.
   This document defines a network-wide measurement interval to align
   measurement requirements for networks or VPN services.

The second sentence makes it sound like there has to be one interval
network-wide. This seems inconsistent with the first sentence, that says
different intervals are required depending on the specifics of each metric.

§3.2 Collecting Data On Demand

s/Data On-demand/Data On Demand/
There is no hyphen in 'Data On Demand', unlike 'on-demand data', which is a
compound adjective. [ https://www.grammarbook.com/punctuation/hyphens.asp ]

   To obtain a snapshot of a large amount of performance data... For
   example, a specified "link-id" of a VPN can be used as a filter in a
   RESTCONF GET request to retrieve per-link VPN PM data.

Surely one link isn't an example of a _large_ amount of data.

Figure 3. The mappings using strings of ':'s in the ASCII art didn't work for
me. I genuinely thought that VN1, VN3, N1 and N2 were all intended to be mapped
to each other in some weird way. Only when I read the text did I work out that
these are two mappings that cross without intersecting. A gap either side of
the intersection for one of the lines might work better.

§4.1 Layering Relationship

   Apart from the association between the VPN topology and the underlay
   topology, VPN Network PM YANG module can also provide the performance
   status of the underlay network and VPN services.  For example, the
   module can provide...

Performance status is the primary purpose of the PM module, so it's odd to
describe it as if it's incidental to the topology associations ('can also
provide'). You might mean that the performance statements are all optional so
theoretically none needs to exist. If that's what you were trying to say, it
needs saying explicitly. The second 'can provide' is fine because it's an
example.

§4.3 Node Level PM Augmentation

   For VPN service performance monitoring, the module defines one
   attribute:
   "role":

The role attribute is solely about topology, so it seems wrong to say it's for
performance monitoring.

§ 4.4. Link & TP Level PM Augmentation

      ... Lists p
      erformance measurement statistics...
      ...Indicat
      es the abstract link...
(Inappropriate line breaks)

§6 Security Considerations

s/It is thus important to control read access/It thus might be important to
control read access/ (Rationale: for consistency with the 'may be' in the
previous sentence.) Perhaps the point should also be made that there is a
trade-off between confidentiality and monitoring performance properly (see
earlier point about catastrophic failures due to higher layers concealing lower
layer performance problems).

Each time the word 'access' is used, pls specify which type. For example:
* In the first three bullets s/Unauthorized access/Unauthorized write access/
* In the last three bullets s/Unauthorized access/Unauthorized read access/

Also, it would have been preferable to write repetitive bullets like these as a
table that covers all the cases. Otherwise, all the repetition just makes it
hard to identify what the differences between each bullet are. Eg.

    Unauthorized access to the following subtrees could have the following
    impacts: +--------+----------------------+------------------+ | Access |   
      Node            | Potential impact |
    +--------+----------------------+------------------+ |
    /nw:networks/nw:network/nw:network-types         | | write  | service type 
           | render  invalid  | | write  | VPN identifier       | render 
    invalid  | | write  | VPN service topology | render  invalid  | | write  |
    any of the above     | disable VPN PM   |
    +--------+----------------------+------------------+ |
    /nw:networks/nw:network/nw:node                  | | etc....

[OPSAWG] Tsvart last call review of draft-ietf-op… Bob Briscoe via Datatracker
Re: [OPSAWG] Tsvart last call review of draft-iet… Wubo (lana)