Re: [OPSAWG] Tsvart last call review of draft-ietf-opsawg-yang-vpn-service-pm-11

"Wubo (lana)" <lana.wubo@huawei.com> Wed, 28 September 2022 08:10 UTC

From: "Wubo (lana)" <lana.wubo@huawei.com>
To: Bob Briscoe <ietf@bobbriscoe.net>, "tsv-art@ietf.org" <tsv-art@ietf.org>
CC: "draft-ietf-opsawg-yang-vpn-service-pm.all@ietf.org" <draft-ietf-opsawg-yang-vpn-service-pm.all@ietf.org>, "last-call@ietf.org" <last-call@ietf.org>, "opsawg@ietf.org" <opsawg@ietf.org>
Thread-Topic: Tsvart last call review of draft-ietf-opsawg-yang-vpn-service-pm-11
Thread-Index: AQHY0ZY7RDEj2e/rJUCjmuJFnTBu3q30bCJg
Date: Wed, 28 Sep 2022 08:10:02 +0000
Message-ID: <e3e8cbe0bb8a4703a218277b89753022@huawei.com>
References: <166418960145.62421.7829563511218012388@ietfa.amsl.com>
In-Reply-To: <166418960145.62421.7829563511218012388@ietfa.amsl.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
Content-Type: multipart/alternative; boundary="_000_e3e8cbe0bb8a4703a218277b89753022huaweicom_"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/opsawg/bGqJ7V2EvVzJv7H-Xi2Y-jIOsKw>
Subject: Re: [OPSAWG] Tsvart last call review of draft-ietf-opsawg-yang-vpn-service-pm-11
Precedence: list

Hi Bob,



Thank you for the detailed review and helpful comments. Please find the replies inline.



Thanks,

Bo



-----Original Message-----
From: Bob Briscoe via Datatracker [mailto:noreply@ietf.org]
Sent: Monday, September 26, 2022 6:53 PM
To: tsv-art@ietf.org
Cc: draft-ietf-opsawg-yang-vpn-service-pm.all@ietf.org; last-call@ietf.org; opsawg@ietf.org
Subject: Tsvart last call review of draft-ietf-opsawg-yang-vpn-service-pm-11



Reviewer: Bob Briscoe

Review result: On the Right Track



This document has been reviewed as part of the transport area review team's ongoing effort to review key IETF documents. These comments were written primarily for the transport area directors, but are copied to the document's authors and WG to allow them to address any issues raised and also to the IETF discussion list for information.



When done at the time of IETF Last Call, the authors should consider this review as part of the last-call comments they receive. Please always CC tsv-art@ietf.org<mailto:tsv-art@ietf.org> if you reply to or forward this review.



==============================================================================



This draft is a useful contribution. I was good to see that the ability to monitor delay percentiles had been included. This review was written against draft-10, but then checked against draft-11 via the diff.



I'm afraid I have 14 comments, some of which are fairly major (e.g. #3, #4 & #12). It should be borne in mind that this implies that everything else in the draft is just fine :) However, all reviews tend to look fairly negative, because they necessarily focus on points of concern and disagreement.



==1. Granularity of Units==



      Measurement interval ("measurement-interval"):  Specifies the

      performance measurement interval, in seconds.



Making the minimum granularity 1 s seems too coarse. 1 s is quite a common interval for certain metrics (e.g. utilization), so it seems wrong to also make it the minimum. However, I wouldn't fight hard if you disagree.

[Bo Wu] In the YANG model, the default interval is 60s. This Indicates the period of time that PM measurement tasks can be performed and results are collected, which we think is a reasonable unit.



     typedef percentile {

       type decimal64 {

         fraction-digits 2;



2 decimals doesn't seem enough for percentiles. I suggest 3 at least, so that five-9s percentiles can be specified (for instance, at a packet rate of 100,000 pkt/s, a 99.999%-ile delay statistic would imply 1 pkt/s is above the percentile).

[Bo Wu] OK with this one.



Is there a reason why the default percentiles are 10% and 90%? I think these defaults would be rarely used. If this is just an arbitrary choice, 1% and 99% might be better choices.

[Bo Wu] These are configurable and can be changed to accommodate specific contexts. The current defaults are used to echo what was used in other RFCs such RFC9244.



==2. Recursion==



I don't think this draft precludes a VPN over a VPN over a physical network (e.g. a CVLAN over an SVLAN). However, it doesn't mention the possibility either. An example with multiple layers of VPNs would be useful.

[Bo Wu] In the draft, the current text does not make any assumption on the underlay. Per RFC8345, a network topology is a hierarchical structure, which could be VPN service topology, or L1, L2, L3, OSPF topology, and two networks can use “supporting-network” to show the interconnection. Therefore, we think this draft allow this and not sure this is representative enough to zoom into it.

https://www.rfc-editor.org/rfc/rfc8345.html



==3. Is the definition of TP adequate to determine different types of loss?==



I could not work out whether this YANG model would enable an operator to identify whether losses are due to:

  * tail loss (buffer full),

  * selective early discard (AQM),

  * or discards due to transmission errors.

I read RFC8345 which was cited as the reference for the definitions of link and TP. However, the definition of TP was 'a physical or logical port or, more generally, an interface', which is not specific enough to determine exactly where the TP of a physical or logical link is. Is a TP before or after the output buffer? Before or after the input buffer? Before or after packet error checking?



Also, is the TP of a VPN before or after encap. Is it before or after decap?

Whether per-class-id PM statistics include a packet or not will be highly dependent on the answer to this question. Because encap and decap can alter the class of a packet if the operator is using the pipe model where the DSCP is not copied to and from the outer on encap and decap [RFC2983].

[Bo Wu] Agree the current text is not clear enough.



There could be three types of underlay networks for VPN services: L3 topology, L2 topology, and physical topology, or a combination of these topologies. As per RFC 8345, the network type of one particular network instance can support the three types simultaneously.



For TP counters, we currently reuse the YANG counters in RFC 8343 "ietf-interfaces", which supports physical interfaces as well as logical interfaces.

https://www.rfc-editor.org/rfc/rfc8343#section-3.



If the underlay network is an L3 topology, then the TP is L3 termination point, which could be a physical interface or an L2 logical interface.

For the TP of VPN, it could be also physical interface or an L2 logical interface.



==4. Per traffic descriptor PM statistics?==



"§4.4 Link & TP PM Augmentation" includes  the following



       |     +--ro one-way-pm-statistics-per-class* [class-id]



Is there a reference for how class-id can be used? The description says:



                   "The class-id is used to identify the

                    class of service. This identifier is internal

                    to the administration.";



but that seems unworkable. The administration of a VPN is often separate from the administration of the network it runs over. So how does one administration know what the other means by each class-id? The whole point of OAM standardization is to ensure that network administration doesn't need to be complemented by ad hoc manual techniques, which are prone to human error.



Ideally, rather than class-id being just an opaque string, it would be associated with a generic traffic descriptor (filter) that could also use wildcards, for example:

    dst_addr==unicast

    next_header!=udp            # in IPv6, equivalent to protocol ID in IPv4

    DSCP==0b000???

    ECN==0b?1

Is there anything like this in YANG already?

If not, it surely needs to be created for this PM draft.

I presume it would have to encompass Ethernet, MPLS etc, not just IP.

[Bo Wu] Yes. The class-id is defined in L3NM and L2NM.For l3vpn, the class-id is associated with filter like you suggested and l2vpn with pcp, etc.. I added the references of two drafts for your information.

https://datatracker.ietf.org/doc/html/rfc9182 L3NM

https://datatracker.ietf.org/doc/html/rfc9291 L2NM



==5. Why only unicast and non-unicast?==



       +--ro inbound-unicast?            yang:counter64

       +--ro inbound-non-unicast?        yang:counter64



It seems rather arbitrary to pick this particular traffic filter in the TP part of the YANG tree in Fig 7. First, it begs the question why no distinction can be made between anycast and multicast. But, more generally, it begs the question why more general traffic filters are not possible, as described in my previous point (which would allow unicast and non-unicast to be specified as just one of many other possible filters).

[Bo Wu] This definition takes L2VPN into account, e.g. rfc7432 BUM. Thus, we would like to maintain this leaf.



==6. ECN marking==



An operator might be just as interested in ECN marking as loss (Congestion Experienced in the IP-ECN field [RFC3168], or the equivalent in MPLS [RFC5129]). But it does not appear in the model.



Also, an operator might be interested in loss statistics for ECN-capable packets separately from non-ECN-capable, for instance, to detect overload when loss of ECN packets is introduced [RFC7567; §4.2.1]. Again, stats per generic traffic descriptor, as described above, would support this without needing a specific set of statistics.

[Bo Wu] We think loss stats have been covered in the TP and the link. If ECN-specific counters needs to be added, I think this module can be extended for specific implementation. The current draft is based on the RFC 8343 “ietf-interface” YANG and the general mechanisms, e.g.. TWAMP, STAMP, BGP-LS, etc.



==7. Overload events==



An operator is likely to want alarm notifications when loss or ECN marking exceeds a configurable threshold.

[Bo Wu] Agree ECN marking is useful, which could be extended for L3VPN specific.



==8. Utilization per class-id?==



The inbound-octets or outbound-octets counters in Fig 7 allow utilization to be determined for a TP. But why is there not (AFAICT) an equivalent ability to monitor the utilization of a link per class-id?

[Bo Wu] Per RFC8345, the link is unidirectional between a source node and destination node. So it is not same to get the counters on the port/interface.



==9. Example test-case==



FWIW, while reviewing, I had in mind some performance monitoring requirements in one of my own recent drafts. I am not suggesting you cite them, but you might want to use them as a test case; to see whether this YANG model could define all the required statistics:

https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-aqm-dualq-coupled-25#section-2.5.2

[Bo Wu] Thanks for sharing this. We agree the PM module can be helpful.



==10. Fragmentation statistics?==



Given the encap process for a VPN can lead to packet-too-big errors or fragmentation, wouldn't it be useful to monitor a PTB error count and/or a count of fragments created during encap? Or perhaps this is more fault diagnostics than performance monitoring (nonetheless, it impacts performance).

[Bo Wu] Yes.We agree the Fragmentation statistics can be useful, but the module depends on ”ietf-interface” to collect the data, which has no such counter. And we also think fragment counters are visible at the CE side, not PE.



==11. Extensibility, backward and forward compatibility==



There is no discussion of extensibility in the draft. What are the implications (if any) of adding more metrics or new topology features in future? Considering cases like: * old device and new mgmt system; * new device and old mgmt system;

* some devices in a network support updated elements of the model and others don't.



This might just require citing an RFC about what is meant to happen when two different versions of a YANG model interact, e.g. default ignore?

[Bo Wu] Yes, we can add some extensibility consideration, e.g.

This module can be further extended to include L3-specific details, e.g., adding ECN statistics to support performance-sensitive applications.



==12. Security Considerations specific to this YANG model==



§6 gives useful generic security advice for protocols used to access the content of this model (NETCONF, RESTCONF). And it states the (fairly obvious) implications of not protecting the main subsets of content with write or read access.



However, it doesn't state how protecting each main subset impacts on access to the other subsets, For instance:

  * Is read access to VPN PM statistics possible without read access to VPN

  topology? * Do some write operations on VPN topology require read access to

  the underlying network topology? * Is read access to VPN topology possible

  without read access to underlying network topology? * Is read access to

  underlying network PM statistics advisable in order to diagnose VPN

  performance issues? * etc.



On this last point, it is common for an overlay to conceal occasional errors in lower layers (e.g. HARQ conceals lower layer losses from higher layers; link protection conceals lower layer link failures from higher layers). However, once lower layer errors exceed a certain point, it becomes impossible to conceal them, resulting in potential catastrophic failure. If a higher layer is not monitoring underlying errors that are being concealed, it will not know to initiate appropriate remedial actions (which might include dual-homing so it can switch to an alternative network operator).



I learned insights like this by reading the outputs of the Resilinets project [ https://resilinets.org/ ], which is well worth a look if you haven't already.

[Bo Wu] Thanks for sharing this. We agree that the OSS and the orchestrator applications need more information to assure services. And this model is defined as a network model, not a customer service model per RFC8309. The model is for service provider' OSS or orchestrator applications. Therefore the access to the underlay network topology and VPN topology is no difference here.



==13. Normative references==



There seem to be an excessively large number of normative references. I suggest they are checked. I didn't do this systematically myself, except I happened to notice the following seem to be informatively cited, not normatively:



   [RFC9182]

[Bo Wu] Already cited as informative.

[ITU-T-Y-1731]

[Bo Wu] We can change this to informative.

These are cited as example statistics that _can_ be

   collected, which seems informative. [RFC6241] [RFC8040] NETCONF and RESTCONF

   are given as example ways of accessing this YANG model, which seems

   informative - if another protocol were used, it would not alter this spec.

   [RFC6242] [RFC8446] Altho' SSH is mandatory to implement if NETCONF is used

   and TLS is mandatory to implement if RESTCONF is used, neither NETCONF nor

   RESTCONF is mandatory to implement /for this YANG model/. etc.

[Bo Wu] These are cited as normative because of the YANG Guidelines. For example, https://trac.ietf.org/trac/ops/wiki/yang-security-guidelines says explicitly: “Note: [RFC 8446], [RFC6241], [RFC6242], [RFC8341], and [RFC8040] must be "normative references".”



==14. Nits==



§1. Intro



   The module can be used to

   monitor and manage network performance on the topology level or the

   service topology between VPN sites, in particular.



Delete 'in particular' (don't know what purpose it's serving here)?

[Bo Wu] Thanks for catching. Fixed.



§3. Model Usage



   Before using the model, the controller needs to establish complete

   topology visibility of the network and VPN.



Requiring /complete/ visibility seems brittle and unnecessary. E.g. if there's a partial failure of the management plane, there's no reason to stop monitoring performance of the parts that are visible.

[Bo Wu] Agree. Will remove complete.



s/Protocol(STAMP)/Protocol (STAMP)/

[Bo Wu] Fixed.



   The performance monitoring information

   reflecting the quality of the network or VPN service (e.g., end-to-

   end network performance data between source node and destination node

   in the network or between VPN sites)



Pls make it clear that 'end-to-end' is being used in the Bell-head sense (not the Net-head sense as in 'the end-to-end principle' or 'end-to-end protocol'

which means 'application end-point to application end-point').

[Bo Wu] Agree. Will remove “end-to-end”.



   The measurement and report intervals that are associated with these

   performance data usually depend on the configuration of the specific

   measurement method or collection method or various combinations.

   This document defines a network-wide measurement interval to align

   measurement requirements for networks or VPN services.



The second sentence makes it sound like there has to be one interval network-wide. This seems inconsistent with the first sentence, that says different intervals are required depending on the specifics of each metric.

[Bo Wu] Agree. How about we correct the second one:

“This document defines network-wide measurement intervals to align

   measurement requirements for networks or VPN services.”



§3.2 Collecting Data On Demand



s/Data On-demand/Data On Demand/

There is no hyphen in 'Data On Demand', unlike 'on-demand data', which is a compound adjective. [ https://www.grammarbook.com/punctuation/hyphens.asp ]

[Bo Wu] Thanks for catching this. Fixed.



   To obtain a snapshot of a large amount of performance data... For

   example, a specified "link-id" of a VPN can be used as a filter in a

   RESTCONF GET request to retrieve per-link VPN PM data.



Surely one link isn't an example of a _large_ amount of data.

[Bo Wu] We will remove “a _large_ amount of data”. Thanks.



Figure 3. The mappings using strings of ':'s in the ASCII art didn't work for me. I genuinely thought that VN1, VN3, N1 and N2 were all intended to be mapped to each other in some weird way. Only when I read the text did I work out that these are two mappings that cross without intersecting. A gap either side of the intersection for one of the lines might work better.

[Bo Wu] I will try to make this better.



§4.1 Layering Relationship



   Apart from the association between the VPN topology and the underlay

   topology, VPN Network PM YANG module can also provide the performance

   status of the underlay network and VPN services.  For example, the

   module can provide...



Performance status is the primary purpose of the PM module, so it's odd to describe it as if it's incidental to the topology associations ('can also provide'). You might mean that the performance statements are all optional so theoretically none needs to exist. If that's what you were trying to say, it needs saying explicitly. The second 'can provide' is fine because it's an example.



[Bo Wu] How about we make such corrections:

Based on the association between the VPN topology and the underlying topologies, VPN Network PM YANG module extends the performance status of the underlay networks and VPN services. For example, the module can provide link PM statistics and port statistics of an underlay network, e.g. L1, L2, L3, OSPF, IS-IS, etc..



§4.3 Node Level PM Augmentation



   For VPN service performance monitoring, the module defines one

   attribute:

   "role":



The role attribute is solely about topology, so it seems wrong to say it's for performance monitoring.

[Bo Wu] Agree. Will change to: “For VPN service topology, the module defines one attribute:”



§ 4.4. Link & TP Level PM Augmentation



      ... Lists p

      erformance measurement statistics...

      ...Indicat

      es the abstract link...

(Inappropriate line breaks)



[Bo Wu] Will fix.



§6 Security Considerations



s/It is thus important to control read access/It thus might be important to control read access/ (Rationale: for consistency with the 'may be' in the previous sentence.) Perhaps the point should also be made that there is a trade-off between confidentiality and monitoring performance properly (see earlier point about catastrophic failures due to higher layers concealing lower layer performance problems).



Each time the word 'access' is used, pls specify which type. For example:

* In the first three bullets s/Unauthorized access/Unauthorized write access/

* In the last three bullets s/Unauthorized access/Unauthorized read access/



Also, it would have been preferable to write repetitive bullets like these as a table that covers all the cases. Otherwise, all the repetition just makes it hard to identify what the differences between each bullet are. Eg.



    Unauthorized access to the following subtrees could have the following

    impacts: +--------+----------------------+------------------+ | Access |

      Node            | Potential impact |

    +--------+----------------------+------------------+ |

    /nw:networks/nw:network/nw:network-types         | | write  | service type

           | render  invalid  | | write  | VPN identifier       | render

    invalid  | | write  | VPN service topology | render  invalid  | | write  |

    any of the above     | disable VPN PM   |

    +--------+----------------------+------------------+ |

    /nw:networks/nw:network/nw:node                  | | etc....



[Bo Wu] Thanks for suggestion. We will change as you suggested.

[OPSAWG] Tsvart last call review of draft-ietf-op… Bob Briscoe via Datatracker
Re: [OPSAWG] Tsvart last call review of draft-iet… Wubo (lana)