Re: [ippm] Some thoughts on draft-mhmcsfh-ippm-pam

Adrian Farrel <adrian@olddog.co.uk> Thu, 25 August 2022 07:56 UTC

Date: Thu, 25 Aug 2022 08:56:23 +0100
From: Adrian Farrel <adrian@olddog.co.uk>
To: Greg Mirsky <gregimirsky@gmail.com>
Cc: draft-mhmcsfh-ippm-pam@ietf.org, IETF IPPM WG <ippm@ietf.org>
Message-ID: <1736368444.77775.1661414183151@www.getmymail.co.uk>
In-Reply-To: <CA+RyBmXXTZk+LobuRzMCGs5hPOULi85py4GGCDk91a793qDgzw@mail.gmail.com>
References: <0cb301d8a4c9$2eba1b40$8c2e51c0$@olddog.co.uk> <CA+RyBmW9iqpPz0Xxcgki7v3TbMG1=ydcGv4D=ytwCwpwMt9U=g@mail.gmail.com> <070701d8b27f$1b5add00$52109700$@olddog.co.uk> <CA+RyBmXXTZk+LobuRzMCGs5hPOULi85py4GGCDk91a793qDgzw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Importance: Normal
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/PXnhc8yuZ6I9riRdVTNkZSwwr7o>
Subject: Re: [ippm] Some thoughts on draft-mhmcsfh-ippm-pam
Precedence: list

Greg,

This looks to have addressed all of my comments. Thanks.

Adrian

On 24/08/2022 23:51 Greg Mirsky <gregimirsky@gmail.com> wrote:

Hi Adrian,
thank you for your feedback and clarifications. Please find my follow-up notes under the GIM2>> tag. Attached is the new working version of the draft and diff highlighting the updates.

I'm grateful for your comments and look forward to more discussions.

Regards,

Greg

On Wed, Aug 17, 2022 at 2:20 PM Adrian Farrel <adrian@olddog.co.uk> wrote:

Hi again Greg,

If you think I supplied enough text to be named as a Contributor, that’s fine. Otherwise, I’d be happy with just an acknowledgement.

GIM2>> Welcome aboard!

More details below.

Cheers,

Adrian

From: Greg Mirsky <gregimirsky@gmail.com>
Sent: 16 August 2022 22:58
To: Adrian Farrel <adrian@olddog.co.uk>
Cc: draft-mhmcsfh-ippm-pam@ietf.org; IETF IPPM WG <ippm@ietf.org>
Subject: Re: Some thoughts on draft-mhmcsfh-ippm-pam

Hi Adrian,

tons of thanks for your kind words supporting our work. Your comments and proposed updates are greatly appreciated. We've discussed them and prepared updates that are in-lined below under the GIM>> tag. Attached, please find the diff highlighting updates and the new working version.

Adrian, we greatly appreciate your insightful comments and practical proposals and would be honored if you agree to join this work as a contributor.

Regards,

Greg

On Sun, Jul 31, 2022 at 3:35 AM Adrian Farrel <adrian@olddog.co.uk> wrote:

Hi authors and IPPM working group,

Greg asked me if would have a look at this draft, and I am happy to do
so because we have an increased "need" to deliver SLOs for advanced uses
including IETF Network Slices.

It seems reasonable, to me, to start this work by defining the metrics
(as this document does) before we get into how to record or distribute
them.

Cheer,
Adrian

==

Passive voice

I would like you to inject some more precision into the text (starting
with the Abstract) so that the reader can know who assess the service is
being delivered in compliance with its specified quality.

GIM>> We propose the following update clarifying that PAM can be used by providers and/or users of a Network Slice service:

OLD TEXT:

Specifically, PAM can be used to assess whether a service is provided

in compliance with its specified quality, i.e., in accordance with

its defined SLOs.

NEW TEXT:

Specifically, PAM

can be used by providers and/or users of the Network Slice service to

assess whether the service is provided in compliance with its

specified quality, i.e., in accordance with its defined SLOs.

[AF] Right, this clarifies your intent. Thanks.

[AF] It opens up the question (for me), however, about whether there is a distinction between measurements performed by the user of a service and measurements performed by the provider of a service. I suspect this is practical as much as philosophical, but I don’t understand enough about the mechanisms to know whether it matters.

GIM2>> That is a very interesting question, thank you. I think that there should exist a view of the service that both operator and user can share. An operator might have more information for the root-cause investigation, problem localization, and troubleshooting. Perhaps we can get to more details in a future version.

There is a chicken/egg situation here, I think. If you define SLOs that
cannot be reported against because adequate metrics do not exist, then
you have surely done something wrong! Conversely, if you define metrics
that are interesting, but are not the sort of thing used in the
expression of an SLO, then you may be limiting the value of the work.

So what?

Well, I think that, notwithstanding that you *do* discuss SLOs, you
mght make it clearer that your metrics are "useful for the definition
and monitoring of SLOs."

GIM>> A very useful clarification, thank you for your suggestion. Proposed update:

OLD TEXT:

These metrics, referred to as Precision Availability Metrics (PAM),

can be used to assess the service levels that are being delivered.

NEW TEXT:

These metrics, referred to as Precision Availability Metrics (PAM),

are useful for defining and monitoring of SLOs.

[AF] wfm

Indeed, 3.1, in its discussion of violation intervals, seems to be
looking at the violation of SLOs rather than simply reporting metrics.

Of course, it is also true that an operator may want to measure the
aspects of the network behaviour to make judgements beyond the delivery
of the SLOs. And a customer might want to measure the behaviour of the
service to determine its suitability for use by applications even when
the SLOs are being met.

---

Your paragraph on the meaning of "precision" at the top of page 4 is
timely. I know that it seems pedantic, but the precision with respect
to the delivery of an SLO only applies as the value of the metric
approaches that specified in the SLO. We do not care (should not care?)
about the precision of the metric when the value of the metric is a
long way removed from that specified in the SLO (for better or worse).

GIM>> That is a very interesting question. It seems unlikely that a different, more accurate measurement method would be used when a value of a performance metric is in the zone of the specified threshold (optimal or critical). On the other hand, it might be helpful that a monitoring system adjusts monitoring rate depending on how close to a threshold is the metric value. I think that is one of the open questions for further discussion. To hildlight it, we propose this update:

OLD TEXT:

It should be noted that "precision" refers to what is

being assessed, not to the mechanism used to measure it; in other

words, it does not refer to the precision of the mechanism with which

actual service levels are measured. The specification and

implementation of methods that provide for accurate measurements is a

separate topic independent of the definition of the metrics in which

the results of such measurements would be expressed.

NEW TEXT:

It should be noted that precision refers to what is

being assessed, not the mechanism used to measure it; in other words,

it does not refer to the precision of the mechanism with which actual

service levels are measured. Furthermore, the precision, with

respect to the delivery of an SLO, only applies when the metric value

approaches the specified threshold levels in the SLO. The

specification and implementation of methods that provide for accurate

measurements is a separate topic independent of the definition of the

metrics in which the results of such measurements would be expressed.

[AF] That’s good. And I like that an implementation might reduce the measurement frequency when things are very good or very bad since measurement (per Heisenberg?) may tend to degrade the performance of the thing being measured.

In 2.1, I wonder whether you want to make some statement about SLEs
being out of scope.

GIM>> Thank you for the suggestion, agreed. Here's the update:

NEW TEXT:

Service Level Expectations, as defined in Section 4.1 of

[I-D.ietf-teas-ietf-network-slices], are outside the scope of this

document.

[AF] Fine.

[AF] I might add “…because it is in the nature of SLEs that they define parts of the SLA that are not easily measured.

GIM2>> Accepted, thank you.

In 3.1, your use of "degraded" is doubtful.

A reduction from "exceptionally good" to "very, very good" is a
degradation, but not one we care about with an SLO that says "good
enough".

So perhaps
OLD
* VI is a time interval during which at least one of the performance
parameters degraded compared to its pre-defined optimal level
threshold.

* SVI is a time interval during which at least one the performance
parameters degraded compared to its pre-defined critical
threshold.
NEW
* VI is a time interval during which at least one of the performance
parameters degraded below its pre-defined optimal level threshold.

* SVI is a time interval during which at least one the performance
parameters degraded below its pre-defined critical threshold.
END

GIM>> We agree and accept the proposed text update, thank you.

In 3.1

* Consequently, VFI is a time interval during which all performance
objectives are at or better than their respective pre-defined
optimal levels. In such a case, the service is in compliance with
its specification.

The last sentence here is debatable! It is true that the service will
be in compliance with its specification during the VFI, but the implied
converse is not true. That is, a service still may be in compliance with
its specification during a VI (and compliance might depend on the VI
count or ration). Indeed, the service could be in compliance even during
an SVI.

GIM>> Agree and removed the last sentence.

The last sentence of 3.1 could use a pointer to the definition of ratios

GIM>> Added a forward reference to Section 3.2.

although it is a bit obvious, is it worth noting (for the benefit of
IPPM and to avoid BMWG) that these metrics are necessarily in-service
metrics?

GIM>> Do you refer to metrics introduced in Section 3.1 or all metrics defined in the draft? I think that that is the latter. If you agree, I'd add that characterization as a generic earlier than Section 3.1. WDYT?

[AF] Yes, sorry, I was sloppy and made the comment against 3.1 when it does, as you say, apply to the whole document.

In 3.2 you have count of packets but not count of bytes. Is that OK?

GIM>> We've discussed your question. It is not obvious how to count violated bytes(octets) and/or severely violated bytes(octets). Simply the number of octets in a violated packet? Another question for the further discussion, thank you.

[AF] Yes, I think I meant that (in general) all of the bytes in a violated packet are violated bytes. The issue being (of course?) that if packets alternate small and large, and if the large packets all violate, one measure might imply 50% violation while the other might exceed 90%. “For future discussion” is a fine way forward.

3.2 talks about "EIs". Do you mean "VIs" or do you need to introduce a
definition?

GIM>> Thank you for catching my editorial sloppiness. Fixed.

In 3.2

Determining the condition in which the path is currently with respect
to availability/unavailability is helpful.

The use of "path" is worrying in this context. Can you say "service"? Or
at least "connectivity"?

GIM>> Would the following update be acceptable:

OLD TEXT:

Determining the condition in which the path is currently with respect

to availability/unavailability is helpful.

NEW TEXT:

Determining the condition in which the monitored service is currently

with respect to availability/unavailability is helpful.

[AF] Yes, although to fix the English…

NEW NEW

Determining the current condition of the monitored service

with respect to availability/unavailability is helpful.

GIM2>> Thank you.

Then in 3.3 you have

VI, SVI, and VFI characterize the communication between two nodes
relative to the level of required and acceptable performance and when
the performance level degrades below an acceptable level.

This puts the SLO specification very much into the context of a P2P
communication. I think you need to justify this somewhere with a
discussion of how a service is decomposed into connectivity constructs
and how the SLOs are applied to each of these even if they are stated as
applying to the whole service. This is particularly important when you
look at the VIR and SVIR for a service that comprises multiple
connectivity constructs only one of which is under performing.

GIM>> A very accurate observation, thank you. We propose a new paragraph into the Section 3.3 (it becomes the second pararagraph):

NEW TEXT:

It is worth noting that a service might include a set of connectivity

constructs. An SLO might apply to all the constructs, or some

constructs are assigned different SLO values or even different sets

of SLOs. It is worth noting that a composite service might include a

set of connectivity constructs. An SLO might apply to all the

constructs, or some constructs are assigned different sets of SLOs.

For the purpose of PAM, each connectivity construct that composes the

service can be monitored for its own SLO conformance as a sub-

service. The composition of PAMs of these sub-services can be viewed

as the PAM of the composite service. The composition of PAMs of

these sub-services can be viewed as the PAM of the composite service.

[AF] Seems you duplicated the first couple of sentences, here.

[AF] Otherwise, this looks good.

GIM2>> Indeed, thank you for catching that.

3.2 has...

switching
between periods requires ten consecutive intervals, shorter
conditions may not be adequately reflected.

No clue here as to where this requirement often comes from, nor what
the process of switching means. 3.3 gives some clues as to what is
going on, so a forward pointer would help. But even 3.3 doesn't explain
why 10.

GIM>> Indeed, ten intervals is only an example, and we update the text to reflect that:

NEW TEXT:

But because

the transition between service availability/unavailability periods is

based on a pre-defined number of consecutive intervals, e.g., ten,

shorter conditions may not be adequately reflected.

[AF] OK

Is the definition of VIR really what you want? It seems odd that the
existence of an SVI reduces the VIR. You could define...
* violated interval ratio (VIR) is the ratio of the combined number
of VIs and SVIs to the total number of time unit intervals in a
time of the availability periods during a fixed measurement
interval.

GIM>> Agreed and gratefully accepted.

---

4.

For example, an SLA might state
that any given SLO applies only to a certain percentage of packets

Is this really...

For example, an SLA might state
that any given SLO applies to at least a certain percentage of
packets

GIM>> You are correct. Updated the text with your suggestion.

---

4.

s/To support statistical services/To support statistical SLOs/

GIM>> Thank you, accepted.

---

In 4 you have

The definition of histogram metrics is for further study.

I wasn't clear whether you intend that to be in a future version of this
document or in a separate document. Section 6 helps. Maybe include a
forward pointer?

GIM>> Added the reference:

NEW TEXT:

The definition of histogram metrics is for further study

(see Section 6).

[ippm] Some thoughts on draft-mhmcsfh-ippm-pam Adrian Farrel
Re: [ippm] Some thoughts on draft-mhmcsfh-ippm-pam Greg Mirsky
Re: [ippm] Some thoughts on draft-mhmcsfh-ippm-pam Adrian Farrel
Re: [ippm] Some thoughts on draft-mhmcsfh-ippm-pam Greg Mirsky
Re: [ippm] Some thoughts on draft-mhmcsfh-ippm-pam Adrian Farrel