Re: [bmwg] I-D Action: draft-ietf-bmwg-b2b-frame-00.txt

Hi Al.

(Note this is commenting on -01 version of the draft.)

As mentioned in the meeting, I had one more idea.
We could focus more on the maximum number of successful b2b frames.
The motivation is similar to why we avoid other sources of packet loss
when determining the throughput (as an input to the buffer time calculation).
The other sources can lower the average number of b2b frames,
but the maximum number should better match the input throughput.

I think the average and standard deviation can be still useful
for some users, but it is less clear how that matches the input throughput.
Maybe average of throughput measurements (including those affected by other sources)
should be used as an input to calculate "average" buffer time?

Anyway, my earlier comments still apply, see below.

Vratko.

-----Original Message-----
From: bmwg <bmwg-bounces@ietf.org> On Behalf Of Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco)
Sent: Wednesday, 2019-November-20 18:22
To: MORTON, ALFRED C (AL) <acm@research.att.com>
Cc: bmwg@ietf.org
Subject: Re: [bmwg] I-D Action: draft-ietf-bmwg-b2b-frame-00.txt

I see a new draft version is published,
so this is a good time for another round of review.
(Should the e-mail subject be changed?)

Brace for another long e-mail,
as I feel particularly nitpicky today.

Firstly, comments on new content,
found by looking at the difference [2].

* In section 3, the first list, item 4
(talking about buffer time estimates):

> (measured according to Section 26.4 of [RFC2544])

Neither RFC 2544 nor RFC 1242 mention buffer time (nor size) estimation.
They only mention measurement of the number of back-to-back frames,
which is only an input to the buffer time estimate calculation.
Inserting "based on" somewhere into the sentence will fix that.

> tends to increase the "implied" estimate

I agree with the word implied being in quotes.
It is a colloquial term used by group doing the analysis.

The draft mentions (indirectly) two models.
One neglecting header processing,
and one assuming the header processing speed is constant
(thus equal to the measured Throughput).
As the naming for the two estimates is also under review,
I will call the two models "the first order model"
and "the second order model". The second order model
is the one taking the processing speed into account;
the first order model works only when the processing speed
is negligible compared to the offered load.

Now it is clearer what "increase" is being described,
it is the difference between the two model predictions,
namely the time correction term.
Still the sentence is wrong.
The "implied" estimate does not increase.
Merely, our best estimate is increased by switching to a better model.
It would be better to talk about "correcting" an estimate
(instead of increasing it).

* Section 2, few lines before the second list:

> The simplified model

It is the second order model.
We know it is not realistic enough (the draft mentions
unpredictable processing interrupts, not modeled),
but the word "simplified" means "created from a more convoluted model
by applying some simplifying assumption".
The second order model has not been simplified
from any other model.
I suggest to pick two names for the two models
and stick to them (instead of varying adjectives).

* Section 3, second list, item 2:

> The packet header processing function (HeaderProc) operates at	
> approximately the "Measured Throughput"

I think it would be better to split this into two assumptions.
One assumption is related to the model (header processing speed is constant).
Second assumption is related to parameter estimation
(Measured Throughput to be used as the value for the processing speed).

The "approximately" word would then not appear in model definition
nor the test description.
It would just describe the fact the model is not 100% realistic.
That is why the average processing speed can differ from Measured Throughput,
we just have to assume it is still the best approximation we have.

Secondly, general comments.

It occurred to me, that the term "buffer time" is used
without there being an explicit definition for it.
An implicit definition is along the lines of
"the time between traffic starts and first packet drops",
but that depends not only on buffer size,
but also on the offered load and the processing speed.

Section 3 (just after the second list) mentions the combination
of processing speed zero (interrupted) and presumably
offered load at Max Theoretical Frame Rate.
The Actual Buffer Time (just before section 7)
then presumably means buffer time for zero processing speed
and Actual Frame Rate being offered.
But other combinations are also useful.
For example, Implied DUT Buffer Time
already happens to be the correct buffer time for the combination
of offered load at Max Theoretical Frame Rate
and processing speed at 100% of Measured Throughput.
People can be interested in other combinations as well,
for example offered load at 110% of Measured Throughput
(while processing speed stays at 100% of Measured Throughput);
or processing speed at 90% of Measured Throughput (e.g. noisy neighbor)
while the offered load stays at 100% of Measured Throughput.

The draft could include one formula for all the combinations.
As the buffer time depends on intended load,
but buffer size (in number of frames) does not,
it would be nice if Report required the buffer size estimate,
with some buffer times optional.

Another general comment: Some explanations
will be nicer if we used a term Buffer Filling Speed.
In the first order model, buffer filling speed is equal to the offered load,
in the second order model it is the offered load
minus the processing speed.
In both models, the buffer time is the buffer size
divided by the buffer filling speed for the particular load.

Final general comment: I still do not like
the name Implied (when used for buffer times).
The quantities are only implied by the first order model.
People using the second order model (with non-zero processing speed)
are not implying that values.
Maybe it is a good idea to start with naming the two buffer times
(both for Max Theoretical Frame Rate offered,
with processing speed at either 0% or 100% of Measured Throughput),
and only then decide the names for the corresponding models.

Thirdly, comments on points from previous e-mails:

>>> Corrected DUT Buffer Time
>> 
>> This is a big one.
>> The correct name for this quantity could be "DUT Buffer Time 
>> Correction".
>
> If we were talking about a 'correction factor' I would agree,
> but we have two DUT buffer times distinguished by adjectives,
> which seems more straightforward to me.

No, I was not talking about any factor.
I was talking about a difference.

Let me use some artificial numbers.
Say we have a system with Measured Throughput of 10 fps,
Max Theoretical Frame Rate of 100 fps,
and buffer size of 1000 frames.
According to the second order model, the buffer filling speed
during the b2b measurement is 90 fps, so the buffer gets full
after 11.1111 seconds. Ideally, we would measure
1111.11 b2b frames before loss happens.
The formula for Implied DUT Buffer Time would give us back
the 11.1111 seconds value.
But plugging that to Corrected DUT Buffer Time formula
would give us just 1.11111 seconds.
No reasonable load gives this short a buffer time.
But, subtracting 1.1111 s from 11.1111 s gives us 10.0 s,
which is the buffer time with max offered load and blocked processor.
Thus, 1.1111 second is the correction to be subtracted,
not the final corrected result.

Vratko.

[2] https://tools.ietf.org//rfcdiff?url1=draft-ietf-bmwg-b2b-frame-00&url2=draft-ietf-bmwg-b2b-frame-01

-----Original Message-----
From: MORTON, ALFRED C (AL) <acm@research.att.com> 
Sent: Thursday, November 7, 2019 1:58 AM
To: Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco) <vrpolak@cisco.com>
Cc: bmwg@ietf.org
Subject: RE: [bmwg] I-D Action: draft-ietf-bmwg-b2b-frame-00.txt

Hi Vratko, Thanks for your review!

I just now re-discovered your comments, and have addressed them in the working version of the draft (where I had implemented changes to address Maciek's comments!). I will submit the new draft when submission opens-up again - sorry for the omission!

Al
(as a participant/author)

> -----Original Message-----
> From: Vratko Polak -X (vrpolak - PANTHEON TECHNOLOGIES at Cisco) 
> [mailto:vrpolak@cisco.com]
> Sent: Thursday, July 4, 2019 1:42 PM
> To: MORTON, ALFRED C (AL) <acm@research.att.com>
> Cc: bmwg@ietf.org
> Subject: RE: [bmwg] I-D Action: draft-ietf-bmwg-b2b-frame-00.txt
> 
> Hi Al.
> 
> Sorry for being late with the review.
> I wanted to think some things through, and then did not find big 
> enough chunk of time to write everything down.
> 
> First, narrow comments.
> 
> * Second line of 1. Introduction:
> 
> > in[RFC2544], supported by the terms and definitions in [RFC1242], 
> > and
> 
> Missing space after the first "in".
> 
> * Point 4 in chapter 3 seem to be referring to two different estimates:
> 
> > It was found that the actual buffer time in the DUT could be 
> > estimated using results from the Throughput tests conducted 
> > according to Section 26.1 of [RFC2544], because it appears that the 
> > DUT's frame processing rate may tend to increase the estimate.
> 
> Which estimate is increased, relative to what?
> Is it "implied estimate" tending to be larger than "corrected estimate"?
> 
[acm]
Thanks for pointing out this ambiguity - I have re-worded this long sentence for clarity.

> * Within 5.4:
> 
> > (assuming a simple model of the DUT
> > composed of a buffer and a forwarding function)
> 
> A reasonable simplification, but could be mentioned in Motivation, so 
> 5.4 does not need a sentence in parentheses.
[acm]
I used this as an opportunity to remind the reader with a reference to Section 3.

> 
> * Still within 5.4:
> 
> > Corrected DUT Buffer Time
> 
> This is a big one.
> The correct name for this quantity could be "DUT Buffer Time 
> Correction".
[acm]
If we were talking about a 'correction factor' I would agree, but we have two DUT buffer times distinguished by adjectives, which seems more straightforward to me.

> 
> * still 5.4:
> 
> > and the Buffer size is more accurately estimated by excluding them.
> 
> Can be turned into a formula, where
> "Corrected DUT Buffer Time" can be used in a more natural meaning.
> 
>   Corrected DUT Buffer Time = Implied DUT Buffer Time - DUT Buffer 
> Time Correction
[acm]
You are defining a new term here (?), DUT Buffer Time Correction (factor), which could be the unit-less term:

    Measured Throughput
  --------------------------
  Max Theoretical Frame Rate

But we only use this term once, and it contains terms we've defined using quantities easily understood, so that the meaning and limitations of this factor are clear (when numerator and denominator are equal, there's no correction needed).

> 
> * In 6. table:
> 
> > Min,Max,StdDev
> 
> Not sure if it is clear these refer to B2B length statistics.
> Include information that these quantities are to be expressed in frames.
[acm]
Ok

> 
> 
> And now some broader comments.
> 
> * Suspended processor in production:
> 
> > useful to estimate whether frame losses will occur if DUT forwarding 
> > is temporarily suspended in a production deployment
> 
> This is the main benefit, and Introduction already mentions 
> "compensating for disruptions in the software packet processor".
> 
> But I think it would be good to classify different scenarios leading 
> to frame loss, that could explain both two buffer times (implied and 
> corrected) and how to use them outside back-to-back load.
> 
> In back-to-back test we have processor processing, while the buffer is 
> being filled by maximal traffic.
> The time it takes to fill is the "implied buffer time".
> 
> In a hypothetic scenario where processor is suspended and buffer is 
> being filled by maximal traffic, the time to fill the buffer is 
> shorter, the "corrected buffer time".
> 
> A scenario occuring in practice has suspended processor, but the 
> buffer is being filled more slowly, say at throughput rate.
> Deployers wishing to predict the time for the buffer to fill up can 
> use this formula:
>   Real Buffer Time = Corrected Buffer Time * B2B Frame Rate / Real 
> Frame Rate That is why reporting corrected (instead of implied) buffer 
> time is useful.
[acm]
Yes, vCPU operation can be suspended, and processors can appear suspended while they handle higher priority processes that interrupt the data plane operations for a significant amount of time.
But orchestrating these suspensions/interrupts for measurements isn't practical.  

Nevertheless, the calculation above is useful, I think, so I added it below the table of results, but I think the correction should be based on the Measured Throughput we used before (B2B would be = Max Theoretical).
We have the corrected time using Measured Throughput and add time to that value.

> 
> Also, it would be nice to name the scenarios and rename the buffer times.
> For example, "running buffer time" and "suspended buffer time"
> (instead of "implied" and "corrected" respectively).
[acm]
I see what you are going, but the term "suspended buffer time"
seems anti-intuitive. The packet forwarding is running or suspended, not the buffer, so we need a longer term like "suspended forwarding buffer time" which if we measured correctly, is just the accurate estimate of "buffer time".

> 
> * B2B processor rate:
> 
> For the computation of the corrected buffer time to be correct, real 
> processor frame processing rate (average during B2B test) should be 
> used instead Measured Throughput in the 5.4 formula.
> 
> The process rate is not easy to measure directly, especially if the 
> immediate rate varies over the duration of B2B traffic.
> I agree that Throughput is a reasonable approximation, but there may 
> be other quantities (e.g. FRMOL [1]), that are either a better 
> approximation, or at least easier to measure.
> 
> Not sure how much attention other such quantities should get in the 
> draft, as Throughput has the advantage of avoiding some frame sizes.
[acm]
Right, that's a big advantage.
I have some sympathy for using other approximations/measurements of the "real" forwarding rate. But FRMOL may not be the best alternative, because overload behavior could be degenerative.
The definition of FRMOL goes on to say:

   Discussion:

      Forwarding rate at maximum offered load may be less than the
      maximum rate at which a device can be observed to successfully
      forward traffic.  This will be the case when the ability of a
      device to forward frames degenerates when offered traffic at
      maximum load.

> 
> * DUT vs SUT:
> 
> This is related to final items of 4. Prerequisities.
> 
> > Therefore, sources of packet loss
> > that are un-related to consistent evaluation of buffer size SHOULD 
> > be identified and removed or mitigated.
> 
> Do we have a separate document discussing differences between testing 
> DUT and SUT? We should have.
> 
> Usually I prefer testing SUT (meaning no extra mitigations), but in 
> this case, for the analysis of the three aforementioned scenarios to 
> work correctly, we need to make reasonably sure the processor is not 
> going to get suspended during B2B test.
[acm]
In BMWG's literature, a SUT is composed of multiple DUTs in an expected arrangement, or typical of a planned deployment.

> 
> Also, I agree that an average result of Binary Search with Loss 
> Verification gives a more realistic process rate estimate than an 
> average result without loss verification.
[acm]
Good.  Thanks again for your comments!
> 
> Vratko.
> 
> [1] https://tools.ietf.org/html/rfc2285#page-16
_______________________________________________
bmwg mailing list
bmwg@ietf.org
https://www.ietf.org/mailman/listinfo/bmwg