Re: [bmwg] Mean vs Median

Hi Marius,

On your first comment below, yes, we need to move beyond the 
RFC2544 latency of a single packet in new work (while keeping the 
intent of earlier work in mind).

On your second point, relaying Scott's comment about hard and fast rules:
I guess I would tend toward reporting the Median for any distribution
(it's not subject to the outliers like mean), but I still believe
that one statistic is not enough - a metric of variation is needed.

Al
(as participant)

> -----Original Message-----
> From: bmwg [mailto:bmwg-bounces@ietf.org] On Behalf Of Marius Georgescu
> Sent: Tuesday, November 10, 2015 12:25 AM
> To: bmwg@ietf.org
> Subject: Re: [bmwg] Mean vs Median
> 
> Hello Al,
> 
> Thank you very much for joining the discussion.
> Please find my comments inline.
> 
> > On Nov 10, 2015, at 10:40, MORTON, ALFRED C (AL) <acmorton@att.com>
> wrote:
> >
> > Hi Marius, Paul, and all who have contributed so far.
> >
> > a quick reply/differing opinion below.
> >
> >> -----Original Message-----
> >> From: bmwg [mailto:bmwg-bounces@ietf.org] On Behalf Of Marius
> >> Georgescu
> > ...
> >>> On Nov 10, 2015, at 02:40, Paul Emmerich <emmericp@net.in.tum.de>
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On 03.11.15 09:45, Stenio Fernandes wrote:
> >>>> a word of caution here... a number of phenomena in computer
> >>>> networks follows a heavy-tailed probability distribution function,
> >>>> which means that there is a non-negligible probability that a
> >>>> random variable will take huge values. these values might be
> >>>> erroneously considered
> >> as outliers.
> >>>
> >>> this is a really important point. I have benchmarked software where
> >> the 99th percentile of the latency is twice the average/median and
> >> the 99.9th percentile ten times the average/median.
> >>
> >> Can you give us more context (test setup; physical/virtualized
> >> tester/DUT; one tester/sender_receiver tester ... ) on these
> >> measurements?
> > [ACM]
> > My understanding (and I've seen some results, but I've had trouble
> > re-locating them) is that both outliers and bimodal distributions are
> > more common in the world of virtual DUTs than they were in the
> > physical/past. Not only does this affect analysis, but the threshold
> > waiting time for packet arrival must be chosen carefully to even
> > measure such outliers.
> >>
> >>> This is an important performance characteristic for
> >>> latency-sensitive
> >> applications that isn't captured by taking just 20 measurements. So
> >> I'd really like to see a standard that calls for thousands of latency
> >> measurements to capture this properly.
> >>>
> >>
> >> I think we should keep practicality in mind here. If we follow
> >> RFC2544.latency measurement, the frame stream has to be 2 min long.
> >> 2000 min ~ 33h  of testing for just one test sounds unreasonable to
> >> me. I would agree to have a lower bound for the sample size as
> >> RFC2544 actually recommends (n > 20).
> > [ACM]
> > Latency (delay) and delay variation need many single delay
> > measurements to be meaningful. One way to view the variation is for a
> > single flow of packets with spacing that might come from an
> > application, say 20ms spacing for VoIP. Collecting a few thousand of
> such packets should not take so long.
> [MG]
> I think there is one thing that needs clarification. The procedure in
> RFC2544 says:
> 
> “The stream SHOULD be at least 120 seconds in duration.An identifying
> tag SHOULD be included in one frame after 60 seconds with the type of
> tag being implementation dependent.”
> 
> I never got to ask Scott this, but  would it make sense to tag more than
> one frame? If we tag all the frames in the stream, we would have
> (depending on the throughput) thousands of measurements in 60 seconds.
> 
> >
> >>
> >>> You can also get interesting insights into a black-box device by
> >>> looking at histograms/probability density functions. For example,
> >>> you can figure out if the device processes packets in batches,
> >>> estimate the batch size, figure out at which rates interrupt
> >>> moderation algorithms change etc. (This is, of course, not really a
> >>> performance metric, just an interesting insight.)
> >>>
> >>
> >> I agree this is an interesting insight. It can also be the base for a
> >> decision between summarizing functions. However, in the light of
> >> consistency and simplicity of the methodology, I think we would need
> >> to recommend one function. We could do that depending on the
> >> metric/DUT characteristics, previous testing behavior …
> > [ACM]
> > I agree the right summary statistics can only be chosen after an
> > examination of the raw distribution for a particular scenario.  If
> > Bi-modal, the central statistics of the sample could be meaningless.
> > Without this examination, I don't think one recommendation can always
> be right.
> >
> [MG] I agree that analyzing the probability distribution is the best way
> to choose, but maybe not the most consistent procedure. I had a short
> discussion about this with Scott before the BMWG meeting. In my
> understanding of his feedback, the consistency in recommending a
> summarizing function is more important. In other words, if we leave room
> for interpreting the data, the results might be more misleading than if
> we choose  the “wrong”  summarizing function.
> 
> Best regards,
> Marius
> _______________________________________________
> bmwg mailing list
> bmwg@ietf.org
> https://www.ietf.org/mailman/listinfo/bmwg