Re: [codec] Audio tests: Further steps

On Tue, Apr 23, 2013 at 05:34:55PM -0400, Paul Coverdale wrote:
> I don't know why you're pouring scorn on this exercise, Ron. It seems to me
> that it is a bona-fide attempt to understand the strengths and weaknesses of
> the Opus codec in a controlled, unbiased manner, what a characterisation
> test should do.

As I said already, I think we can probably find _something_ interesting in
just about any test that someone wants to run, and personally, given that
the encoder is still undergoing active improvement, I suspect that any
test which does go out of its way to find genuine current weaknesses of
Opus for some edge cases will possibly be the most useful tests of all for
suggesting areas where immediate improvement may be found.

Since openly finding and fixing those things really has been one of the
greatest strengths of this process, and the results compared to a "closed
development, final shootout" methodology have already clearly spoken for
themselves convincingly.  So there's no reason to stop doing that now.
That's fine.  And Good.

Finding nowhere that it might improve would be far less helpful, and would
likewise only show that the given test was blind to already known realities.

But I also think it's very important for any statistical test to be very
clear on what hypothesis it is trying to find relevance for if people are
going to try to attach conclusions to it and claim it to be Science.

Starting with hand-picked samples, deliberately biased to be the hardest
known *for a particular coding method*, pretty much instantly excludes
"fair comparison" from being a meaningful hypothesis.  Putting a lump of
lead on the corner of one die and concluding "hey this one rolls all 6's"
isn't science, it's applied math, or a carny game.  At best.

That doesn't mean you can't make _any_ hypothesis about what the results,
may be testing.  But it's important to clearly make that in advance, if
you want to genuinely analyse the obvious flaws in the proposed method
before you start looking for evidence of pixies in the ink blots.

What does this test aim to demonstrate if it's obviously not a "fair
comparison" as it is currently proposed?  How do you plan to calibrate
and normalise the relative "killerness" of the selected biased samples
in a way that would make any conclusion beyond "Hard sample is Hard"
be even remotely meaningful as a "comparison", unfair or otherwise?

Knowing new hard samples would be great.  Comparing apples and oranges
less usefully so.  The oranges always win.

  Rigorously peer reviewed,
  Ron

> >-----Original Message-----
> >From: codec-bounces@ietf.org [mailto:codec-bounces@ietf.org] On Behalf
> >Of Ron
> >Sent: Tuesday, April 23, 2013 3:31 PM
> >To: codec@ietf.org
> >Cc: cs.wg2.qualinet@listes.epfl.ch
> >Subject: Re: [codec] Audio tests: Further steps
> >
> >On Tue, Apr 23, 2013 at 09:50:16AM +0200, Christian Hoene wrote:
> >> Hi,
> >>
> >> currently, the codec comparison tests are running. Because of the
> >> request of many codec developers, we plan to extend those tests: We
> >> might add audio tests in which the content is varied to a large
> >> extend. For that, we need sample that cannot be compressed well by
> >> Opus or AAC-eLD. For me, it is easy to get those difficult samples for
> >> Opus. It is much challenging to get those for AAC-eLD. Thus, if
> >> somebody had to time to study the weaknesses of AAC-eLD, please
> >forward me the samples.
> >
> >Uhm, so ...  while I'm certain that the codec developers will be
> >delighted if you can point out any new killer samples that they aren't
> >yet aware of (since significant work has already been made to improve
> >the encoder for the known ones, and that work is still ongoing) -- I'm
> >also pretty certain that going out of your way to deliberately select
> >such samples immediately disqualifies this from being characterised as a
> >"comparison test", or at least claiming that it's even remotely
> >representative of what people will observe over a general corpus of
> >their own audio, given the degree to which such samples really are
> >outliers.
> >
> >> I cannot start fair tests if I do not have challenging samples for
> >> both codecs.
> >
> >While such a test might have some novelty value to show "here are some
> >non-exhaustive results for the worst samples that we could find in a few
> >days of searching", I'm pretty sure words like "fair" and "scientific
> >rigour" don't really belong in the same sentence.  Not in the least when
> >you also say "we have the established list for one codec, but the known
> >killers for the other is at present entirely unknown to us".
> >
> >If you want to spend your time doing that, that's fine, and the results
> >may well be 'interesting'.  But mischaracterising them as a "comparison"
> >test would just be somewhere on the spectrum from "mildly amusing" to "a
> >sad day for Modern Science".
> >
> >It's your reputation though, and I can't tell you how to spend it.
> >But you might want to think this through a little better if you are
> >going to paint this with the brush of Being Science.
> >
> >This isn't the cosmetics industry, other people can measure these things
> >too, and will continue to for some time to come.
> >
> > Cheers,
> > Ron
> >
> >
> >_______________________________________________
> >codec mailing list
> >codec@ietf.org
> >https://www.ietf.org/mailman/listinfo/codec
> 
> _______________________________________________
> codec mailing list
> codec@ietf.org
> https://www.ietf.org/mailman/listinfo/codec