Re: [bmwg] RAI-ART review of draft-ietf-bmwg-sip-bench-term-04 and draft-ietf-bmwg-sip-bench-meth-04

Carol Davids <davids@iit.edu> Mon, 16 July 2012 19:09 UTC

MIME-version: 1.0
Content-transfer-encoding: 7bit
Content-type: text/plain; charset="US-ASCII"
User-Agent: Microsoft-MacOutlook/14.2.3.120616
Date: Mon, 16 Jul 2012 14:09:54 -0500
From: Carol Davids <davids@iit.edu>
To: "Worley, Dale R (Dale)" <dworley@avaya.com>, "rai@ietf.org" <rai@ietf.org>, Al Morton <acmorton@att.com>, "draft-ietf-bmwg-sip-bench-term@tools.ietf.org" <draft-ietf-bmwg-sip-bench-term@tools.ietf.org>, "bmwg@ietf.org" <bmwg@ietf.org>
Message-id: <CC29C656.21417%davids@iit.edu>
Thread-topic: [bmwg] RAI-ART review of draft-ietf-bmwg-sip-bench-term-04 and draft-ietf-bmwg-sip-bench-meth-04
In-reply-to: <CD5674C3CD99574EBA7432465FC13C1B22726A0A88@DC-US1MBEX4.global.avaya.com>
Cc: Mary Barnes <mary.ietf.barnes@gmail.com>
Subject: Re: [bmwg] RAI-ART review of draft-ietf-bmwg-sip-bench-term-04 and draft-ietf-bmwg-sip-bench-meth-04
Precedence: list

Dale,

Thanks very much for your careful review of the drafts.  Please see
below for our responses and descriptions of what we have done to address
the issues. We highlight your comments under RAI-ART REVIEW COMMENT:, and
indicate our responses under
RESPONSE:CD/VG:

We also appreciate your pointing out the need for a final editorial review
and have begun that work using the detail you provided.  We plan to post
the edited version in time for IETF 85 in November.

Best,

Carol 


Carol Davids
Email: davids@iit.edu
Skype: caroldavids1
..................................................................

RAI-ART REVIEW COMMENT:
I. Technical issues
A. Media
The drafts seem to be undecided as to whether they want to benchmark
media processing or not.  At one point, it says that the benchmarking
is entirely in the signaling plane.  At another point, it is specified
how to set the number of media streams per INVITE dialog.  I suspect
the conflict is that there are devices the WG wants to benchmark which
do handle the media and their media handling affects performance
significantly, but the WG is unsure how to parametrize media
benchmarking.

RESPONSE: CD/VG: 
The media parameters need to be specified because, if
the DUT/SUT processes media, the media processing will impact the SIP
Performance, using cycles that could otherwise have been used for
SIP-related processing.  We do not measure the performance of the audio
and/or video.  But we need to document the conditions under which the test
was conducted if we are to be able to compare test results.


We are measuring the signaling-plane throughput, but thisthroughput will
vary depending upon the conditions of test.  One
important condition of test is media processing:  Is the DUT processing
media streams and if so, how many and what parameters define them? We note
the presence and character of the media streams being processed by the
device without measuring the quality of the resulting media.

We define the the media according to the packet size, type of codec, and
number of streams.  These parameters describe the conditions under which
the test is
performed.  But we do not measure
the quality or other characteristics of the resulting media.  We expect
that the more work we ask the
device under test to do, the lower will be its SIP throughput.



RAI-ART REVIEW COMMENT:


The simplest situation, of course, is if the DUT is a proxy and the
media bypasses it entirely.  Beyond that, even in the simplest media
relaying situations, significant processing costs can arise.  Even if
the DUT isn't transcoding, it may have to rewrite the RTP codec
numbers (as it may have modified the SDP passing through it).  In more
complex situations, the DUT may have to transcode the codecs.

Some consideration has to be made of how to specify exactly what the
media loads are.  SDP already has ways to specify the exact encoding
of media streams, including the number of samples per packet, which
implies the packet size and packet rate.  These drafts only allow
specifying the packet size as a benchmark parameter, but packet size
is notoriously ambiguous -- which layers of encapsulation of the
encoded media are counted?  But specifying the detailed encoding
side-steps that question.

RESPONSE:CD/VG: 
We will add a note that the testing organization may
choose to specify more characteristics of the associated media and may keep
records of comparative results.  Our goal
in this draft is to describe a method for measuring SIP throughput of the
device under test.  Organizations are
encouraged to identify boundary conditions that they deem important and to
perform tests under as many variants of these boundary conditions as they
wish.




RAI-ART REVIEW COMMENT:

The draft wants to consider the duration of a media stream to be
separately adjustable from the duration of the containing dialog, but
the draft explicitly places out of scope the re-INVITE which is
necessary to accomplish that realistically (that is, with the
signaling matching the presented media packets).


RESPONSE:CD/VG: 
The parameters of test are recorded in the test setup report.  The
session attempt rate and the total number of sessions to be attempted are
identified in this report.  These two numbers determine the total length
of the test.  The duration of a session is also identified in this report,
but the name we assigned that parameter is "Media session hold time."  We
will change the name of the parameter to "Session hold time." When the
INVITE-initiated session includes media, then the session hold time
represents the duration of the media session.  Whether or not we include
media, the session ends when the BYE message is received by the emulated
agent. We do not include the duration of a stream among the parameters of
test. We do allow there to be multiple media streams per session, but the
session is ended by a BYE.

Re-INVITEs are a different question.  They do not affect the duration of
the session.  But it is true that they consume processing cycles. We will
discuss this further and appreciate your identifying the issue.

We will also add a test parameter that describes the time between
successive call attempts by the emulated agent. We recommend setting this
parameter to 0 since setting it at a higher value will make the
testing-to-failure take longer.



RAI-ART REVIEW COMMENT:


B. INVITE dialog structure
The drafts seem to want to consider the establishment (or failure
thereof) of an INVITE dialog to be instantaneous, after which the
dialog continues for a chosen length of time, and then vanishes
instantly.  Little or no consideration is given to the various
scenarios of call establishment, including the most common case:
INVITE sent -- 183 response -- significant delay -- 200 response.
Dialog teardown is not conceptualized as being a processing step that
involves significant cost and may fail:  "Session disconnect is not
considered in the scope of this work item."

This lack of consideration is enhanced in the forking cases, as the
variety of scenarios (and their durations) increases.

In addition, the drafts only consider forking which is done within the
DUT, whereas it will be common in practice for forking to be done
downstream of the DUT, presenting the DUT with a stream of 1xx
responses from multiple endpoints, with a 2xx after an extended delay.

Also, in regard to signaling benchmarking, INVITEs that ultimately
fail are likely to be as costly as INVITEs that succeed, but there
doesn't seem to be a defined parameter "fraction of attempted calls
which succeed" (which controls the callee EAs).


RESPONSE:CD/VG: 

RE DELAY: 
The delay in responding with a 200 OK after getting an
INVITE is not specified in
the current methodology document and is assumed to be instantaneous. Our
intent is to stress the DUT as quickly as possible.  Introducing a delay
serves to increase the time before the DUT produces its first
stress-induced error. A high interval (> 32s) may cause the DUT to enter a
stable state and not be subject to stress. For this reason, we
intentionally chose not to introduce delays before issuing a 200 OK.

Note that many user agents automatically introduce delays by first sending
a 180-ringing, etc.  Any additional
artificial delays, while easy to introduce, would be an additional
tuning parameter that is subject to differing interpretations.


RE FORKING: 
Regarding the forking case being done in the DUT ---
this was a design
choice since we wanted to model the complexity and delay caused by
forking N branches downstream and collating the responses, etc. at the
DUT.  In other words, our assumption is that the DUT is a proxy (or a
B2BUA) and is doing forking and response collating.
The event that the tester arranges forking to happen downstream
of the DUT is automatically captured when the device acting as a
DUT is a user agent client and the next downstream SIP entity is forking
and presenting responses to the DUT.


RE THE COST OF FAILURES:
Regarding the
contention that "INVITEs that ultimately fail are likely
to be as costly as INVITEs that succeed, but there doesn't seem to be a
Defined parameter "fraction of attempted calls which succeed" (which
controls the callee EAs)" --- it seems to us that the Session
Establishment Performance benchmark (please see Section 5.2) covers
this.


RAI-ART REVIEW COMMENT:


C. Loop detection
All discussions of loop detection needs to based on the revised loop
detection requirements in RFC 5393.


RESPONSE:CD/VG: 
We will update the methodology document to ensure that loop
detection
is based on the revised loop detection requirements in rfc5393.  This
is a good catch! 



RAI-ART REVIEW COMMENT:

D. Authentication
In some SIP operations, authentication is commonly done.  This can
have various effects on the message flows that need to be taken into
account in the benchmarks.

For instance, a registrar may require that the registering UA
authenticate itself.  Commonly, the UA sends a REGISTER request, which
is rejected with 401 because it contains a nonce that is too old.  The
UA then immediately sends another REGISTER with the nonce provided in
the 401 response, and that request receives a 200 response.

In this scenario, the number of effective REGISTER requests is half of
the total REGISTER requests, leading to an apparent attempt failure
rate of 50%, even though the middlebox is doing the Right Thing 100%
of the time.  This suggests that the definition of "attempt failure"
needs to be updated so that a 4xx response "passed upstream" by the
DUT is not counted as an attempt failure.

In other scenarios, the DUT itself might be expected to enforce SIP
authentication, which would require a somewhat different definition of
attempt failure, and would be expected to have lower throughput.

So some thought needs to be given as to whether these scenarios are
to be benchmarked, and to document how authentication is to be handle
in whatever benchmarks are defined.

RESPONSE:CD/VG: 


We agree with your analysis, and in fact, the terminology document already
recognizes this through its scope (please see Section 2.1
of the terminology document).  It says,
    o  REGISTER and INVITE requests may be challenged or remain
      
unchallenged for authentication purpose as this may impact the
      
performance benchmarks.  Any observable performance degradation
      
due to authentication is of interest to the SIP community.
      
Whether or not the REGISTER and INVITE requests are challenged is
      
a condition of test and will be recorded and reported.
However, in the methodology document, we do not include any further
guidance for this.  As you point out, there is a need for some guidance.
To remedy this, we propose that we add the following in Section 5.1 of
The methodology document:

    Authentication
option = ___________________________________
    (on|off;
if on, please see Note-2 below).
    Number
of responses of the following type:
      
401:       _____________  (if
authentication turned on; N/A
                                  otherwise)
      
407:       _____________  (if
authentication turned on; N/A
                                  otherwise)
      
2xx-class: _____________
      
1xx-class: _____________
      
Others:    _____________
This information will provide the tester to analyze how many 401/407
were received and adjust the metrics in Section 5.2 and 5.3 accordingly.











>

[bmwg] RAI-ART review of draft-ietf-bmwg-sip-benc… Worley, Dale R (Dale)
Re: [bmwg] RAI-ART review of draft-ietf-bmwg-sip-… Carol Davids