[bmwg] Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC

Carol Davids <davids@iit.edu> Tue, 04 March 2014 05:01 UTC

Return-Path: <davids@iit.edu>
X-Original-To: bmwg@ietfa.amsl.com
Delivered-To: bmwg@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 8CA531A034D for <bmwg@ietfa.amsl.com>; Mon, 3 Mar 2014 21:01:25 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.599
X-Spam-Level:
X-Spam-Status: No, score=-0.599 tagged_above=-999 required=5 tests=[BAYES_05=-0.5, HTML_MESSAGE=0.001, J_CHICKENPOX_75=0.6, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_LOW=-0.7, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dNpeZ9_5aKZk for <bmwg@ietfa.amsl.com>; Mon, 3 Mar 2014 21:01:13 -0800 (PST)
Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by ietfa.amsl.com (Postfix) with ESMTP id 560381A034B for <bmwg@ietf.org>; Mon, 3 Mar 2014 21:01:13 -0800 (PST)
Received: by mail-ig0-f176.google.com with SMTP id uy17so9583387igb.3 for <bmwg@ietf.org>; Mon, 03 Mar 2014 21:01:10 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:user-agent:date:subject:from:to:cc:message-id :thread-topic:mime-version:content-type; bh=iN+8S/GJl2nmuzYPVscY/yRA9yDNDrzFlNLzp9ehUcc=; b=NPwWWUi4Dgfg3ZurM4kJMpvHg4y2ZRtbh2xSt7TBz4uMRnKSPe5NFbpXYTNm9PXoPV lzbUvgLlT0H0DGYoUB1lKyQlsI34Qznp0dQ7UE1PeKe4UGOprPfpK3rYAFkJdByk2+mC x+yVSwMwDtF+jizkPLP0dQGt5Ji8+1/QE4Q3QnyP4zzP6DY04FEThhg9yRECY9N2QuS3 +td+8hzpOHadwMr8vi986QifWbWs7yHoRYa7XiLt+4h7YyUEpYIbAsdQTncWHSvdOzbs 39nRlYD/MG9dlwx/wVpdGmNLUD4Nk8SMeZ7L5/BFwKbYYSa0uQUxO3MYzl8yainBSNEy 185Q==
X-Gm-Message-State: ALoCoQmAqqMrxMcqfFOhzZJ74BtJGZE3T91bPfMnGjcbAYedkA/8PvI5DotLgWz2TR/41iT0je27
X-Received: by 10.43.90.202 with SMTP id bj10mr21046227icc.48.1393909270062; Mon, 03 Mar 2014 21:01:10 -0800 (PST)
Received: from [192.168.0.197] (c-24-7-193-205.hsd1.il.comcast.net. [24.7.193.205]) by mx.google.com with ESMTPSA id om7sm46412696igb.10.2014.03.03.21.01.02 for <multiple recipients> (version=TLSv1 cipher=RC4-SHA bits=128/128); Mon, 03 Mar 2014 21:01:08 -0800 (PST)
User-Agent: Microsoft-MacOutlook/14.3.9.131030
Date: Mon, 03 Mar 2014 23:01:00 -0600
From: Carol Davids <davids@iit.edu>
To: bmwg@ietf.org, Robert Sparks <rjsparks@nostrum.com>
Message-ID: <CF3AB9C0.588A3%davids@iit.edu>
Thread-Topic: Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC
Mime-version: 1.0
Content-type: multipart/alternative; boundary="B_3476732464_843562"
Archived-At: http://mailarchive.ietf.org/arch/msg/bmwg/u3gHW570M1bwvuDSUynn8GQnlug
X-Mailman-Approved-At: Tue, 04 Mar 2014 07:11:30 -0800
Subject: [bmwg] Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC
X-BeenThere: bmwg@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: Benchmarking Methodology Working Group <bmwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/bmwg>, <mailto:bmwg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/bmwg/>
List-Post: <mailto:bmwg@ietf.org>
List-Help: <mailto:bmwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/bmwg>, <mailto:bmwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 04 Mar 2014 05:01:25 -0000

Robert, All,

Below are replies to the comments provided by Robert Sparks on January 24,
2014, both general and also those specific to the terminology document.  We
will reply to the comments related to the Methodology document shortly.
These comments were very helpful to us as we wrote version 09 of the
documents. 

Robert¹s comments are identified by the double asterisks (**) . Our
responses are identified by the triple asterisks (***).
The General comments are identified by us as G: Item 1, G: Item 2, etc.  The
ones specific to a particular section are identified by the section to which
they relate, preceded by a T in the case of comments related to the
Terminology document and by an M in the case of comments related to the
Methodology document.

Regards,

Carol Davids   Vijay Gurbani Scott Poretsky

Reviews of draft-ietf-bmwg-sip-bench-term-08 and
draft-ietf-bmwg-sip-bench-meth-08
 
**G: Summary: These drafts are not ready for publication as RFCs.
***Response: G: Summary:  We have edited the documents in light of the
comments provided by Robert and other reviewers and also in light of
experience running the tests in a lab environment and with the collaboration
of a vendor of a commercial product.  We changed the titles of the documents
to reflect more accurately their scope; we reduced the number of benchmarks
and the number of tests.  We reduced the number of distinct test
architectures to two and moved the illustrations of the two architectures to
the Methodology document for ease of use. Details on these and other changes
are inline below. 
 
**G: Item 1: First, some of the text in these documents shows signs of being
old, and the
working group may have been staring at them so long that they've become hard
to
see.  The terminology document says "The issue of overload in SIP networks
is
currently a topic of discussion in the SIPPING WG." (SIPPING was closed in
2009). The methodology document suggests a "flooding" rate that is orders of
magnitude below what simple devices achieve at the moment. That these
survived
working group last call indicates a different type of WG review may be
needed
to groom other bugs out of the documents.
 
***Response, G: Item 1:
We removed comments and tests related to flooding from the documents.
 
**G: Item 2: Who is asking for these benchmarks, and are they (still)
participating in the
group? The measurements defined here are very simplistic and will provide
limited insight into the relative performance of two elements in a real
deployment. The documents should be clear about their limitations, and it
would
be good to know that the community asking for these benchmarks is getting
tools
that will actually be useful to them. The crux of these two documents is in
the
last paragraph of the introduction to the methodology doc: "Finally, the
overall value of these tests is to serve as a comparison function between
multiple SIP implementations".  The documents punt on providing any
comparison
guidance, but even if we assume someone can figure that out, do these
benchmarks provide something actually useful for inputs?
 
***Response,G: Item 2:
Yes, they are valuable to the community.
1. A major SBC vendor used these documents and the paid services of two
students to perform the tests describe therein to learn the values of the
benchmarks which were subsequently published for external release.
2. Regarding the measurements being simplistic:  They were intentionally
designed to be simplistic because the goal of the BMWG Working Group is not
to reproduce real-world traffic in the lab. To quote the BMWG charter: To
better distinguish the BMWG from other measurement initiatives in the
IETF,
the scope of the BMWG is limited to the characterization of 
implementations
of various internetworking technologies
using controlled stimuli in a
laboratory environment. Said differently,
the BMWG does not attempt to
produce benchmarks for live, operational
Networks.
3. Regarding the documents not providing any comparison guidance, again,
that is intentional.  The documents were designed such that testing two
different implementations will result in two different reports that then can
be compared by Operations personnel.  It is not the job of the document
itself to provide comparison guidance. The metrics generated by the methods
in these documents define a frontiers beyond which "there be dragons."
4. In summary, we believe that these documents are useful and they have been
used by vendors in the community.
 
 
**G: Item 3: It would be good to explain how these documents relate to
RFC6076.
 
***Response, G: Item 3:
The authors have been in contact for several years and agreed that there is
little overlap.  The 6076 relates to the end to end performance of a service
on a network.  These drafts, on the other hand, refer to lab tests of a
device .
 
 
**G: Item 4: The terminology tries to refine the definition of session, but
the definition
provided, "The combination of signaling and media messages and processes
that
support a SIP-based service" doesn't answer what's in one session vs
another.
Trying to generically define session has been hard and several working
groups
have struggled with it (see INSIPID for a current version of that
conversation). This document doesn't _need_ a generic definition of session
-
it only needs to define the set of messages that it is measuring. It would
be
much clearer to say "for the purposes of this document, as session is the
set
of SIP messages associated with an INVITE initiated dialog and any
Associated
Media, or a series of related SIP MESSAGE requests". (And looking at the
benchmarks, you aren't leveraging related MESSAGE requests - they all appear
to
be completely independent). Introducing the concepts of Invite-initiated
sessions and non-invite-initiated sessions doesn't actually help define the
metrics. When you get to the metrics, you can speak concretely in terms of a
series of INVITEs, REGISTERs, and MESSAGEs. Doing that, and providing a
short
introduction pointing folks with PSTN backgrounds relating these to "Session
Attempts" will be clearer.
 
To be clear, I strongly suggest a fundamental restructuring of the document
to
describe the benchmarks in terms of dialogs and transactions, and remove the
IS
and NS concepts completely.
 
***Response, G: Item 4:Re-definition of a session:
We believe that the 3D depiction of the session is useful.  As we state in
the document, the definition is for the purpose of this document only.  The
reason we created it was to be able to refer to all the different cases ­
ones in which we have a INVITE initiated session with media, in which case
all three of the components are non-null; the case of a INVITE-initiated
session without media, in which case the media and control components are
null and only the Sig component is non-null; and the case of non-INVITE
initiated sessions, such as REGISTER and MESSAGE, in which case again, the
only non-null component is the Sig component. We will in the next revision
of the document refer to the diagram and its nomenclature  in our
descriptions of the metrics and the test cases. Each test case describes the
set of SIP messages and the order in which they should be sent. For this
reason we do not need to define a session as this or that set of SIP
requests. 
 
**G: Item 5: The INVITE related tests assume no provisional responses,
leaving out the
effect on a device's memory when the state machines it is maintaining
transition
to the proceeding state. Further, by not including provisionals, and
building
the tests to search for Timer B firing, the tests insure there will be
multiple
retransmissions of the INVITE (when using UDP) that the device being tested
has
to handle. The traffic an element has to handle and likely the memory it
will
consume will be very different with even a single 100 trying, which is the
more
usual case in deployed networks. The document should be clear _why_ it chose
the test model it did and left out metrics that took having a provisional
response into account. Similarly, you are leaving out the delayed-offer
INVITE
transactions used by 3pcc and it should be more obvious that you are doing
so.
 
Likewise, the media oriented tests take a very basic approach to simulating
media. It should be explicitly stated that you are simulating the effects of
a
codec like G.711 and that you are assuming an element would only be
forwarding
packets and has to do no transcoding work. It's not clear from the documents
whether the EA is generating actual media or dummy packets. If it's actual
media, the test parameters that assume constant sized packets at a constant
rate will not work well for video (and I suspect endpoints, like B2BUAs,
will
terminate your call early if you send them garbage).
 
The sections on a series of INVITEs is fairly clear that you mean each of
them
to have different dialog identifiers.  I don't see any discussion of varying
the To: URI. If you don't, what's going to keep a gateway or B2BUA from
rejecting all but the first with something like Busy? Similarly, I'm not
finding where you talk about how many AoRs you are registering against in
the
registration tests. I think, as written, someone could write this where all
the
registers affected only one AoR.
 
***Response, G: Item 5:
Item 5: Why not define all the metrics in terms of dialogs and transactions?
These documents describe black-box testing.  The evidence of the existence
of transactions is that the session was set up. In the case of a REGISTER
request, for example, we see the 200 OK to the REGISTER and know there was a
successful session.
 
**G: Item 6: Stress Testing:
The methodology document calls Stress Testing out of scope, but the very
nature
of the Benchmarking algorithm is a stress test. You are iteratively pushing
to
see at what point something fails, _exactly_ by finding the rate of
attempted
sessions per second that the thing under test would consider too high.
 
*** Response, G: Item 6:
These are benchmark tests, designed to find the highest rate at which the
system can handle session attempts with no failures of the application
itself.  The tests stop at the point where a single application error is
observer. Stress testing would continue to run with an ever-increasing
number of errors at the application layer, at ever higher rates until such
time as the platform upon which the application runs, fails
catastrophically, by, for example rebooting, or stopping operation entirely
and failing to reboot.
 
 -    -    -    -    -    -    -    -    -    -    -    -    -    -    -
-    -    -    -    -    -   -
TERMINOLOGY:
Now to specific issues in document order, starting with the terminology
document (nits are separate and at the end):
 
** T (for Terminology document): The title and abstract are misleading -
this is
not general benchmarking for SIP performance. You have a narrow set of
tests, gathering metrics on a small subset of the protocol machinery.
Please (as RFC 6076 did) look for a title that matches the scope of the
document. For instance, someone testing a SIP Events server would be
ill-served
with the benchmarks defined here.
 
*** Response:  T: The documents have been renamed as follows:
Methodology for Benchmarking Session Initiation Protocol (SIP) Devices:
Basic session setup and registration
Terminology for Benchmarking Session Initiation Protocol (SIP)
Devices:Basic session setup and registration
 
 
** T, section 1: RFC5393 should be a normative reference. You probably also
need
to pull in RFCs 4320 and 6026 in general - they affect the state machines
you
are measuring.
 
*** Response, T, section 1:  Agreed. We have pulled in rfc5393, rfc4320 and
rfc6026.   
 
 
** T, 3.1.1: As noted above, this definition of session is not useful. It
doesn't provide any distinction between two different sessions. I strongly
disagree that SIP reserves "session" to describe services analogous to
telephone calls on a switched network - please provide a reference. SIP
INVITE
transactions can pend forever - it is only the limited subset of the use of
the transactions (where you don't use a provisional response) that keeps
this
communication "brief". In the normal case, an INVITE an its final response
can
be separated by an arbitrary amount of time. Instead of trying to tweak this
text, I suggest replacing all of it with simpler, more direct descriptions
of
the sequence of messages you are using for the benchmarks you are defining
Here.
 
***Response, T, 3.1.1: Same as response to Item 4: Re-definition of a
session:
We believe that the 3D depiction of the session is useful.  As we state in
the document, the definition is for the purpose of this document only.  The
reason we created it was to be able to refer to all the different cases ­
ones in which we have a INVITE initiated session with media, in which case
all three of the components are non-null; the case of a INVITE-initiated
session without media, in which case the media and control components are
null and only the Sig component is non-null; and the case of non-INVITE
initiated sessions, such as REGISTER and MESSAGE, in which case again, the
only non-null component is the Sig component. Each test case describes the
set of SIP messages and the order in which they should be sent. For this
reason we do not need to define a session as this or that set of SIP
requests. 
 
 
**T, 3.1.1: How is this vector notion (and graph) useful for this document?
I
don't see that it's actually used anywhere in the documents. Similarly, the
arrays don't appear to be actually used (though you reference them from some
definitions) - What would be lost from the document if you simply removed
all
this text?
 
***Response, T3.1.1: It is not necessary to refer to the diagram after the
initial explanation.  We do in fact refer to the components of the session
in the methodology document.
 
-    -     -     -     -     -    -
 
**T, 3.1.5, Discussion, last sentence: Why is it important to say "For
UA-type
of network devices such as gateways, it is expected that the UA will be
driven
into overload based on the volume of media streams it is processing." It's
not
clear that's true for all such devices. How is saying anything here useful?
 
***Response: T, 3.1.5: We do not consider gateways anymore, so we have
removed this from T,3.1.5.
 
 
**T, 3.1.6: This definition says an outstanding BYE or CANCEL is a Session
Attempt. Why not just say INVITE? You aren't actually measuring "session
attempts" for INVITEs or REGISTERs - you have separate benchmarks for them.
***Response: T, 3.1.6: Agreed.  The definition was modified to say, "A SIP
INVITE or
REGISTER request sent by the EA that has not received a final response."
 
 
**T, 3.1.7: It needs to be explicit that these benchmarks are not accounting
for/allowing early dialogs.
***Response: T, 3.1.7: Agreed. We added a sentence to that affect.
 
 
**T, 3.1.8: The words "early media" appear here for the first time. Given
the
way the benchmarks are defined, does it make sense to discuss early media in
these documents at all (beyond noting you do not account for it)? If so,
there needs to be much more clarity. (By the way, this Discussion will be
much easier to write in terms of dialogs).
***Response: T, 3.1.8: We now refer to early pre-call media, following what
rfc3261 does in Section 20.11 when it first talks about early media.
 
 
**T, 3.1.9, Discussion point 2: What does "the media session is established"
mean? If you leave this written as a generic definition, then is this when
an
MSRP connection has been made? If you simplify it to the simple media model
currently in the document, does it mean an RTP packet has been sent? Or does
it
have to be received?. For the purposes of the benchmarks defined here, it
doesn't seem to matter, so why have this as part of the discussion anyway?
***Response: T, 3.1.9: We did not find that phrase in T 3.1.9, but we did
find a SUBSCRIBE given as an example of a NS session and changed that to a
REGISTER.
 
 
**T, 3.1.9, Definition: A series of CANCELs meets this definition.
***Response: We have clarified the fact that we only consider the REGISTER
request  
as an NS. The CANCELs are out of scope.
 
 
**T, 3.1.10 Discussion: This doesn't talk about 3xx responses, and they
aren't
covered elsewhere in the document.
***Response: T, 3.1.10 Discussion: The 3xx has been added to the list as
well.  Only the 2xx is considered to be a success.
 
 
**T, 3.1.11 Discussion: Isn't the MUST in this section methodology? Why is
it in
this document and not -meth-?
***Response: T, 3.1.11 Discussion: T3.1.11 was removed from version (-09).
 
**T, 3.1.11 Discussion, next to last sentence: "measured by the number of
distinct Call-IDs" means you are not supporting forking, or you would not
count
answers from more than on leg of the fork as different sessions, like you
should. Or are you intending that there would never be an answer from more
than
one leg of a fork? If so, the documents need to be clearer about the
methodology and what's actually being measured.
***Response: T, 3.1.11 Discussion: T3.1.11 was removed from version (-09).
 
 
**T, 3.2.2 Definition: There's something wrong with this definition. For
example, proxies do not create sessions (or dialogs). Did you mean "forwards
messages between"?
***Response: T, 3.2.2 Definition: Wording was changed to, "Device in the
test topology that facilitates the creation of sessions between EAs."
 
 
**T, 3.2.2 Discussion: This is definition by enumeration since it uses a
MUST,
and is exclusive of any future things that might sit in the middle. If
that's
what you want, make this the definition. The MAY seems contradictory unless
you
are saying a B2BUA or SBC is just a specialized User Agent Server. If so,
please say it that way.
***Response: T, 3.2.2 Discussion: The text now reads as follows: "The DUT is
an
RFC3261-compatible network intermediary such as ..."
 
 
**T, 3.2.3: This seems out of place or under-explored.  You don't appear to
actually _use_ this definition in the documents. You declare these things in
scope, but the only consequence is the line in this section about the not
lowering performance benchmarks when present. Consider making that part of
the
methodology of a benchmark and removing this section. If you think it's
essential, please revisit the definition - you may want to generalize it
into
_anything_ that sits on the path and may affect SIP processing times
(otherwise, what's special about this either being SIP Aware, or being a
Firewall)?
***Response: T, 3.2.3: References to firewalls both stateful and otherwise
have been removed. 
 
**T, 3.2.5 Definition: This definition just obfuscates things. Point to
3261's
definition instead. How is TCP a measurement unit? Does the general
terminology
template include "enumeration" as a type? Do you really want to limit this
enumeration to the set of currently defined transports? Will you never run
these benchmarks for SIP over websockets?
 
**Response: T, 3.2.5 Definition: The set of transports now includes
websockets,
RFC (rfc7118).  
 
 
**T, 3.3.2 Discussion: Again, there needs to be clarity about what it means
to
"create" a media session. This description differentiates attempt vs
success,
so what is it exactly that makes a media session attempt successful? When
you
say number of media sessions, do you mean number of M lines or total number
of
INVITEs that have SDP with m lines?
***Response: T, 3.3.2 Discussion: This term was removed.
 
** T, 3.3.3: This would much clearer written in terms of transactions and
dialogs
(you are already diving into transaction state machine details). This is a
place where the document needs to point out that it is not providing
benchmarks
relevant to environments where provisionals are allowed to happen and INVITE
transactions are allowed to pend.
***Response: ** T, 3.3.3: This is about whether or not the attempt to set up
a call has succeeded. It is about how we define success and failure.  It is
about how long you wait before you declare a failure.  This section defines
a parameter, measured in units of time, that represents the amount of time
your EA Client will wait for a response from the EA Server, after the elapse
of which the EA will declare a failure to establish a call.  Remember, this
is lab testing not end to end testing.  We are not concerned with whether or
not the call is ever set up after some errors have occurred.  We are testing
to failure.  The failure to establish the session before X seconds have
passed is a failure within the context of this test.
 
The edited version reads as follows:
3.3.3.  Establishment Threshold Time
Definition:
     Configuration of the EA that represents the amount of time that an EA
client  will wait for a response from the EA server before declaring a
Session Attempt Failure.
 
**T, 3.3.4: How does this model (A single session duration separate from the
media session hold time) produce useful benchmarks? Are you using it to
allow
media to go beyond the termination of a call? If not, then you have media
only
for the first part of a call? What real world thing does this reflect?
Alternatively, what part of the device or system being benchmarked does this
provide insight to?
 
***Response: T, 3.3.4: The term "Media Session Hold Time" was removed.
 
**T, 3.3.5: The document needs to be honest about the limits of this simple
model of media. It doesn't account for codecs that do not have constant
packet
sizes. The benchmarks that use the model don't capture the differences based
on
content of the media being sent - a B2BUA or gateway, may will behave
differently if it is transcoding or doing content processing (such as DTMF
detection) than it will if it is just shoveling packets without looking at
them.
***Response: T, 3.3.5: The following changes were made to the definition:
 Definition: Configuration on the EA for a fixed number of frames or samples
to be sent in each RTP packet of the media session.
 
  Discussion:
     For a single benchmark test, media sessions use a defined number of
samples or frames per RTP packet.  If two SBCs for example used the same
codec but one put more frames into the RTP packets than the other, this
might  cause variation in
     performance benchmark measurements.
 
  Measurement Units:
     An Integer Number of frames or samples, depending on whether hybrid or
sample-based codecs are used, respectively.
 
  Issues:
     None.
 
  See Also:
 
In addition, a new parameter, "Codec Type"  was added as follows:
Definition:
     The name of the codec used to generate the media session.
 
  Discussion:
     For a single benchmark test, all sessions use the same size packet
     for media streams.  The size of packets can cause variation in
     performance benchmark measurements.
 
  Measurement Units:
     This is an alphanumeric name assigned to uniquely identify the codec.
 
  Issues:
     None.
 
  See Also:
 
 In addition, this parameter was added to the Test Setup Report in M5.1.
 
**T, 3.3.6: Again, the model here is that any two media packets present the
same
load to the thing under test. That's not true for transcoding, mixing, or
analysis (such as for dtmf detection). It's not clear that if you have two
streams, each stream has its own "constant rate". You call out having one
audio
and one video stream - how do you configure different rates for them?
***Response: **T, 3.3.6: This definition has been deleted.
 
 
**T, 3.3.7: This document points to the methodology document for indicating
whether streams are bi-directional or uni-directional. I cant find where the
methodology document talks about this (the string 'direction' does not
occur in that document).
***Response: T, 3.3.7: This definition has been deleted.
 
**T, 3.3.8: This text is old - it was probably written pre-RFC5056. If you
fork,
loop detection is not optional. This, and the methodology document should be
updated to take that into account.
***T, 3.3.8: This text has been removed. It relates to loop detection which
is no longer considered in version 09.
 
**T, 3.3.9: Clarify if more than one leg of a fork can be answered
successfully
and update 3.1.11 accordingly. Talk about how this affects the success
benchmarks (how will the other legs getting failure responses affect the
scores?)
***T, 3.3.9: Response: This text has been removed.  It relates to forking
which is no longer considered in version 09.
 
**T, 3.3.9, Measurement units: There is confusion here. The unit is probably
"endpoints". This section talks about two things, that, and type of forking.
How is "type of forking" a unit, and are these templates supposed to allow
more
than one unit for a term?
***T, 3.3.9: Response: This text has been removed.  It relates to forking
which is no longer considered in version 09.
 
 
**T, 3.4.2, Definition: It's not clear what "successfully completed" means.
Did
you mean "successfully established"? This is a place where speaking in terms
of
dialogs and transactions rather than sessions will be much clearer.
***Response: T, 3.4.2, Definition:  The SER was re-defined as follows:
3.4.1.  Session Establishment Rate
 
   Definition:
      The maximum value of the Session Attempt Rate that the DUT can
      handle for an extended, pre-defined, period with zero failures.
 
   Discussion:
      This benchmark is obtained with zero failure in which 100% of the
      sessions attempted by the Emulated Agent are successfully
      completed by the DUT.  The session attempt rate provisioned on the
      EA is raised and lowered as described in the algorithm in the
      accompanying methodology document, until a traffic load at the
      given attempt rate over the sustained period of time identified by
      T in the algorithm completes without any failed session attempts.
      Sessions may be IS or NS or a mix of both and will be defined in
      the particular test.
 
   Measurement Units:
      sessions per second (sps)
 
   Issues:
      None.
 
   See Also:
      Invite-Initiated Sessions
      Non-Invite-Initiated Sessions
      Session Attempt Rate
 
 
**T, 3.4.3, This benchmark metric is underdefined. I'll focus on that in the
context of the methodology document (where the docs come closer to defining
it).
This definition includes a variable T but doesn't explain it - you have to
read
the methodology to know what T is all about. You might just say "for the
duration of the test" or whatever is actually correct.
***Response: T, 3.4.3: This was a reference to Session Capacity, a concept
that has been removed from version 09.
 
**T, 3.4.3, Discussion: "Media Session Hold Time MUST be set to infinity".
Why?
The argument you give in the next sentence just says the media session hold
time has to be at least as long as the session duration. If they were equal,
and finite, the test result does not change. What's the utility of the
infinity
concept here?
***Response: T, 3.4.3, Discussion: This was a reference to Session Capacity,
a concept that has been removed from version 09.
 
**T, 3.4.4: "until it stops responding". Any non-200 response is still a
response, and if something sends a 503 or 4xx with a retry-after (which is
likely when it's truly saturating) you've hit the condition you are trying
to
find. The notion that the Overload Capacity is measurable by not getting any
responses at all is questionable.  This discussion has a lot of methodology
in
it - why isn't that (only) in the methodology document?
***Response: T, 3.4.4: This related to Session Overload Capacity, a concept
that has been removed from version 09.
 
**T, 3.4.5: A normal, fully correct system that challenged requests and
performed flawlessly would have a .5 Session Establishment Performance
score.
Is that what you intended? The SHOULD in this section looks like
methodology.
Why is this a SHOULD and not a MUST (the document should be clearer about
why
sessions remaining established is important). Or wait - is this what Note 2
in
section 5.1 of the methodology document (which talks about reporting
formats)
is supposed to change? If so, that needs to be moved to the actual
methodology
and made _much_ clearer.
***Response: T, 3.4.5: This section related to Session Establishment
Performance, a concept that has been removed from version 09.
 
**T, 3.4.6: You talk of the first non-INVITE in an NS. How are you
distinguishing subsequent non-INVITES in this NS from requests in some other
NS? Are you using dialog identifiers or something else? Why do you expect
that
to matter (why is the notion of a sequence of related non-INVITEs useful
from a
benchmarking perspective - there isn't state kept in intermediaries because
of
them - what will make this metric distinguishable from a metric that just
focuses on the transactions?)
***Response: T, 3.4.6: This section related to Session Attempt Delay, a
concept that was removed from version 09.
 
**T, 3.4.7: What's special about MESSAGE? Why aren't you focusing on INFO or
some other end-to-end non-INVITE? I suspect it's because you are wanting to
focus on a simple non-INVITE transaction (which is why you are leaving out
SUBSCRIBE/NOTIFY). MESSAGE is good enough for that, but you should be clear
that's why you chose it. You should also talk about whether the payload of
all
of the MESSAGE requests are the same size and whether that size is a
parameter
to the benchmark. (You'll likely get very different behavior from a MESSAGE
that fragments.) 
***Response: T, 3.4.7: This section related to the IM Rate. We removed IM
from the scope of these documents in version 09, due to the fact that there
many ways to deliver such services and specifying one or the other to be
tested would not be useful.
 
 
**T, 3.4.7: The definition says "messages completed" but the discussion
talks
about "definition of success". Does success mean an IM transaction completed
successfully? If so, the definition of success for a UAC has a problem. As
written, it describes a binary outcome for the whole test, not how to
determine
the success of an individual transaction - how do you get from what it
describes to a rate?
***Response: T, 3.4.7: IM is outside the scope of the documents in version
09.
 
**T, Appendix A: The document should better motivate why this is here.
Why does it mention SUBSCRIBE/NOTIFY when the rest of the document(s) are
silent on them.  The discussion says you are _selecting_ a Session Attempts
Arrival Rate distribution. It would be clearer to say you are selecting the
distribution of messages sent from the EA. It's not clear how this
particular
metric will benefit from different sending distributions.
***Response: T, Appendix A: Appendix A has been removed.
-     -    -     -     -    -     -     -    -     -     -    -     -     -
-     -     -

Comments related to the Methodology document will be sent later.

Carol Davids
Professor &  Director RTC Lab
Illinois Institute of Technology
Office: 630-682-6024
Mobile: 630-292-9417
Email: davids@iit.edu
Skype: caroldavids1
Web: rtc-lab.itm.iit.edu