Re: [bmwg] Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC

Carol Davids <davids@iit.edu> Fri, 25 January 2013 04:56 UTC

User-Agent: Microsoft-MacOutlook/14.2.5.121010
Date: Thu, 24 Jan 2013 22:56:07 -0600
From: Carol Davids <davids@iit.edu>
To: Robert Sparks <rjsparks@nostrum.com>, ietf@ietf.org, draft-ietf-bmwg-sip-bench-term@tools.ietf.org, draft-ietf-bmwg-sip-bench-meth@tools.ietf.org, bmwg@ietf.org
Message-ID: <CD2769FF.31306%davids@iit.edu>
Thread-Topic: Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC
In-Reply-To: <51015BB3.7030406@nostrum.com>
Mime-version: 1.0
Content-type: multipart/alternative; boundary="B_3441912973_13556270"
Subject: Re: [bmwg] Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt> (Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices) to Informational RFC
Precedence: list

Robert,

Thank you for your careful review and your detailed questions and comments.
We appreciate the time and effort.

Yours,

Carol

Carol Davids
Professor &  Director RTC Lab
Illinois Institute of Technology
Office: 630-682-6024
Mobile: 630-292-9417
Email: davids@iit.edu
Skype: caroldavids1
Web: rtc-lab.itm.iit.edu

From:  Robert Sparks <rjsparks@nostrum.com>
Date:  Thursday, January 24, 2013 10:05 AM
To:  <ietf@ietf.org>, <draft-ietf-bmwg-sip-bench-term@tools.ietf.org>,
<draft-ietf-bmwg-sip-bench-meth@tools.ietf.org>, "bmwg.ietf.org"
<bmwg@ietf.org>
Subject:  Re: Last Call: <draft-ietf-bmwg-sip-bench-term-08.txt>
(Terminology for Benchmarking Session Initiation Protocol (SIP) Networking
Devices) to Informational RFC
Resent-From:  <draft-alias-bounces@tools.ietf.org>
Resent-To:  Carol Davids <davids@iit.edu>, Scott Poretsky
<sporetsky@allot.com>, <vkg@bell-labs.com>
Resent-Date:  Thu, 24 Jan 2013 10:05:36 -0600 (CST)

  Reviews of draft-ietf-bmwg-sip-bench-term-08 and
draft-ietf-bmwg-sip-bench-meth-08

 Summary: These drafts are not ready for publication as RFCs.

 First, some of the text in these documents shows signs of being old, and
the
 working group may have been staring at them so long that they've become
hard to
 see.  The terminology document says "The issue of overload in SIP networks
is
 currently a topic of discussion in the SIPPING WG." (SIPPING was closed in
 2009). The methodology document suggests a "flooding" rate that is orders
of
 magnitude below what simple devices achieve at the moment. That these
survived
 working group last call indicates a different type of WG review may be
needed
 to groom other bugs out of the documents.

 Who is asking for these benchmarks, and are they (still) participating in
the
 group?  The measurements defined here are very simplistic and will provide
 limited insight into the relative performance of two elements in a real
 deployment. The documents should be clear about their limitations, and it
would
 be good to know that the community asking for these benchmarks is getting
tools
 that will actually be useful to them. The crux of these two documents is in
the
 last paragraph of the introduction to the methodology doc: "Finally, the
 overall value of these tests is to serve as a comparison function between
 multiple SIP implementations".  The documents punt on providing any
comparison
 guidance, but even if we assume someone can figure that out, do these
 benchmarks provide something actually useful for inputs?

 It would be good to explain how these documents relate to RFC6076.

 The terminology tries to refine the definition of session, but the
definition
 provided, "The combination of signaling and media messages and processes
that
 support a SIP-based service" doesn't answer what's in one session vs
another.
 Trying to generically define session has been hard and several working
groups
 have struggled with it (see INSIPID for a current version of that
 conversation). This document doesn't _need_ a generic definition of session
-
 it only needs to define the set of messages that it is measuring. It would
be
 much clearer to say "for the purposes of this document, as session is the
set
 of SIP messages associated with an INVITE initiated dialog and any
Associated
 Media, or a series of related SIP MESSAGE requests". (And looking at the
 benchmarks, you aren't leveraging related MESSAGE requests - they all
appear to
 be completely independent). Introducing the concepts of Invite-initiated
 sessions and non-invite-initiated sessions doesn't actually help define the
 metrics. When you get to the metrics, you can speak concretely in terms of
a
 series of INVITEs, REGISTERs, and MESSAGEs. Doing that, and providing a
short
 introduction pointing folks with PSTN backgrounds relating these to
"Session
 Attempts" will be clearer.

 To be clear, I strongly suggest a fundamental restructuring of the document
to
 describe the benchmarks in terms of dialogs and transactions, and remove
the IS
 and NS concepts completely.

 The INVITE related tests assume no provisional responses, leaving out the
 effect on a device's memory when the state machines it is maintaining
transition
 to the proceeding state. Further, by not including provisionals, and
building
 the tests to search for Timer B firing, the tests insure there will be
multiple
 retransmissions of the INVITE (when using UDP) that the device being tested
has
 to handle. The traffic an element has to handle and likely the memory it
will
 consume will be very different with even a single 100 trying, which is the
more
 usual case in deployed networks. The document should be clear _why_ it
chose
 the test model it did and left out metrics that took having a provisional
 response into account. Similarly, you are leaving out the delayed-offer
INVITE
 transactions used by 3pcc and it should be more obvious that you are doing
so.

 Likewise, the media oriented tests take a very basic approach to simulating
 media. It should be explicitly stated that you are simulating the effects
of a
 codec like G.711 and that you are assuming an element would only be
forwarding
 packets and has to do no transcoding work. It's not clear from the
documents
 whether the EA is generating actual media or dummy packets. If it's actual
 media, the test parameters that assume constant sized packets at a constant
 rate will not work well for video (and I suspect endpoints, like B2BUAs,
will
 terminate your call early if you send them garbage).

 The sections on a series of INVITEs is fairly clear that you mean each of
them
 to have different dialog identifiers.  I don't see any discussion of
varying
 the To: URI. If you don't, what's going to keep a gateway or B2BUA from
 rejecting all but the first with something like Busy? Similarly, I'm not
 finding where you talk about how many AoRs you are registering against in
the
 registration tests. I think, as written, someone could write this where all
the
 registers affected only one AoR.

 The methodology document calls Stress Testing out of scope, but the very
nature
 of the Benchmarking algorithm is a stress test. You are iteratively pushing
to
 see at what point something fails, _exactly_ by finding the rate of
attempted
 sessions per second that the thing under test would consider too high.

 Now to specific issues in document order, starting with the terminology
 document (nits are separate and at the end):

 * T (for Terminology document): The title and abstract are misleading -
this is
 not general benchmarking for SIP performance. You have a narrow set of
 tests, gathering metrics on a small subset of the protocol machinery.
 Please (as RFC 6076 did) look for a title that matches the scope of the
 document. For instance, someone testing a SIP Events server would be
ill-served
 with the benchmarks defined here.

 * T, section 1: RFC5393 should be a normative reference. You probably also
need
 to pull in RFCs 4320 and 6026 in general - they affect the state machines
you
 are measuring.

 * T, 3.1.1: As noted above, this definition of session is not useful. It
 doesn't provide any distinction between two different sessions. I strongly
 disagree that SIP reserves "session" to describe services analogous to
 telephone calls on a switched network - please provide a reference. SIP
INVITE
 transactions can pend forever - it is only the limited subset of the use of
 the transactions (where you don't use a provisional response) that keeps
this
 communication "brief". In the normal case, an INVITE an its final response
can
 be separated by an arbitrary amount of time. Instead of trying to tweak
this
 text, I suggest replacing all of it with simpler, more direct descriptions
of
 the sequence of messages you are using for the benchmarks you are defining
 here.

 *T, 3.1.1: How is this vector notion (and graph) useful for this document?
I
 don't see that it's actually used anywhere in the documents. Similarly, the
 arrays don't appear to be actually used (though you reference them from
some
 definitions) - What would be lost from the document if you simply removed
all
 this text?

 *T, 3.1.5, Discussion, last sentence: Why is it important to say "For
UA-type
 of network devices such as gateways, it is expected that the UA will be
driven
 into overload based on the volume of media streams it is processing." It's
not
 clear that's true for all such devices. How is saying anything here useful?

 *T, 3.1.6: This definition says an outstanding BYE or CANCEL is a Session
 Attempt. Why not just say INVITE? You aren't actually measuring "session
 attempts" for INVITEs or REGISTERs - you have separate benchmarks for them.

 *T, 3.1.7: It needs to be explicit that these benchmarks are not accounting
 for/allowing early dialogs.

 *T, 3.1.8: The words "early media" appear here for the first time. Given
the
 way the benchmarks are defined, does it make sense to discuss early media
in
 these documents at all (beyond noting you do not account for it)? If so,
 there needs to be much more clarity. (By the way, this Discussion will be
 much easier to write in terms of dialogs).

 *T, 3.1.9, Discussion point 2: What does "the media session is established"
 mean? If you leave this written as a generic definition, then is this when
an
 MSRP connection has been made? If you simplify it to the simple media model
 currently in the document, does it mean an RTP packet has been sent? Or
does it
 have to be received?. For the purposes of the benchmarks defined here, it
 doesn't seem to matter, so why have this as part of the discussion anyway?

 *T, 3.1.9, Definition: A series of CANCELs meets this definition.

 *T, 3.1.10 Discussion: This doesn't talk about 3xx responses, and they
aren't
 covered elsewhere in the document.

 *T, 3.1.11 Discussion: Isn't the MUST in this section methodology? Why is
it in
 this document and not -meth-?

 *T, 3.1.11 Discussion, next to last sentence: "measured by the number of
 distinct Call-IDs" means you are not supporting forking, or you would not
count
 answers from more than on leg of the fork as different sessions, like you
 should. Or are you intending that there would never be an answer from more
than
 one leg of a fork? If so, the documents need to be clearer about the
 methodology and what's actually being measured.

 *T, 3.2.2 Definition: There's something wrong with this definition. For
 example, proxies do not create sessions (or dialogs). Did you mean
"forwards
 messages between"?

 *T, 3.2.2 Discussion: This is definition by enumeration since it uses a
MUST,
 and is exclusive of any future things that might sit in the middle. If
that's
 what you want, make this the definition. The MAY seems contradictory unless
you
 are saying a B2BUA or SBC is just a specialized User Agent Server. If so,
 please say it that way.

 *T, 3.2.3: This seems out of place or under-explored.  You don't appear to
 actually _use_ this definition in the documents.You declare these things in
 scope, but the only consequence is the line in this section about the not
 lowering performance benchmarks when present. Consider making that part of
the
 methodology of a benchmark and removing this section. If you think it's
 essential, please revisit the definition - you may want to generalize it
into
 _anything_ that sits on the path and may affect SIP processing times
 (otherwise, what's special about this either being SIP Aware, or being a
 Firewall)?

 *T, 3.2.5 Definition: This definition just obfuscates things. Point to
3261's
 definition instead. How is TCP a measurement unit? Does the general
terminology
 template include "enumeration" as a type? Do you really want to limit this
 enumeration to the set of currently defined transports? Will you never run
 these benchmarks for SIP over websockets?

 *T, 3.3.2 Discussion: Again, there needs to be clarity about what it means
to
 "create" a media session. This description differentiates attempt vs
success,
 so what is it exactly that makes a media session attempt successful? When
you
 say number of media sessions, do you mean number of M lines or total number
of
 INVITEs that have SDP with m lines?

 *T, 3.3.3: This would much clearer written in terms of transactions and
dialogs
 (you are already diving into transaction state machine details). This is a
 place where the document needs to point out that it is not providing
benchmarks
 relevant to environments where provisionals are allowed to happen and
INVITE
 transactions are allowed to pend.

 *T, 3.3.4: How does this model (A single session duration separate from the
 media session hold time) produce useful benchmarks? Are you using it to
allow
 media to go beyond the termination of a call? If not, then you have media
only
 for the first part of a call? What real world thing does this reflect?
 Alternatively, what part of the device or system being benchmarked does
this
 provide insight to?

 *T, 3.3.5: The document needs to be honest about the limits of this simple
 model of media. It doesn't account for codecs that do not have constant
packet
 sizes. The benchmarks that use the model don't capture the differences
based on
 content of the media being sent - a B2BUA or gateway, may will behave
 differently if it is transcoding or doing content processing (such as DTMF
 detection) than it will if it is just shoveling packets without looking at
 them.

 *T, 3.3.6: Again, the model here is that any two media packets present the
same
 load to the thing under test. That's not true for transcoding, mixing, or
 analysis (such as for dtmf detection). It's not clear that if you have two
 streams, each stream has its own "constant rate". You call out having one
audio
 and one video stream - how do you configure different rates for them?

 *T, 3.3.7: This document points to the methodology document for indicating
 whether streams are bi-directional or uni-directional. I cant find where
the
 methodology document talks about this (the string 'direction' does not
 occur in that document).

 *T, 3.3.8: This text is old - it was probably written pre-RFC5056. If you
fork,
 loop detection is not optional. This, and the methodology document should
be
 updated to take that into account.

 *T, 3.3.9: Clarify if more than one leg of a fork can be answered
successfully
 and update 3.1.11 accordingly. Talk about how this affects the success
 benchmarks (how will the other legs getting failure responses affect the
 scores?)

 *T, 3.3.9, Measurement units: There is confusion here. The unit is probably
 "endpoints". This section talks about two things, that, and type of
forking.
 How is "type of forking" a unit, and are these templates supposed to allow
more
 than one unit for a term?

 *T, 3.4.2, Definition: It's not clear what "successfully completed" means.
Did
 you mean "successfully established"? This is a place where speaking in
terms of
 dialogs and transactions rather than sessions will be much clearer.

 *T, 3.4.3, This benchmark metric is underdefined. I'll focus on that in the
 context of the methodology document (where the docs come closer to defining
it).
 This definition includes a variable T but doesn't explain it - you have to
read
 the methodology to know what T is all about. You might just say "for the
 duration of the test" or whatever is actually correct.

 *T, 3.4.3, Discussion: "Media Session Hold Time MUST be set to infinity".
Why?
 The argument you give in the next sentence just says the media session hold
 time has to be at least as long as the session duration. If they were
equal,
 and finite, the test result does not change. What's the utility of the
infinity
 concept here?

 *T, 3.4.4: "until it stops responding". Any non-200 response is still a
 response, and if something sends a 503 or 4xx with a retry-after (which is
 likely when it's truly saturating) you've hit the condition you are trying
to
 find. The notion that the Overload Capacity is measurable by not getting
any
 responses at all is questionable.  This discussion has a lot of methodology
in
 it - why isn't that (only) in the methodology document?

 *T, 3.4.5: A normal, fully correct system that challenged requests and
 performed flawlessly would have a .5 Session Establishment Performance
score.
 Is that what you intended? The SHOULD in this section looks like
methodology.
 Why is this a SHOULD and not a MUST (the document should be clearer about
why
 sessions remaining established is important). Or wait - is this what Note 2
in
 section 5.1 of the methodology document (which talks about reporting
formats)
 is supposed to change? If so, that needs to be moved to the actual
methodology
 and made _much_ clearer.

 *T, 3.4.6: You talk of the first non-INVITE in an NS. How are you
 distinguishing subsequent non-INVITES in this NS from requests in some
other
 NS? Are you using dialog identifiers or something else? Why do you expect
that
 to matter (why is the notion of a sequence of related non-INVITEs useful
from a
 benchmarking perspective - there isn't state kept in intermediaries because
of
 them - what will make this metric distinguishable from a metric that just
 focuses on the transactions?)

 *T, 3.4.7: What's special about MESSAGE? Why aren't you focusing on INFO or
 some other end-to-end non-INVITE? I suspect it's because you are wanting to
 focus on a simple non-INVITE transaction (which is why you are leaving out
 SUBSCRIBE/NOTIFY). MESSAGE is good enough for that, but you should be clear
 that's why you chose it. You should also talk about whether the payload of
all
 of the MESSAGE requests are the same size and whether that size is a
parameter
 to the benchmark. (You'll likely get very different behavior from a MESSAGE
 that fragments.) 

 *T, 3.4.7: The definition says "messages completed" but the discussion
talks
 about "definition of success". Does success mean an IM transaction
completed
 successfully?  If so, the definition of success for a UAC has a problem. As
 written, it describes a binary outcome for the whole test, not how to
determine
 the success of an individual transaction - how do you get from what it
 describes to a rate?

 *T, Appendix A: The document should better motivate why this is here.
 Why does it mention SUBSCRIBE/NOTIFY when the rest of the document(s) are
 silent on them.  The discussion says you are _selecting_ a Session Attempts
 Arrival Rate distribution. It would be clearer to say you are selecting the
 distribution of messages sent from the EA. It's not clear how this
particular
 metric will benefit from different sending distributions.

 Now the Methodology document (comments prefixed with an M):

 *M, Introduction: Can the document say why the subset of functionality
 benchmarked here was chosen over other subsets? Why was Subscribe/Notify or
 Info not included (or invites with MSRP or even simple early media, etc)?

 *M, Introduction paragraph 4: This points to section 4 and section 2 of the
 terminology document for configuration options. Section 4 is the iana
 considerations section (which has no options). What did you mean to point
to?

 *M, Introduction paragraph 4, last sentence: This seems out of place - why
is
 it in the introduction and not in a section on that specific methodology.

 *M, 4.1: It's not clear here, or in the methodology sections whether the
tests
 allow the transport to change as you go across an intermediary. Do you
intend
 to be able to benchmark a proxy that has TCP on one side and UDP on the
other?

 *M, 4.2: This is another spot where pointing to the Updates to 3261 that
change
 the transaction state machines is important.

 *M, 4.4: Did you really mean RTSP? Maybe you meant MSRP or something else?
RTSP
 is not, itself, a media protocol.

 *M, 4.9: There's something wrong with this sentence: "This test is run for
an
 extended period of time, which is referred to as infinity, and which is,
 itself, a parameter of the test labeled T in the pseudo-code". What value
is
 there in giving some finite parameter T the name "infinity"?

 *M, 4.9: Where did 100 (as an initial value for s) come from? Modern
devices
 process at many orders of magnitude higher rates than that. Do you want to
 provide guidance instead of an absolute number here?

 *M 4.9: In the pseudo-code, you often say "the largest value". It would
help to
 say the the largest value of _what_.

 *M 4.9: What is the "steady_state" function called in the pseudo-code?

 *M 6.3: Expected Results: The EA will have different performance
 characteristics if you have them sending media or not. That could cause
this
 metric to be different from session establishment without media.

 *M 6.5: This section should call out that loop detection is not optional
when
 forking. The Expected Results description is almost tautological - could it
 instead say how having this measurement is _useful_ to those consuming this
 benchmark?

 *M 6.8, Procedure: Why is "May need to run for each transport of interest."
in
 a section titled "Session Establishement Rate with TLS Encrypted SIP"?

 *M 6.10: This document doesn't define Flooding. What do you mean? How is
this
 different than "Stress test" as called out in section 4.8? Where does 500
come
 from?  (Again, I suspect that's a very old value - and you should be
providing
 guidance rather than an absolute number). But it's not clear how this isn't
 just the session establishment rate test that just starts with a bigger
number.
 What is it actually trying to report on that's different from the session
 establishment rate test, and how is the result useful?

 *M 6.11: Is each registration going to a different AoR? (You must be, or
the
 re-registration test makes no sense.) You might talk about configuring the
 registrar and the EA so they know what to use.

 *M 6.12, Expected Results: Where do you get the idea that re-registration
 should be faster than initial registration? How is knowing the difference
(or
 even that there is a difference) between this and the registration metric
 likely to be useful to the consumer?

 *M 6.14: Session Capacity, as defined in the terminology doc, is a count of
 sessions, not a rate. This section treats it as a rate and says it can be
 interpreted as "throughput". I'm struggling to see what it actually is
 measuring. The way your algorithm is defined in section 4.9, I find s
before I
 use T. Lets say I've got a box where the value of s that's found is 10000,
and
 I've got enough memory that I can deal with several large values of T. If I
run
 this test with T of 50000, my benchmark result is 500,000,000. If I run
with a
 T of 100000, my benchmark result is 1,000,000,000. How are those numbers
 telling me _anything_ about session capacity. That the _real_ session
capacity
 is at least that much? Is there some part of this methodology that has me
hunt
 for a maximal value of T? Unless I've missed something, this metric needs
more
 clarification to not be completely misleading. Maybe instead of "Session
 Capacity" you should simply be reporting "Simultaneous Sessions Measured"

 *M 8: "and various other drafts" is not helpful - if you know of other
 important documents to point to, point to them.

 Nits:

 T : The definition of Stateful Proxy and Stateless Proxy copied the words
 "defined by this specification" from RFC3261. This literal copy introduces
 confusion. Can you make it more visually obvious you are quoting? And even
if
 you do, could you replace "by this specification" with "by [RFC3261]"?

 T, Introduction, 2nd paragraph, last sentence: This rules out stateless
 proxies.

 T, Section 3: In the places where this template is used, you are careful to
say
 None under Issues when there aren't any, but not so careful to say None
under
 See Also when there isn't anything. Leaving them blank makes some
transitions
 hard to read - they read like you are saying see also (whatever the next
 section heading is).

 T, 3.1.6, Discussion: s/tie interval/time interval/

 M, Introduction, paragraph 2: You say "any [RFC3261] conforming device",
but
 you've ruled endpoint UAs out in other parts of the documents.

 M 4.9: You have comments explaining send_traffic the _second_ time you use
it.
 They would be better positioned at the first use.

 M 5.2: This is the first place the concept of re-Registration is mentioned.
A
 forward pointer to what you mean, or an introduction before you get to this
 format would be clearer.

 On 1/16/13 3:48 PM, The IESG wrote:

>  
> The IESG has received a request from the Benchmarking Methodology WG
> (bmwg) to consider the following document:
> - 'Terminology for Benchmarking Session Initiation Protocol (SIP)
>    Networking Devices'
>   <draft-ietf-bmwg-sip-bench-term-08.txt> as Informational RFC
> 
> The IESG plans to make a decision in the next few weeks, and solicits
> final comments on this action. Please send substantive comments to the
> ietf@ietf.org mailing lists by 2013-01-30. Exceptionally, comments may be
> sent to iesg@ietf.org instead. In either case, please retain the
> beginning of the Subject line to allow automated sorting.
> 
> Abstract
> 
> 
>    This document provides a terminology for benchmarking the SIP
>    performance of networking devices.  The term performance in this
>    context means the capacity of the device- or system-under-test to
>    process SIP messages.  Terms are included for test components, test
>    setup parameters, and performance benchmark metrics for black-box
>    benchmarking of SIP networking devices.  The performance benchmark
>    metrics are obtained for the SIP signaling plane only.  The terms are
>    intended for use in a companion methodology document for
>    characterizing the performance of a SIP networking device under a
>    variety of conditions.  The intent of the two documents is to enable
>    a comparison of the capacity of SIP networking devices.  Test setup
>    parameters and a methodology document are necessary because SIP
>    allows a wide range of configuration and operational conditions that
>    can influence performance benchmark measurements.  A standard
>    terminology and methodology will ensure that benchmarks have
>    consistent definition and were obtained following the same
>    procedures.
> 
> 
> 
> 
> The file can be obtained via
> http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/
> 
> IESG discussion can be tracked via
> http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/ballot/
> 
> 
> No IPR declarations have been submitted directly on this I-D.
> 
> 
>

[bmwg] Last Call: <draft-ietf-bmwg-sip-bench-term… The IESG
Re: [bmwg] Last Call: <draft-ietf-bmwg-sip-bench-… Robert Sparks
Re: [bmwg] Last Call: <draft-ietf-bmwg-sip-bench-… Carol Davids
[bmwg] Last Call: <draft-ietf-bmwg-sip-bench-term… Carol Davids