[bmwg] RAI-ART review of draft-ietf-bmwg-sip-bench-term-04 and draft-ietf-bmwg-sip-bench-meth-04

"Worley, Dale R (Dale)" <dworley@avaya.com> Fri, 27 April 2012 22:01 UTC

From: "Worley, Dale R (Dale)" <dworley@avaya.com>
To: "rai@ietf.org" <rai@ietf.org>, Al Morton <acmorton@att.com>, "draft-ietf-bmwg-sip-bench-term@tools.ietf.org" <draft-ietf-bmwg-sip-bench-term@tools.ietf.org>, "bmwg@ietf.org" <bmwg@ietf.org>
Date: Fri, 27 Apr 2012 17:56:42 -0400
Thread-Topic: RAI-ART review of draft-ietf-bmwg-sip-bench-term-04 and draft-ietf-bmwg-sip-bench-meth-04
Thread-Index: AQHNJMFJRhoQSDLmPUGO2SgZHs/AAA==
Message-ID: <CD5674C3CD99574EBA7432465FC13C1B22726A0A88@DC-US1MBEX4.global.avaya.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: Mary Barnes <mary.ietf.barnes@gmail.com>
Subject: [bmwg] RAI-ART review of draft-ietf-bmwg-sip-bench-term-04 and draft-ietf-bmwg-sip-bench-meth-04
Precedence: list

I am the assigned RAI-ART reviewer for this draft.  For background on
RAI-ART, please see the FAQ at
<http://wiki.tools.ietf.org/area/rai/trac/wiki/RaiArtfaq>.  Please
resolve these comments along with any other comments you may receive.

This draft is on the right track but has open issues, described in the
review.

Please note that I am quite familiar with SIP but have little
familiarity with benchmarking.  Thus, some of the points I make might
have implicit answers in the usual practice of benchmarking.  But
since one audience of these documents is the SIP community, such
implicit answers should probably be made explicit.

I. Technical issues

A. Media

The drafts seem to be undecided as to whether they want to benchmark
media processing or not.  At one point, it says that the benchmarking
is entirely in the signaling plane.  At another point, it is specified
how to set the number of media streams per INVITE dialog.  I suspect
the conflict is that there are devices the WG wants to benchmark which
do handle the media and their media handling affects performance
significantly, but the WG is unsure how to parametrize media
benchmarking.

The simplest situation, of course, is if the DUT is a proxy and the
media bypasses it entirely.  Beyond that, even in the simplest media
relaying situations, significant processing costs can arise.  Even if
the DUT isn't transcoding, it may have to rewrite the RTP codec
numbers (as it may have modified the SDP passing through it).  In more
complex situations, the DUT may have to transcode the codecs.

Some consideration has to be made of how to specify exactly what the
media loads are.  SDP already has ways to specify the exact encoding
of media streams, including the number of samples per packet, which
implies the packet size and packet rate.  These drafts only allow
specifying the packet size as a benchmark parameter, but packet size
is notoriously ambiguous -- which layers of encapsulation of the
encoded media are counted?  But specifying the detailed encoding
side-steps that question.

The draft wants to consider the duration of a media stream to be
separately adjustable from the duration of the containing dialog, but
the draft explicitly places out of scope the re-INVITE which is
necessary to accomplish that realistically (that is, with the
signaling matching the presented media packets).

B. INVITE dialog structure

The drafts seem to want to consider the establishment (or failure
thereof) of an INVITE dialog to be instantaneous, after which the
dialog continues for a chosen length of time, and then vanishes
instantly.  Little or no consideration is given to the various
scenarios of call establishment, including the most common case:
INVITE sent -- 183 response -- significant delay -- 200 response.
Dialog teardown is not conceptualized as being a processing step that
involves significant cost and may fail:  "Session disconnect is not
considered in the scope of this work item."

This lack of consideration is enhanced in the forking cases, as the
variety of scenarios (and their durations) increases.

In addition, the drafts only consider forking which is done within the
DUT, whereas it will be common in practice for forking to be done
downstream of the DUT, presenting the DUT with a stream of 1xx
responses from multiple endpoints, with a 2xx after an extended delay.

Also, in regard to signaling benchmarking, INVITEs that ultimately
fail are likely to be as costly as INVITEs that succeed, but there
doesn't seem to be a defined parameter "fraction of attempted calls
which succeed" (which controls the callee EAs).

C. Loop detection

All discussions of loop detection needs to based on the revised loop
detection requirements in RFC 5393.

D. Authentication

In some SIP operations, authentication is commonly done.  This can
have various effects on the message flows that need to be taken into
account in the benchmarks.

For instance, a registrar may require that the registering UA
authenticate itself.  Commonly, the UA sends a REGISTER request, which
is rejected with 401 because it contains a nonce that is too old.  The
UA then immediately sends another REGISTER with the nonce provided in
the 401 response, and that request receives a 200 response.

In this scenario, the number of effective REGISTER requests is half of
the total REGISTER requests, leading to an apparent attempt failure
rate of 50%, even though the middlebox is doing the Right Thing 100%
of the time.  This suggests that the definition of "attempt failure"
needs to be updated so that a 4xx response "passed upstream" by the
DUT is not counted as an attempt failure.

In other scenarios, the DUT itself might be expected to enforce SIP
authentication, which would require a somewhat different definition of
attempt failure, and would be expected to have lower throughput.

So some thought needs to be given as to whether these scenarios are
to be benchmarked, and to document how authentication is to be handled
in whatever benchmarks are defined.

II. Editorial issues

The drafts appear to me to be, well, drafts.  That is, they generally
contain the intended technical content but the exposition is not
complete or clear, and at points the reader has to guess at the exact
meaning.  At various points the sentences are not correct English,
references are not complete or correct, and there are various points
of contradiction between different parts of the documents.

The document needs a proper final revision, with each paragraph edited
over carefully to maximize its clarity and to verify that all parts of
the documents are consistent with each other.

Examples of editorial problems are:

draft-ietf-bmwg-sip-bench-term-04

section 1

   The term Throughput is defined in RFC2544 [RFC2544].

The definition is in RFC 1242; RFC 2544 refers to RFC 1242.

   This document uses existing terminology defined in other BMWG work.
   Examples include, but are not limited to:

      Device under test (DUT) (c.f., Section 3.1.1 RFC 2285 [RFC2285]).
      System under test (SUT) (c.f., Section 3.1.2, RFC 2285 [RFC2285]).

In what way would the reader determine the relevant "other BMWG work"?
This reference needs to be made definite in some way.

      The behavior of a stateful proxy is further defined in Section
      16.

This sentence was copied directly from RFC 3261, in whose section 16
the referenced definition exists.  However, *this* document has no
section 16.

section 2.2

The figures are inconsistent in how they label the "Tester" boxes as
"EA" or not.  Is this difference meaningful?

Figures 9 and 10 show "SUT" embracing the entire test setup, including
the "Tester" boxes, whereas RFC 2285 section 3.1.2 says that SUT does
not include the tester components of the setup.

section 3.1.1

The various components of a "session" (usually "dialog" in SIP
terminology) are given quasi-mathematical names that look peculiar to
me.  Worse, they're not completely correct: since each RTP stream has
a corresponding RTCP stream, "session[x].medc" should be
"session[x].medc[y]".  (Is there any benefit of introducing this
symbolism?)

section 3.1.5

This section defines "overload" and then says "The distinction between
an overload condition and other failure scenarios is outside the scope
of this document which is blackbox testing."  If the distinction is
outside the scope, why is there a definition here?

section 3.1.6

The definition of "Session Attempt" seems to be incorrect.  Of course,
a session attempt is each sending of an INVITE/SUBSCRIBE/MESSAGE,
whether or not it is ultimately successful.  But the definition written
makes "session attempt" a time-varying property, which is true only
until a response is received by the EA.

section 3.1.8

      An IS is identified by the Call-ID, To-tag, and From-tag of the
      SIP message that establishes the session.

As written, this is incorrect, as the to-tag is present only in the
response(s) to the INVITE.

      2.  If a media session is described in the SDP body of the
          signaling message, then the media session is established by
          the end of Establishment Threshold Time (c.f.  Section 3.3.3).

This sentence is correct as far as it goes, but there is no clear
description (that I can tell) of when a media session is
"established".  Indeed, it's not clear to me what a proper way to set
up the media is -- in real SIP systems, a proxy or SBC can start
receiving early media from a callee before it has received *any* SIP
responses from the callee, and the middlebox can have some difficulty
matching the RTP to a dialog being set up.  So the exact details of
when to start the media flows are needed to make the benchmarking
process reproducible.

section 3.2.3

This section defines "SIP-Aware Stateful Firewall" as 

      Device in test topology that provides Denial-of-Service (DoS)
      Protection to the Signaling and Media Planes for the EAs and
      Signaling Server

But a device can be a SIP-Aware Stateful Firewall (in the ordinary
sense of the words) without providing DoS protection.

section 3.3.2

"IS Media Attempt Rate" is defined as

      Configuration on the EA for number of ISs with Associated Media to
      be established at the DUT per continuous one- second time
      intervals.

However, "established" should be "attempted".

section 3.3.3

      Configuration of the EA for representing the amount of time that
      an EA will wait before declaring a Session Attempt Failure.

The discussion makes clear that there may be different Establishment
Threshold Times for IS and NS, but these times are written about in
most places as if they were the same.

draft-ietf-bmwg-sip-bench-meth-04

section 2

Refers to "Presence Rate", but draft-ietf-bmwg-sip-bench-term-04 says
that presence is out of scope.

section 4

What is the relationship between the data items described in this
draft's "Benchmarking Considerations" and
draft-ietf-bmwg-sip-bench-term-04's "Test Setup Parameters"?

section 4.2

I'm having trouble understanding the various terms that include the
word "server", including "Signaling Server", "The Server", "SIP
Server".  Are these intended to have the same meaning?  Are they
intended to cover the whole range of SIP "middleboxes" (including,
e.g., proxies)?  In section 2 is

   The DUT is a SIP Server, which
   may be any [RFC3261] conforming device.  The SUT can be any device or
   group of devices containing RFC 3261 conforming functionality along
   with Firewall and/or NAT functionality.

If read literally, that would *exclude* pure proxies, since they don't
have firewall or NAT functionality.

section 4.3

References "Figures 4 and 5", which don't exist.  Does it mean to
reference draft-ietf-bmwg-sip-bench-term-04?

section 4.9

The formatting of the pseudocode is inconsistent.  The variable "c"
should be boolean rather than 0/1.

section 6.6

      1.  If the DUT is being benchmarked as a proxy or B2BUA, and
          forking is supported in the DUT, then configure the DUT in the
          test topology shown in Figure 5 in [I-D.sip-bench-term].  If
          the DUT does not support forking, then this step can be
          skipped.
      2.  Configure a SUT according to the test topology shown in Figure
          8 of [I-D.sip-bench-term].

It's not clear to me how one can configure the DUT/SUT according to
both figures 5 and 8.  And neither figure shows two callee EAs.

The text suggests that DUTs that do not support forking can be tested,
even though this is a test specifically of performance when the DUT is
doing forking.

Dale

[bmwg] RAI-ART review of draft-ietf-bmwg-sip-benc… Worley, Dale R (Dale)
Re: [bmwg] RAI-ART review of draft-ietf-bmwg-sip-… Carol Davids