Document: draft-ietf-bmwg-sdn-controller-benchmark-meth-08
Reviewer: Stewart Bryant
Review Date: 2018-04-16
IETF LC End Date: 2018-02-02
IESG Telechat date: 2018-04-19

Summary:

This is a well written comprehensive test set for SDN controllers. Two minor
points remain from my previous review that I would draft to the attention of
the responsible AD.

Major issues: None

Minor issues:

I found the large amount of text on Openflow that appears out of the blue in
the appendix somewhat strange since the test suit is controller protocol
agnostic. I understand that this is included by way of illustrative example. It
might be useful to the reader to make a statement to this effect.

[Authors] 1. Regarding your #1 comment, we will add a note to mention that 'OpenFlow protocol is used as an example to illustrate the methodologies defined in this document'. Hope this works

Nits/editorial comments:

The test traffic
generators TP1 and TP2 SHOULD be connected to the first and the last
leaf Network Device.

SB> I am sure I know what does first and last mean, but the meaning should be
called out.
[Authors] We will clarify about the Test Traffic Generators (TP1/2) connectivity in the setup. We will update such that 'TP1 SHOULD be connected to Network Device 1 and TP2 SHOULD be connected to Network Device n'.

-------------------------------------------------------------------------------------------------------------------
Spencer Dawkins 
Spencer Dawkins' No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)
I have a few questions, at the No Objection level ... do the right thing, of
course.

I apologize for attempting to play amateur statistician, but it seems to me
that this text

4.7. Test Repeatability

To increase the confidence in measured result, it is recommended
that each test SHOULD be repeated a minimum of 10 times.

is recommending a heuristic, when I'd think that you'd want to repeat a test
until the results seem to be converging on some measure of central tendency,
given some acceptable margin of error, and this text

Procedure:

1. Establish the network connections between controller and network
nodes.
2. Query the controller for the discovered network topology
information and compare it with the deployed network topology
information.
3. If the comparison is successful, increase the number of nodes by 1
and repeat the trial.
If the comparison is unsuccessful, decrease the number of nodes by
1 and repeat the trial.
4. Continue the trial until the comparison of step 3 is successful.
5. Record the number of nodes for the last trial (Ns) where the
topology comparison was successful.

seems to beg for a binary search, especially if you're testing whether a
controller can support a large number of controllers ...

[Authors] I would like to clarify that the above procedure is for single test trial. We recommend to repeat the procedure for atleast 10 times for better accuracy of results.

This text

Reference Test Setup:

The test SHOULD use one of the test setups described in section 3.1
or section 3.2 of this document in combination with Appendix A.

or some variation is repeated about 16 times, and I'm not understanding why
this is using BCP 14 language, and if BCP 14 language is the right thing to do,
I'm not understanding why it's always SHOULD.

I get the part that this will help compare results, if two researchers are
running the same tests. Is there more to the requirement than that?

[Authors] Our intention is to help compare results, if two testers are running  thesame tests. We do not have any other requirments than this..
In this text,

Procedure:

1. Perform the listed tests and launch a DoS attack towards
controller while the trial is running.

Note:

DoS attacks can be launched on one of the following interfaces.

a. Northbound (e.g., Query for flow entries continuously on
northbound interface)
b. Management (e.g., Ping requests to controller's management
interface)
c. Southbound (e.g., TCP SYN messages on southbound interface)

is there a canonical description of "DoS attack" that researchers should be
using, in order to compare results? These are just examples, right?
[Authors] You are correct. Note section is to give some examples to simulate DoS attacks

Is the choice of

[OpenFlow Switch Specification]  ONF,"OpenFlow Switch Specification"
Version 1.4.0 (Wire Protocol 0x05), October 14, 2013.

intentional? I'm googling that the current version of OpenFlow is 1.5.1, from
2015.

[Authors] This is intentional as all our examples are derived based on this version of the specification.

-------------------------------------------------------------------------------------------------------------------
Eric Rescorla's No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)

Rich version of this review at:
https://mozphab-ietf.devsvcdev.mozaws.net/D3948

COMMENTS
reported.

4.7. Test Repeatability

To increase the confidence in measured result, it is recommended
that each test SHOULD be repeated a minimum of 10 times.


Nit: you might be happier with "RECOMMENDED that each test be repeated
..."

Also, where does 10 come from? Generally, the number of trials you
need depends on the variance of each trial.

[Authors] The RECOMMENDED number 10 was arrived based on our experience during the benchmarking. I will discuss with other authors for changing SHOULD to RECOMMENDED. 
[Eric] The SHOULD versus RECOMMENDED has the same normative force in 2119. It's just editorial and I thought it would read better.
[Authors] Updated SHOULD to RECOMMENDED in the revised version

Test Reporting

Each test has a reporting format that contains some global and
identical reporting components, and some individual components that
are specific to individual tests. The following test configuration
parameters and controller settings parameters MUST be reflected in


This is an odd MUST, as it's not required for interop.
[Authors] The intent of specifying MUST is to capture relevant test parameters to enable Apple to Apple comparison of test test results across two testers/test runs.

5. Stop the trial when the discovered topology information matches
the deployed network topology, or when the discovered topology
information return the same details for 3 consecutive queries.
6. Record the time last discovery message (Tmn) sent to controller
from the forwarding plane test emulator interface (I1) when the
trial completed successfully. (e.g., the topology matches).


How large is the TD usually? How much does 3 seconds compare to that?
[Authors] The test duration varies depends on the size of the test topology. For a smaller topology (3 - 10) the TD was within a minute. 
         So we kept the query interval of 3 seconds to accomodate smaller and larger topologies.
         
[Eric] So, 3 seconds is a pretty big fraction of that. It introduces non-trivial random (I think) error. 
       As for n-1, I *think* it's the right one here, but I'm not sure. It's what you use for "sample variance" typically. Have you talked to a statistician?
[Authors] The 3 seconds query is used as a stop criteria for the test. The measurement is based on the time that the controller receives the last discsovery message. We have also reworded Step 4 to mention 3 seconds as 'RECOMMENDED value' (as below)
         "Query the controller every t seconds (RECOMMENDED value for t is 3) to obtain the discovered network topology information through the northbound interface or the management interface and compare it with the deployed network topology information.


Total Trials

SUM[SQUAREOF(Tri-TDm)]
Topology Discovery Time Variance (TDv)  ----------------------
Total Trials -1


You probably don't need to specify individual formulas for mean and
variance. However, you probably do want to explain why you are using
the n-1 sample variance formula.

[Authors] We have added both formulas based on the feedback received in the mailing list.  We are using n-1, as it is commonly used variance measure. Do we need an explanation here or providing any reference to this is sufficient?
[Eric] Well, my point was that you could specify mean and variance in one place and not repeat them over and over Well, n-1 is typically used for sample variance. This is something a little different. 
[Authors] You are right. We used n-1 sample variance to correct the bias in the estimation of the population variance.

Measurement:

(R1-T1) + (R2-T2)..(Rn-Tn)
Asynchronous Message Processing Time Tr1 = -----------------------
Nrx


Incidentally, this formula is the same as \sum_i{R_i} - \sum_i{T_i}
[Authors]  Good suggestion, we have incorporated this in the new version.

messages transmitted to the controller.

If this test is repeated with varying number of nodes with same
topology, the results SHOULD be reported in the form of a graph. The
X coordinate SHOULD be the Number of nodes (N), the Y coordinate
SHOULD be the average Asynchronous Message Processing Time.


This is an odd metric because an implementation which handled overload
by dropping every other message would look better than one which
handled overload by queuing.

[acm] 
If processing time were the only number reported, you're right.

Although the early generation of controller benchmarking tools
overlooked the important combinations of metrics,
the Reporting Format adds the success/loss message performance:

The report should capture the following information in addition to
the configuration parameters captured in section 5.

- Successful messages exchanged (Nrx)

- Percentage of unsuccessful messages exchanged, computed using the
formula (1 - Nrx/Ntx) * 100), Where Ntx is the total number of
messages transmitted to the controller.

BUT, it would be better if SHOULD or RECOMMENDED terms were used,
to cover the case you identified.
[Authors] Updated as SHOULD in the revised draft 

-------------------------------------------------------------------------------------------------------------------
Mirja Kühlewind's No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)

Editorial comments:

1) sdn-controller-benchmark-term should probably rather be referred in the
intro (instead of the abstract).
[Authors] We have moved this sentence to Introduction.

2) Is the test setup needed in both docs (this and
sdn-controller-benchmark-term) or would a reference to
sdn-controller-benchmark-term maybe be sufficient?
[Authors] We have added test setup in both drafts as per the feedback received in the mailing list.

3) Appendix A.1 should probably also be moved to sdn-controller-benchmark-term
[Authors] We have removed Appendix section A.1 as the same test topology is illustrated in test setup.

-------------------------------------------------------------------------------------------------------------------
Alissa Cooper's No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)


Regarding this text:

"The test SHOULD use one of the test setups described in section 3.1
or section 3.2 of this document in combination with Appendix A."

Appendix A is titled "Example Test Topology." If it's really an example, then
it seems like it should not be normatively required. So either the appendix
needs to be re-named, or the normative language needs to be removed. And if it
is normatively required, why is it in an appendix? The document would also
benefit from describing what the exception cases to the SHOULD are (I guess if
the tester doesn't care about having comparable results with other tests?).
[Authors] We will remove this section and corresponding references in the draft as this is already captured in the test setup.

-------------------------------------------------------------------------------------------------------------------
Adam Roach's No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)

I again share Martin's concerns about the use of the word "standard" in this
document's abstract and introduction.
[Authors] Reworded the text in abstract section
-------------------------------------------------------------------------------------------------------------------
Benjamin Kaduk's No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)

In the Abstract:

This document defines the methodologies for benchmarking control
plane performance of SDN controllers.

Why "the" methodologies?   That seems more authoritative than is
appropriate in an Informational document.
[Authors] Agreed. We have repharsed the sentence as 'This document defines methodologies for benchmarking control plane performance of SDN controllers'. 

Why do we need the test setup diagrams in both the terminology draft
and this one?  It seems like there is some excess redundancy, here.
[Authors] We agree, but this was done based on the feedback from the mailing list.

In Section 4.1, how can we even have a topology with just one
network device?  This "at least 1" seems too low.  Similarly, how
would TP1 and TP2 *not* be connected to the same node if there is
only one device?
[Authors] You are right. This may not be a case with single node in the topology. We have fixed that assumption.

Thank you for adding consideration to key distribution in Section

4.4, as noted by the secdir review.  But insisting on having key
distribution done prior to testing gives the impression that keys
are distributed once and updated never, which has questionable
security properties.  Perhaps there is value in doing some testing
while rekeyeing is in progress?
[Authors] The intention of this draft is to benchmark a controller after bringing it to a stable state. So we do not recommend doing benchmarking during transitions to avoid inconsistency in the observed results.

I agree with others that the statistical methodology is not clearly
justified, such as the sample size of 10 in Section 4.7 (with no
consideration for sample relative variance), use of sample vs.
population veriance, etc.
[Authors] We are using sample variance for all test calculations as it is widely used for finding the variance of sample from the mean. As we only use sample vairance, we felt it not required to clarify this further in the document scope. 

It seems like the measurements being described sometimes start the
timer at an event at a network element and other times start the
timer when a message enters the SDN controller itself (similarly for
outgoing messages), which seems to include a different treatment of
propagation delays in the network, for different tests.  Assuming
these differences were made by conscious choice, it might be nice to
describe why the network propagation is/is not included for any
given measurement.
[Authors] We have captured information related to this in Section 4.5

It looks like the term "Nxrn" is introduced implicitly and the
reader is supposed to infer that the 'n' represents a counter, with
Nrx1 corresponding to the first measurement, Nrx2 the second, etc.
It's probably worth mentioning this explicitly, for all fields that
are measured on a per-trial/counter basis.
[Authors] Noticed a forward reference to Nrxn in Section 5.1.3 (Step 5). We have addressed it in the revised draft.
I'm not sure that the end condition for the test in Section 5.2.2
makes sense.

[Authors] We have added additonal steps to 5.2.2 to capture the end condition. 

It seems like the test in Section 5.2.3 should not allow flexibility
in "unique source and/or destination address" and rather should
specify exactly what happens.
[Authors] As suggested, we have removed the flexibility 

In Section 5.3.1, only considering 2% of asynchronous messages as
invalid implies a preconception about what might be the reason for
such invalid messages, but that assumption might not hold in the
case of an active attack, which may be somewhat different from the
pure DoS scenario considered in the following section.

[Authors] You are right. But this test helps to understand the system behaviour when network devices malfunction and not DoS attacks.

Section 5.4.1 says "with incremental sequence number and source
address" -- are both the sequence number and source address
incrementing for each packet sent? 
[Authors] Yes

This could be more clear. It also is a little jarring to refer to "test traffic generator TP2"
when TP2 is just receiving traffic and not generating it.

[Authors] Both the sequence number and source address are increased. We will remove the term test traffic generator and simply mention TP2.

Appendix B.3 indicates that plain TCP or TLS can be used for
communications between switch and controller.  It seems like this
would be a highly relevant test parameter to report with the results
for the tests described in this document, since TLS would introduce
additional overhead to be quantified!
[Authors] In Section 4.7, we have recommendeded to report this parameter in the test report 

The figure in Section B.4.5 leaves me a little confused as to what
is being measured, if the SDN Application is depicted as just
spontaneously installing a flow at some time vaguely related to
traffic generation but not dependent on or triggered by the traffic
generation.
[Authors] You are correct. We have updated the sequence diagram to inject flow at the begining of the test to avoid confusion

-------------------------------------------------------------------------------------------------------------------
 Ignas Bagdonas' No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)
 
 The document seems to assume the OpenFlow dataplane abstraction model – which
is one of the possible models; the practical applicability of such model to
anything beyond experimental deployments is a completely separate question
outside of the scope of this document. The methodology tends to apply to a
broader set of central control based systems, and not only to the data plane
operations – therefore the document seems to be setting at least something
practically usable for benchmarking of such central control systems. Possibly
the document could mention such assumptions made about the overall model where
the methodology defined applies to.

[acm] 
That's a good wording suggestion, it certainly captures the working group
consensus to prepare the methods independent from OpenFlow, for 
wider applicability.

However, we also received feedback from Stuart Bryant seeking *more* 
message specificity (appended below, see * ), which is not really possible 
for the general methods. Please try to strike a balance between 
these comments in discussion today, if possible!

A nit: s/Khasanov Boris/Boris Khasanov, unless Boris himself would insist
otherwise.

-------------------------------------------------------------------------------------------------------------------
Suresh Krishnan's No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)

I share Ignas's concern about this being too tightly associated with the OpenFlow model.
[Authors] We have mentioned a text that OpenFlow is just used as illustrative example in this document

* Section 4.1
The test cases SHOULD use Leaf-Spine topology with at least 1
Network Device in the topology for benchmarking.

How is it even possible to have a leaf-spine topology with one Network Device?
[Authors] We have fixed the text to use atleast 2 network network devices in the test topology in the revised draft

-------------------------------------------------------------------------------------------------------------------

Martin Vigoureux's No Objection on draft-ietf-bmwg-sdn-controller-benchmark-meth-08: (with COMMENT)

Hello,

I have the same question/comment than on the companion document:
I wonder about the use of the term "standard" in the abstract in view of the
intended status of the document (Informational). Could the use of this word
confuse the reader?

[Authors] We have rephrased the abstract to avoid the usage of term 'standard'