Re: [bmwg] WG Last Call: draft-ietf-bmwg-ngfw-performance-05

Hi Brian (and hi bmwg people).

Part one: replies to previous comments.

>> After holidays, I will write higher quality comments, one e-mail per area.

Yeah, that did not happen. :(
Here we are again. Another deadline, another quick and dirty (but still quite long) e-mail.

>> 1. Test Bed Considerations. 

> In our opinion, we should keep this section in the draft.
> It just creates an awareness of pre-test to the readers.

Alright. If it is just for awareness,
it is ok to keep it vague and short.

>> 2. Sentence with "safety margin of 10%". Unclear.

> We suggest removing it.

Deal.

>> 3. Is it "test bed" or "testbed"?

> We will describe this in the next version.

Ok.

>> 4. Sustain phase follows after ramp-up phase immediately

> 1. ask test tool vendors if there is any way to add pause between two phases

I do not think it is a good idea.
Any pause between the two phases can affect KPI,
especially in the first 2 second interval.

> I don't think the in-flight traffic will impact accuracy of the results.  

The impact should not be big, comparing to allowed tolerances.
But it can make some KPI look strange. Receiving more packets than what was sent,
closing more connections than what was opened (in the same 2s period or for 300s period).

Basically, just add a short sentence explaining some packets/transactions
being in-flight is the expected state, and the number of in-flight things
can vary between the 2s periods.

Depending on the test equipment, even reading KPI sequentially can introduce some skew.
If you read number of packets sent (by test equipment),
before you get to reading packets received, all the sent packets may have already arrived
and then some (showing negative number of packets in flight).

Now, when I am thinking about it, if SUT buffers too much data,
and this volume varies wildly between 2s periods,
you can maybe see validation criteria failing just based on that.
Not sure how probable that is, so not sure whether the draft should mention it.

>> 5. Validation criteria. 

I think I have a better understanding now.
Initial Throughput acts like a lower bound of a binary search
(if it does not meet validation criteria the test fails),
Target Throughput acts like an upper bound
(we tolerate SUT failing validation criteria at this load),
and we are searching for the highest load still meeting the validation criteria.

> There will always be continuous minor spikes.

I will wait for the next draft version to see how exactly that is handled.

>> 6. It seems the same word "throughput" is used to mean different quantities
>> depending on context.

> We will work on this in the next draft.

Ok, will comment when I see it.

>> 7. SUT state affecting performance. 

> SUT MUST be stateful

Looks like we agree on "any such state enters a stationary state during ramp-up,
so in sustain phase the performance is stable".

> per TCP connection and the session will be closed once the transactions are
> completed. SUT will then remove the session entries from its session table.

If it detects the TCP is closed. It may not, due to package loss?
I am not that familiar with TCP (comparing state tracing on client, server, and DUT).
But what about UDP (or QUIC or other exotic transports)?
The draft seems to be mentioning only TCP,
but is it a hard requirement for all traffic profiles?

More detailed comments in future (hypothetic) e-mail.

>> 8. Stateless or stateful traffic generation.

> Note: Stateful traffic generators MUST be used for all benchmarking tests

Makes sense.

Part two: new comments.

On table formatting.
In the table with descriptions of security features,
some rows have a space between pipe and start of description line,
but some do not.
Similar with space between end of description line and pipe.
Table 4 is consistent across rows, but spacing is different across columns.

On latency, the "time from" part. Section 6.1:
The distinction between TTLB and TTFB is clear on the "time to" part,
but the "time from" part is not clear enough to me.
"sending the SYN packet from the client" (for TTFB)
sounds like start of request, but
"sending a GET request from Client" could be start or end.
Compare to the detailed distinction in [4].

On KPI averaging intervals.
Section 7.1.4.2 states:
"The frequency of KPI metric measurements SHOULD be 2 seconds."
Other 7.x.4.2. sections contain similar sentences.
But KPI descriptions in section 6.1 refer to
"average {quantity} per second (with)in the sustaining period".
It is not clear to me which time interval we are averaging over.
It could be:
A. Time interval since the previous KPI measurement (2 second ago)
till the current KPI measurement (now).
B. Time interval since start of sustain phase till now.
C. The whole 300s sustain phase (only known after the last KPI measurement).

Sounds like a minor distinction, but it is important for validation criteria.
An iteration that is valid for C option can contain invalid 2-second intervals
(and B option depends on how soon the invalid intervals happen).

On reported details.
Point 2.F in section 6 tells the report to contain information
about traffic mix used. It should be detailed enough.
Here it becomes part three: vague and maybe off-topic comments.

There is a sliding scale if strictness for BMWG documents.
If a document is very strict, there is a risk of being too restrictive
and hampering further progress. But the public can be reasonably sure
that if two benchmarking labs test the same SUT, they should report the same results.
Or at least equivalent ones, if the SUT performance is not very deterministic.
On the other end of the spectrum, if a document is vague enough,
different labs can apply different improvements to the testing procedure,
allowing them to get the equivalent results is shorter time
or with less resources. The downside is public having trouble
believing results from two benchmarking labs will be close enough.

Usually, I think the preferred compromise is to be vague in details
we believe do not affect the results,
be strict in details that are "industry standard" already,
and for most of other things give freedom coupled with
a MUST on documenting the chosen details.
Ideally, such documented details from one benchmarking lab
are detailed enough that other benchmarking labs can implement them
and get equivalent result. Or explain they cannot implement them,
because their test equipment has too different set of features and limitations.

I think this draft gives freedom in many places,
without requiring documentation. Or the documentation requirement
is not strict enough for other benchmarking labs to believe
they can replicate the results (without guessing the missing details).

Finally, there is another sliding scale, this time for tests.
Some tests want to be simple and synthetic, to give results as accurate as possible.
Other tests want to be realistic (thus not simple),
to give users good idea of what could happen in production.
Synthetic tests are useful for DUT developers,
their regression testing becomes more sensitive
and it is easier to investigate where the regression is.
Realistic tests are useful for sales, as customers are rarely interested
in DUT behavior in obviously unrealistic cases.

I think your draft prioritizes realism,
that is why it is hard to describe all the details precisely enough.
But maybe my view is wrong, and your focus is somewhere else?

Vratko.

[4] https://tools.ietf.org/html/rfc1242#section-3.8

-----Original Message-----
From: bmwg <bmwg-bounces@ietf.org> On Behalf Of bmonkman@netsecopen.org
Sent: Thursday, 2020-December-24 17:03
To: 'Vratko Polak -X (vrpolak - PANTHEON TECH SRO at Cisco)' <vrpolak=40cisco.com@dmarc.ietf.org>; bmwg@ietf.org
Cc: 'MORTON, ALFRED C (AL)' <acm@research.att.com>; 'Bala Balarajah' <bala@netsecopen.org>
Subject: Re: [bmwg] WG Last Call: draft-ietf-bmwg-ngfw-performance-05

Vratko et al,

See comments from the authors inline below - preceded by [authors].

Brian

-----Original Message-----
From: bmwg <bmwg-bounces@ietf.org> On Behalf Of Vratko Polak -X (vrpolak -
PANTHEON TECH SRO at Cisco)
Sent: December 23, 2020 10:31 AM
To: bmwg@ietf.org
Cc: MORTON, ALFRED C (AL) <acm@research.att.com>
Subject: Re: [bmwg] WG Last Call: draft-ietf-bmwg-ngfw-performance-05

> Please read and express your opinion on whether or not this 
> Internet-Draft should be forwarded to the Area Directors for 
> publication as an Informational RFC.

The current draft is a large document, and I will have multiple comments.
I expect some of them will be addressed by creating -06 version, so my
opinion is -05 should not be forwarded for publication.

> Send your comments to this list or to co-chairs at 
> bmwg-chairs@ietf.com

The issue is, I do not have all the comments ready yet.
In general, I need to spend some effort when turning my nebulous ideas into
coherent sentences (mostly because only when writing the sentences I realize
the topic is even more complicated than I thought at first).

Also, specifically for BMWG, I want my comments to be more complete than
usual.
Not just "I do not like/understand this sentence", but give a new sentence
and a short explanation why the new sentence is better.
I have two reasons for aiming for high quality comments.
First, I imagine many people are reading this list.
That means, if I write a lazy superficial comment, I save my time, but
readers will spend more time trying to reconstruct my meaning.
(Similar to how in software development, code is written once but read many
times.) Second reason is high latency on this mailing list.
Usually, by the time the author reacts to the comments, the reviewer has
switched their attention to other tasks, so it is better when the first
comment does not need any subsequent clarifications from the reviewer.

> allow for holidays and other competing topics

I reserved some time before holidays, originally for improving MLRsearch,
but NGFW is closer to publishing so it takes precedence.

My plan is to start with giving a few low-quality comments, mainly to hint
what areas I want to see improved.
After holidays, I will write higher quality comments, one e-mail per area.
This e-mail contains the low-quality comments (in decreasing order of
brevity).

1. Test Bed Considerations. Useful, but maybe should be expanded into a
separate draft.
(Mainly expanding on "testbed reference pre-tests", and what to do if they
fail but we still want some results.)

[authors]  The section "Test Bed Considerations" just gives a recommendation
(even though we haven't use Capital letter "RECOMMEND"). The section
describes the importance of the pre-test, and it also gives an idea about
pre-test. The Test labs or any user can decide themselves, if the pre-test
is needed for their test.  However, based on our discussions with test labs,
they usually perform such a pre-test. In our opinion, we should keep this
section in the draft. It just creates an awareness of pre-test to the
readers.

2. Sentence with "safety margin of 10%". Unclear.
If you want to add or subtract, name both the quantity before and after the
operation, so in later references it is clear which quantity is referenced.
Also, why 10% and not something else (e.g. 5%)?

[authors] You are right. Either we need to change the wording or remove the
whole sentence. We suggest removing it  

3. Is it "test bed" or "testbed"?
I assume it means "SUT" plus "test equipment" together, but is should be
clarified.

[authors] Based on Oxford and Cambridge, it should be "test bed". We will
solve the inconsistency  issue in the next version. A test bed should also
include test equipment.  we will describe this in the next version.

4. Sustain phase follows after ramp-up phase immediately, without any pause,
right? Then there is in-flight traffic at sustain phase start and end,
making it hard to get precise counters.

[authors] We don't think we can add a pause between ramp-up and sustain
phase.  Since the frequency of the measurements are 2 second and the total
sustain phase is 300s,I don't think the in-flight traffic will impact
accuracy of the results.  However, we have two suggestions here:
1. ask test tool vendors if there is any way to add pause between two phases
2. we can describe in the draft that the measurement should occur between  X
sec (e.g. 2sec) after ramp-up begins and X sec before ramp-up ends.
If it doesn't appear to be [possible to build in a pause we would go with
option 2.

5. Validation criteria. The draft contains terms "target throughput" and
"initial throughput", but also phrases like "the maximum and average
achievable throughput within the validation criteria".
I am not even sure if validation criteria apply to a trial (e.g. telemetry
suggests test equipment behavior was not stable enough) or a whole search
(e.g. maximum achievable throughput is below acceptance threshold).

[authors] Section 6 .1 describes the average throughput.  Due to the
behavior of stateful traffic (TCP) and also test tools behavior, getting a
100% linear (stable) throughput is not easy. There will always be continuous
minor spikes. That's Why we chose to measure the average values.
We will remove the wording "maximum ..." in the next version. Also, we will
clarify that throughput means always avg. throughput. For an e.g. "target
throughput" means "average target throughput"

6. It seems the same word "throughput" is used to mean different quantities
depending on context.
Close examination suggests it probably means forwarding rate [0] except the
offered load [1] is not given explicitly (and maybe is not even constant).
When I see "throughput" I think [2] (max offered load with no loss), which
does not work as generally the draft allows some loss.
Also, some terms (e.g. "http throughput") do not refer to packets, but other
"transactions".

[authors] The throughput measurement defined in [2] doesn't fit for L7
stateful traffic.  For example TCP retransmissions are not always packet
loss. Due to the test complexity and test tools behavior we have to allow
some transaction failures. Therefore, we needed to define a different
definition for the KPI throughput. Section 6.1 describes that the KPI
measures the average Layer 2 throughput. But you are right; the term "http
throughput" can be considered as L7 throughput or Goodput.   We will work on
this in the next draft.

7. SUT state affecting performance. The draft does not mention any, so I
think it assumes "stateless" SUT.
An example of "stateful" SUT is NAT, where opening sessions has smaller
performance than forwarding on already opened sessions.
Or maybe it is assumed any such state enters a stationary state during
ramp-up, so in sustain phase the performance is stable (e.g. NAT sessions
may be timing out, but in a stable rate).

[authors] SUT MUST be stateful, and it must do Stateful inspection. It
doesn't mean that the SUT must do NAT if it is in stateful mode. NAT is just
another feature which can or can't be enabled and this is based on the
customer scenario.
The traffic profile has limited (e.g. 10 for throughput test) transactions
per TCP connection and the session will be closed once the transactions are
completed. SUT will then remove the session entries from its session table.
This means, there will be always new stateful sessions will be opened and
established during the sustain period as well.   Apart from this, we can
consider whether we want to add NAT as an option feature in the feature
table (table 2).

8. Stateless or stateful traffic generation. Here stateless means
predetermined packets are sent at predetermined times.
Stateful means time or content of next-to-send packet depends on time or
content of previously received packets.
Draft section 7.1 looks like stateless traffic to me (think IMIX [3]), while
others look like stateful (you cannot count http transaction rate from lossy
stateless traffic).
In general, stateful traffic is more resource intensive for test equipment,
so it is harder to achieve high enough offered load.
Also, stateful traffic generation is more sensitive to packet loss and
latency of SUT.

[authors] This is not IMIX [3].  IMIX [3] defines based on variable packet
sizes. But here in the draft, we define traffic mix based  on different
applications, and it's object sizes. For example an application mix can be
HTTPS, HTTPS, DNS (UDP), VOIP (TCP and UDP), and, etc.). In this example we
have a mix of stateful and stateless traffic and each application has
different object sizes. One object can have multiple packets with different
sizes. The packet sizes are dependent on multiple factors namely; TCP
behavior, MTU size, total object size.
Note: Stateful traffic generators MUST be used for all benchmarking tests
and we used/are using stateful traffic generators for the NSO certification
program.    

Vratko.

[0] https://tools.ietf.org/html/rfc2285#section-3.6.1
[1] https://tools.ietf.org/html/rfc2285#section-3.5.2
[2] https://tools.ietf.org/html/rfc2544#section-26.1
[3] https://tools.ietf.org/html/rfc6985

-----Original Message-----
From: bmwg <bmwg-bounces@ietf.org> On Behalf Of MORTON, ALFRED C (AL)
Sent: Friday, 2020-December-18 19:16
To: bmwg@ietf.org
Subject: [bmwg] WG Last Call: draft-ietf-bmwg-ngfw-performance-05

Hi BMWG,

We will start a WG Last Call for

Benchmarking Methodology for Network Security Device Performance

https://tools.ietf.org/html/draft-ietf-bmwg-ngfw-performance-05

The WGLC will close on 22 January, 2021, allow for holidays and other
competing topics (IOW, plenty of time!)

Please read and express your opinion on whether or not this Internet-Draft
should be forwarded to the Area Directors for publication as an
Informational RFC.  Send your comments to this list or to co-chairs at
bmwg-chairs@ietf.com

for the co-chairs,
Al

_______________________________________________
bmwg mailing list
bmwg@ietf.org
https://www.ietf.org/mailman/listinfo/bmwg
_______________________________________________
bmwg mailing list
bmwg@ietf.org
https://www.ietf.org/mailman/listinfo/bmwg

_______________________________________________
bmwg mailing list
bmwg@ietf.org
https://www.ietf.org/mailman/listinfo/bmwg