Re: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt

Dear Eduard,

Thank you very much for your review!

On 11/17/2022 4:02 AM, Vasilenko Eduard wrote:

[...]

> I am still sure about the big dependency between "packets per second" and "new sessions per second".

I definitely agree with it.

> But it would be utterly difficult to specify a profile for mixed traffic - it would be very different between fixed and mobile subscribers.
> Hence, let test it separately and assume linear influence on each other: if pps is at 60% then cps is probably may be up to 40% from the tested maximum.
> Does it make sense to specify it in the document?

To be more precise, the situation of a stateful NAT64 or NAT44 gateway 
is even more complicated. Several different things happen, including:
1) A new connection is established.
2) A packet is transmitted in the upload direction.
3) A packet is transmitted in the download direction.
4) A connection is teared down. (Either by closing a TCP connection or 
by the timeout of the TCP or UDP "connection".)

To have conditions similar to the traffic of the Internet, all the above 
should be considered.

Some further comments:

As for iptables, connection tear down is much-much more costly (by means 
of CPU power) than connection establishment. According the my 
measurements, in the case of 100M connections, iptables could 
established 2.237M connections in a second, but it could terminate only 
345k connections in a second. Please see the Table 4 and Table 5 in: 
https://datatracker.ietf.org/doc/html/draft-lencse-v6ops-transition-scalability-04 

I tested the throughput of Jool, iptables+tayga, and OpenBSD PF stateful 
NAT64 solutions using uni-directional traffic in the upload and download 
directions and I found them different.

The actual quantity of upload and download traffic can be very much 
different in case of a home user! Yet RFC 2544 / RFC 2510 / RFC 8219 
require testing throughput with bidirectional traffic. We also kept it 
in our draft and added testing with unidirectional traffic as OPTIONAL.

Should we perhaps make testing with unidirectional traffic REQUIRED?

After all these considerations: *What should we recommend and why?*

Perhaps an appropriate mix of 1, 2, 3, and 4 could be the desired load 
for benchmarking a stateful NATxy gateway... But, we see a lot of 
hindrances, including:

- Following the long established tradition (also required by RFC 2544 / 
RFC 2510 / RFC 8219) we use UDP for testing. In UDP, there is no such 
thing as "termination of a connection". Of course the gateway still 
handles UDP "connections", but it can be "terminated" only by timeout. 
And it makes very hard (perhaps impossible) to use let use let us say 
10% connection tear down.

- We could still use a mix of let us say: 10% of the packets result in 
new connections and 90% of the packets belong to an existing connection. 
(The ratios could be changed.) However, adding 10% new connections may 
significantly increase the number of connections during a 60s long 
throughput test, because we are not able to terminate the same number of 
old connections. (And the number of connections  highly influence the 
performance of a stateful NATxy gateway, please see the connection 
scalability results in the above mentioned draft.)

So currently we cannot propose a better solution than measuring separately:
- connection setup performance (maximum connection establishment rate)
- packet forwarding performance measured with a constant number of 
connections (throughput with bidirectional traffic; optionally: 
throughput with uni-directional traffic)
- connection termination performance (connection tear down rate)

Can you propose a repeatable measurement that uses an appropriate mix of 
1-4?

> It is cheating to test ports not processing Engine. Of course, a lightly loaded engine would show much better results.
> Unfortunately, to get meaningful results it is important to overload the whole engine.
> In the case of a hardware device, it is one NPU on the processing line card. Vendors share (under NDA) performance for it. It is always below 100GE but above 10GE. Hence, many ports may be needed in the 10GE case.
> In the case of a Virtual appliance, it is probably a VM of one vCPU of Intel or AMD. One vCPU would not be capable to overload the 10GE port.
> Typically, the number of sessions is so big (in a Telco environment) that engines scale linearly (for software-based and hardware-based). The test for many engines (CPU cores) is not more useful but more difficult to implement.

I do not really understand the point of the above. It seems to be about 
a special hardware device that has several cards with perhaps multiple 
ports... ?

In our simple model, we have a DUT with only two ports, e.g., in the 
case of NAT64 it looks as follows:

                  +--------------------------------------+
        2001:2::2 |Initiator                    Responder| 198.19.0.2
    +-------------|                Tester                |<------------+
    | IPv6 address|                         [state table]| IPv4 address|
    |             +--------------------------------------+             |
    |                                                                  |
    |             +--------------------------------------+             |
    |   2001:2::1 |                 DUT:                 | 198.19.0.1  |
    +------------>|        Stateful NAT64 gateway        |-------------+
      IPv6 address|     [connection tracking table]      | IPv4 address
                  +--------------------------------------+

        Figure 2: Test setup for benchmarking stateful NAT64 gateways

> IMHO: the discussion for what we are trying to overload is mandatory.

What do you mean under "overload"?

If I understand it correctly, we do overload during both the maximum 
connection establishment rate test and the throughput test in order to 
find the maximum lossless rate using a binary search.

> How big a session table should be created in phase 1? It may be 10, 10k, or 1M. How to decide?

Yes, this is a very good question. IMHO, if we want to imitate the 
situation of a commercial device used by and ISP, the number of sessions 
should begin around 1M and go up until our hardware is unable to handle it.

I used this approach, and the upper limit was 800M and 1600M for 
iptables and Jool, respectively. (Please see the connection scalability 
results in the above mentioned draft in Table 4 and Table 9.)

> Or asking the same from a different direction: how many packets should be sent over every session?

In my measurements, this number was determined by the number of 
connections and the achieved throughput. For example, if you see Table 4 
and choose the first column (1.56M connections), then the median 
throughput is 5.326M. It means that 319,560M packets were forwarded 
during the 60s long throughput test, thus 204,846 packets belonged to a 
single connection. However, if you choose the last column (800M 
connections), then the median throughput is 3.689M. It means that 
221,340M packets were forwarded during the 60s long throughput test, 
thus 276 or 277 packets belonged to a single connection.

If do similar calculation with Jool, you will get an order of magnitude 
lower values due to its lower throughput.

> Section 2:
> I do not understand why pseudorandom port numbers for every packet were assumed in the basic design.

The aim of using pseudorandom enumeration of all port number 
combinations that are possible with the given source port and 
destination port number ranges is twofold:

1) To achieve that all test frames result in a new connection during the 
preliminary test phase. -- This is needed to measure connection 
establishment performance.

2) To fill up the connection tracking table of the DUT and the state 
table of the tester as soon as possible. -- This is important, if the 
preliminary test phase is performed in preparation of a real test phase.

> It is an unrealistic assumption that somebody would mistakenly generate random packets instead of sessions.
> I could not believe such a mistake. People are testing stateful devices (FW, LB, NAT) for ages.
> Hence, it does not makes sense to warrant against it.
> Could you rephrase this section: "Of course, it's known that the load for a stateful device should be flow based where many packets from both directions are simulating one session (many packets should look like 1 session)".
> Hardware-based testers like Spirent support it.

I feel that you expect to use a given number of sessions, and the 
packets should belong to those sessions.

Am I right?

We do exactly that. But first we establish the connections for all the 
required sessions in the preliminary test phase and then generate 
packets in the real test phase that belong to those sessions.

> It is just a question of how many packets are in one session and how many sessions.
Yes, it is a good question. There can be multiple approaches:

- If the tests are commissioned by a network operator, then the numbers 
should be tailored to the statistics of the network of the operator.

- If the tests are performed by an academic researcher (like me), then 
-- I think -- wide ranges should be examined: starting from a low 
realistic number up to the hardware limits, as I did in the above 
mentioned draft to provide a general, wide viewing angle picture about 
the implementation.  :-)

What do you think?

> New session generation could be pseudorandom (primarily on the source port).

Yes, I agree, but RFC 4814 requires pseudorandom by means of both SOURCE 
and DESTINATION port numbers: 
https://www.rfc-editor.org/rfc/rfc4814#section-4.5

Please see my considerations in Section 2.3 of 
http://www.hit.bme.hu/~lencse/publications/ECC-2022-SFNATxy-Tester-published.pdf

> Section 3: a little inconsistency:
> I guess that "state table" should be created on the Initiator too (you mentioned only the responder).

In fact, something like "state table" is actually created in the 
Initiator of siitperf for performance considerations. The pseudorandom 
enumeration of all possible port number combination happens before the 
preliminary test phase and they are stored in an array. The are just 
read from there linearly during the preliminary test phase. (As the IP 
addresses are fixed, thus one can say, that the "state table" is there.) 
But we do not need it after the preliminary test phase. Initiator uses 
simply pseudorandom port numbers in the real test phase: as all possible 
combinations were enumerated in the preliminary test phase, no new 
combinations may occur.

> "connection tracking table" is claimed "unknown for the Tester" because it is inside the "NATxy gateway". But then why it has been mentioned in "Initiator"? It is unknown, right?

Do you mean the following text?

    *  Initiator: The port of the Tester that may initiate a connection
       through the stateful DUT in the client to server direction.
       Theoretically, it can use any source and destination port numbers
       from the ranges recommended by [RFC4814  <https://datatracker.ietf.org/doc/html/rfc4814>]: if the used four tuple
       does not belong to an existing connection, the DUT will register a
       new connection into its connection tracking table.

The connection tracking table is mentioned as an explanation, what 
happens. Yes, the content of the connection tracking table is unknown 
for both the Initiator and the Responder.

> Section 4.1: if destination port numbers are so condensed around a few numbers in the real Internet then why "from a few to several hundreds or thousands as needed" on the test?

YES, THIS IS A VERY MUCH VALID QUESTION!

My answer is historical: because at the time of designing siitperf (the 
only stateful NAT64 / NAT44 Tester that supports the draft) I did not 
want to make too much work for myself and I used a single fixed IP 
address pair. Thus currently the only way to achieve hundreds of 
millions of connections with siitperf is to increase the destination 
port number range. Let is see some example numbers. If we have 40,000 
source port numbers, then:
- 10 destination port numbers result in 400,000 connections
- 100 destination port numbers result in 4M connections
- 1000 destination port numbers result in 40M connections

Regarding this design decision of using a single IP address pair, 
Section 4.4. of our draft currently says:

    1.  A single source address destination address pair is used for all
        tests.  We make this assumption for simplicity.  Of course, we
        are aware that [RFC2544  <https://datatracker.ietf.org/doc/html/rfc2544>] requires testing also with 256 different
        destination networks.

We have discussed it with my co-author, Keiichi Shima, and we believe that:
- On the one hand, the usage of multiple destination NETWORKS is not 
interesting here, because we do not do router testing.
- On the other hand, the usage of multiple IP ADDRESSES may be 
appropriate here, especially because of our experience with OpenBSD. 
Please see slides 11 and 12 of my ITEF 115 BMWG presentation: 
https://datatracker.ietf.org/meeting/115/materials/slides-115-bmwg-benchmarking-methodology-for-stateful-natxy-gateways-using-rfc-4814-pseudorandom-port-numbers

It was our question that "*Shall we add the usage of multiple IP 
addresses as a **requirement?*", and _we still have this question open._

Using multiple IP addresses may be useful not only to generate entropy 
for the hash function to support RSS, but also to generate a high number 
of network flows using a low(er) number of destination port numbers.

What do you think?

However, then we must also convince the authors of RFC 4814 (and BMWG 
members) that narrowing down the destination port number range is not a 
violation of the following requirement in Section 4.5 of RFC 4814:

    In addition, it may be desirable to pick pseudorandom values from a
    selected pool of numbers.  Many services identify themselves through
    use of reserved destination port numbers between 1 and 49151
    inclusive.  Unless specific port numbers are required, it is
    RECOMMENDED to pick randomly distributed destination port numbers
    between these lower and upper boundaries.

> It is probably not important to mention that one MAP subscriber has a very limited number of source ports. Because "Border Relay" (our DUT) would have many-many subscribers.

Yes, when testing MAP BR, it is fair to assume many MAP CE-s. But MAP BR 
is a stateless device, and thus, it is out of scope of our draft. 
However, MAP CE is stateful, it may be in scope.

> Section 4.4: Why do we manipulate ports but keep only one IP address? Real scenarios manipulate IP addresses very much (tens of thousands of subscribers are possible for one NAT gateway). Is it possible to recommend at least source address randomization too? (destination addresses would be very limited in the wild Internet)

Please see my answers above: I am not against using multiple IP addresses.

However, if the Initiator uses only multiple SOURCE IP addresses, then 
they will appear only as multiple source port numbers after the stateful 
NATxy gateway translated the packet. Therefore, they will not be 
visible, when the Responder generates traffic based on the received four 
tuples.

If the Initiator uses multiple DESTINATION IP addresses, they will 
"survive" the stateful NATxy translation and they will also appear in 
the traffic generated by the Responder.

>>    Important warning: in normal (non-NAT) router testing, the port
>>    number selection algorithm, whether it is pseudo-random or enumerated
>>    in increasing (or decreasing) order does not affect final results.
> In reality, it affects. Because depending on the router configuration, it may use port numbers for hash load-balancing. It would affect the load distribution of many links.

I think that the order should not count, if the hash function is good 
enough.

> Section 4.5:
>> In practice, we RECOMMEND the usage of binary search.
> We could listen to the vendor's claims and choose the initial point more smartly.

I am not sure, if I understand what you mean.

Do you mean that compared to using 0 and the maximum frame rate for the 
media as the initial lower bound and the initial upper bound, 
respectively, one can have a better starting point for the binary search?

> Section 4.5.1:
>> We RECOMMEND median as the summarizing function of the
>>    results complemented with the first percentile and the 99th
>>    percentile as indices of the dispersion of the results.
>>     connections/s 1st perc. (req.)
>>     connections/s 99th perc. (req.)
> When I was a student  I have been told (around 1987) that "standard deviation" is better to use for tests data filtering.
> The theory is here:https://en.wikipedia.org/?title=3-sigma&redirect=no
> In the case of Microsoft Excel it is something like this:
> Min=AVERAGE(Array)-STDEV(Array)
> Max=AVERAGE(Array)+STDEV(Array)
> Look here:https://www.ablebits.com/office-addins-blog/calculate-standard-deviation-excel/

Yes, they are very good, if the results follow NORMAL DISTRIBUTION. 
However, I believe that we may not assume normal distribution here, 
because, according to my experience, the distribution of the results of 
the multiple experiments is not even symmetric. Please consider that due 
to some unexpected event, some frames may be lost, and thus there are 
outliers among the results towards zero. However, there should not be 
significant outliers in the other direction in a well working system. 
(There is no "accidental" high performance.)

We chose median as summarizing function, because it is less sensitive to 
outliers than average.

Section 7.2 of RFC 8219 redefines the Latency measurement of RFC 2544 to 
provide better quality results, and it recommends:

    To account for the variation, the 1st and 99th percentiles of the 20
    iterations MAY be reported in two separated columns.

We followed that approach.

> Section 4.6
>>    *  The Initiator counts the received frames, and if all N frames are
>>       arrived then the R frame rate of the maximum connection
>>       establishment rate measurement (performed in the preliminary test
>>       phase) is raised for the next iteration, otherwise lowered (as
>>       well as in the case if test frames were missing in the preliminary
>>       test phase).
> I do not remember such a capability on hardware testers (it was many years ago). But hardware testers are mandatory to test Engine that is close to 100Gbps.
> We need somebody from Spirent to judge: is it possible?

It can be easily implemented. It is just that all the elements of the 
state table of the Responder are to be used for packet generation.

The "cheapest" solution is to use them in linear order. (It can be done 
with siitperf.)

A more sophisticated implementation can use them in pseudorandom order. 
(I haven't yet implemented it, but it is only matter of time and work.)

> Section 4.8: how are you going to delete the connection table for the hardware device? CLI may return the prompt even before the job is finished. The platform may be asynchronous.
> Moreover, because of the:
>>    We are aware that the performance of removing the entire content of
>>    the connection tracking table at one time may be different from
>>    removing all the entries one by one.
> The value of such a test is very questionable.

Yes, I share all your concerns.

However, up to now, I received consistent results, when I used different 
number of connections. You can find them in Table 5 and Table 10 of my 
before mentioned draft for iptables and Jool. Since then, I tested it 
also for OpenBSD PF.

> Alternatively, it is possible to establish connections very fast (the same rate that was tested as a maximum) and then continue to send traffic over 1st and the last sessions. Sessions would expire according to the creation time. It would be possible to monitor that tear down time is not worse than the connection establishment time.
> I was working for a decade for a vendor selling NAT/FW. I did never discuss with customers/partners/vendor_R&D tear down performance. I propose deleting it completely.

Well, connection tear down performance was also overlooked in RFC 8219, 
of which I am a co-author. However, after my experience with iptables, I 
cannot overlook it any more, because its connection tear down 
performance is much lower than its connection establishment performance 
(as I mentioned above).

Note: their proportion may depend on the ratio of the "hashsize" and 
"nf_conntrack_max" parameters.

> Section 4.9:
>>    DUT was exhausted and it stopped responding
>> if the DUT collapses
> It should be qualified as a test fail.

I definitely agree with it. In theory, it is easy to do so. But in 
practice, we need some mechanism to detect it, reboot the server, wait 
until it is ready and continue on the binary search with the lower half 
interval.

> It is a severe bug when the device is "not responding". The device should drop packets after any resources are exhausted but be available for management and monitoring.

I agree with you that is should be so. But unfortunately, I experienced 
the opposite.

Once again, thank you very much for all your work reading and commenting 
our draft!!!

Best regards,

Gábor

> Eduard
> -----Original Message-----
> From: bmwg [mailto:bmwg-bounces@ietf.org] On Behalf Ofinternet-drafts@ietf.org
> Sent: Thursday, October 20, 2022 5:14 AM
> To:i-d-announce@ietf.org
> Cc:bmwg@ietf.org
> Subject: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt
>
>
> A New Internet-Draft is available from the on-line Internet-Drafts directories.
> This draft is a work item of the Benchmarking Methodology WG of the IETF.
>
>          Title           : Benchmarking Methodology for Stateful NATxy Gateways using RFC 4814 Pseudorandom Port Numbers
>          Authors         : Gabor Lencse
>                            Keiichi Shima
>    Filename        : draft-ietf-bmwg-benchmarking-stateful-01.txt
>    Pages           : 25
>    Date            : 2022-10-19
>
> Abstract:
>     RFC 2544 has defined a benchmarking methodology for network
>     interconnect devices.  RFC 5180 addressed IPv6 specificities and it
>     also provided a technology update, but excluded IPv6 transition
>     technologies.  RFC 8219 addressed IPv6 transition technologies,
>     including stateful NAT64.  However, none of them discussed how to
>     apply RFC 4814 pseudorandom port numbers to any stateful NATxy
>     (NAT44, NAT64, NAT66) technologies.  We discuss why using
>     pseudorandom port numbers with stateful NATxy gateways is a difficult
>     problem.  We recommend a solution limiting the port number ranges and
>     using two phases: the preliminary test phase and the real test phase.
>     We show how the classic performance measurement procedures (e.g.
>     throughput, frame loss rate, latency, etc.) can be carried out.  We
>     also define new performance metrics and measurement procedures for
>     maximum connection establishment rate, connection tear down rate and
>     connection tracking table capacity measurements.
>
>
> The IETF datatracker status page for this draft is:
> https://datatracker.ietf.org/doc/draft-ietf-bmwg-benchmarking-stateful/
>
> There is also an htmlized version available at:
> https://datatracker.ietf.org/doc/html/draft-ietf-bmwg-benchmarking-stateful-01
>
> A diff from the previous version is available at:
> https://www.ietf.org/rfcdiff?url2=draft-ietf-bmwg-benchmarking-stateful-01
>
>
> Internet-Drafts are also available by rsync at rsync.ietf.org::internet-drafts
>
>
> _______________________________________________
> bmwg mailing list
> bmwg@ietf.org
> https://www.ietf.org/mailman/listinfo/bmwg
>