Re: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt

Vasilenko Eduard <vasilenko.eduard@huawei.com> Mon, 21 November 2022 08:12 UTC

Return-Path: <vasilenko.eduard@huawei.com>
X-Original-To: bmwg@ietfa.amsl.com
Delivered-To: bmwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 589ACC14CE56 for <bmwg@ietfa.amsl.com>; Mon, 21 Nov 2022 00:12:20 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.197
X-Spam-Level:
X-Spam-Status: No, score=-4.197 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gJwi771Q0O42 for <bmwg@ietfa.amsl.com>; Mon, 21 Nov 2022 00:12:15 -0800 (PST)
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 88F41C14CE54 for <bmwg@ietf.org>; Mon, 21 Nov 2022 00:12:14 -0800 (PST)
Received: from fraeml743-chm.china.huawei.com (unknown [172.18.147.206]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4NG0RR1599z681Zh; Mon, 21 Nov 2022 16:09:39 +0800 (CST)
Received: from mscpeml100002.china.huawei.com (7.188.26.75) by fraeml743-chm.china.huawei.com (10.206.15.224) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 21 Nov 2022 09:12:10 +0100
Received: from mscpeml500001.china.huawei.com (7.188.26.142) by mscpeml100002.china.huawei.com (7.188.26.75) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Mon, 21 Nov 2022 11:12:09 +0300
Received: from mscpeml500001.china.huawei.com ([7.188.26.142]) by mscpeml500001.china.huawei.com ([7.188.26.142]) with mapi id 15.01.2375.031; Mon, 21 Nov 2022 11:12:09 +0300
From: Vasilenko Eduard <vasilenko.eduard@huawei.com>
To: Keiichi SHIMA <shima@wide.ad.jp>
CC: Gabor LENCSE <lencse@hit.bme.hu>, "bmwg@ietf.org" <bmwg@ietf.org>
Thread-Topic: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt
Thread-Index: AQHY5Cm9MooRIKNPGkywYx4Ls+VH/65B5XhAgAEsXACAADSloIAFpSAAgAA+SzA=
Date: Mon, 21 Nov 2022 08:12:09 +0000
Message-ID: <7c04167c6edf4b51a5d71d13d08864a5@huawei.com>
References: <166623206196.44847.13266883730319756483@ietfa.amsl.com> <8d135160fba94fa089f8901a66668200@huawei.com> <9668432a-df92-c717-6e87-a7c5cb7b442e@hit.bme.hu> <632bb5d7a1144ed2a5eed6ba62b951ab@huawei.com> <0CFE6971-68F4-4FDB-9D32-6A6573625672@wide.ad.jp>
In-Reply-To: <0CFE6971-68F4-4FDB-9D32-6A6573625672@wide.ad.jp>
Accept-Language: en-US, zh-CN
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.48.159.19]
Content-Type: multipart/alternative; boundary="_000_7c04167c6edf4b51a5d71d13d08864a5huaweicom_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Archived-At: <https://mailarchive.ietf.org/arch/msg/bmwg/y1we-1ItdyiYDA5uA6gvVwK_zFQ>
X-Mailman-Approved-At: Mon, 21 Nov 2022 05:46:41 -0800
Subject: Re: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt
X-BeenThere: bmwg@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Benchmarking Methodology Working Group <bmwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/bmwg>, <mailto:bmwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/bmwg/>
List-Post: <mailto:bmwg@ietf.org>
List-Help: <mailto:bmwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/bmwg>, <mailto:bmwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Nov 2022 08:12:20 -0000

Hi Keiichi,
It is a good point that BMWG typically tests just one parameter at a time. Or else a huge permutation of possible tests may become a reality.

For me, it looks important to stay in the linear proportion region (if possible). It gives the possibility to combine test results for the estimation of something in the business.

For example, if the box has X FIB routes for IPv4, typically it has X/4 FIB routes for IPv6. Then it is possible to guess about a mixed environment (combination of Y IPv6 and Z IPv4) creating a proportion where IPv6 occupies 4x FIB resources. X=Z+4Y.

I am just trying to find an orthogonal basis for tests.

If the basic tests for the stateful device would be “maximum sessions connection and teardown per second in the stable load environment” then this number could participate in other calculations that are real for business.
Unfortunately, for stateful devices, the correlation between cps (connection + teardown) and pps is very tight. I believe linear.
Hence, if we have maximum pps and maximum cps (connection + teardown) then we would be capable to estimate some real business situations (in a linear proportion).
If we would have a separate “maximum connection rate” and “maximum teardown rate” then the network owner would have more challenges to guess the performance in his environment. It is still possible to create a linear proportion from 3 parameters, just it would be more difficult and the error would be bigger because could be some dependency between the session connection and teardown.

I am not asking to bloat the tests.
I am asking for 2 tests (just connection, and connection + teardown) instead of the other 2 tests (connection, and teardown).
because it gives more value to the business.

But the test for “connection + teardown” looks already mixed. That formally breaks BMWG's practice to test only one parameter at a time.
Hence, it is up to the group's consensus. Maybe I am wrong here and we should stay in the pure theoretical zone.

PS: not all my comments below are related to this problem.

Eduard
From: Keiichi SHIMA [mailto:shima@wide.ad.jp]
Sent: Monday, November 21, 2022 9:33 AM
To: Vasilenko Eduard <vasilenko.eduard@huawei.com>
Cc: Keiichi SHIMA <shima@wide.ad.jp>; Gabor LENCSE <lencse@hit.bme.hu>; bmwg@ietf.org
Subject: Re: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt

Hello Vasilenko Eduard

I am Keiichi SHIMA, co-author of the draft.

In my understandings, you are trying to define methodology to measure the performance of a translator box under the similar traffic patterns (and similar device configuration) to the real traffic pattern / device configuration. But our draft (draft-ietf-bmwg-benchmarking-stateful) is focusing on the same methodology defined in RFC8219 (and RFC5180, RFC2544) with some enhancements for supporting random port numbers (suggested in RFC4814) and some additional metrics (e.g. connection establishment rate). So we basically follow the same benchmarking methods defined in RFC8219 (and former RFCs).

Actually RFC8219 doesn’t consider the traffic patters you mentioned and doesn’t support device configurations which multiple ports are connected to one routing (or translating) engine. It may be a new discussion point to enhance RFC8219 (and/or former RFCs). Probably in a separate Internet Draft.

Regards,
---
Keiichi SHIMA (島 慶一)
WIDE project <shima@wide.ad.jp<mailto:shima@wide.ad.jp>>
          PGP: 9D95 8544 A5CE D530 9230  DF1C BB6E ABE1 D91F EDFC







On Nov 18, 2022, at 0:38, Vasilenko Eduard <vasilenko.eduard=40huawei.com@dmarc.ietf.org<mailto:vasilenko.eduard=40huawei.com@dmarc.ietf.org>> wrote:

Hi Gabor,
In a real production environment may be the situation when 1) connection establishment test happens without tear down. It is after links down/up (or box reload) then subscribers organize DDoS.
But in the majority of other cases 2) the connection establishment rate is almost equal to the connection tear-down rate (stable loaded).
Hence, nobody cares about the production environment which is the pure tear-down rate. This test is just not representative, and not needed.
Hence, you could optimize these 2 cases that the business need.

I strictly suspect that tear down rate is faster on hardware platforms than on connection establishment, but I am not sure because nobody tested. Nobody cares for pure teardown.
When people talk about cps the typically mean: new session connects and old session disconnects (typical situation).

I do not agree that the upload/download split is important. It is just again the question about overall cps & pps that may be achieved in different ways.
All engines that I have seen before were flexible on how to split performance between upstream and downstream.
The Upstream/downstream ratio (1:6) is very important when you need to calculate how many engines to buy, but not for tests.

Think about the above. I hope you understand now why I said: “if pps is at 60% then cps is probably may be up to 40% from the tested maximum”.
I had some typical case (#2) in mind.
> I do not really understand the point of the above. It seems to be about a special hardware device that has several cards with perhaps multiple ports... ?
Looking at the document, people may conclude that they are testing ports (including the picture). That is a very wrong strategy. They need to test the “processing engine”.
Because if you would test 35Gbps NPU (millions of pps capable) through one 10GE port – you would get a VERY optimistic result. It would be a very wrong result.
You need many ports in this case or one high-speed port.
Please, come out of your software box. Hardware could have much more performance.

> What do you mean under "overload"?
Probably wrong wording on my side. The test is trying to “stress” what? It is not a port.

> Yes, this is a very good question. IMHO, if we want to imitate the situation of a commercial device used by and ISP, the number of sessions should begin around 1M and go up until our hardware is unable to handle it.
It should be proportional to the tested engine performance. You have to come up with some relationships between sessions and packets. It is not good to leave the tester without guidance on this topic – he/she is not as strong on this topic as you.

> thus 276 or 277 packets belonged to a single connection.
People may ask why this number.

My way to calculate numbers would be:
1. Assume average packet size on the Internet (I have seen recent accurate measurements of the big telco: 753Bytes) - S
Then for every Gbps of the traffic, you would be capable to recommend Gbps/(8S)=0.166 Mpps
2. Assume average active session per subscriber (something like 100 because 64 is the starting bucket typically reserved for the subscriber) – N
Assume traffic per subs (unfortunately very different for fixed and mobile: 1.5Mbps against 150Kbps) – Gf, Gm
Then average session consumes Gf/N=15kbps or Gm/N=1.5kbps
3. Assume average session time on the Internet (possible to google, something like 2 minutes) – T
Then average session sends
Gf/N*T/8=225kB or Gm/N*T/8=22.5kB,
Gf/N*T/(8S)=300 packets or Gm/N*T/(8S)=150 packets per session
I did not try to make calculations in line with your numbers – it did happen by itself.
3. Now, we could calculate cps per Gbps. Gbps/Gf*N/T=1.1k or Gbps/Gm*N/T=11k old session disconnect and new session connect per every Gbps.

If you would put this logic into the document – testers would be capable to correct it to their reality (that would change over the years).
If not – they would be still blind and choose the relationships randomly.

I feel that you expect to use a given number of sessions, and the packets should belong to those sessions.
No. I am just proposing not to assume that the tester is stupid. Of course, they know that all packets should not be random, packets should be organized in sessions, and sessions could be random.
I am proposing just changing the tone in section 2.

>In fact, something like "state table" is actually created in the Initiator of siitperf for performance considerations. The pseudorandom enumeration of all possible port number combination happens before the preliminary test phase and they are stored in an array.
I have not understood this. It is not stated in the document.
I do not understand why it is needed to waste so much memory on this table creation. It is algorithmic – no need to occupy memory.
What if somebody would do it more optimally?
Is it possible to delete the "state table" definition? It is internal implementation details that do not need discussion in the document.


"connection tracking table" is claimed "unknown for the Tester" because it is inside the "NATxy gateway". But then why it has been mentioned in "Initiator"? It is unknown, right?
Do you mean the following text?

   *  Initiator: The port of the Tester that may initiate a connection

      through the stateful DUT in the client to server direction.

      Theoretically, it can use any source and destination port numbers

      from the ranges recommended by [RFC4814<https://datatracker.ietf.org/doc/html/rfc4814>]: if the used four tuple

      does not belong to an existing connection, the DUT will register a

      new connection into its connection tracking table.
The connection tracking table is mentioned as an explanation, what happens. Yes, the content of the connection tracking table is unknown for both the Initiator and the Responder.
It is better to invent a new name for the active stateful table inside Initiator because above you have claimed that the “connection tracking table” is principally “unknown”.

because at the time of designing siitperf (the only stateful NAT64 / NAT44 Tester that supports the draft) I did not want to make too much work for myself and I used a single fixed IP address pair.
Is it possible now to use many (100?) destination IP addresses and many (10k) source IP addresses to make the test close to the real production environment?

It was our question that "Shall we add the usage of multiple IP addresses as a requirement?", and we still have this question open.
I prefer to stick to the real production environment as much as possible. It has every engine:
- thousands of source IP addresses (calculation above shows 666 subscribers per Gbps)
- hundreds of destination IP addresses (it is load balancers of content providers)
- tens of thousands of source ports (multiply thousands of subscribers by 200 ports per subscriber)
- below ten of destination ports (80, 8080, 443, 53, etc)
Is it possible for software tools? I am sure that it is possible for hardware testers.

Yes, when testing MAP BR, it is fair to assume many MAP CE-s. But MAP BR is a stateless device, and thus, it is out of scope of our draft. However, MAP CE is stateful, it may be in scope.
The statement is incorrect:
However, in some special cases the size of the source port range is
   limited.  E.g.  when benchmarking the CE and BR of a MAP-T [RFC7599<https://datatracker.ietf.org/doc/html/rfc7599>]
   system together (as a compound system performing stateful NAT44),
   then the source port range is limited to the number of source port
   numbers assigned to each subscriber.  (It could be as low as 2048
   ports.)


Section 4.5:

In practice, we RECOMMEND the usage of binary search.

We could listen to the vendor's claims and choose the initial point more smartly.
I am not sure, if I understand what you mean.
In the majority of cases, the vendor would make a good guess about what load is probably the maximum for a specific configuration.
It is possible to have a much better start than exponential.

Yes, they are very good, if the results follow NORMAL DISTRIBUTION. However, I believe that we may not assume normal distribution here, because, according to my experience, the distribution of the results of the multiple experiments is not even symmetric.
Unexpected. I do not understand why. OK. Then I withdraw my comment.


It is a severe bug when the device is "not responding". The device should drop packets after any resources are exhausted but be available for management and monitoring.
I agree with you that is should be so. But unfortunately, I experienced the opposite.
I have seen very severe punishment imposed on vendors for such types of bugs.
They always fix it – the pressure from customers is too high.
Because it is a perfect DDoS tool.

Eduard
From: bmwg [mailto:bmwg-bounces@ietf.org] On Behalf Of Gabor LENCSE
Sent: Thursday, November 17, 2022 4:12 PM
To: bmwg@ietf.org<mailto:bmwg@ietf.org>
Subject: Re: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt

Dear Eduard,
Thank you very much for your review!
On 11/17/2022 4:02 AM, Vasilenko Eduard wrote:

[...]


I am still sure about the big dependency between "packets per second" and "new sessions per second".
I definitely agree with it.

But it would be utterly difficult to specify a profile for mixed traffic - it would be very different between fixed and mobile subscribers.

Hence, let test it separately and assume linear influence on each other: if pps is at 60% then cps is probably may be up to 40% from the tested maximum.

Does it make sense to specify it in the document?
To be more precise, the situation of a stateful NAT64 or NAT44 gateway is even more complicated. Several different things happen, including:
1) A new connection is established.
2) A packet is transmitted in the upload direction.
3) A packet is transmitted in the download direction.
4) A connection is teared down. (Either by closing a TCP connection or by the timeout of the TCP or UDP "connection".)
To have conditions similar to the traffic of the Internet, all the above should be considered.
Some further comments:
As for iptables, connection tear down is much-much more costly (by means of CPU power) than connection establishment. According the my measurements, in the case of 100M connections, iptables could established 2.237M connections in a second, but it could terminate only 345k connections in a second. Please see the Table 4 and Table 5 in: https://datatracker.ietf.org/doc/html/draft-lencse-v6ops-transition-scalability-04
I tested the throughput of Jool, iptables+tayga, and OpenBSD PF stateful NAT64 solutions using uni-directional traffic in the upload and download directions and I found them different.
The actual quantity of upload and download traffic can be very much different in case of a home user! Yet RFC 2544 / RFC 2510 / RFC 8219 require testing throughput with bidirectional traffic. We also kept it in our draft and added testing with unidirectional traffic as OPTIONAL.
Should we perhaps make testing with unidirectional traffic REQUIRED?

After all these considerations: What should we recommend and why?
Perhaps an appropriate mix of 1, 2, 3, and 4 could be the desired load for benchmarking a stateful NATxy gateway... But, we see a lot of hindrances, including:
- Following the long established tradition (also required by RFC 2544 / RFC 2510 / RFC 8219) we use UDP for testing. In UDP, there is no such thing as "termination of a connection". Of course the gateway still handles UDP "connections", but it can be "terminated" only by timeout. And it makes very hard (perhaps impossible) to use let use let us say 10% connection tear down.
- We could still use a mix of let us say: 10% of the packets result in new connections and 90% of the packets belong to an existing connection. (The ratios could be changed.) However, adding 10% new connections may significantly increase the number of connections during a 60s long throughput test, because we are not able to terminate the same number of old connections. (And the number of connections  highly influence the performance of a stateful NATxy gateway, please see the connection scalability results in the above mentioned draft.)
So currently we cannot propose a better solution than measuring separately:
- connection setup performance (maximum connection establishment rate)
- packet forwarding performance measured with a constant number of connections (throughput with bidirectional traffic; optionally: throughput with uni-directional traffic)
- connection termination performance (connection tear down rate)
Can you propose a repeatable measurement that uses an appropriate mix of 1-4?

It is cheating to test ports not processing Engine. Of course, a lightly loaded engine would show much better results.

Unfortunately, to get meaningful results it is important to overload the whole engine.

In the case of a hardware device, it is one NPU on the processing line card. Vendors share (under NDA) performance for it. It is always below 100GE but above 10GE. Hence, many ports may be needed in the 10GE case.

In the case of a Virtual appliance, it is probably a VM of one vCPU of Intel or AMD. One vCPU would not be capable to overload the 10GE port.

Typically, the number of sessions is so big (in a Telco environment) that engines scale linearly (for software-based and hardware-based). The test for many engines (CPU cores) is not more useful but more difficult to implement.
I do not really understand the point of the above. It seems to be about a special hardware device that has several cards with perhaps multiple ports... ?
In our simple model, we have a DUT with only two ports, e.g., in the case of NAT64 it looks as follows:

                 +--------------------------------------+

       2001:2::2 |Initiator                    Responder| 198.19.0.2

   +-------------|                Tester                |<------------+

   | IPv6 address|                         [state table]| IPv4 address|

   |             +--------------------------------------+             |

   |                                                                  |

   |             +--------------------------------------+             |

   |   2001:2::1 |                 DUT:                 | 198.19.0.1  |

   +------------>|        Stateful NAT64 gateway        |-------------+

     IPv6 address|     [connection tracking table]      | IPv4 address

                 +--------------------------------------+



       Figure 2: Test setup for benchmarking stateful NAT64 gateways



IMHO: the discussion for what we are trying to overload is mandatory.
What do you mean under "overload"?
If I understand it correctly, we do overload during both the maximum connection establishment rate test and the throughput test in order to find the maximum lossless rate using a binary search.

How big a session table should be created in phase 1? It may be 10, 10k, or 1M. How to decide?
Yes, this is a very good question. IMHO, if we want to imitate the situation of a commercial device used by and ISP, the number of sessions should begin around 1M and go up until our hardware is unable to handle it.
I used this approach, and the upper limit was 800M and 1600M for iptables and Jool, respectively. (Please see the connection scalability results in the above mentioned draft in Table 4 and Table 9.)


Or asking the same from a different direction: how many packets should be sent over every session?
In my measurements, this number was determined by the number of connections and the achieved throughput. For example, if you see Table 4 and choose the first column (1.56M connections), then the median throughput is 5.326M. It means that 319,560M packets were forwarded during the 60s long throughput test, thus 204,846 packets belonged to a single connection. However, if you choose the last column (800M connections), then the median throughput is 3.689M. It means that 221,340M packets were forwarded during the 60s long throughput test, thus 276 or 277 packets belonged to a single connection.
If do similar calculation with Jool, you will get an order of magnitude lower values due to its lower throughput.


Section 2:

I do not understand why pseudorandom port numbers for every packet were assumed in the basic design.
The aim of using pseudorandom enumeration of all port number combinations that are possible with the given source port and destination port number ranges is twofold:
1) To achieve that all test frames result in a new connection during the preliminary test phase. -- This is needed to measure connection establishment performance.
2) To fill up the connection tracking table of the DUT and the state table of the tester as soon as possible. -- This is important, if the preliminary test phase is performed in preparation of a real test phase.


It is an unrealistic assumption that somebody would mistakenly generate random packets instead of sessions.

I could not believe such a mistake. People are testing stateful devices (FW, LB, NAT) for ages.

Hence, it does not makes sense to warrant against it.

Could you rephrase this section: "Of course, it's known that the load for a stateful device should be flow based where many packets from both directions are simulating one session (many packets should look like 1 session)".

Hardware-based testers like Spirent support it.
I feel that you expect to use a given number of sessions, and the packets should belong to those sessions.
Am I right?
We do exactly that. But first we establish the connections for all the required sessions in the preliminary test phase and then generate packets in the real test phase that belong to those sessions.



It is just a question of how many packets are in one session and how many sessions.
Yes, it is a good question. There can be multiple approaches:
- If the tests are commissioned by a network operator, then the numbers should be tailored to the statistics of the network of the operator.
- If the tests are performed by an academic researcher (like me), then -- I think -- wide ranges should be examined: starting from a low realistic number up to the hardware limits, as I did in the above mentioned draft to provide a general, wide viewing angle picture about the implementation.  :-)
What do you think?


New session generation could be pseudorandom (primarily on the source port).
Yes, I agree, but RFC 4814 requires pseudorandom by means of both SOURCE and DESTINATION port numbers: https://www.rfc-editor.org/rfc/rfc4814#section-4.5
Please see my considerations in Section 2.3 of http://www.hit.bme.hu/~lencse/publications/ECC-2022-SFNATxy-Tester-published.pdf

Section 3: a little inconsistency:

I guess that "state table" should be created on the Initiator too (you mentioned only the responder).
In fact, something like "state table" is actually created in the Initiator of siitperf for performance considerations. The pseudorandom enumeration of all possible port number combination happens before the preliminary test phase and they are stored in an array. The are just read from there linearly during the preliminary test phase. (As the IP addresses are fixed, thus one can say, that the "state table" is there.) But we do not need it after the preliminary test phase. Initiator uses simply pseudorandom port numbers in the real test phase: as all possible combinations were enumerated in the preliminary test phase, no new combinations may occur.

"connection tracking table" is claimed "unknown for the Tester" because it is inside the "NATxy gateway". But then why it has been mentioned in "Initiator"? It is unknown, right?
Do you mean the following text?

   *  Initiator: The port of the Tester that may initiate a connection

      through the stateful DUT in the client to server direction.

      Theoretically, it can use any source and destination port numbers

      from the ranges recommended by [RFC4814<https://datatracker.ietf.org/doc/html/rfc4814>]: if the used four tuple

      does not belong to an existing connection, the DUT will register a

      new connection into its connection tracking table.
The connection tracking table is mentioned as an explanation, what happens. Yes, the content of the connection tracking table is unknown for both the Initiator and the Responder.


Section 4.1: if destination port numbers are so condensed around a few numbers in the real Internet then why "from a few to several hundreds or thousands as needed" on the test?
YES, THIS IS A VERY MUCH VALID QUESTION!
My answer is historical: because at the time of designing siitperf (the only stateful NAT64 / NAT44 Tester that supports the draft) I did not want to make too much work for myself and I used a single fixed IP address pair. Thus currently the only way to achieve hundreds of millions of connections with siitperf is to increase the destination port number range. Let is see some example numbers. If we have 40,000 source port numbers, then:
- 10 destination port numbers result in 400,000 connections
- 100 destination port numbers result in 4M connections
- 1000 destination port numbers result in 40M connections
Regarding this design decision of using a single IP address pair, Section 4.4. of our draft currently says:

   1.  A single source address destination address pair is used for all

       tests.  We make this assumption for simplicity.  Of course, we

       are aware that [RFC2544<https://datatracker.ietf.org/doc/html/rfc2544>] requires testing also with 256 different

       destination networks.
We have discussed it with my co-author, Keiichi Shima, and we believe that:
- On the one hand, the usage of multiple destination NETWORKS is not interesting here, because we do not do router testing.
- On the other hand, the usage of multiple IP ADDRESSES may be appropriate here, especially because of our experience with OpenBSD. Please see slides 11 and 12 of my ITEF 115 BMWG presentation:https://datatracker.ietf.org/meeting/115/materials/slides-115-bmwg-benchmarking-methodology-for-stateful-natxy-gateways-using-rfc-4814-pseudorandom-port-numbers
It was our question that "Shall we add the usage of multiple IP addresses as a requirement?", and we still have this question open.
Using multiple IP addresses may be useful not only to generate entropy for the hash function to support RSS, but also to generate a high number of network flows using a low(er) number of destination port numbers.
What do you think?
However, then we must also convince the authors of RFC 4814 (and BMWG members) that narrowing down the destination port number range is not a violation of the following requirement in Section 4.5 of RFC 4814:

   In addition, it may be desirable to pick pseudorandom values from a

   selected pool of numbers.  Many services identify themselves through

   use of reserved destination port numbers between 1 and 49151

   inclusive.  Unless specific port numbers are required, it is

   RECOMMENDED to pick randomly distributed destination port numbers

   between these lower and upper boundaries.



It is probably not important to mention that one MAP subscriber has a very limited number of source ports. Because "Border Relay" (our DUT) would have many-many subscribers.
Yes, when testing MAP BR, it is fair to assume many MAP CE-s. But MAP BR is a stateless device, and thus, it is out of scope of our draft. However, MAP CE is stateful, it may be in scope.

Section 4.4: Why do we manipulate ports but keep only one IP address? Real scenarios manipulate IP addresses very much (tens of thousands of subscribers are possible for one NAT gateway). Is it possible to recommend at least source address randomization too? (destination addresses would be very limited in the wild Internet)
Please see my answers above: I am not against using multiple IP addresses.
However, if the Initiator uses only multiple SOURCE IP addresses, then they will appear only as multiple source port numbers after the stateful NATxy gateway translated the packet. Therefore, they will not be visible, when the Responder generates traffic based on the received four tuples.
If the Initiator uses multiple DESTINATION IP addresses, they will "survive" the stateful NATxy translation and they will also appear in the traffic generated by the Responder.

  Important warning: in normal (non-NAT) router testing, the port

  number selection algorithm, whether it is pseudo-random or enumerated

  in increasing (or decreasing) order does not affect final results.

In reality, it affects. Because depending on the router configuration, it may use port numbers for hash load-balancing. It would affect the load distribution of many links.
I think that the order should not count, if the hash function is good enough.


Section 4.5:

In practice, we RECOMMEND the usage of binary search.

We could listen to the vendor's claims and choose the initial point more smartly.
I am not sure, if I understand what you mean.
Do you mean that compared to using 0 and the maximum frame rate for the media as the initial lower bound and the initial upper bound, respectively, one can have a better starting point for the binary search?

Section 4.5.1:

We RECOMMEND median as the summarizing function of the

  results complemented with the first percentile and the 99th

  percentile as indices of the dispersion of the results.

   connections/s 1st perc. (req.)

   connections/s 99th perc. (req.)

When I was a student  I have been told (around 1987) that "standard deviation" is better to use for tests data filtering.

The theory is here: https://en.wikipedia.org/?title=3-sigma&redirect=no

In the case of Microsoft Excel it is something like this:

Min=AVERAGE(Array)-STDEV(Array)

Max=AVERAGE(Array)+STDEV(Array)

Look here: https://www.ablebits.com/office-addins-blog/calculate-standard-deviation-excel/
Yes, they are very good, if the results follow NORMAL DISTRIBUTION. However, I believe that we may not assume normal distribution here, because, according to my experience, the distribution of the results of the multiple experiments is not even symmetric. Please consider that due to some unexpected event, some frames may be lost, and thus there are outliers among the results towards zero. However, there should not be significant outliers in the other direction in a well working system. (There is no "accidental" high performance.)
We chose median as summarizing function, because it is less sensitive to outliers than average.
Section 7.2 of RFC 8219 redefines the Latency measurement of RFC 2544 to provide better quality results, and it recommends:

   To account for the variation, the 1st and 99th percentiles of the 20

   iterations MAY be reported in two separated columns.
We followed that approach.


Section 4.6

  *  The Initiator counts the received frames, and if all N frames are

     arrived then the R frame rate of the maximum connection

     establishment rate measurement (performed in the preliminary test

     phase) is raised for the next iteration, otherwise lowered (as

     well as in the case if test frames were missing in the preliminary

     test phase).

I do not remember such a capability on hardware testers (it was many years ago). But hardware testers are mandatory to test Engine that is close to 100Gbps.

We need somebody from Spirent to judge: is it possible?
It can be easily implemented. It is just that all the elements of the state table of the Responder are to be used for packet generation.
The "cheapest" solution is to use them in linear order. (It can be done with siitperf.)
A more sophisticated implementation can use them in pseudorandom order. (I haven't yet implemented it, but it is only matter of time and work.)

Section 4.8: how are you going to delete the connection table for the hardware device? CLI may return the prompt even before the job is finished. The platform may be asynchronous.

Moreover, because of the:

  We are aware that the performance of removing the entire content of

  the connection tracking table at one time may be different from

  removing all the entries one by one.

The value of such a test is very questionable.
Yes, I share all your concerns.
However, up to now, I received consistent results, when I used different number of connections. You can find them in Table 5 and Table 10 of my before mentioned draft for iptables and Jool. Since then, I tested it also for OpenBSD PF.

Alternatively, it is possible to establish connections very fast (the same rate that was tested as a maximum) and then continue to send traffic over 1st and the last sessions. Sessions would expire according to the creation time. It would be possible to monitor that tear down time is not worse than the connection establishment time.

I was working for a decade for a vendor selling NAT/FW. I did never discuss with customers/partners/vendor_R&D tear down performance. I propose deleting it completely.
Well, connection tear down performance was also overlooked in RFC 8219, of which I am a co-author. However, after my experience with iptables, I cannot overlook it any more, because its connection tear down performance is much lower than its connection establishment performance (as I mentioned above).
Note: their proportion may depend on the ratio of the "hashsize" and "nf_conntrack_max" parameters.

Section 4.9:

  DUT was exhausted and it stopped responding

if the DUT collapses

It should be qualified as a test fail.
I definitely agree with it. In theory, it is easy to do so. But in practice, we need some mechanism to detect it, reboot the server, wait until it is ready and continue on the binary search with the lower half interval.

It is a severe bug when the device is "not responding". The device should drop packets after any resources are exhausted but be available for management and monitoring.
I agree with you that is should be so. But unfortunately, I experienced the opposite.

Once again, thank you very much for all your work reading and commenting our draft!!!
Best regards,
Gábor

Eduard

-----Original Message-----

From: bmwg [mailto:bmwg-bounces@ietf.org] On Behalf Of internet-drafts@ietf.org<mailto:internet-drafts@ietf.org>

Sent: Thursday, October 20, 2022 5:14 AM

To: i-d-announce@ietf.org<mailto:i-d-announce@ietf.org>

Cc: bmwg@ietf.org<mailto:bmwg@ietf.org>

Subject: [bmwg] I-D Action: draft-ietf-bmwg-benchmarking-stateful-01.txt





A New Internet-Draft is available from the on-line Internet-Drafts directories.

This draft is a work item of the Benchmarking Methodology WG of the IETF.



        Title           : Benchmarking Methodology for Stateful NATxy Gateways using RFC 4814 Pseudorandom Port Numbers

        Authors         : Gabor Lencse

                          Keiichi Shima

  Filename        : draft-ietf-bmwg-benchmarking-stateful-01.txt

  Pages           : 25

  Date            : 2022-10-19



Abstract:

   RFC 2544 has defined a benchmarking methodology for network

   interconnect devices.  RFC 5180 addressed IPv6 specificities and it

   also provided a technology update, but excluded IPv6 transition

   technologies.  RFC 8219 addressed IPv6 transition technologies,

   including stateful NAT64.  However, none of them discussed how to

   apply RFC 4814 pseudorandom port numbers to any stateful NATxy

   (NAT44, NAT64, NAT66) technologies.  We discuss why using

   pseudorandom port numbers with stateful NATxy gateways is a difficult

   problem.  We recommend a solution limiting the port number ranges and

   using two phases: the preliminary test phase and the real test phase.

   We show how the classic performance measurement procedures (e.g.

   throughput, frame loss rate, latency, etc.) can be carried out.  We

   also define new performance metrics and measurement procedures for

   maximum connection establishment rate, connection tear down rate and

   connection tracking table capacity measurements.





The IETF datatracker status page for this draft is:

https://datatracker.ietf.org/doc/draft-ietf-bmwg-benchmarking-stateful/



There is also an htmlized version available at:

https://datatracker.ietf.org/doc/html/draft-ietf-bmwg-benchmarking-stateful-01



A diff from the previous version is available at:

https://www.ietf.org/rfcdiff?url2=draft-ietf-bmwg-benchmarking-stateful-01





Internet-Drafts are also available by rsync at rsync.ietf.org<http://rsync.ietf.org>::internet-drafts





_______________________________________________

bmwg mailing list

bmwg@ietf.org<mailto:bmwg@ietf.org>

https://www.ietf.org/mailman/listinfo/bmwg


_______________________________________________
bmwg mailing list
bmwg@ietf.org<mailto:bmwg@ietf.org>
https://www.ietf.org/mailman/listinfo/bmwg