[v6ops] Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review

Gábor LENCSE <lencse@hit.bme.hu> Tue, 17 August 2021 12:51 UTC

Return-Path: <lencse@hit.bme.hu>
X-Original-To: v6ops@ietfa.amsl.com
Delivered-To: v6ops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 007F03A22D1 for <v6ops@ietfa.amsl.com>; Tue, 17 Aug 2021 05:51:52 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fOYYUoKs3WXD for <v6ops@ietfa.amsl.com>; Tue, 17 Aug 2021 05:51:46 -0700 (PDT)
Received: from frogstar.hit.bme.hu (frogstar.hit.bme.hu [IPv6:2001:738:2001:4020::2c]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 751113A22C9 for <v6ops@ietf.org>; Tue, 17 Aug 2021 05:51:44 -0700 (PDT)
Received: from [192.168.1.146] (host-79-121-41-15.kabelnet.hu [79.121.41.15]) (authenticated bits=0) by frogstar.hit.bme.hu (8.15.2/8.15.2) with ESMTPSA id 17HCpNhq065401 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for <v6ops@ietf.org>; Tue, 17 Aug 2021 14:51:29 +0200 (CEST) (envelope-from lencse@hit.bme.hu)
X-Authentication-Warning: frogstar.hit.bme.hu: Host host-79-121-41-15.kabelnet.hu [79.121.41.15] claimed to be [192.168.1.146]
To: "v6ops@ietf.org list" <v6ops@ietf.org>
From: Gábor LENCSE <lencse@hit.bme.hu>
Message-ID: <1a82ae6f-db3d-4228-e176-5dab0049a156@hit.bme.hu>
Date: Tue, 17 Aug 2021 14:51:18 +0200
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="------------34AA70CC10E8ADC031766804"
Content-Language: en-US
X-Virus-Scanned: clamav-milter 0.103.2 at frogstar.hit.bme.hu
X-Virus-Status: Clean
Received-SPF: pass (frogstar.hit.bme.hu: authenticated connection) receiver=frogstar.hit.bme.hu; client-ip=79.121.41.15; helo=[192.168.1.146]; envelope-from=lencse@hit.bme.hu; x-software=spfmilter 2.001 http://www.acme.com/software/spfmilter/ with libspf2-1.2.10;
X-DCC--Metrics: frogstar.hit.bme.hu; whitelist
X-Scanned-By: MIMEDefang 2.79 on 152.66.248.44
Archived-At: <https://mailarchive.ietf.org/arch/msg/v6ops/v6dbCBV0F7oww3Zi1kcO-3-odOQ>
Subject: [v6ops] Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review
X-BeenThere: v6ops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: v6ops discussion list <v6ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/v6ops>, <mailto:v6ops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/v6ops/>
List-Post: <mailto:v6ops@ietf.org>
List-Help: <mailto:v6ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/v6ops>, <mailto:v6ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 17 Aug 2021 12:51:52 -0000

Dear All,

At IETF 111, I promised to perform scale-up test for stateful NAT64 and 
also for stateful NAT44. (Our primary focus is stateful NAT64, as it is 
a part of 464XLAT. As there was interest in CGN, I am happy to invest 
some time into it, too. I think the comparison of their scalability may 
be interesting for several people.)

Now, I am in the phase of preliminary tests. I mean that I have 
performed tests to explore the behavior of both the Tester (stateful 
branch of siitperf) and the benchmarked application to be able to 
determine the conditions for the production tests.

Now I write to ask all  interested "IPv6 Operations" mailing list 
members to review my report below from two points of view:
- Do you consider the methodology sound?
- Do you think that the parameters are appropriate to provide meaningful 
results for network operators?

Control question:
- Will you support the inclusion of the results into 
draft-ietf-v6ops-transition-comparison and the publication of the draft 
as an RFC even if the results will contradict to your view of the 
scalability of the stateful technologies?

As I had more experience with iptables than Jool, I started with the 
scale-up tests of stateful NAT44.

Now, I give a short description of the test system.

I used two identical HPE ProLiant DL380 Gen10 (DL380GEN10) servers with 
the following configuration:
- 2 Intel 5218 CPUs (the clock frequency was set to 2.3GHz fixed )
- 256GB (8x32GB) DDR4 SDRAM @ 2666 MHz (accessed quad channel)
- 2-port BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (used 
with 10Gbps DAC cables)

The servers have 4 NUMA nodes (0, 1, 2, 3) and the NICs belong to NUMA 
node 1. All the interrupts caused by the NICs are processed by 8 of the 
32 CPU cores (the ones that belong to NUMA node 1, that is cores from 8 
to 15) thus regarding our measurements, the DUT (Device Under Test) is 
more or less equivalent with an 8-core server. :-)

I used Debian Linux 9.13 with 5.12.14 kernel on both servers.

The test setup was the same as our draft describes: 
https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful

                  +--------------------------------------+
         10.0.0.2 |Initiator                    Responder| 198.19.0.2
    +-------------|                Tester                |<------------+
    | private IPv4|                         [state table]| public IPv4 |
    |             +--------------------------------------+             |
    |                                                                  |
    |             +--------------------------------------+             |
    |    10.0.0.1 |                 DUT:                 | 198.19.0.1  |
    +------------>|        Sateful NATxy gateway         |-------------+
      private IPv4|     [connection tracking table]      | public IPv4
                  +--------------------------------------+

Both the Tester and the DUT were the above mentioned HPE servers.

I wanted to follow my original plan included in my IETF 111 presentation 
to measure both the maximum connection establishment rate and the 
throughput using the following number of connections: 1 million, 10 
million, 100 million, and 1 billion. However, the DUT became unavailable 
during my tests with 1 billion connections as the connection tracking 
table has exhausted its memory. Thus I have reduced the highest number 
of connections to 500 million.

First, I had to perform the tests for the maximum connection 
establishment rate. To achieve the required growing number of 
connections, I used always 50,000 different source port numbers (from 1 
to 50,000) and increased the destination port numbers as 20, 200, 2000, 
and 10,000 (instead of 20,000).

As for the size of the connection tracking table, I used powers of 2. 
The very first value was 2^20=1,048,576 and then I had to use the lowest 
large enough number, that is 2^24, etc. The hash size parameter was 
always set to 1/8 of the size of the connection tracking table.

I have performed binary search to determine the maximum connection 
establishment rate. The stopping criterion was expressed by the "error" 
parameter, which is the difference of the upper and lower bound. It was 
set to 1,000 in all cases except the last one: then it was set to 10,000 
to save some execution time. The experiments were executed 10 times 
except the last one (only 3 times).

I have calculated the median, the minimum and the maximum of the 
measured connection establishment rates. I hope the values of the table 
below will be readable in my e-mail:

Num. conn. 	1,000,000 	10,000,000 	100,000,000 	500,000,000
src ports 	50,000 	50,000 	50,000 	50,000
dst ports 	20 	200 	2,000 	10,000
conntrack t. s. 	2^20 	2^24 	2^27 	2^29
hash table size 	c.t.s/8 	c.t.s/8 	c.t.s/8 	c.t.s/8
num. exp. 	10 	10 	10 	3
error 	1,000 	1,000 	1,000 	10,000
cps median 	1.124 	1.311 	1.01 	0.742
cps max 	1.108 	1.308 	1.007 	0.742
cps min 	1.128 	1.317 	1.013 	0.742
c.t.s/n.c. 	1.049 	1.678 	1.342 	1.074


The cps (connections per second) values are given in million connections 
per second.

The rise of the median to 1.311M at 10 million connections (from 1.124M 
at 1 million connections) is deliberately caused by the fact that the 
size of the connection tracking table was quite large compared to the 
actual number of connections. I have included this proportion in the 
last line of the table. Thus only the very first and very last columns 
are directly comparable. If we consider that the maximum connection 
establishment rate was decreased from 1.124M to 0.742M while the total 
number of connections increased from 1M to 500M, I think, we can be 
satisfied with the scale-up of iptables.
(Of course, I can see the limitations of my measurements: the error of 
10,000 was too high, the binary search always finished at the same number.)

As I wanted more informative results, I have abandoned the decimal 
increase of the number of connections. I rather used binary increase to 
ensure that the proportion of the number of connections and the size of 
the connection tracking table be constant.

I had another concern. Earlier, I have pointed our that the role of the 
source and destination port numbers is not completely symmetrical in the 
hash function that distributes the interrupts among the CPU cores [1]. 
Therefore, I decided to increase to source and destination port range 
together.

[1] G. Lencse, "Adding RFC 4814 Random Port Feature to Siitperf: Design, 
Implementation and Performance Estimation", /International Journal of 
Advances in Telecommunications, Electrotechnics, Signals and Systems/, 
vol 9, no 3, pp. 18-26, 2020, DOI: 10.11601/ijates.v9i3.291 Full paper 
in PDF <http://www.hit.bme.hu/~lencse/publications/291-1113-1-PB.pdf>

The results are shown in the table below (first, please see only the 
same lines as in the table above):

Num. conn. 	1,562,500 	6,250,000 	25,000,000 	100,000,000 	400,000,000
src ports 	2,500 	5,000 	10,000 	20,000 	40,000
dst ports 	625 	1,250 	2,500 	5,000 	10,000
conntrack t. s. 	2^21 	2^23 	2^25 	2^27 	2^29
hash table size 	c.t.s/8 	c.t.s/8 	c.t.s/8 	c.t.s/8 	c.t.s/8
num. exp. 	10 	10 	10 	10 	5
error 	1,000 	1,000 	1,000 	1,000 	1,000
cps median 	*1.216* 	*1.147* 	*1.085* 	*1.02* 	/*0.88*/
cps min 	1.21 	1.14 	1.077 	1.015 	0.878
cps max 	1.224 	1.153 	1.087 	1.024 	0.884
c.t.s/n.c. 	1.342 	1.342 	1.342 	1.342 	1.342
thorughput median 	/*3.605*/ 	/*3.508*/ 	/*3.227*/ 	/*3.242*/ 	/*2.76*/
thorughput min 	3.592 	3.494 	3.213 	3.232 	2.748
thorughput max 	3.627 	3.521 	3.236 	3.248 	2.799


Now, the maximum connection setup rate results deteriorate only very 
slightly with the increase of the number of connections from 1.216M to 
1.02M, while the number of connections increased from 1,562,500 to 
100,000,000. (A slightly higher degradation can be observed only in the 
last column. I will return to it later.)

The last 3 lines of the table show the median, minimum and maximum 
values of the throughput. As required by RFC 8219, throughput was 
determined by using bidirectional traffic and the duration of the 
elementary steps of the binary search was 60s. (The binary search was 
executed 10 times, except for the last column, where it was done only 5 
times to save execution time.)

Note: commercial testers usually report the total number of frames 
forwarded. Siitperf reports the number of frames per direction. Thus in 
case of bidirectional tests, the reported value should be multiplied by 
2 to receive the total number of frames per second. I did so, too.

Although RFC 2544/5180/8219 require testing with bidirectional traffic, 
I suspect that perhaps unidirectional throughput my also be interesting 
for ISPs, as home users usually have much more download than upload 
traffic...

The degradation of the throughput is also moderate, except at the last 
column. I attribute the higher decrease of the throughput at 400,000,000 
connections (as well as that of the maximum connection establishment 
rate) to NUMA issues. In more detail: this time nearly the entire memory 
of the server was in use, whereas in the previous cases iptables could 
use NUMA local memory (if it was smart enough to do that). Unfortunately 
I cannot add more memory to these computers to check my hypothesis, but 
in a few weeks, I hope to be able to use some DELL PowerEdge R430 
computers that have only two NUMA nodes and 384GB RAM, see the "P" nodes 
here: https://starbed.nict.go.jp/en/equipment/

Now, I kindly ask everybody, who is interested in the scale up tests, to 
comment my measurements regarding both the methodology and the parameters!

I am happy to provide more details, if needed.

I plan to start working on the NAT64 tests in the upcoming weeks. I plan 
to use Jool. I have very limited experience with it, so first, I need to 
find out, how I can tune its connection tracking table parameters to be 
able to perform fair scale up tests.

In the meantime, please comment my above experiments and results so that 
I may improve the tests to provide convincing results for all interested 
parties!

Best regards,

Gábor

P.S.: If someone would volunteer repeating my experiments, I would be 
happy to share my scripts and experience and to provide support for 
siitperf, which is available from: https://github.com/lencsegabor/siitperf
The stateful branch of siitperf is in a very alpha state. It has only a 
partial documentation in this paper, which is still under review: 
http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf 
(It may be revised or removed at any time.)
The support for unique pseudorandom source and destination port number 
combinations, what I used for my current tests, is not described there, 
as I invented it after the submission of that paper. (I plan to update 
the paper, if I can have a chance to revise it. In the unlikely case, if 
the paper will be accepted as is, I plan to write a shorter paper about 
the new features. Now, the commented source code is the most reliable 
documentation.)