Re: [v6ops] Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review

Gábor,

Thanks for some great work!
I will try to get a more thorough look later, but for a first set of comments.

For methodology.
 - Setting a baseline, e.g. by measuring base IPv4 forwarding would be useful in establishing how much extra work is involved doing the translation.
 - Scaling linearly by number of cores is challenging in these systems. It would be interesting to see the results for 1, 2, 4, 8 cores and not only
   for the 8 core system.

Regarding the number of connections. You should see a drop-off when the size of the session table is larger than the L3 cache.
You might also balance maximum bandwidth available with maximum session size. 250G of forwarding per socket on a PCIe 3.0 system.

I might have read a little too much between the lines in the draft, but I got a feeling (and just that), that the tests were coloured a bit by the behaviour of a particular implementation (iptables).

You are of course measuring how a particular implementation (or set of implementations) scales and we as the IETF have to deduce if the scaling limitations are in the implementation or in the protocol mechanism. We do know that you can build large scale NAT44s and NAT64s, so back to my first point, it might be useful to provide a baseline to give an idea of the additional cost associated with the extra treatment of packets.

It would certainly be interesting to run your tool against the VPP implementation of NAT.
Here are some NAT performance results from runs with the CSIT benchmarking setup for VPP:
https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/packet_throughput_graphs/nat44.html
https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/throughput_speedup_multi_core/nat44.html

Best regards,
Ole

> On 17 Aug 2021, at 14:51, Gábor LENCSE <lencse@hit.bme.hu> wrote:
> 
> Dear All,
> 
> At IETF 111, I promised to perform scale-up test for stateful NAT64 and also for stateful NAT44. (Our primary focus is stateful NAT64, as it is a part of 464XLAT. As there was interest in CGN, I am happy to invest some time into it, too. I think the comparison of their scalability may be interesting for several people.)
> 
> Now, I am in the phase of preliminary tests. I mean that I have performed tests to explore the behavior of both the Tester (stateful branch of siitperf) and the benchmarked application to be able to determine the conditions for the production tests.
> 
> Now I write to ask all  interested "IPv6 Operations" mailing list members to review my report below from two points of view:
> - Do you consider the methodology sound?
> - Do you think that the parameters are appropriate to provide meaningful results for network operators?
> 
> Control question:
> - Will you support the inclusion of the results into draft-ietf-v6ops-transition-comparison and the publication of the draft as an RFC even if the results will contradict to your view of the scalability of the stateful technologies?
> 
> As I had more experience with iptables than Jool, I started with the scale-up tests of stateful NAT44.
> 
> Now, I give a short description of the test system.
> 
> I used two identical HPE ProLiant DL380 Gen10 (DL380GEN10) servers with the following configuration:
> - 2 Intel 5218 CPUs (the clock frequency was set to 2.3GHz fixed )
> - 256GB (8x32GB) DDR4 SDRAM @ 2666 MHz (accessed quad channel)
> - 2-port BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (used with 10Gbps DAC cables)
> 
> The servers have 4 NUMA nodes (0, 1, 2, 3) and the NICs belong to NUMA node 1. All the interrupts caused by the NICs are processed by 8 of the 32 CPU cores (the ones that belong to NUMA node 1, that is cores from 8 to 15) thus regarding our measurements, the DUT (Device Under Test) is more or less equivalent with an 8-core server. :-)
> 
> I used Debian Linux 9.13 with 5.12.14 kernel on both servers.
> 
> The test setup was the same as our draft describes: https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful
>                  +--------------------------------------+
>         10.0.0.2 |Initiator                    Responder| 198.19.0.2
>    +-------------|                Tester                |<------------+
>    | private IPv4|                         [state table]| public IPv4 |
>    |             +--------------------------------------+             |
>    |                                                                  |
>    |             +--------------------------------------+             |
>    |    10.0.0.1 |                 DUT:                 | 198.19.0.1  |
>    +------------>|        Sateful NATxy gateway         |-------------+
>      private IPv4|     [connection tracking table]      | public IPv4
>                  +--------------------------------------+
> 
> 
> Both the Tester and the DUT were the above mentioned HPE servers.
> 
> I wanted to follow my original plan included in my IETF 111 presentation to measure both the maximum connection establishment rate and the throughput using the following number of connections: 1 million, 10 million, 100 million, and 1 billion. However, the DUT became unavailable during my tests with 1 billion connections as the connection tracking table has exhausted its memory. Thus I have reduced the highest number of connections to 500 million.
> 
> First, I had to perform the tests for the maximum connection establishment rate. To achieve the required growing number of connections, I used always 50,000 different source port numbers (from 1 to 50,000) and increased the destination port numbers as 20, 200, 2000, and 10,000 (instead of 20,000).
> 
> As for the size of the connection tracking table, I used powers of 2. The very first value was 2^20=1,048,576 and then I had to use the lowest large enough number, that is 2^24, etc. The hash size parameter was always set to 1/8 of the size of the connection tracking table.
> 
> I have performed binary search to determine the maximum connection establishment rate. The stopping criterion was expressed by the "error" parameter, which is the difference of the upper and lower bound. It was set to 1,000 in all cases except the last one: then it was set to 10,000 to save some execution time. The experiments were executed 10 times except the last one (only 3 times).
> 
> I have calculated the median, the minimum and the maximum of the measured connection establishment rates. I hope the values of the table below will be readable in my e-mail:
> 
> Num. conn.	1,000,000	10,000,000	100,000,000	500,000,000
> src ports	50,000	50,000	50,000	50,000
> dst ports	20	200	2,000	10,000
> conntrack t. s.	2^20	2^24	2^27	2^29
> hash table size	c.t.s/8	c.t.s/8	c.t.s/8	c.t.s/8
> num. exp.	10	10	10	3
> error	1,000	1,000	1,000	10,000
> cps median	1.124	1.311	1.01	0.742
> cps max	1.108	1.308	1.007	0.742
> cps min	1.128	1.317	1.013	0.742
> c.t.s/n.c.	1.049	1.678	1.342	1.074
> 
> The cps (connections per second) values are given in million connections per second.
> 
> The rise of the median to 1.311M at 10 million connections (from 1.124M at 1 million connections) is deliberately caused by the fact that the size of the connection tracking table was quite large compared to the actual number of connections. I have included this proportion in the last line of the table. Thus only the very first and very last columns are directly comparable. If we consider that the maximum connection establishment rate was decreased from 1.124M to 0.742M while the total number of connections increased from 1M to 500M, I think, we can be satisfied with the scale-up of iptables.
> (Of course, I can see the limitations of my measurements: the error of 10,000 was too high, the binary search always finished at the same number.)
> 
> As I wanted more informative results, I have abandoned the decimal increase of the number of connections. I rather used binary increase to ensure that the proportion of the number of connections and the size of the connection tracking table be constant.
> 
> I had another concern. Earlier, I have pointed our that the role of the source and destination port numbers is not completely symmetrical in the hash function that distributes the interrupts among the CPU cores [1]. Therefore, I decided to increase to source and destination port range together.
> 
> [1] G. Lencse, "Adding RFC 4814 Random Port Feature to Siitperf: Design, Implementation and Performance Estimation", International Journal of Advances in Telecommunications, Electrotechnics, Signals and Systems, vol 9, no 3, pp. 18-26, 2020, DOI: 10.11601/ijates.v9i3.291 Full paper in PDF
> 
> The results are shown in the table below (first, please see only the same lines as in the table above):
> 
> Num. conn.	1,562,500	6,250,000	25,000,000	100,000,000	400,000,000
> src ports	2,500	5,000	10,000	20,000	40,000
> dst ports	625	1,250	2,500	5,000	10,000
> conntrack t. s.	2^21	2^23	2^25	2^27	2^29
> hash table size	c.t.s/8	c.t.s/8	c.t.s/8	c.t.s/8	c.t.s/8
> num. exp.	10	10	10	10	5
> error	1,000	1,000	1,000	1,000	1,000
> cps median	1.216	1.147	1.085	1.02	0.88
> cps min	1.21	1.14	1.077	1.015	0.878
> cps max	1.224	1.153	1.087	1.024	0.884
> c.t.s/n.c.	1.342	1.342	1.342	1.342	1.342
> thorughput median	3.605	3.508	3.227	3.242	2.76
> thorughput min	3.592	3.494	3.213	3.232	2.748
> thorughput max	3.627	3.521	3.236	3.248	2.799
> 
> Now, the maximum connection setup rate results deteriorate only very slightly with the increase of the number of connections from 1.216M to 1.02M, while the number of connections increased from 1,562,500 to 100,000,000. (A slightly higher degradation can be observed only in the last column. I will return to it later.)
> 
> The last 3 lines of the table show the median, minimum and maximum values of the throughput. As required by RFC 8219, throughput was determined by using bidirectional traffic and the duration of the elementary steps of the binary search was 60s. (The binary search was executed 10 times, except for the last column, where it was done only 5 times to save execution time.)
> 
> Note: commercial testers usually report the total number of frames forwarded. Siitperf reports the number of frames per direction. Thus in case of bidirectional tests, the reported value should be multiplied by 2 to receive the total number of frames per second. I did so, too.
> 
> Although RFC 2544/5180/8219 require testing with bidirectional traffic, I suspect that perhaps unidirectional throughput my also be interesting for ISPs, as home users usually have much more download than upload traffic...
> 
> The degradation of the throughput is also moderate, except at the last column. I attribute the higher decrease of the throughput at 400,000,000 connections (as well as that of the maximum connection establishment rate) to NUMA issues. In more detail: this time nearly the entire memory of the server was in use, whereas in the previous cases iptables could use NUMA local memory (if it was smart enough to do that). Unfortunately I cannot add more memory to these computers to check my hypothesis, but in a few weeks, I hope to be able to use some DELL PowerEdge R430 computers that have only two NUMA nodes and 384GB RAM, see the "P" nodes here: https://starbed.nict.go.jp/en/equipment/
> 
> Now, I kindly ask everybody, who is interested in the scale up tests, to comment my measurements regarding both the methodology and the parameters!
> 
> I am happy to provide more details, if needed.
> 
> I plan to start working on the NAT64 tests in the upcoming weeks. I plan to use Jool. I have very limited experience with it, so first, I need to find out, how I can tune its connection tracking table parameters to be able to perform fair scale up tests.
> 
> In the meantime, please comment my above experiments and results so that I may improve the tests to provide convincing results for all interested parties!
> 
> Best regards,
> 
> Gábor
> 
> P.S.: If someone would volunteer repeating my experiments, I would be happy to share my scripts and experience and to provide support for siitperf, which is available from: https://github.com/lencsegabor/siitperf
> The stateful branch of siitperf is in a very alpha state. It has only a partial documentation in this paper, which is still under review: http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf (It may be revised or removed at any time.)
> The support for unique pseudorandom source and destination port number combinations, what I used for my current tests, is not described there, as I invented it after the submission of that paper. (I plan to update the paper, if I can have a chance to revise it. In the unlikely case, if the paper will be accepted as is, I plan to write a shorter paper about the new features. Now, the commented source code is the most reliable documentation.)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> v6ops mailing list
> v6ops@ietf.org
> https://www.ietf.org/mailman/listinfo/v6ops

Re: [v6ops] Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review

Attachment: signature.asc