Re: [v6ops] Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review

Dear Ole,

Thank you very much for your reply!

Please see my answers inline.

8/17/2021 10:52 PM keltezéssel, otroan@employees.org írta:
> Gábor,
>
> Thanks for some great work!
> I will try to get a more thorough look later, but for a first set of comments.
>
> For methodology.
>   - Setting a baseline, e.g. by measuring base IPv4 forwarding would be useful in establishing how much extra work is involved doing the translation.
I had the measurement results ready both for IPv4 and IPv6 kernel 
routing. (20 experiments, error=1000) The below table shows the number 
of all forwarded packets of bidirectional traffic (in million frames per 
seconds).

Linux kernel routing 	IPv4 	IPv6
thorughput median 	9.471 	9.064
thorughput min 	9.443 	9.029
thorughput max 	9.486 	9.088

>   - Scaling linearly by number of cores is challenging in these systems. It would be interesting to see the results for 1, 2, 4, 8 cores and not only
>     for the 8 core system.

Until the current measurements, I always did exactly this kind of 
scale-up tests. For example, I compared the performance of different 
DNS64 implementations using 1, 2, 4, 8, 16 cores in [2] and the 
performance of different authoritative DNS servers using 1, 2, 4, 8, 16, 
32 cores in [3]. I have got some interesting experience with switching 
on/off the CPU cores. First, I did the on/off switching of the i-th CPU 
core by writing 1/0 values into the 
/sys/devices/system/cpu/cpu$i/onlinefile of the working Linux kernel 
[2]. Whereas it seemed to work well, when the query rates were moderate 
(a few times ten thousand queries per second), it caused problems in the 
second case, when I used up to 3 million queries per second query rates, 
and thus I rather set the number of active CPU cores at the DUT by using 
the maxcpus=nkernel parameter.

[2] G. Lencse and Y. Kadobayashi, "Benchmarking DNS64 Implementations: 
Theory and Practice", /Computer Communications/ (Elsevier), vol. 127, 
no. 1, pp. 61-74, September 1, 2018, DOI: 10.1016/j.comcom.2018.05.005 
Review version in PDF 
<http://www.hit.bme.hu/~lencse/publications/ECC-2018-DNS64-BM-for-review.pdf>

[3] G. Lencse, "Benchmarking Authoritative DNS Servers", /IEEE Access/, 
vol. 8. pp. 130224-130238, July 2020. DOI: 10.1109/ACCESS.2020.3009141 
Revised version in PDF 
<http://www.hit.bme.hu/~lencse/publications/IEEE-Access-2020-AuthDNS-revised.pdf>

So now I feel I have the experience to perform such measurements.

I plan to start a series of measurements with 1, 2, 4, 8 cores. To spare 
execution time, I need to choose one fixed number of connections. (It 
would last very long to test all possible combinations if I used several 
different numbers of connections!) As I expect good scale up, and so far 
we have seen only the 8-core performance, I expect that the single core 
performance is likely between 1/6 to 1/8 of it. It means that I should 
use a moderate number of connections, otherwise the filling up of the 
conntrack table would last too long.

So I plan to use the following parameters: number of connections: 
4,000,000; src ports: 4,000; dst ports: 1,000; conntrack table size: 
2^22; hash size: c.t.s/8.

Do you agree with it?

> Regarding the number of connections. You should see a drop-off when the size of the session table is larger than the L3 cache.

Good catch!

Although I cannot directly measure the size of the conntrack table, I 
can record the change of the free memory during the experiments and thus 
I can make an estimation, when this happens. I do not promise to deal 
with it now, but I plan to use it later on.

> You might also balance maximum bandwidth available with maximum session size. 250G of forwarding per socket on a PCIe 3.0 system.

Do you mean "250G" as 250Gbps link?

I am sorry, but it is beyond my dreams. The systems I usually use (at 
NICT StarBED, Japan) have 10Gbps NICs. Now I use two HPE servers with 
10/25Gbps NICs interconnected by 10Gbps DAC cables (at the Széchenyi 
István University, Győr, Hungary), and even though my colleague has 
purchased 25Gbps DAC cables as I requested, I do not use them, as my 
current rates are very far from the maximum frame rate (14,880,952fps) 
of the 10Gbps links.

So it would be nice, if someone having higher performance hardware could 
repeat my measurements. Any volunteers?

> I might have read a little too much between the lines in the draft, but I got a feeling (and just that), that the tests were coloured a bit by the behaviour of a particular implementation (iptables).

Yes, my results reflect only the behavior of iptables.

On the one hand, if the results of one particular implementation show:
- poor speed up, then it does not prove that the technology is bad, 
other implementations might perform much better.
- good speed up, then it proves that the technology is good, but other 
implementations may still perform much worse.

Of course, our time is rather limited, thus my approach is that I would 
like to test the implementations we expect to be good enough. As for 
NAT44, I expect that iptables is among them. If anyone can suggest a 
better one, I am open to try it. It should be free software for two reasons:
- I do not have a budget to buy a proprietary implementation
- the licenses of some vendors prohibit the publication of benchmarking 
results.

> You are of course measuring how a particular implementation (or set of implementations) scales and we as the IETF have to deduce if the scaling limitations are in the implementation or in the protocol mechanism. We do know that you can build large scale NAT44s and NAT64s, so back to my first point, it might be useful to provide a baseline to give an idea of the additional cost associated with the extra treatment of packets.

Yes, my results (9Mfps vs. 3Mfps) definitely show that stateful NAT44 is 
not without performance costs.

> It would certainly be interesting to run your tool against the VPP implementation of NAT.
> Here are some NAT performance results from runs with the CSIT benchmarking setup for VPP:
> https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/packet_throughput_graphs/nat44.html
> https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/throughput_speedup_multi_core/nat44.html
As far as I could figure out, their stateless tests seem to scale up 
well up to 4 CPU cores, but stateful tests do not scale up, regardless 
of speaking about connections per second or throughput.

I really wonder how iptables behaves! :-)

Best regards,

Gábor

> Best regards,
> Ole
>
>
>
>> On 17 Aug 2021, at 14:51, Gábor LENCSE <lencse@hit.bme.hu> wrote:
>>
>> Dear All,
>>
>> At IETF 111, I promised to perform scale-up test for stateful NAT64 and also for stateful NAT44. (Our primary focus is stateful NAT64, as it is a part of 464XLAT. As there was interest in CGN, I am happy to invest some time into it, too. I think the comparison of their scalability may be interesting for several people.)
>>
>> Now, I am in the phase of preliminary tests. I mean that I have performed tests to explore the behavior of both the Tester (stateful branch of siitperf) and the benchmarked application to be able to determine the conditions for the production tests.
>>
>> Now I write to ask all  interested "IPv6 Operations" mailing list members to review my report below from two points of view:
>> - Do you consider the methodology sound?
>> - Do you think that the parameters are appropriate to provide meaningful results for network operators?
>>
>> Control question:
>> - Will you support the inclusion of the results into draft-ietf-v6ops-transition-comparison and the publication of the draft as an RFC even if the results will contradict to your view of the scalability of the stateful technologies?
>>
>> As I had more experience with iptables than Jool, I started with the scale-up tests of stateful NAT44.
>>
>> Now, I give a short description of the test system.
>>
>> I used two identical HPE ProLiant DL380 Gen10 (DL380GEN10) servers with the following configuration:
>> - 2 Intel 5218 CPUs (the clock frequency was set to 2.3GHz fixed )
>> - 256GB (8x32GB) DDR4 SDRAM @ 2666 MHz (accessed quad channel)
>> - 2-port BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (used with 10Gbps DAC cables)
>>
>> The servers have 4 NUMA nodes (0, 1, 2, 3) and the NICs belong to NUMA node 1. All the interrupts caused by the NICs are processed by 8 of the 32 CPU cores (the ones that belong to NUMA node 1, that is cores from 8 to 15) thus regarding our measurements, the DUT (Device Under Test) is more or less equivalent with an 8-core server. :-)
>>
>> I used Debian Linux 9.13 with 5.12.14 kernel on both servers.
>>
>> The test setup was the same as our draft describes: https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful
>>                   +--------------------------------------+
>>          10.0.0.2 |Initiator                    Responder| 198.19.0.2
>>     +-------------|                Tester                |<------------+
>>     | private IPv4|                         [state table]| public IPv4 |
>>     |             +--------------------------------------+             |
>>     |                                                                  |
>>     |             +--------------------------------------+             |
>>     |    10.0.0.1 |                 DUT:                 | 198.19.0.1  |
>>     +------------>|        Sateful NATxy gateway         |-------------+
>>       private IPv4|     [connection tracking table]      | public IPv4
>>                   +--------------------------------------+
>>
>>
>> Both the Tester and the DUT were the above mentioned HPE servers.
>>
>> I wanted to follow my original plan included in my IETF 111 presentation to measure both the maximum connection establishment rate and the throughput using the following number of connections: 1 million, 10 million, 100 million, and 1 billion. However, the DUT became unavailable during my tests with 1 billion connections as the connection tracking table has exhausted its memory. Thus I have reduced the highest number of connections to 500 million.
>>
>> First, I had to perform the tests for the maximum connection establishment rate. To achieve the required growing number of connections, I used always 50,000 different source port numbers (from 1 to 50,000) and increased the destination port numbers as 20, 200, 2000, and 10,000 (instead of 20,000).
>>
>> As for the size of the connection tracking table, I used powers of 2. The very first value was 2^20=1,048,576 and then I had to use the lowest large enough number, that is 2^24, etc. The hash size parameter was always set to 1/8 of the size of the connection tracking table.
>>
>> I have performed binary search to determine the maximum connection establishment rate. The stopping criterion was expressed by the "error" parameter, which is the difference of the upper and lower bound. It was set to 1,000 in all cases except the last one: then it was set to 10,000 to save some execution time. The experiments were executed 10 times except the last one (only 3 times).
>>
>> I have calculated the median, the minimum and the maximum of the measured connection establishment rates. I hope the values of the table below will be readable in my e-mail:
>>
>> Num. conn.	1,000,000	10,000,000	100,000,000	500,000,000
>> src ports	50,000	50,000	50,000	50,000
>> dst ports	20	200	2,000	10,000
>> conntrack t. s.	2^20	2^24	2^27	2^29
>> hash table size	c.t.s/8	c.t.s/8	c.t.s/8	c.t.s/8
>> num. exp.	10	10	10	3
>> error	1,000	1,000	1,000	10,000
>> cps median	1.124	1.311	1.01	0.742
>> cps max	1.108	1.308	1.007	0.742
>> cps min	1.128	1.317	1.013	0.742
>> c.t.s/n.c.	1.049	1.678	1.342	1.074
>>
>> The cps (connections per second) values are given in million connections per second.
>>
>> The rise of the median to 1.311M at 10 million connections (from 1.124M at 1 million connections) is deliberately caused by the fact that the size of the connection tracking table was quite large compared to the actual number of connections. I have included this proportion in the last line of the table. Thus only the very first and very last columns are directly comparable. If we consider that the maximum connection establishment rate was decreased from 1.124M to 0.742M while the total number of connections increased from 1M to 500M, I think, we can be satisfied with the scale-up of iptables.
>> (Of course, I can see the limitations of my measurements: the error of 10,000 was too high, the binary search always finished at the same number.)
>>
>> As I wanted more informative results, I have abandoned the decimal increase of the number of connections. I rather used binary increase to ensure that the proportion of the number of connections and the size of the connection tracking table be constant.
>>
>> I had another concern. Earlier, I have pointed our that the role of the source and destination port numbers is not completely symmetrical in the hash function that distributes the interrupts among the CPU cores [1]. Therefore, I decided to increase to source and destination port range together.
>>
>> [1] G. Lencse, "Adding RFC 4814 Random Port Feature to Siitperf: Design, Implementation and Performance Estimation", International Journal of Advances in Telecommunications, Electrotechnics, Signals and Systems, vol 9, no 3, pp. 18-26, 2020, DOI: 10.11601/ijates.v9i3.291 Full paper in PDF
>>
>> The results are shown in the table below (first, please see only the same lines as in the table above):
>>
>> Num. conn.	1,562,500	6,250,000	25,000,000	100,000,000	400,000,000
>> src ports	2,500	5,000	10,000	20,000	40,000
>> dst ports	625	1,250	2,500	5,000	10,000
>> conntrack t. s.	2^21	2^23	2^25	2^27	2^29
>> hash table size	c.t.s/8	c.t.s/8	c.t.s/8	c.t.s/8	c.t.s/8
>> num. exp.	10	10	10	10	5
>> error	1,000	1,000	1,000	1,000	1,000
>> cps median	1.216	1.147	1.085	1.02	0.88
>> cps min	1.21	1.14	1.077	1.015	0.878
>> cps max	1.224	1.153	1.087	1.024	0.884
>> c.t.s/n.c.	1.342	1.342	1.342	1.342	1.342
>> thorughput median	3.605	3.508	3.227	3.242	2.76
>> thorughput min	3.592	3.494	3.213	3.232	2.748
>> thorughput max	3.627	3.521	3.236	3.248	2.799
>>
>> Now, the maximum connection setup rate results deteriorate only very slightly with the increase of the number of connections from 1.216M to 1.02M, while the number of connections increased from 1,562,500 to 100,000,000. (A slightly higher degradation can be observed only in the last column. I will return to it later.)
>>
>> The last 3 lines of the table show the median, minimum and maximum values of the throughput. As required by RFC 8219, throughput was determined by using bidirectional traffic and the duration of the elementary steps of the binary search was 60s. (The binary search was executed 10 times, except for the last column, where it was done only 5 times to save execution time.)
>>
>> Note: commercial testers usually report the total number of frames forwarded. Siitperf reports the number of frames per direction. Thus in case of bidirectional tests, the reported value should be multiplied by 2 to receive the total number of frames per second. I did so, too.
>>
>> Although RFC 2544/5180/8219 require testing with bidirectional traffic, I suspect that perhaps unidirectional throughput my also be interesting for ISPs, as home users usually have much more download than upload traffic...
>>
>> The degradation of the throughput is also moderate, except at the last column. I attribute the higher decrease of the throughput at 400,000,000 connections (as well as that of the maximum connection establishment rate) to NUMA issues. In more detail: this time nearly the entire memory of the server was in use, whereas in the previous cases iptables could use NUMA local memory (if it was smart enough to do that). Unfortunately I cannot add more memory to these computers to check my hypothesis, but in a few weeks, I hope to be able to use some DELL PowerEdge R430 computers that have only two NUMA nodes and 384GB RAM, see the "P" nodes here: https://starbed.nict.go.jp/en/equipment/
>>
>> Now, I kindly ask everybody, who is interested in the scale up tests, to comment my measurements regarding both the methodology and the parameters!
>>
>> I am happy to provide more details, if needed.
>>
>> I plan to start working on the NAT64 tests in the upcoming weeks. I plan to use Jool. I have very limited experience with it, so first, I need to find out, how I can tune its connection tracking table parameters to be able to perform fair scale up tests.
>>
>> In the meantime, please comment my above experiments and results so that I may improve the tests to provide convincing results for all interested parties!
>>
>> Best regards,
>>
>> Gábor
>>
>> P.S.: If someone would volunteer repeating my experiments, I would be happy to share my scripts and experience and to provide support for siitperf, which is available from: https://github.com/lencsegabor/siitperf
>> The stateful branch of siitperf is in a very alpha state. It has only a partial documentation in this paper, which is still under review: http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf (It may be revised or removed at any time.)
>> The support for unique pseudorandom source and destination port number combinations, what I used for my current tests, is not described there, as I invented it after the submission of that paper. (I plan to update the paper, if I can have a chance to revise it. In the unlikely case, if the paper will be accepted as is, I plan to write a shorter paper about the new features. Now, the commented source code is the most reliable documentation.)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> v6ops mailing list
>> v6ops@ietf.org
>> https://www.ietf.org/mailman/listinfo/v6ops