Re: [v6ops] Scale-up tests of iptables for the number of CPU cores -- Re: Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review

Dear Joel and Lorenzo,

Thank you for your suggestions! I am sure that solutions eliminating 
TCP/IP socket interface may produce much higher performance.

However, the aim of my current tests is to check the scalability of 
stateful technologies. From our draft ( 
https://datatracker.ietf.org/doc/html/draft-ietf-v6ops-transition-comparison 
), we need to eliminate the following section:

    Stateful technologies, 464XLAT and DS-Lite (and also NAT444) can
    therefore be much more efficient in terms of port allocation and thus
    public IP address saving.  The price is the stateful operation in the
    service provider network, which allegedly does not scale up well.  It
    should be noticed that in many cases, all those factors may depend on
    how it is actually implemented.

    XXX MEASUREMENTS ARE PLANNED TO TEST IF THE ABOVE IS TRUE.  XXX

As for scalability, the single core performance is redundant, rather we 
focus on two things:

1. How the performance scales-up with the number of CPU cores?

It can be seen in the table below:

num. CPU cores          1          2 4          8         16
src ports             4,000      4,000      4,000      4,000 4,000
dst ports             1,000      1,000      1,000      1,000 1,000
num. conn.        4,000,000  4,000,000  4,000,000  4,000,000 4,000,000
conntrack t. s.        2^23       2^23       2^23       2^23 2^23
hash table size       c.t.s      c.t.s      c.t.s      c.t.s c.t.s
c.t.s/num.conn.       2.097      2.097      2.097      2.097 2.097
num. exp.                10         10         10 10         10
error                   100        100         100     1,000 1,000
cps median            223.5      371.1       708.7     1,341 2,383
cps min               221.6      367.7       701.7     1,325 2,304
cps max               226.7      375.9       723.6     1,376 2,417
cps rel. scale up         1      0.830       0.793     0.750 0.666
thorughput median     414.9      742.3       1,379     2,336 4,557
thorughput min        413.9      740.6       1,373     2,311 4,436
thorughput max        416.1      746.9       1,395     2,361 4,627
tp. rel. scale up         1      0.895       0.831     0.704 0.686

Of course, the performance of a 16-core system is only 10-fold, and not 
16-fold, but IMHO it is quite good.

For example, please refer to the scale-up results of NSD (a high 
performance authoritative DNS server) in Table 9 of this (open access) 
paper:

G. Lencse, "Benchmarking Authoritative DNS Servers", /IEEE Access/, vol. 
8. pp. 130224-130238, July 2020. DOI: 10.1109/ACCESS.2020.3009141 
https://ieeexplore.ieee.org/document/9139929

For example, the scale-up of the medians in Table 9 is: 
1,454,661/177,432=8.2-fold performance using 16 cores, that is, the 
relative speed up is only 0.52. And DNS is not a stateful technology!

2. How the performance degrades with the growth of the number of 
sessions stored in the connection tracking table of the stateful NATxy 
device?

Regarding this, I plan to perform measurements with the following 
parameters:

src ports             2,500      5,000 10,000      20,000      40,000
dst ports               625      1,250      2,500       5,000 10,000
num. conn.        1,562,500  6,250,000 25,000,000 100,000,000 400,000,000
conntrack t. s.        2^21       2^23       2^25        2^27   2^29
hash table size       c.t.s      c.t.s      c.t.s c.t.s       c.t.s
c.t.s/num.conn.       1.342 1.342 1.342 1.342       1.342
num. exp.                10         10         10 10          10
error                 1,000      1,000      1,000 1,000       1,000

As carrying out the measurements requires a lot of time (both from me to 
set and start the measurements and also the execution time is rather 
high, when we are talking about the last two columns), I would like to 
check in advance, if the members of the WG consider these parameters good.

A question to all WG members:

Will you be convinced by the results gained using the parameters above?

If not, please point out, what problem you can see that may question the 
validity of my results!

Thank you very much in advance!

Best regards,

Gábor

9/4/2021 5:10 AM keltezéssel, Joel M. Halpern írta:
> Or you could use fd.io, which is optimized for both performance and 
> flexible applciation of packet behaviors (NAT, IPSec, LISP, ...).
>
> Yours,
> Joel
>
> On 9/3/2021 9:02 PM, Lorenzo Colitti wrote:
>> Note that the Linux forwarding stack is not very optimized for 
>> forwarding. If you want Hugh speeds you probably want to use XDP, 
>> which acts on packets as soon as the receiving NIC DMAs them into 
>> memory.
>>
>> That means you have to do all the packet modifications yourself 
>> though. Modifying IPv6 packets is tirival (just change the TTL) but 
>> implementing IPv4 NAT is more complicated.
>>
>> On Sat, 4 Sept 2021, 01:26 Gábor LENCSE, <lencse@hit.bme.hu 
>> <mailto:lencse@hit.bme.hu>> wrote:
>>
>>     Dear Ole,
>>
>>     I have performed the scale-up tests of iptables using 1, 2, 4, 8,
>>     and 16 CPU cores. I used two "P" series nodes of NICT StarBED, which
>>     are DELL PowerEdge R430 servers, please see their hardware details
>>     here: https://starbed.nict.go.jp/en/equipment/
>>     <https://starbed.nict.go.jp/en/equipment/>
>>
>>     I have done some tuning of the parameters: number of connections:
>>     4,000,000; src ports: 4,000; dst ports: 1,000; conntrack table size:
>>     2^23; hash size = connection table size.
>>
>>     I think that the results are quite good, both the number of
>>     connections per second and throughput scaled up quite well with the
>>     number of CPU cores.
>>
>>     num. CPU cores          1          2 4          8         16
>>     src ports             4,000      4,000      4,000      4,000 4,000
>>     dst ports             1,000      1,000      1,000      1,000 1,000
>>     num. conn.        4,000,000  4,000,000  4,000,000  4,000,000 
>> 4,000,000
>>     conntrack t. s.        2^23       2^23       2^23       2^23 2^23
>>     hash table size       c.t.s      c.t.s      c.t.s      c.t.s c.t.s
>>     c.t.s/num.conn.       2.097      2.097      2.097      2.097 2.097
>>     num. exp.                10         10         10 10 10
>>     error                   100        100         100     1,000 1,000
>>     cps median            223.5      371.1       708.7     1,341 2,383
>>     cps min               221.6      367.7       701.7     1,325 2,304
>>     cps max               226.7      375.9       723.6     1,376 2,417
>>     cps rel. scale up         1      0.830       0.793     0.750 0.666
>>     thorughput median     414.9      742.3       1,379     2,336 4,557
>>     thorughput min        413.9      740.6       1,373     2,311 4,436
>>     thorughput max        416.1      746.9       1,395     2,361 4,627
>>     tp. rel. scale up         1      0.895       0.831     0.704 0.686
>>
>>     As you can see, the performance of a 16-core machine is about 10x of
>>     the performance of a single core machine. I think it is very good
>>     results for the scale-up of a stateful NAT44 implementation.
>>
>>     What do you think?
>>
>>     Best regards,
>>
>>     Gábor
>>
>>     8/18/2021 10:52 PM keltezéssel, Gábor LENCSE írta:
>>>     Dear Ole,
>>>
>>>     Thank you very much for your reply!
>>>
>>>     Please see my answers inline.
>>>
>>>     8/17/2021 10:52 PM keltezéssel, otroan@employees.org
>>>     <mailto:otroan@employees.org> írta:
>>>>     Gábor,
>>>>
>>>>     Thanks for some great work!
>>>>     I will try to get a more thorough look later, but for a first 
>>>> set of comments.
>>>>
>>>>     For methodology.
>>>>       - Setting a baseline, e.g. by measuring base IPv4 forwarding 
>>>> would be useful in establishing how much extra work is involved 
>>>> doing the translation.
>>>     I had the measurement results ready both for IPv4 and IPv6 kernel
>>>     routing. (20 experiments, error=1000) The below table shows the
>>>     number of all forwarded packets of bidirectional traffic (in
>>>     million frames per seconds).
>>>
>>>     Linux kernel routing     IPv4     IPv6
>>>     thorughput median     9.471     9.064
>>>     thorughput min     9.443     9.029
>>>     thorughput max     9.486     9.088
>>>
>>>
>>>>       - Scaling linearly by number of cores is challenging in these 
>>>> systems. It would be interesting to see the results for 1, 2, 4, 8 
>>>> cores and not only
>>>>         for the 8 core system.
>>>
>>>     Until the current measurements, I always did exactly this kind of
>>>     scale-up tests. For example, I compared the performance of
>>>     different DNS64 implementations using 1, 2, 4, 8, 16 cores in [2]
>>>     and the performance of different authoritative DNS servers using
>>>     1, 2, 4, 8, 16, 32 cores in [3]. I have got some interesting
>>>     experience with switching on/off the CPU cores. First, I did the
>>>     on/off switching of the i-th CPU core by writing 1/0 values into
>>>     the /sys/devices/system/cpu/cpu$i/onlinefile of the working Linux
>>>     kernel [2]. Whereas it seemed to work well, when the query rates
>>>     were moderate (a few times ten thousand queries per second), it
>>>     caused problems in the second case, when I used up to 3 million
>>>     queries per second query rates, and thus I rather set the number
>>>     of active CPU cores at the DUT by using the maxcpus=nkernel
>>>     parameter.
>>>
>>>     [2] G. Lencse and Y. Kadobayashi, "Benchmarking DNS64
>>>     Implementations: Theory and Practice", /Computer Communications/
>>>     (Elsevier), vol. 127, no. 1, pp. 61-74, September 1, 2018, DOI:
>>>     10.1016/j.comcom.2018.05.005 Review version in PDF
>>> <http://www.hit.bme.hu/~lencse/publications/ECC-2018-DNS64-BM-for-review.pdf>
>>>
>>>     [3] G. Lencse, "Benchmarking Authoritative DNS Servers", /IEEE
>>>     Access/, vol. 8. pp. 130224-130238, July 2020. DOI:
>>>     10.1109/ACCESS.2020.3009141 Revised version in PDF
>>> <http://www.hit.bme.hu/~lencse/publications/IEEE-Access-2020-AuthDNS-revised.pdf>
>>>
>>>     So now I feel I have the experience to perform such measurements.
>>>
>>>     I plan to start a series of measurements with 1, 2, 4, 8 cores. To
>>>     spare execution time, I need to choose one fixed number of
>>>     connections. (It would last very long to test all possible
>>>     combinations if I used several different numbers of connections!)
>>>     As I expect good scale up, and so far we have seen only the 8-core
>>>     performance, I expect that the single core performance is likely
>>>     between 1/6 to 1/8 of it. It means that I should use a moderate
>>>     number of connections, otherwise the filling up of the conntrack
>>>     table would last too long.
>>>
>>>     So I plan to use the following parameters: number of connections:
>>>     4,000,000; src ports: 4,000; dst ports: 1,000; conntrack table
>>>     size: 2^22; hash size: c.t.s/8.
>>>
>>>     Do you agree with it?
>>>
>>>>     Regarding the number of connections. You should see a drop-off 
>>>> when the size of the session table is larger than the L3 cache.
>>>
>>>     Good catch!
>>>
>>>     Although I cannot directly measure the size of the conntrack
>>>     table, I can record the change of the free memory during the
>>>     experiments and thus I can make an estimation, when this happens.
>>>     I do not promise to deal with it now, but I plan to use it later 
>>> on.
>>>
>>>>     You might also balance maximum bandwidth available with maximum 
>>>> session size. 250G of forwarding per socket on a PCIe 3.0 system.
>>>
>>>     Do you mean "250G" as 250Gbps link?
>>>
>>>     I am sorry, but it is beyond my dreams. The systems I usually use
>>>     (at NICT StarBED, Japan) have 10Gbps NICs. Now I use two HPE
>>>     servers with 10/25Gbps NICs interconnected by 10Gbps DAC cables
>>>     (at the Széchenyi István University, Győr, Hungary), and even
>>>     though my colleague has purchased 25Gbps DAC cables as I
>>>     requested, I do not use them, as my current rates are very far
>>>     from the maximum frame rate (14,880,952fps) of the 10Gbps links.
>>>
>>>     So it would be nice, if someone having higher performance hardware
>>>     could repeat my measurements. Any volunteers?
>>>
>>>
>>>>     I might have read a little too much between the lines in the 
>>>> draft, but I got a feeling (and just that), that the tests were 
>>>> coloured a bit by the behaviour of a particular implementation 
>>>> (iptables).
>>>
>>>     Yes, my results reflect only the behavior of iptables.
>>>
>>>     On the one hand, if the results of one particular implementation 
>>> show:
>>>     - poor speed up, then it does not prove that the technology is
>>>     bad, other implementations might perform much better.
>>>     - good speed up, then it proves that the technology is good, but
>>>     other implementations may still perform much worse.
>>>
>>>     Of course, our time is rather limited, thus my approach is that I
>>>     would like to test the implementations we expect to be good
>>>     enough. As for NAT44, I expect that iptables is among them. If
>>>     anyone can suggest a better one, I am open to try it. It should be
>>>     free software for two reasons:
>>>     - I do not have a budget to buy a proprietary implementation
>>>     - the licenses of some vendors prohibit the publication of
>>>     benchmarking results.
>>>
>>>>     You are of course measuring how a particular implementation (or 
>>>> set of implementations) scales and we as the IETF have to deduce if 
>>>> the scaling limitations are in the implementation or in the 
>>>> protocol mechanism. We do know that you can build large scale 
>>>> NAT44s and NAT64s, so back to my first point, it might be useful to 
>>>> provide a baseline to give an idea of the additional cost 
>>>> associated with the extra treatment of packets.
>>>
>>>     Yes, my results (9Mfps vs. 3Mfps) definitely show that stateful
>>>     NAT44 is not without performance costs.
>>>
>>>>     It would certainly be interesting to run your tool against the 
>>>> VPP implementation of NAT.
>>>>     Here are some NAT performance results from runs with the CSIT 
>>>> benchmarking setup for VPP:
>>>> https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/packet_throughput_graphs/nat44.html 
>>>> <https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/packet_throughput_graphs/nat44.html> 
>>>>
>>>> https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/throughput_speedup_multi_core/nat44.html 
>>>> <https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/throughput_speedup_multi_core/nat44.html> 
>>>>
>>>     As far as I could figure out, their stateless tests seem to scale
>>>     up well up to 4 CPU cores, but stateful tests do not scale up,
>>>     regardless of speaking about connections per second or throughput.
>>>
>>>     I really wonder how iptables behaves! :-)
>>>
>>>     Best regards,
>>>
>>>     Gábor
>>>
>>>>     Best regards,
>>>>     Ole
>>>>
>>>>
>>>>
>>>>>     On 17 Aug 2021, at 14:51, Gábor LENCSE<lencse@hit.bme.hu> 
>>>>> <mailto:lencse@hit.bme.hu>  wrote:
>>>>>
>>>>>     Dear All,
>>>>>
>>>>>     At IETF 111, I promised to perform scale-up test for stateful 
>>>>> NAT64 and also for stateful NAT44. (Our primary focus is stateful 
>>>>> NAT64, as it is a part of 464XLAT. As there was interest in CGN, I 
>>>>> am happy to invest some time into it, too. I think the comparison 
>>>>> of their scalability may be interesting for several people.)
>>>>>
>>>>>     Now, I am in the phase of preliminary tests. I mean that I 
>>>>> have performed tests to explore the behavior of both the Tester 
>>>>> (stateful branch of siitperf) and the benchmarked application to 
>>>>> be able to determine the conditions for the production tests.
>>>>>
>>>>>     Now I write to ask all  interested "IPv6 Operations" mailing 
>>>>> list members to review my report below from two points of view:
>>>>>     - Do you consider the methodology sound?
>>>>>     - Do you think that the parameters are appropriate to provide 
>>>>> meaningful results for network operators?
>>>>>
>>>>>     Control question:
>>>>>     - Will you support the inclusion of the results into 
>>>>> draft-ietf-v6ops-transition-comparison and the publication of the 
>>>>> draft as an RFC even if the results will contradict to your view 
>>>>> of the scalability of the stateful technologies?
>>>>>
>>>>>     As I had more experience with iptables than Jool, I started 
>>>>> with the scale-up tests of stateful NAT44.
>>>>>
>>>>>     Now, I give a short description of the test system.
>>>>>
>>>>>     I used two identical HPE ProLiant DL380 Gen10 (DL380GEN10) 
>>>>> servers with the following configuration:
>>>>>     - 2 Intel 5218 CPUs (the clock frequency was set to 2.3GHz 
>>>>> fixed )
>>>>>     - 256GB (8x32GB) DDR4 SDRAM @ 2666 MHz (accessed quad channel)
>>>>>     - 2-port BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet 
>>>>> Controller (used with 10Gbps DAC cables)
>>>>>
>>>>>     The servers have 4 NUMA nodes (0, 1, 2, 3) and the NICs belong 
>>>>> to NUMA node 1. All the interrupts caused by the NICs are 
>>>>> processed by 8 of the 32 CPU cores (the ones that belong to NUMA 
>>>>> node 1, that is cores from 8 to 15) thus regarding our 
>>>>> measurements, the DUT (Device Under Test) is more or less 
>>>>> equivalent with an 8-core server. :-)
>>>>>
>>>>>     I used Debian Linux 9.13 with 5.12.14 kernel on both servers.
>>>>>
>>>>>     The test setup was the same as our draft 
>>>>> describes:https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful 
>>>>> <https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful> 
>>>>>
>>>>> +--------------------------------------+
>>>>>              10.0.0.2 |Initiator Responder| 198.19.0.2
>>>>>         +-------------| Tester                |<------------+
>>>>>         | private IPv4|                         [state table]| 
>>>>> public IPv4 |
>>>>>         | +--------------------------------------+             |
>>>>> | |
>>>>>         | +--------------------------------------+             |
>>>>>         |    10.0.0.1 | DUT:                 | 198.19.0.1  |
>>>>>         +------------>|        Sateful NATxy gateway         
>>>>> |-------------+
>>>>>           private IPv4|     [connection tracking table]      | 
>>>>> public IPv4
>>>>> +--------------------------------------+
>>>>>
>>>>>
>>>>>     Both the Tester and the DUT were the above mentioned HPE servers.
>>>>>
>>>>>     I wanted to follow my original plan included in my IETF 111 
>>>>> presentation to measure both the maximum connection establishment 
>>>>> rate and the throughput using the following number of connections: 
>>>>> 1 million, 10 million, 100 million, and 1 billion. However, the 
>>>>> DUT became unavailable during my tests with 1 billion connections 
>>>>> as the connection tracking table has exhausted its memory. Thus I 
>>>>> have reduced the highest number of connections to 500 million.
>>>>>
>>>>>     First, I had to perform the tests for the maximum connection 
>>>>> establishment rate. To achieve the required growing number of 
>>>>> connections, I used always 50,000 different source port numbers 
>>>>> (from 1 to 50,000) and increased the destination port numbers as 
>>>>> 20, 200, 2000, and 10,000 (instead of 20,000).
>>>>>
>>>>>     As for the size of the connection tracking table, I used 
>>>>> powers of 2. The very first value was 2^20=1,048,576 and then I 
>>>>> had to use the lowest large enough number, that is 2^24, etc. The 
>>>>> hash size parameter was always set to 1/8 of the size of the 
>>>>> connection tracking table.
>>>>>
>>>>>     I have performed binary search to determine the maximum 
>>>>> connection establishment rate. The stopping criterion was 
>>>>> expressed by the "error" parameter, which is the difference of the 
>>>>> upper and lower bound. It was set to 1,000 in all cases except the 
>>>>> last one: then it was set to 10,000 to save some execution time. 
>>>>> The experiments were executed 10 times except the last one (only 3 
>>>>> times).
>>>>>
>>>>>     I have calculated the median, the minimum and the maximum of 
>>>>> the measured connection establishment rates. I hope the values of 
>>>>> the table below will be readable in my e-mail:
>>>>>
>>>>>     Num. conn.    1,000,000    10,000,000 100,000,000    500,000,000
>>>>>     src ports    50,000    50,000    50,000    50,000
>>>>>     dst ports    20    200    2,000    10,000
>>>>>     conntrack t. s.    2^20    2^24    2^27    2^29
>>>>>     hash table size    c.t.s/8    c.t.s/8    c.t.s/8 c.t.s/8
>>>>>     num. exp.    10    10    10    3
>>>>>     error    1,000    1,000    1,000    10,000
>>>>>     cps median    1.124    1.311    1.01    0.742
>>>>>     cps max    1.108    1.308    1.007    0.742
>>>>>     cps min    1.128    1.317    1.013    0.742
>>>>>     c.t.s/n.c.    1.049    1.678    1.342    1.074
>>>>>
>>>>>     The cps (connections per second) values are given in million 
>>>>> connections per second.
>>>>>
>>>>>     The rise of the median to 1.311M at 10 million connections 
>>>>> (from 1.124M at 1 million connections) is deliberately caused by 
>>>>> the fact that the size of the connection tracking table was quite 
>>>>> large compared to the actual number of connections. I have 
>>>>> included this proportion in the last line of the table. Thus only 
>>>>> the very first and very last columns are directly comparable. If 
>>>>> we consider that the maximum connection establishment rate was 
>>>>> decreased from 1.124M to 0.742M while the total number of 
>>>>> connections increased from 1M to 500M, I think, we can be 
>>>>> satisfied with the scale-up of iptables.
>>>>>     (Of course, I can see the limitations of my measurements: the 
>>>>> error of 10,000 was too high, the binary search always finished at 
>>>>> the same number.)
>>>>>
>>>>>     As I wanted more informative results, I have abandoned the 
>>>>> decimal increase of the number of connections. I rather used 
>>>>> binary increase to ensure that the proportion of the number of 
>>>>> connections and the size of the connection tracking table be 
>>>>> constant.
>>>>>
>>>>>     I had another concern. Earlier, I have pointed our that the 
>>>>> role of the source and destination port numbers is not completely 
>>>>> symmetrical in the hash function that distributes the interrupts 
>>>>> among the CPU cores [1]. Therefore, I decided to increase to 
>>>>> source and destination port range together.
>>>>>
>>>>>     [1] G. Lencse, "Adding RFC 4814 Random Port Feature to 
>>>>> Siitperf: Design, Implementation and Performance Estimation", 
>>>>> International Journal of Advances in Telecommunications, 
>>>>> Electrotechnics, Signals and Systems, vol 9, no 3, pp. 18-26, 
>>>>> 2020, DOI: 10.11601/ijates.v9i3.291 Full paper in PDF
>>>>>
>>>>>     The results are shown in the table below (first, please see 
>>>>> only the same lines as in the table above):
>>>>>
>>>>>     Num. conn.    1,562,500    6,250,000    25,000,000 
>>>>> 100,000,000    400,000,000
>>>>>     src ports    2,500    5,000    10,000    20,000 40,000
>>>>>     dst ports    625    1,250    2,500    5,000    10,000
>>>>>     conntrack t. s.    2^21    2^23    2^25    2^27 2^29
>>>>>     hash table size    c.t.s/8    c.t.s/8    c.t.s/8 c.t.s/8    
>>>>> c.t.s/8
>>>>>     num. exp.    10    10    10    10    5
>>>>>     error    1,000    1,000    1,000    1,000    1,000
>>>>>     cps median    1.216    1.147    1.085    1.02    0.88
>>>>>     cps min    1.21    1.14    1.077    1.015    0.878
>>>>>     cps max    1.224    1.153    1.087    1.024    0.884
>>>>>     c.t.s/n.c.    1.342    1.342    1.342    1.342 1.342
>>>>>     thorughput median    3.605    3.508    3.227 3.242    2.76
>>>>>     thorughput min    3.592    3.494    3.213    3.232 2.748
>>>>>     thorughput max    3.627    3.521    3.236    3.248 2.799
>>>>>
>>>>>     Now, the maximum connection setup rate results deteriorate 
>>>>> only very slightly with the increase of the number of connections 
>>>>> from 1.216M to 1.02M, while the number of connections increased 
>>>>> from 1,562,500 to 100,000,000. (A slightly higher degradation can 
>>>>> be observed only in the last column. I will return to it later.)
>>>>>
>>>>>     The last 3 lines of the table show the median, minimum and 
>>>>> maximum values of the throughput. As required by RFC 8219, 
>>>>> throughput was determined by using bidirectional traffic and the 
>>>>> duration of the elementary steps of the binary search was 60s. 
>>>>> (The binary search was executed 10 times, except for the last 
>>>>> column, where it was done only 5 times to save execution time.)
>>>>>
>>>>>     Note: commercial testers usually report the total number of 
>>>>> frames forwarded. Siitperf reports the number of frames per 
>>>>> direction. Thus in case of bidirectional tests, the reported value 
>>>>> should be multiplied by 2 to receive the total number of frames 
>>>>> per second. I did so, too.
>>>>>
>>>>>     Although RFC 2544/5180/8219 require testing with bidirectional 
>>>>> traffic, I suspect that perhaps unidirectional throughput my also 
>>>>> be interesting for ISPs, as home users usually have much more 
>>>>> download than upload traffic...
>>>>>
>>>>>     The degradation of the throughput is also moderate, except at 
>>>>> the last column. I attribute the higher decrease of the throughput 
>>>>> at 400,000,000 connections (as well as that of the maximum 
>>>>> connection establishment rate) to NUMA issues. In more detail: 
>>>>> this time nearly the entire memory of the server was in use, 
>>>>> whereas in the previous cases iptables could use NUMA local memory 
>>>>> (if it was smart enough to do that). Unfortunately I cannot add 
>>>>> more memory to these computers to check my hypothesis, but in a 
>>>>> few weeks, I hope to be able to use some DELL PowerEdge R430 
>>>>> computers that have only two NUMA nodes and 384GB RAM, see the "P" 
>>>>> nodes here:https://starbed.nict.go.jp/en/equipment/ 
>>>>> <https://starbed.nict.go.jp/en/equipment/>
>>>>>
>>>>>     Now, I kindly ask everybody, who is interested in the scale up 
>>>>> tests, to comment my measurements regarding both the methodology 
>>>>> and the parameters!
>>>>>
>>>>>     I am happy to provide more details, if needed.
>>>>>
>>>>>     I plan to start working on the NAT64 tests in the upcoming 
>>>>> weeks. I plan to use Jool. I have very limited experience with it, 
>>>>> so first, I need to find out, how I can tune its connection 
>>>>> tracking table parameters to be able to perform fair scale up tests.
>>>>>
>>>>>     In the meantime, please comment my above experiments and 
>>>>> results so that I may improve the tests to provide convincing 
>>>>> results for all interested parties!
>>>>>
>>>>>     Best regards,
>>>>>
>>>>>     Gábor
>>>>>
>>>>>     P.S.: If someone would volunteer repeating my experiments, I 
>>>>> would be happy to share my scripts and experience and to provide 
>>>>> support for siitperf, which is available 
>>>>> from:https://github.com/lencsegabor/siitperf 
>>>>> <https://github.com/lencsegabor/siitperf>
>>>>>     The stateful branch of siitperf is in a very alpha state. It 
>>>>> has only a partial documentation in this paper, which is still 
>>>>> under 
>>>>> review:http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf 
>>>>> <http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf> 
>>>>> (It may be revised or removed at any time.)
>>>>>     The support for unique pseudorandom source and destination 
>>>>> port number combinations, what I used for my current tests, is not 
>>>>> described there, as I invented it after the submission of that 
>>>>> paper. (I plan to update the paper, if I can have a chance to 
>>>>> revise it. In the unlikely case, if the paper will be accepted as 
>>>>> is, I plan to write a shorter paper about the new features. Now, 
>>>>> the commented source code is the most reliable documentation.)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>     _______________________________________________
>>>>>     v6ops mailing list
>>>>>     v6ops@ietf.org  <mailto:v6ops@ietf.org>
>>>>>     https://www.ietf.org/mailman/listinfo/v6ops 
>>>>> <https://www.ietf.org/mailman/listinfo/v6ops>
>>>
>>>
>>>     _______________________________________________
>>>     v6ops mailing list
>>>     v6ops@ietf.org  <mailto:v6ops@ietf.org>
>>>     https://www.ietf.org/mailman/listinfo/v6ops 
>>> <https://www.ietf.org/mailman/listinfo/v6ops>
>>
>>     _______________________________________________
>>     v6ops mailing list
>>     v6ops@ietf.org <mailto:v6ops@ietf.org>
>>     https://www.ietf.org/mailman/listinfo/v6ops
>>     <https://www.ietf.org/mailman/listinfo/v6ops>
>>
>>
>> _______________________________________________
>> v6ops mailing list
>> v6ops@ietf.org
>> https://www.ietf.org/mailman/listinfo/v6ops
>>