Re: [v6ops] Scale-up tests of iptables for the number of CPU cores -- Re: Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review

Tom Herbert <tom@herbertland.com> Mon, 06 September 2021 13:06 UTC

Return-Path: <tom@herbertland.com>
X-Original-To: v6ops@ietfa.amsl.com
Delivered-To: v6ops@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 553C03A09A8 for <v6ops@ietfa.amsl.com>; Mon, 6 Sep 2021 06:06:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.896
X-Spam-Level:
X-Spam-Status: No, score=-1.896 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=herbertland-com.20150623.gappssmtp.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jis4FwMCt2-9 for <v6ops@ietfa.amsl.com>; Mon, 6 Sep 2021 06:06:09 -0700 (PDT)
Received: from mail-ej1-x62d.google.com (mail-ej1-x62d.google.com [IPv6:2a00:1450:4864:20::62d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5D6123A09A5 for <v6ops@ietf.org>; Mon, 6 Sep 2021 06:06:09 -0700 (PDT)
Received: by mail-ej1-x62d.google.com with SMTP id lc21so13395763ejc.7 for <v6ops@ietf.org>; Mon, 06 Sep 2021 06:06:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=herbertland-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=U4smxXDJRkl89TVS1jdEoT2FFaq3rjYQcY1rzIQB82I=; b=U57f8YBpZB5wVgyRT0DjaRIjitcDr5zi39quGJf3hzQNIK9dxnCB3UwzWRYHgXnzNb gEwqRa39H7Xr23EeT9/tHAAHVCbJMM6r1HsxVnfcLd02cwvybNEpN9pBgXbZrYQmf6Es paIDU5gLSHzjRHUHKe/yaka+NfGF2tCHwF6T9qdFF/97VdqPwpxm3qT1xtfWyJbIuqPk ib3FjBckNH8d7sTlfzhKEQ33lGaq1Y4rIyEpCzu2YjbYhkuOgFGXT32UI5AHsSsymVxr Vj02f1ODI+37jXkzDdfypdozqxHtwcYl49bBBRXXbmgKYIuM3dtyLzsaU3Yu8EtMLSak HQEg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=U4smxXDJRkl89TVS1jdEoT2FFaq3rjYQcY1rzIQB82I=; b=rsGcUSx5tbu0mnRH0mp6v724vg6ykI2Z+nx/tjTlrGdLltV0TQPQBqe2NN7vTOEuWf GDmpkzfOQDFIBFP6mw8paYKTHeuxMak45U6J6dyzOTxClQyVGpOqrmdv4op9U6maK+Pr nqEq6rQlYzJpgk0OiO8yE67DQsbGUOUGmi4jjI6BqqbXszkcW2baHbi45DCHpHXGPm8s D0XaARXCBpgEjawbqFjx8rMKpbA2U1FivgdFhuM9sI4h73aRi6tYvD27b2YGikoFGWuQ pveB+wGm/PMxInPPox4UvOBV4TFpDMyJniv2DLDr6fqjcxU1bWKDw+40C4yLdmpqTg6z 8dGw==
X-Gm-Message-State: AOAM530tvW1Iu8b3zLjNKevBG5ZVsgYNWOZV7wc4082YB6r+P3NEoglv bpZI7rYPrjoA9cjVHGfhAu0+4IGqA1YmoEGtrrIenwEGHII=
X-Google-Smtp-Source: ABdhPJwSzZcVwmWDl8iubWjLWzoyTHmF1LXTMwUdy9EahQyJNyC+0K3UfLedVXlb/SVtUP+WgJxvntHGWDBdgQwzwkc=
X-Received: by 2002:a17:906:b104:: with SMTP id u4mr13599878ejy.201.1630933566478; Mon, 06 Sep 2021 06:06:06 -0700 (PDT)
MIME-Version: 1.0
References: <1a82ae6f-db3d-4228-e176-5dab0049a156@hit.bme.hu> <DE1B80FE-6BD9-44F4-843F-78776E48C9F3@employees.org> <eff7c05a-15b5-e465-f558-5ef7eb291318@hit.bme.hu> <782eb836-23e6-b180-47f0-5118733f1e59@hit.bme.hu> <CAKD1Yr39JzNG6ng2TS3GNfM1z=rcs9jA2ph66f-3tjy5G_cjbA@mail.gmail.com> <388905fe-13da-e6e2-26c6-3c6a07d58574@joelhalpern.com> <da4eda6b-0502-dc6e-6477-050eb4b3df7d@hit.bme.hu>
In-Reply-To: <da4eda6b-0502-dc6e-6477-050eb4b3df7d@hit.bme.hu>
From: Tom Herbert <tom@herbertland.com>
Date: Mon, 06 Sep 2021 09:05:53 -0400
Message-ID: <CALx6S37jhPeE2Z1tKH-g-=p=rh8CLhiVZApZyn7CMQh173xiOg@mail.gmail.com>
To: Gábor LENCSE <lencse@hit.bme.hu>
Cc: "Joel M. Halpern" <jmh@joelhalpern.com>, Lorenzo Colitti <lorenzo=40google.com@dmarc.ietf.org>, "v6ops@ietf.org WG" <v6ops@ietf.org>
Content-Type: multipart/alternative; boundary="000000000000e1f0dd05cb534f0f"
Archived-At: <https://mailarchive.ietf.org/arch/msg/v6ops/nNc4OcSWORo-djjcRmb0gEFB1wo>
Subject: Re: [v6ops] Scale-up tests of iptables for the number of CPU cores -- Re: Preliminary scale-up tests and results for draft-ietf-v6ops-transition-comparison -- request for review
X-BeenThere: v6ops@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: v6ops discussion list <v6ops.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/v6ops>, <mailto:v6ops-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/v6ops/>
List-Post: <mailto:v6ops@ietf.org>
List-Help: <mailto:v6ops-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/v6ops>, <mailto:v6ops-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Sep 2021 13:06:16 -0000

On Mon, Sep 6, 2021, 5:10 AM Gábor LENCSE <lencse@hit.bme.hu> wrote:

> Dear Joel and Lorenzo,
>
> Thank you for your suggestions! I am sure that solutions eliminating
> TCP/IP socket interface may produce much higher performance.
>
> However, the aim of my current tests is to check the scalability of
> stateful technologies. From our draft (
> https://datatracker.ietf.org/doc/html/draft-ietf-v6ops-transition-comparison
> ), we need to eliminate the following section:
>

Hi Gabor,

You might want to look at Ptables which was introduced at the last Netdev
conference. This is explicitly designed to address the performance
challenges of iptables.

https://netdevconf.info/0x15/session.html?Introducing-Ptables

Tom

   Stateful technologies, 464XLAT and DS-Lite (and also NAT444) can
>    therefore be much more efficient in terms of port allocation and thus
>    public IP address saving.  The price is the stateful operation in the
>    service provider network, which allegedly does not scale up well.  It
>    should be noticed that in many cases, all those factors may depend on
>    how it is actually implemented.
>
>    XXX MEASUREMENTS ARE PLANNED TO TEST IF THE ABOVE IS TRUE.  XXX
>
> As for scalability, the single core performance is redundant, rather we
> focus on two things:
>
> 1. How the performance scales-up with the number of CPU cores?
>
> It can be seen in the table below:
>
> num. CPU cores          1          2          4          8         16
> src ports             4,000      4,000      4,000      4,000      4,000
> dst ports             1,000      1,000      1,000      1,000      1,000
> num. conn.        4,000,000  4,000,000  4,000,000  4,000,000  4,000,000
> conntrack t. s.        2^23       2^23       2^23       2^23       2^23
> hash table size       c.t.s      c.t.s      c.t.s      c.t.s      c.t.s
> c.t.s/num.conn.       2.097      2.097      2.097      2.097      2.097
> num. exp.                10         10         10         10         10
> error                   100        100         100     1,000      1,000
> cps median            223.5      371.1       708.7     1,341      2,383
> cps min               221.6      367.7       701.7     1,325      2,304
> cps max               226.7      375.9       723.6     1,376      2,417
> cps rel. scale up         1      0.830       0.793     0.750      0.666
> thorughput median     414.9      742.3       1,379     2,336      4,557
> thorughput min        413.9      740.6       1,373     2,311      4,436
> thorughput max        416.1      746.9       1,395     2,361      4,627
> tp. rel. scale up         1      0.895       0.831     0.704      0.686
>
> Of course, the performance of a 16-core system is only 10-fold, and not
> 16-fold, but IMHO it is quite good.
>
> For example, please refer to the scale-up results of NSD (a high
> performance authoritative DNS server) in Table 9 of this (open access)
> paper:
>
> G. Lencse, "Benchmarking Authoritative DNS Servers", *IEEE Access*, vol.
> 8. pp. 130224-130238, July 2020. DOI: 10.1109/ACCESS.2020.3009141
> https://ieeexplore.ieee.org/document/9139929
>
> For example, the scale-up of the medians in Table 9 is:
> 1,454,661/177,432=8.2-fold performance using 16 cores, that is, the
> relative speed up is only 0.52. And DNS is not a stateful technology!
>
>
> 2. How the performance degrades with the growth of the number of sessions
> stored in the connection tracking table of the stateful NATxy device?
>
> Regarding this, I plan to perform measurements with the following
> parameters:
>
> src ports             2,500      5,000     10,000      20,000      40,000
> dst ports               625      1,250      2,500       5,000      10,000
> num. conn.        1,562,500  6,250,000 25,000,000 100,000,000 400,000,000
> conntrack t. s.        2^21       2^23       2^25        2^27        2^29
> hash table size       c.t.s      c.t.s      c.t.s       c.t.s       c.t.s
> c.t.s/num.conn.       1.342      1.342      1.342       1.342       1.342
> num. exp.                10         10         10          10          10
> error                 1,000      1,000      1,000       1,000       1,000
>
> As carrying out the measurements requires a lot of time (both from me to
> set and start the measurements and also the execution time is rather high,
> when we are talking about the last two columns), I would like to check in
> advance, if the members of the WG consider these parameters good.
>
> A question to all WG members:
>
> Will you be convinced by the results gained using the parameters above?
>
> If not, please point out, what problem you can see that may question the
> validity of my results!
>
> Thank you very much in advance!
>
> Best regards,
>
> Gábor
>
>
> 9/4/2021 5:10 AM keltezéssel, Joel M. Halpern írta:
>
> Or you could use fd.io, which is optimized for both performance and
> flexible applciation of packet behaviors (NAT, IPSec, LISP, ...).
>
> Yours,
> Joel
>
> On 9/3/2021 9:02 PM, Lorenzo Colitti wrote:
>
> Note that the Linux forwarding stack is not very optimized for forwarding.
> If you want Hugh speeds you probably want to use XDP, which acts on packets
> as soon as the receiving NIC DMAs them into memory.
>
> That means you have to do all the packet modifications yourself though.
> Modifying IPv6 packets is tirival (just change the TTL) but implementing
> IPv4 NAT is more complicated.
>
> On Sat, 4 Sept 2021, 01:26 Gábor LENCSE, <lencse@hit.bme.hu
> <mailto:lencse@hit.bme.hu> <lencse@hit.bme.hu>> wrote:
>
>     Dear Ole,
>
>     I have performed the scale-up tests of iptables using 1, 2, 4, 8,
>     and 16 CPU cores. I used two "P" series nodes of NICT StarBED, which
>     are DELL PowerEdge R430 servers, please see their hardware details
>     here: https://starbed.nict.go.jp/en/equipment/
>     <https://starbed.nict.go.jp/en/equipment/>
> <https://starbed.nict.go.jp/en/equipment/>
>
>     I have done some tuning of the parameters: number of connections:
>     4,000,000; src ports: 4,000; dst ports: 1,000; conntrack table size:
>     2^23; hash size = connection table size.
>
>     I think that the results are quite good, both the number of
>     connections per second and throughput scaled up quite well with the
>     number of CPU cores.
>
>     num. CPU cores          1          2 4          8         16
>     src ports             4,000      4,000      4,000      4,000 4,000
>     dst ports             1,000      1,000      1,000      1,000 1,000
>     num. conn.        4,000,000  4,000,000  4,000,000  4,000,000 4,000,000
>     conntrack t. s.        2^23       2^23       2^23       2^23 2^23
>     hash table size       c.t.s      c.t.s      c.t.s      c.t.s c.t.s
>     c.t.s/num.conn.       2.097      2.097      2.097      2.097 2.097
>     num. exp.                10         10         10 10         10
>     error                   100        100         100     1,000 1,000
>     cps median            223.5      371.1       708.7     1,341 2,383
>     cps min               221.6      367.7       701.7     1,325 2,304
>     cps max               226.7      375.9       723.6     1,376 2,417
>     cps rel. scale up         1      0.830       0.793     0.750 0.666
>     thorughput median     414.9      742.3       1,379     2,336 4,557
>     thorughput min        413.9      740.6       1,373     2,311 4,436
>     thorughput max        416.1      746.9       1,395     2,361 4,627
>     tp. rel. scale up         1      0.895       0.831     0.704 0.686
>
>     As you can see, the performance of a 16-core machine is about 10x of
>     the performance of a single core machine. I think it is very good
>     results for the scale-up of a stateful NAT44 implementation.
>
>     What do you think?
>
>     Best regards,
>
>     Gábor
>
>     8/18/2021 10:52 PM keltezéssel, Gábor LENCSE írta:
>
>     Dear Ole,
>
>     Thank you very much for your reply!
>
>     Please see my answers inline.
>
>     8/17/2021 10:52 PM keltezéssel, otroan@employees.org
>     <mailto:otroan@employees.org> <otroan@employees.org> írta:
>
>     Gábor,
>
>     Thanks for some great work!
>     I will try to get a more thorough look later, but for a first set of
> comments.
>
>     For methodology.
>       - Setting a baseline, e.g. by measuring base IPv4 forwarding would
> be useful in establishing how much extra work is involved doing the
> translation.
>
>     I had the measurement results ready both for IPv4 and IPv6 kernel
>     routing. (20 experiments, error=1000) The below table shows the
>     number of all forwarded packets of bidirectional traffic (in
>     million frames per seconds).
>
>     Linux kernel routing     IPv4     IPv6
>     thorughput median     9.471     9.064
>     thorughput min     9.443     9.029
>     thorughput max     9.486     9.088
>
>
>       - Scaling linearly by number of cores is challenging in these
> systems. It would be interesting to see the results for 1, 2, 4, 8 cores
> and not only
>         for the 8 core system.
>
>
>     Until the current measurements, I always did exactly this kind of
>     scale-up tests. For example, I compared the performance of
>     different DNS64 implementations using 1, 2, 4, 8, 16 cores in [2]
>     and the performance of different authoritative DNS servers using
>     1, 2, 4, 8, 16, 32 cores in [3]. I have got some interesting
>     experience with switching on/off the CPU cores. First, I did the
>     on/off switching of the i-th CPU core by writing 1/0 values into
>     the /sys/devices/system/cpu/cpu$i/onlinefile of the working Linux
>     kernel [2]. Whereas it seemed to work well, when the query rates
>     were moderate (a few times ten thousand queries per second), it
>     caused problems in the second case, when I used up to 3 million
>     queries per second query rates, and thus I rather set the number
>     of active CPU cores at the DUT by using the maxcpus=nkernel
>     parameter.
>
>     [2] G. Lencse and Y. Kadobayashi, "Benchmarking DNS64
>     Implementations: Theory and Practice", /Computer Communications/
>     (Elsevier), vol. 127, no. 1, pp. 61-74, September 1, 2018, DOI:
>     10.1016/j.comcom.2018.05.005 Review version in PDF
>
> <http://www.hit.bme.hu/~lencse/publications/ECC-2018-DNS64-BM-for-review.pdf>
> <http://www.hit.bme.hu/~lencse/publications/ECC-2018-DNS64-BM-for-review.pdf>
>
>     [3] G. Lencse, "Benchmarking Authoritative DNS Servers", /IEEE
>     Access/, vol. 8. pp. 130224-130238, July 2020. DOI:
>     10.1109/ACCESS.2020.3009141 Revised version in PDF
>
> <http://www.hit.bme.hu/~lencse/publications/IEEE-Access-2020-AuthDNS-revised.pdf>
> <http://www.hit.bme.hu/~lencse/publications/IEEE-Access-2020-AuthDNS-revised.pdf>
>
>     So now I feel I have the experience to perform such measurements.
>
>     I plan to start a series of measurements with 1, 2, 4, 8 cores. To
>     spare execution time, I need to choose one fixed number of
>     connections. (It would last very long to test all possible
>     combinations if I used several different numbers of connections!)
>     As I expect good scale up, and so far we have seen only the 8-core
>     performance, I expect that the single core performance is likely
>     between 1/6 to 1/8 of it. It means that I should use a moderate
>     number of connections, otherwise the filling up of the conntrack
>     table would last too long.
>
>     So I plan to use the following parameters: number of connections:
>     4,000,000; src ports: 4,000; dst ports: 1,000; conntrack table
>     size: 2^22; hash size: c.t.s/8.
>
>     Do you agree with it?
>
>     Regarding the number of connections. You should see a drop-off when
> the size of the session table is larger than the L3 cache.
>
>
>     Good catch!
>
>     Although I cannot directly measure the size of the conntrack
>     table, I can record the change of the free memory during the
>     experiments and thus I can make an estimation, when this happens.
>     I do not promise to deal with it now, but I plan to use it later on.
>
>     You might also balance maximum bandwidth available with maximum
> session size. 250G of forwarding per socket on a PCIe 3.0 system.
>
>
>     Do you mean "250G" as 250Gbps link?
>
>     I am sorry, but it is beyond my dreams. The systems I usually use
>     (at NICT StarBED, Japan) have 10Gbps NICs. Now I use two HPE
>     servers with 10/25Gbps NICs interconnected by 10Gbps DAC cables
>     (at the Széchenyi István University, Győr, Hungary), and even
>     though my colleague has purchased 25Gbps DAC cables as I
>     requested, I do not use them, as my current rates are very far
>     from the maximum frame rate (14,880,952fps) of the 10Gbps links.
>
>     So it would be nice, if someone having higher performance hardware
>     could repeat my measurements. Any volunteers?
>
>
>     I might have read a little too much between the lines in the draft,
> but I got a feeling (and just that), that the tests were coloured a bit by
> the behaviour of a particular implementation (iptables).
>
>
>     Yes, my results reflect only the behavior of iptables.
>
>     On the one hand, if the results of one particular implementation show:
>     - poor speed up, then it does not prove that the technology is
>     bad, other implementations might perform much better.
>     - good speed up, then it proves that the technology is good, but
>     other implementations may still perform much worse.
>
>     Of course, our time is rather limited, thus my approach is that I
>     would like to test the implementations we expect to be good
>     enough. As for NAT44, I expect that iptables is among them. If
>     anyone can suggest a better one, I am open to try it. It should be
>     free software for two reasons:
>     - I do not have a budget to buy a proprietary implementation
>     - the licenses of some vendors prohibit the publication of
>     benchmarking results.
>
>     You are of course measuring how a particular implementation (or set of
> implementations) scales and we as the IETF have to deduce if the scaling
> limitations are in the implementation or in the protocol mechanism. We do
> know that you can build large scale NAT44s and NAT64s, so back to my first
> point, it might be useful to provide a baseline to give an idea of the
> additional cost associated with the extra treatment of packets.
>
>
>     Yes, my results (9Mfps vs. 3Mfps) definitely show that stateful
>     NAT44 is not without performance costs.
>
>     It would certainly be interesting to run your tool against the VPP
> implementation of NAT.
>     Here are some NAT performance results from runs with the CSIT
> benchmarking setup for VPP:
>
> https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/packet_throughput_graphs/nat44.html
>
> <https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/packet_throughput_graphs/nat44.html>
> <https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/packet_throughput_graphs/nat44.html>
>
> https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/throughput_speedup_multi_core/nat44.html
>
> <https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/throughput_speedup_multi_core/nat44.html>
> <https://docs.fd.io/csit/rls2106/report/vpp_performance_tests/throughput_speedup_multi_core/nat44.html>
>
>     As far as I could figure out, their stateless tests seem to scale
>     up well up to 4 CPU cores, but stateful tests do not scale up,
>     regardless of speaking about connections per second or throughput.
>
>     I really wonder how iptables behaves! :-)
>
>     Best regards,
>
>     Gábor
>
>     Best regards,
>     Ole
>
>
>
>     On 17 Aug 2021, at 14:51, Gábor LENCSE<lencse@hit.bme.hu>
> <lencse@hit.bme.hu>  <mailto:lencse@hit.bme.hu> <lencse@hit.bme.hu>
> wrote:
>
>     Dear All,
>
>     At IETF 111, I promised to perform scale-up test for stateful NAT64
> and also for stateful NAT44. (Our primary focus is stateful NAT64, as it is
> a part of 464XLAT. As there was interest in CGN, I am happy to invest some
> time into it, too. I think the comparison of their scalability may be
> interesting for several people.)
>
>     Now, I am in the phase of preliminary tests. I mean that I have
> performed tests to explore the behavior of both the Tester (stateful branch
> of siitperf) and the benchmarked application to be able to determine the
> conditions for the production tests.
>
>     Now I write to ask all  interested "IPv6 Operations" mailing list
> members to review my report below from two points of view:
>     - Do you consider the methodology sound?
>     - Do you think that the parameters are appropriate to provide
> meaningful results for network operators?
>
>     Control question:
>     - Will you support the inclusion of the results into
> draft-ietf-v6ops-transition-comparison and the publication of the draft as
> an RFC even if the results will contradict to your view of the scalability
> of the stateful technologies?
>
>     As I had more experience with iptables than Jool, I started with the
> scale-up tests of stateful NAT44.
>
>     Now, I give a short description of the test system.
>
>     I used two identical HPE ProLiant DL380 Gen10 (DL380GEN10) servers
> with the following configuration:
>     - 2 Intel 5218 CPUs (the clock frequency was set to 2.3GHz fixed )
>     - 256GB (8x32GB) DDR4 SDRAM @ 2666 MHz (accessed quad channel)
>     - 2-port BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (used
> with 10Gbps DAC cables)
>
>     The servers have 4 NUMA nodes (0, 1, 2, 3) and the NICs belong to NUMA
> node 1. All the interrupts caused by the NICs are processed by 8 of the 32
> CPU cores (the ones that belong to NUMA node 1, that is cores from 8 to 15)
> thus regarding our measurements, the DUT (Device Under Test) is more or
> less equivalent with an 8-core server. :-)
>
>     I used Debian Linux 9.13 with 5.12.14 kernel on both servers.
>
>     The test setup was the same as our draft describes:
> https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful
>
> <https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful>
> <https://datatracker.ietf.org/doc/html/draft-lencse-bmwg-benchmarking-stateful>
>                       +--------------------------------------+
>              10.0.0.2 |Initiator                    Responder| 198.19.0.2
>         +-------------|                Tester
> |<------------+
>         | private IPv4|                         [state table]| public IPv4
> |
>         |             +--------------------------------------+
> |
>         |
> |
>         |             +--------------------------------------+
> |
>         |    10.0.0.1 |                 DUT:                 | 198.19.0.1
> |
>         +------------>|        Sateful NATxy gateway
> |-------------+
>           private IPv4|     [connection tracking table]      | public IPv4
>                       +--------------------------------------+
>
>
>     Both the Tester and the DUT were the above mentioned HPE servers.
>
>     I wanted to follow my original plan included in my IETF 111
> presentation to measure both the maximum connection establishment rate and
> the throughput using the following number of connections: 1 million, 10
> million, 100 million, and 1 billion. However, the DUT became unavailable
> during my tests with 1 billion connections as the connection tracking table
> has exhausted its memory. Thus I have reduced the highest number of
> connections to 500 million.
>
>     First, I had to perform the tests for the maximum connection
> establishment rate. To achieve the required growing number of connections,
> I used always 50,000 different source port numbers (from 1 to 50,000) and
> increased the destination port numbers as 20, 200, 2000, and 10,000
> (instead of 20,000).
>
>     As for the size of the connection tracking table, I used powers of 2.
> The very first value was 2^20=1,048,576 and then I had to use the lowest
> large enough number, that is 2^24, etc. The hash size parameter was always
> set to 1/8 of the size of the connection tracking table.
>
>     I have performed binary search to determine the maximum connection
> establishment rate. The stopping criterion was expressed by the "error"
> parameter, which is the difference of the upper and lower bound. It was set
> to 1,000 in all cases except the last one: then it was set to 10,000 to
> save some execution time. The experiments were executed 10 times except the
> last one (only 3 times).
>
>     I have calculated the median, the minimum and the maximum of the
> measured connection establishment rates. I hope the values of the table
> below will be readable in my e-mail:
>
>     Num. conn.    1,000,000    10,000,000    100,000,000    500,000,000
>     src ports    50,000    50,000    50,000    50,000
>     dst ports    20    200    2,000    10,000
>     conntrack t. s.    2^20    2^24    2^27    2^29
>     hash table size    c.t.s/8    c.t.s/8    c.t.s/8    c.t.s/8
>     num. exp.    10    10    10    3
>     error    1,000    1,000    1,000    10,000
>     cps median    1.124    1.311    1.01    0.742
>     cps max    1.108    1.308    1.007    0.742
>     cps min    1.128    1.317    1.013    0.742
>     c.t.s/n.c.    1.049    1.678    1.342    1.074
>
>     The cps (connections per second) values are given in million
> connections per second.
>
>     The rise of the median to 1.311M at 10 million connections (from
> 1.124M at 1 million connections) is deliberately caused by the fact that
> the size of the connection tracking table was quite large compared to the
> actual number of connections. I have included this proportion in the last
> line of the table. Thus only the very first and very last columns are
> directly comparable. If we consider that the maximum connection
> establishment rate was decreased from 1.124M to 0.742M while the total
> number of connections increased from 1M to 500M, I think, we can be
> satisfied with the scale-up of iptables.
>     (Of course, I can see the limitations of my measurements: the error of
> 10,000 was too high, the binary search always finished at the same number.)
>
>     As I wanted more informative results, I have abandoned the decimal
> increase of the number of connections. I rather used binary increase to
> ensure that the proportion of the number of connections and the size of the
> connection tracking table be constant.
>
>     I had another concern. Earlier, I have pointed our that the role of
> the source and destination port numbers is not completely symmetrical in
> the hash function that distributes the interrupts among the CPU cores [1].
> Therefore, I decided to increase to source and destination port range
> together.
>
>     [1] G. Lencse, "Adding RFC 4814 Random Port Feature to Siitperf:
> Design, Implementation and Performance Estimation", International Journal
> of Advances in Telecommunications, Electrotechnics, Signals and Systems,
> vol 9, no 3, pp. 18-26, 2020, DOI: 10.11601/ijates.v9i3.291 Full paper in
> PDF
>
>     The results are shown in the table below (first, please see only the
> same lines as in the table above):
>
>     Num. conn.    1,562,500    6,250,000    25,000,000    100,000,000
> 400,000,000
>     src ports    2,500    5,000    10,000    20,000    40,000
>     dst ports    625    1,250    2,500    5,000    10,000
>     conntrack t. s.    2^21    2^23    2^25    2^27    2^29
>     hash table size    c.t.s/8    c.t.s/8    c.t.s/8    c.t.s/8    c.t.s/8
>     num. exp.    10    10    10    10    5
>     error    1,000    1,000    1,000    1,000    1,000
>     cps median    1.216    1.147    1.085    1.02    0.88
>     cps min    1.21    1.14    1.077    1.015    0.878
>     cps max    1.224    1.153    1.087    1.024    0.884
>     c.t.s/n.c.    1.342    1.342    1.342    1.342    1.342
>     thorughput median    3.605    3.508    3.227    3.242    2.76
>     thorughput min    3.592    3.494    3.213    3.232    2.748
>     thorughput max    3.627    3.521    3.236    3.248    2.799
>
>     Now, the maximum connection setup rate results deteriorate only very
> slightly with the increase of the number of connections from 1.216M to
> 1.02M, while the number of connections increased from 1,562,500 to
> 100,000,000. (A slightly higher degradation can be observed only in the
> last column. I will return to it later.)
>
>     The last 3 lines of the table show the median, minimum and maximum
> values of the throughput. As required by RFC 8219, throughput was
> determined by using bidirectional traffic and the duration of the
> elementary steps of the binary search was 60s. (The binary search was
> executed 10 times, except for the last column, where it was done only 5
> times to save execution time.)
>
>     Note: commercial testers usually report the total number of frames
> forwarded. Siitperf reports the number of frames per direction. Thus in
> case of bidirectional tests, the reported value should be multiplied by 2
> to receive the total number of frames per second. I did so, too.
>
>     Although RFC 2544/5180/8219 require testing with bidirectional
> traffic, I suspect that perhaps unidirectional throughput my also be
> interesting for ISPs, as home users usually have much more download than
> upload traffic...
>
>     The degradation of the throughput is also moderate, except at the last
> column. I attribute the higher decrease of the throughput at 400,000,000
> connections (as well as that of the maximum connection establishment rate)
> to NUMA issues. In more detail: this time nearly the entire memory of the
> server was in use, whereas in the previous cases iptables could use NUMA
> local memory (if it was smart enough to do that). Unfortunately I cannot
> add more memory to these computers to check my hypothesis, but in a few
> weeks, I hope to be able to use some DELL PowerEdge R430 computers that
> have only two NUMA nodes and 384GB RAM, see the "P" nodes here:
> https://starbed.nict.go.jp/en/equipment/
> <https://starbed.nict.go.jp/en/equipment/>
> <https://starbed.nict.go.jp/en/equipment/>
>
>     Now, I kindly ask everybody, who is interested in the scale up tests,
> to comment my measurements regarding both the methodology and the
> parameters!
>
>     I am happy to provide more details, if needed.
>
>     I plan to start working on the NAT64 tests in the upcoming weeks. I
> plan to use Jool. I have very limited experience with it, so first, I need
> to find out, how I can tune its connection tracking table parameters to be
> able to perform fair scale up tests.
>
>     In the meantime, please comment my above experiments and results so
> that I may improve the tests to provide convincing results for all
> interested parties!
>
>     Best regards,
>
>     Gábor
>
>     P.S.: If someone would volunteer repeating my experiments, I would be
> happy to share my scripts and experience and to provide support for
> siitperf, which is available from:https://github.com/lencsegabor/siitperf
> <https://github.com/lencsegabor/siitperf>
> <https://github.com/lencsegabor/siitperf>
>     The stateful branch of siitperf is in a very alpha state. It has only
> a partial documentation in this paper, which is still under review:
> http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf
> <http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf>
> <http://www.hit.bme.hu/~lencse/publications/SFNAT64-tester-for-review.pdf>
> (It may be revised or removed at any time.)
>     The support for unique pseudorandom source and destination port number
> combinations, what I used for my current tests, is not described there, as
> I invented it after the submission of that paper. (I plan to update the
> paper, if I can have a chance to revise it. In the unlikely case, if the
> paper will be accepted as is, I plan to write a shorter paper about the new
> features. Now, the commented source code is the most reliable
> documentation.)
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>     _______________________________________________
>     v6ops mailing list
>     v6ops@ietf.org  <mailto:v6ops@ietf.org> <v6ops@ietf.org>
>     https://www.ietf.org/mailman/listinfo/v6ops
> <https://www.ietf.org/mailman/listinfo/v6ops>
> <https://www.ietf.org/mailman/listinfo/v6ops>
>
>
>
>     _______________________________________________
>     v6ops mailing list
>     v6ops@ietf.org  <mailto:v6ops@ietf.org> <v6ops@ietf.org>
>     https://www.ietf.org/mailman/listinfo/v6ops
> <https://www.ietf.org/mailman/listinfo/v6ops>
> <https://www.ietf.org/mailman/listinfo/v6ops>
>
>
>     _______________________________________________
>     v6ops mailing list
>     v6ops@ietf.org <mailto:v6ops@ietf.org> <v6ops@ietf.org>
>     https://www.ietf.org/mailman/listinfo/v6ops
>     <https://www.ietf.org/mailman/listinfo/v6ops>
> <https://www.ietf.org/mailman/listinfo/v6ops>
>
>
> _______________________________________________
> v6ops mailing list
> v6ops@ietf.org
> https://www.ietf.org/mailman/listinfo/v6ops
>
>
> _______________________________________________
> v6ops mailing list
> v6ops@ietf.org
> https://www.ietf.org/mailman/listinfo/v6ops
>