Re: [IPsec] Comments on draft-pwouters-multi-sa-performance

Antony Antony <antony.antony@secunet.com> Tue, 16 November 2021 13:15 UTC

Return-Path: <antony.antony@secunet.com>
X-Original-To: ipsec@ietfa.amsl.com
Delivered-To: ipsec@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A59DB3A08EC; Tue, 16 Nov 2021 05:15:51 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PF8fAsixRGAe; Tue, 16 Nov 2021 05:15:46 -0800 (PST)
Received: from a.mx.secunet.com (a.mx.secunet.com [62.96.220.36]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 2043D3A08E6; Tue, 16 Nov 2021 05:15:43 -0800 (PST)
Received: from localhost (localhost [127.0.0.1]) by a.mx.secunet.com (Postfix) with ESMTP id 13B0C204E5; Tue, 16 Nov 2021 14:15:40 +0100 (CET)
X-Virus-Scanned: by secunet
Received: from a.mx.secunet.com ([127.0.0.1]) by localhost (a.mx.secunet.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pHTaZOz4ueqZ; Tue, 16 Nov 2021 14:15:37 +0100 (CET)
Received: from mailout2.secunet.com (mailout2.secunet.com [62.96.220.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by a.mx.secunet.com (Postfix) with ESMTPS id 2A637201AE; Tue, 16 Nov 2021 14:15:37 +0100 (CET)
Received: from cas-essen-02.secunet.de (unknown [10.53.40.202]) by mailout2.secunet.com (Postfix) with ESMTP id 1B2E980004A; Tue, 16 Nov 2021 14:15:37 +0100 (CET)
Received: from mbx-essen-01.secunet.de (10.53.40.197) by cas-essen-02.secunet.de (10.53.40.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.14; Tue, 16 Nov 2021 14:15:36 +0100
Received: from moon.secunet.de (172.18.26.122) by mbx-essen-01.secunet.de (10.53.40.197) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.14; Tue, 16 Nov 2021 14:15:36 +0100
Date: Tue, 16 Nov 2021 14:15:34 +0100
From: Antony Antony <antony.antony@secunet.com>
To: "Bottorff, Paul" <paul.bottorff@hpe.com>
CC: Paul Wouters <paul.wouters=40aiven.io@dmarc.ietf.org>, "ipsec@ietf.org" <ipsec@ietf.org>, "draft-pwouters-ipsecme-multi-sa-performance@ietf.org" <draft-pwouters-ipsecme-multi-sa-performance@ietf.org>
Message-ID: <YZOu9m8YCekKIQV5@moon.secunet.de>
Reply-To: antony.antony@secunet.com
References: <cc0b5528e7c047e0a9073f637218f013@huawei.com> <3c525728-22e8-d5ef-f183-c2c9d622cc54@nohats.ca> <CS1PR8401MB11924248C78CF59B633D931EFE8E9@CS1PR8401MB1192.NAMPRD84.PROD.OUTLOOK.COM> <YYt+iIbwjZuh4uSI@moon.secunet.de> <CS1PR8401MB1192E17A959F3010EEF24CEDFE989@CS1PR8401MB1192.NAMPRD84.PROD.OUTLOOK.COM>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CS1PR8401MB1192E17A959F3010EEF24CEDFE989@CS1PR8401MB1192.NAMPRD84.PROD.OUTLOOK.COM>
Precedence: first-class
Priority: normal
Organization: secunet
X-ClientProxiedBy: cas-essen-01.secunet.de (10.53.40.201) To mbx-essen-01.secunet.de (10.53.40.197)
X-EXCLAIMER-MD-CONFIG: 2c86f778-e09b-4440-8b15-867914633a10
Archived-At: <https://mailarchive.ietf.org/arch/msg/ipsec/C_657zn0CGDK7C8iQWwHD1WAB4M>
Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
X-BeenThere: ipsec@ietf.org
X-Mailman-Version: 2.1.29
List-Id: Discussion of IPsec protocols <ipsec.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ipsec>, <mailto:ipsec-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ipsec/>
List-Post: <mailto:ipsec@ietf.org>
List-Help: <mailto:ipsec-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ipsec>, <mailto:ipsec-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Nov 2021 13:15:52 -0000

Hi Paul,

Let me try once more to explain how to use draft-pwouters-multi-sa-performance for load balancing paths.
At the moment, the draft and Linux XFRM code only cover per CPU queuing. So forget the CPU use case for now! Let’s focus on network path diversity.

Just like you said, the IPsec peers must be configured with the number of paths. Let’s say there are 4 paths. Configure that in the IKE daemon.

IKE daemon will add 4 paths to the policy, SP. The IKE initiator will send perPath Child SAs, 4(4 + 1 the fallback SA). And negotiate 4 perPath Child SA. IKE initiator should force UDP encapsulation for the load balancing switches to work.

First the IKE initiator setup a Fallback SA. The fallback SA will be using UDP port 4500 to 4500, with an additional attribute 4 perPath SAs.

The data plane, SAD, will hash the clear text traffic using ntuple. The clear text ntuple for TCP and UDP are using source IP + source port + destination IP + destination port. All other traffic, think of ICMP..., hash using source IP + destination IP. This hashing is an example. Any other hash based on clear text flows to choose IPsec SA paths would work too.

When traffic arrives, IPsec gateway compute the hash. If there is no SA for that hash index, use the Fallback SA and send a SADB_ACQUIE to IKE daemon. IKE daemon will negotiate a new perPath SA for that index. Once a perPath SA is installed, the traffic will use that SA. The perPath SA use UDP encapsulation, a unique src port + destination port. The IKE initiator initiates with source port other than 4500, and the IKE responder should respond to the new source port(and not to port 4500) Since the perPath SA has different UDP source port + destination port the switches or load balancers should find enough entropy. Can the load balancers forward the IPsec traffic via different paths based 4 tuple? When rekeying keep the same source port + destination port.

This will also support NATs and also sending DPD using per Path SA UDP ports, using NON-ESP-UDP encapsulation.

I can imagine you feel this solution is complex, and changes to IKE negotiations, however, I think the result is more flexible and clean design. The UDP flows behave as symmetric UDP flow. It is friendly to RSS as well.

Thanks for your feedback. I hope, I didn't waste your time with another long e-mail:)

Cheers,
-antony

On Mon, Nov 15, 2021 at 08:10:25PM +0000, Bottorff, Paul wrote:
> Hi Antony:
> 
> Per path SA is completely inadequate for load balanced systems especially within the data center. Load balanced systems are being used to separate mice from elephant flows and to dynamically re-arrange flows on paths based on load measures. The number of flows on any particular path is selected by the network and may change both hop-by-hop and on the fly. For the network to operate properly we want to identify every flow allowing the network to allocate those flow to paths. For load balance to work properly we need many more flows than paths. Current data center encapsulations support these operations by loading the source port with a flow identifier. Saying that IPsec will provide an inferior and cumbersome solution becomes a barrier to a wider deployment of IPsec in these environments.
> 
> Even for cases where the network is doing simple hash distribution switches don't normally distribute based on SPI, instead they typically identify flows based on the outer 5-tuple. Some of the switches could parse the IPsec packets and build a special hash for them, however some of the switches some of the time is not a satisfying solution. Identifying the flow in source port is a perfect solution since it works for all the switches all the time, supports all the advanced modes of load balancing, and is an already well established technique. 
> 
> Further the server interfaces where IKE would run don't know how many paths exist deep in the network and so don't know how to build for particular paths. If SA was used the only reasonable solution is to build and SA per flow (not per path) which is information available to IKE. There are many flows for each CPU. 
> 
> Cheers,
> 
> Paul
> 
> -----Original Message-----
> From: Antony Antony [mailto:antony.antony@secunet.com] 
> Sent: Wednesday, November 10, 2021 12:11 AM
> To: Bottorff, Paul <paul.bottorff@hpe.com>
> Cc: Paul Wouters <paul.wouters=40aiven.io@dmarc.ietf.org>; Panwei (William) <william.panwei@huawei.com>; ipsec@ietf.org; draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> 
> Hi Paul, 
> 
> I think our draft is a better solution for the network multipath problem too, definitely for a few per Path SAs. Larger number of paths, say 32 or more paths, may cause scaling issues in SPD or/and SAD lookup; the data path lookup. However, data path lookup speed would depend on the implementation. Creating a SA per path would decrease chance out of order delivery, ESP sequence number issues will be minimal.
> 
> To create per path SA the IKE negotiations would be like the current draft, perCPU. The SAD/SPD lookup code path will differ from the prototype code we have for Linux/XFRM. The multi path will be a property of SPD. Use the multip path attribute to lookup in SAD.  Each SA can be created dynamically based on traffic; actually path. First, create the Fallback SA and install the policy with the new flag, perPath SADB_ACQUIRE flag. And when there is traffic and no SA matching packet's path, the policy should trigger an SADB_ACQUIRE and IKE negotiation follows.  After a successful IKE negotiation install the new per Path SA. While IKE negotiates the path SA, use the Fallback SA. So there won’t be any packet drop.
> 
> I am working on a draft, how to deal with UDP encapsulation. Each SA with a different UDP port. My idea is to make UDP encapsulated IPsec SA a pair of SAs. IKE negotiator will choose source port. In one direction source port will change and in the other direction destination port will change, based on path entropy. This solution is generic and will work with the common NAT gateways. With a limitation, IKE peer which is not behind NAT must initiate new SA. 
> 
> RSS/ntuple would use UDP source port plus destination port for entropy. Would this work for network switches/routers along the path? My guess it will!
> 
> Are you seeing out of order delivery because ESP packets take of multiple paths or at the sender? I noticed multiple CPU ESP sender can send the packets out of order. I think if you install an SA per path, out-of-order issues due paths will be resolved. There may be corner cases when there are multiple CPUs and multiple paths. In those cases, you may need CPU x paths SAs. CPU x Paths SAs will add lookup complexity. Out of order sending and delivery seems to be a problem with multiple CPUs on the sender. Even with IPsec offload NICs in Linux out of order sending could be an issue. Using per-CPU would reduce it considerably.
> 
>  
> 
> On Wed, Nov 10, 2021 at 08:32:13AM +0100, Antony Antony wrote:
> > Hi Paul,
> > 
> > I think our draft is a better solution for the network multipath problem, definitely for small number of per Path SAs. Larger number of paths, say 16 or more paths, may cause scaling issues in SPD or/and SAD lookup; the data path. However, data path lookup speed would depend on the implementation.
> > Creating an SA per path would increase chance in order delivery, ESP sequence number issues will be minimal.
> > 
> > To create per path SA the IKE negotiations would be similar to current draft, perCPU. The SAD/SPD code path will be different from the prototype we have for Linux/XFRM. The path will be a property of SPD. When looking up SAD use the path attribute and the correct SA will be used.
> > 
> > Each SA can be created dynamically if and when necessary. First create 
> > the Fallback SA and install the policy with new flag, per-Path SADB_ACQUIRE flag.
> > And when there is traffic and no SA matching per path SAD  entry, the 
> > policy should trigger IKE negotiations and install a new per Path SA.
> > While per path SA is negotiated use the Fallback SA. So there won't be 
> > any packet drop.
> > 
> > I am working a draft, how to deal with UDP encapsulation.  Each SA with different UDP port.
> > My idea is make UDP encapsulated IPsec SA a pair of SAs. In one direction source port will change and in the other direction destination port will change based on path entropy. This solution is generic and will work with the common NAT gateways too.
> > RSS/ntuple would use source port plus destination port for entropy.
> > Would this work for network switches/routers along the path?
> > 
> > Are you seeing out of order delivery because ESP packets take of multiple paths or multiple CPUs on the ESP sender? 
> > I think if you install an SA per path, out of order issues due to path 
> > will mostly solved. There may be corner cases when there are multiple 
> > CPUs and multiple paths. In those cases you may need CPU x paths SAs.
> > CPU x Paths SAs will add lookup complexity. I have no operational experience yet.
> > Out of order delivery/sending seems to be a problem when there are multiple CPUs on the sender.  Even when using IPsec offload NICs in Linux. Using per-CPU would reduce it considerably.
> > 
> > 
> > -antony
> > 
> > On Fri, Nov 05, 2021 at 21:39:05 +0000, Bottorff, Paul wrote:
> > > Hi Paul:
> > > 
> > > I've reviewed your draft to determine if it is viable as a solution to the network multi-pathing problems I've been investigation. Though I have no objection to your solution for multi-cpu balancing, it does not seem to provide a reasonable alternative to draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. Hopefully, we can move forward with draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. We currently have it implemented and working quite successfully in our smart NICs.
> > > 
> > > For network multi-pathing within a highly meshed data center we want to be able to create an identifier for every flow to allow the network to perform load balancing algorithms which measure the load on links and dynamically shift flows to achieve balance (rather than balancing the number of flows on links). For a typical big server we would expect 100s of flows and for a network middle box we would expect 1000s. The amount of state required to carry independent SA state for each of these is troublesome.
> > > 
> > > Further, to use the additional SAs we would need to dynamically generate new identical SAs based on a current flow table which will result in startup delays for new flows as we establish the new SA.
> > > 
> > > Finally, even though some of the switches some of the time could add the SPI into their hash for IPsec packets (and use a different hash for other packets), this feature is not available universally across all data center switches and router. Also the standard procedure for existing data center switches, even those that could generate a special hash for SPI, is to use 5 tuple hash as the flow identifier across all packets.
> > > 
> > > It is true that for network multi-pathing delivery order is only guaranteed per flow, and therefore replay detection must either be disabled or would needed to take this into account re-ordering between flows.
> > > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: IPsec [mailto:ipsec-bounces@ietf.org] On Behalf Of Paul 
> > > Wouters
> > > Sent: Monday, November 1, 2021 8:27 PM
> > > To: Panwei (William) <william.panwei@huawei.com>
> > > Cc: ipsec@ietf.org; 
> > > draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> > > Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > > 
> > > On Fri, 29 Oct 2021, Panwei (William) wrote:
> > > 
> > > Hi William,
> > > 
> > > > Subject: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > > 
> > > > I’ve read the recent version. This is an interesting solution. I think it should be adopted. Below are some comments.
> > > 
> > > Thnks for reading the draft and giving us feedback!
> > > 
> > > > 1.
> > > > 
> > > > The CPU_QUEUES notification value refers to the number of 
> > > > additional
> > > > 
> > > > resource-specific Child SAs that may be installed for this 
> > > > particular
> > > > 
> > > > TSi/TSr combination excluding the Fallback Child SA.
> > > > 
> > > > Is it necessary to limit the amount of additional SAs at the 
> > > > beginning while TS_MAX_QUEUE can be used to reject the request of 
> > > > creating additional SA at any time? In the virtualization scenario, the new VMs can be launched on-demand, in other words, it may be seen as the number of CPUs isn’t fixed, so maybe limiting the addition SAs at the beginning will damage the flexibility.
> > > 
> > > The limit is really a very high maximum number and not a very low number exactly matching CPUs. We had that at first, to try and optimize it but there were too many race conditions, eg with rekeying. So if you have a peer with 4 CPUs and a peer with 2 CPUs, you might just want to set the max to 8 or even 12. It is mostly meant to try and avoid doing CREATE_CHILD_SA's that are just doomed to failure anyway. So see it more as a resource cap that a strict physical limitation.
> > > 
> > > > 2.
> > > > 
> > > > The CPU_QUEUES notification payload is sent in the IKE_AUTH or
> > > > 
> > > > CREATE_CHILD_SA Exchange indicating the negotiated Child SA is a
> > > > 
> > > > Fallback SA.
> > > > 
> > > > Before additional SAs are created, is there any difference between 
> > > > using this first/Fallback SA and using other normal SA? I think there is no difference. So maybe we don’t need to add this notification payload when creating the first SA.
> > > 
> > > the problem right now is that implementations often will discard an an older identical Child SA for the newest one. So one of the key intentions of the document is for the initiator/responder to clearly negotiate from the start they are going to be using Child SA's with identical Traffic Selectors and they want the older ones to stick around. Also, it is important that there is always 1 fallback Child SA that can be used on any CPU resource. So we really wanted to mark that one very clearly. For instance, if it becomes idle, it should NOT be deleted.
> > > 
> > > > When the initiator wants
> > > > to create an additional SA, it can directly send the request with CPU_QUEUE_INFO notification payload.
> > > 
> > > It would be good to know from the responder if they support this and if they are willing to do this before doing the CREATE_CHILD_SA. And as I said above, to ensure both parties agree on which Child SA is the "always be present" fallback SA to ensure things like adding a new CPU always results in encrypted packets via the fallback SA.
> > > 
> > > > There are 3 ways that the
> > > > responder may reply: 1) The responder doesn’t support/recognize 
> > > > this notification, it will ignore this notification and reply as usual.
> > > 
> > > But there is no "as usual" for what happens to the older Child SA. Some implementations will allow it, some will only allow it if it has its own IKE SA, and some will just delete the old one. This is the ambiguity we are trying to address with the draft.
> > > 
> > > > 2) It supports this function and is willing to create the 
> > > > additional SA, so it will reply with CPU_QUEUE_INFO notification too. 3) It supports this function, but it isn’t willing to create more additional SAs, so it will reply with TS_MAX_QUEUE.
> > > > Therefore, it seems like that CPU_QUEUE_INFO and TS_MAX_QUEUE 
> > > > these 2 notifications are enough to use, and the draft can be simplified to only use these 2 notifications.
> > > 
> > > I hope I explained why we think some clear signal has its use. If you take your assumptions to the max, one would need no document at all, as the IKEv2 specification states there can be Child SAs that are duplicates or with overlapping IP ranges, so in theory, nothing is needed.
> > > 
> > > > 3.
> > > > 
> > > > Both peers send
> > > > 
> > > > the preferred minimum number of additional Child SAs to install.
> > > > 
> > > > First, I think sending the number of additional Child SAs is 
> > > > unnecessary. Second, when using “minimum” here my first impression 
> > > > is that it means 0, so in order to remove ambiguity I suggest just saying “the preferred number” (if you think sending the number is necessary).
> > > 
> > > The use of minimum is indicating what the peer needs. A peer with 4 CPUs does not prefer 4, it really prefers as many as the highest number of CPUs of the two peers - within reason. The preference is really a combination of what works best for the combination of the two peers.
> > > 
> > > Note the minimum is not about the minimum number required for functioning, but the minimum number to get optimum performance.
> > > 
> > > By indicating the minimum, both sides can pick the highest minimum and then allow a few more (for race conditions during rekeying).
> > > 
> > > > 4.
> > > > 
> > > > If a CREATE_CHILD_SA exchange request containing both a 
> > > > CPU_QUEUE_INFO and a CPU_QUEUES notification is received, the 
> > > > responder MUST ignore the CPU_QUEUE_INFO payload. If a 
> > > > CREATE_CHILD_SA exchange reply is received with both 
> > > > CPU_QUEUE_INFO and CPU_QUEUES notifications, the initiator MUST 
> > > > ignore the notification that it did not send in the request.
> > > > 
> > > > I think there is ambiguity here. When the initiator sends the 
> > > > CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO 
> > > > and a CPU_QUEUES notification, and the responder also adds 
> > > > CPU_QUEUE_INFO and CPU_QUEUES notifications in the reply, the initiator doesn’t know how to process with this situation, should the initiator ignore the CPU_QUEUE_INFO payload or notify an error to the responder?
> > > 
> > > We went back and forth on this a couple of times with the authors. We really wanted to keep it as simple as possible but also not be too pedantic. From a protocol point of view, we could say to just return an error like SYNTAX_ERROR, but that would cause the IKE SA and all its working Child SAs to also be torn down, and we wanted to avoid that so bugs in the performance implementation does not result in completely tunnel failures. Hence our phrasing of "just ignore X"
> > > on both the initiator and responder.
> > > 
> > > We agree that a broken initiator with a broken responder leads to 
> > > something broken. I think specifying how a broken initiator should 
> > > respond to a broken responder is taking the Postel Principle a step 
> > > too far? :)
> > > 
> > > Paul
> 
> On Fri, Nov 05, 2021 at 21:39:05 +0000, Bottorff, Paul wrote:
> > Hi Paul:
> > 
> > I've reviewed your draft to determine if it is viable as a solution to the network multi-pathing problems I've been investigation. Though I have no objection to your solution for multi-cpu balancing, it does not seem to provide a reasonable alternative to draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. Hopefully, we can move forward with draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. We currently have it implemented and working quite successfully in our smart NICs.
> > 
> > For network multi-pathing within a highly meshed data center we want to be able to create an identifier for every flow to allow the network to perform load balancing algorithms which measure the load on links and dynamically shift flows to achieve balance (rather than balancing the number of flows on links). For a typical big server we would expect 100s of flows and for a network middle box we would expect 1000s. The amount of state required to carry independent SA state for each of these is troublesome.
> > 
> > Further, to use the additional SAs we would need to dynamically generate new identical SAs based on a current flow table which will result in startup delays for new flows as we establish the new SA.
> > 
> > Finally, even though some of the switches some of the time could add the SPI into their hash for IPsec packets (and use a different hash for other packets), this feature is not available universally across all data center switches and router. Also the standard procedure for existing data center switches, even those that could generate a special hash for SPI, is to use 5 tuple hash as the flow identifier across all packets.
> > 
> > It is true that for network multi-pathing delivery order is only guaranteed per flow, and therefore replay detection must either be disabled or would needed to take this into account re-ordering between flows.
> > 
> > Cheers,
> > 
> > Paul
> > 
> > 
> > 
> > -----Original Message-----
> > From: IPsec [mailto:ipsec-bounces@ietf.org] On Behalf Of Paul Wouters
> > Sent: Monday, November 1, 2021 8:27 PM
> > To: Panwei (William) <william.panwei@huawei.com>
> > Cc: ipsec@ietf.org; 
> > draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> > Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > 
> > On Fri, 29 Oct 2021, Panwei (William) wrote:
> > 
> > Hi William,
> > 
> > > Subject: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > 
> > > I’ve read the recent version. This is an interesting solution. I think it should be adopted. Below are some comments.
> > 
> > Thnks for reading the draft and giving us feedback!
> > 
> > > 1.
> > > 
> > > The CPU_QUEUES notification value refers to the number of additional
> > > 
> > > resource-specific Child SAs that may be installed for this 
> > > particular
> > > 
> > > TSi/TSr combination excluding the Fallback Child SA.
> > > 
> > > Is it necessary to limit the amount of additional SAs at the 
> > > beginning while TS_MAX_QUEUE can be used to reject the request of 
> > > creating additional SA at any time? In the virtualization scenario, the new VMs can be launched on-demand, in other words, it may be seen as the number of CPUs isn’t fixed, so maybe limiting the addition SAs at the beginning will damage the flexibility.
> > 
> > The limit is really a very high maximum number and not a very low number exactly matching CPUs. We had that at first, to try and optimize it but there were too many race conditions, eg with rekeying. So if you have a peer with 4 CPUs and a peer with 2 CPUs, you might just want to set the max to 8 or even 12. It is mostly meant to try and avoid doing CREATE_CHILD_SA's that are just doomed to failure anyway. So see it more as a resource cap that a strict physical limitation.
> > 
> > > 2.
> > > 
> > > The CPU_QUEUES notification payload is sent in the IKE_AUTH or
> > > 
> > > CREATE_CHILD_SA Exchange indicating the negotiated Child SA is a
> > > 
> > > Fallback SA.
> > > 
> > > Before additional SAs are created, is there any difference between 
> > > using this first/Fallback SA and using other normal SA? I think there is no difference. So maybe we don’t need to add this notification payload when creating the first SA.
> > 
> > the problem right now is that implementations often will discard an an older identical Child SA for the newest one. So one of the key intentions of the document is for the initiator/responder to clearly negotiate from the start they are going to be using Child SA's with identical Traffic Selectors and they want the older ones to stick around. Also, it is important that there is always 1 fallback Child SA that can be used on any CPU resource. So we really wanted to mark that one very clearly. For instance, if it becomes idle, it should NOT be deleted.
> > 
> > > When the initiator wants
> > > to create an additional SA, it can directly send the request with CPU_QUEUE_INFO notification payload.
> > 
> > It would be good to know from the responder if they support this and if they are willing to do this before doing the CREATE_CHILD_SA. And as I said above, to ensure both parties agree on which Child SA is the "always be present" fallback SA to ensure things like adding a new CPU always results in encrypted packets via the fallback SA.
> > 
> > > There are 3 ways that the
> > > responder may reply: 1) The responder doesn’t support/recognize this 
> > > notification, it will ignore this notification and reply as usual.
> > 
> > But there is no "as usual" for what happens to the older Child SA. Some implementations will allow it, some will only allow it if it has its own IKE SA, and some will just delete the old one. This is the ambiguity we are trying to address with the draft.
> > 
> > > 2) It supports this function and is willing to create the additional 
> > > SA, so it will reply with CPU_QUEUE_INFO notification too. 3) It supports this function, but it isn’t willing to create more additional SAs, so it will reply with TS_MAX_QUEUE.
> > > Therefore, it seems like that CPU_QUEUE_INFO and TS_MAX_QUEUE these 
> > > 2 notifications are enough to use, and the draft can be simplified to only use these 2 notifications.
> > 
> > I hope I explained why we think some clear signal has its use. If you take your assumptions to the max, one would need no document at all, as the IKEv2 specification states there can be Child SAs that are duplicates or with overlapping IP ranges, so in theory, nothing is needed.
> > 
> > > 3.
> > > 
> > > Both peers send
> > > 
> > > the preferred minimum number of additional Child SAs to install.
> > > 
> > > First, I think sending the number of additional Child SAs is 
> > > unnecessary. Second, when using “minimum” here my first impression 
> > > is that it means 0, so in order to remove ambiguity I suggest just saying “the preferred number” (if you think sending the number is necessary).
> > 
> > The use of minimum is indicating what the peer needs. A peer with 4 CPUs does not prefer 4, it really prefers as many as the highest number of CPUs of the two peers - within reason. The preference is really a combination of what works best for the combination of the two peers.
> > 
> > Note the minimum is not about the minimum number required for functioning, but the minimum number to get optimum performance.
> > 
> > By indicating the minimum, both sides can pick the highest minimum and then allow a few more (for race conditions during rekeying).
> > 
> > > 4.
> > > 
> > > If a CREATE_CHILD_SA exchange request containing both a 
> > > CPU_QUEUE_INFO and a CPU_QUEUES notification is received, the 
> > > responder MUST ignore the CPU_QUEUE_INFO payload. If a 
> > > CREATE_CHILD_SA exchange reply is received with both CPU_QUEUE_INFO 
> > > and CPU_QUEUES notifications, the initiator MUST ignore the 
> > > notification that it did not send in the request.
> > > 
> > > I think there is ambiguity here. When the initiator sends the 
> > > CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO 
> > > and a CPU_QUEUES notification, and the responder also adds 
> > > CPU_QUEUE_INFO and CPU_QUEUES notifications in the reply, the initiator doesn’t know how to process with this situation, should the initiator ignore the CPU_QUEUE_INFO payload or notify an error to the responder?
> > 
> > We went back and forth on this a couple of times with the authors. We really wanted to keep it as simple as possible but also not be too pedantic. From a protocol point of view, we could say to just return an error like SYNTAX_ERROR, but that would cause the IKE SA and all its working Child SAs to also be torn down, and we wanted to avoid that so bugs in the performance implementation does not result in completely tunnel failures. Hence our phrasing of "just ignore X"
> > on both the initiator and responder.
> > 
> > We agree that a broken initiator with a broken responder leads to 
> > something broken. I think specifying how a broken initiator should 
> > respond to a broken responder is taking the Postel Principle a step 
> > too far? :)
> > 
> > Paul
> _______________________________________________
> IPsec mailing list
> IPsec@ietf.org
> https://www.ietf.org/mailman/listinfo/ipsec

On Mon, Nov 15, 2021 at 20:10:25 +0000, Bottorff, Paul wrote:
> Hi Antony:
> 
> Per path SA is completely inadequate for load balanced systems especially within the data center. Load balanced systems are being used to separate mice from elephant flows and to dynamically re-arrange flows on paths based on load measures. The number of flows on any particular path is selected by the network and may change both hop-by-hop and on the fly. For the network to operate properly we want to identify every flow allowing the network to allocate those flow to paths. For load balance to work properly we need many more flows than paths. Current data center encapsulations support these operations by loading the source port with a flow identifier. Saying that IPsec will provide an inferior and cumbersome solution becomes a barrier to a wider deployment of IPsec in these environments.
> 
> Even for cases where the network is doing simple hash distribution switches don't normally distribute based on SPI, instead they typically identify flows based on the outer 5-tuple. Some of the switches could parse the IPsec packets and build a special hash for them, however some of the switches some of the time is not a satisfying solution. Identifying the flow in source port is a perfect solution since it works for all the switches all the time, supports all the advanced modes of load balancing, and is an already well established technique. 
> 
> Further the server interfaces where IKE would run don't know how many paths exist deep in the network and so don't know how to build for particular paths. If SA was used the only reasonable solution is to build and SA per flow (not per path) which is information available to IKE. There are many flows for each CPU. 
> 
> Cheers,
> 
> Paul
> 
> -----Original Message-----
> From: Antony Antony [mailto:antony.antony@secunet.com] 
> Sent: Wednesday, November 10, 2021 12:11 AM
> To: Bottorff, Paul <paul.bottorff@hpe.com>
> Cc: Paul Wouters <paul.wouters=40aiven.io@dmarc.ietf.org>; Panwei (William) <william.panwei@huawei.com>; ipsec@ietf.org; draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> 
> Hi Paul, 
> 
> I think our draft is a better solution for the network multipath problem too, definitely for a few per Path SAs. Larger number of paths, say 32 or more paths, may cause scaling issues in SPD or/and SAD lookup; the data path lookup. However, data path lookup speed would depend on the implementation. Creating a SA per path would decrease chance out of order delivery, ESP sequence number issues will be minimal.
> 
> To create per path SA the IKE negotiations would be like the current draft, perCPU. The SAD/SPD lookup code path will differ from the prototype code we have for Linux/XFRM. The multi path will be a property of SPD. Use the multip path attribute to lookup in SAD.  Each SA can be created dynamically based on traffic; actually path. First, create the Fallback SA and install the policy with the new flag, perPath SADB_ACQUIRE flag. And when there is traffic and no SA matching packet's path, the policy should trigger an SADB_ACQUIRE and IKE negotiation follows.  After a successful IKE negotiation install the new per Path SA. While IKE negotiates the path SA, use the Fallback SA. So there won’t be any packet drop.
> 
> I am working on a draft, how to deal with UDP encapsulation. Each SA with a different UDP port. My idea is to make UDP encapsulated IPsec SA a pair of SAs. IKE negotiator will choose source port. In one direction source port will change and in the other direction destination port will change, based on path entropy. This solution is generic and will work with the common NAT gateways. With a limitation, IKE peer which is not behind NAT must initiate new SA. 
> 
> RSS/ntuple would use UDP source port plus destination port for entropy. Would this work for network switches/routers along the path? My guess it will!
> 
> Are you seeing out of order delivery because ESP packets take of multiple paths or at the sender? I noticed multiple CPU ESP sender can send the packets out of order. I think if you install an SA per path, out-of-order issues due paths will be resolved. There may be corner cases when there are multiple CPUs and multiple paths. In those cases, you may need CPU x paths SAs. CPU x Paths SAs will add lookup complexity. Out of order sending and delivery seems to be a problem with multiple CPUs on the sender. Even with IPsec offload NICs in Linux out of order sending could be an issue. Using per-CPU would reduce it considerably.
> 
>  
> 
> On Wed, Nov 10, 2021 at 08:32:13AM +0100, Antony Antony wrote:
> > Hi Paul,
> > 
> > I think our draft is a better solution for the network multipath problem, definitely for small number of per Path SAs. Larger number of paths, say 16 or more paths, may cause scaling issues in SPD or/and SAD lookup; the data path. However, data path lookup speed would depend on the implementation.
> > Creating an SA per path would increase chance in order delivery, ESP sequence number issues will be minimal.
> > 
> > To create per path SA the IKE negotiations would be similar to current draft, perCPU. The SAD/SPD code path will be different from the prototype we have for Linux/XFRM. The path will be a property of SPD. When looking up SAD use the path attribute and the correct SA will be used.
> > 
> > Each SA can be created dynamically if and when necessary. First create 
> > the Fallback SA and install the policy with new flag, per-Path SADB_ACQUIRE flag.
> > And when there is traffic and no SA matching per path SAD  entry, the 
> > policy should trigger IKE negotiations and install a new per Path SA.
> > While per path SA is negotiated use the Fallback SA. So there won't be 
> > any packet drop.
> > 
> > I am working a draft, how to deal with UDP encapsulation.  Each SA with different UDP port.
> > My idea is make UDP encapsulated IPsec SA a pair of SAs. In one direction source port will change and in the other direction destination port will change based on path entropy. This solution is generic and will work with the common NAT gateways too.
> > RSS/ntuple would use source port plus destination port for entropy.
> > Would this work for network switches/routers along the path?
> > 
> > Are you seeing out of order delivery because ESP packets take of multiple paths or multiple CPUs on the ESP sender? 
> > I think if you install an SA per path, out of order issues due to path 
> > will mostly solved. There may be corner cases when there are multiple 
> > CPUs and multiple paths. In those cases you may need CPU x paths SAs.
> > CPU x Paths SAs will add lookup complexity. I have no operational experience yet.
> > Out of order delivery/sending seems to be a problem when there are multiple CPUs on the sender.  Even when using IPsec offload NICs in Linux. Using per-CPU would reduce it considerably.
> > 
> > 
> > -antony
> > 
> > On Fri, Nov 05, 2021 at 21:39:05 +0000, Bottorff, Paul wrote:
> > > Hi Paul:
> > > 
> > > I've reviewed your draft to determine if it is viable as a solution to the network multi-pathing problems I've been investigation. Though I have no objection to your solution for multi-cpu balancing, it does not seem to provide a reasonable alternative to draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. Hopefully, we can move forward with draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. We currently have it implemented and working quite successfully in our smart NICs.
> > > 
> > > For network multi-pathing within a highly meshed data center we want to be able to create an identifier for every flow to allow the network to perform load balancing algorithms which measure the load on links and dynamically shift flows to achieve balance (rather than balancing the number of flows on links). For a typical big server we would expect 100s of flows and for a network middle box we would expect 1000s. The amount of state required to carry independent SA state for each of these is troublesome.
> > > 
> > > Further, to use the additional SAs we would need to dynamically generate new identical SAs based on a current flow table which will result in startup delays for new flows as we establish the new SA.
> > > 
> > > Finally, even though some of the switches some of the time could add the SPI into their hash for IPsec packets (and use a different hash for other packets), this feature is not available universally across all data center switches and router. Also the standard procedure for existing data center switches, even those that could generate a special hash for SPI, is to use 5 tuple hash as the flow identifier across all packets.
> > > 
> > > It is true that for network multi-pathing delivery order is only guaranteed per flow, and therefore replay detection must either be disabled or would needed to take this into account re-ordering between flows.
> > > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: IPsec [mailto:ipsec-bounces@ietf.org] On Behalf Of Paul 
> > > Wouters
> > > Sent: Monday, November 1, 2021 8:27 PM
> > > To: Panwei (William) <william.panwei@huawei.com>
> > > Cc: ipsec@ietf.org; 
> > > draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> > > Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > > 
> > > On Fri, 29 Oct 2021, Panwei (William) wrote:
> > > 
> > > Hi William,
> > > 
> > > > Subject: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > > 
> > > > I’ve read the recent version. This is an interesting solution. I think it should be adopted. Below are some comments.
> > > 
> > > Thnks for reading the draft and giving us feedback!
> > > 
> > > > 1.
> > > > 
> > > > The CPU_QUEUES notification value refers to the number of 
> > > > additional
> > > > 
> > > > resource-specific Child SAs that may be installed for this 
> > > > particular
> > > > 
> > > > TSi/TSr combination excluding the Fallback Child SA.
> > > > 
> > > > Is it necessary to limit the amount of additional SAs at the 
> > > > beginning while TS_MAX_QUEUE can be used to reject the request of 
> > > > creating additional SA at any time? In the virtualization scenario, the new VMs can be launched on-demand, in other words, it may be seen as the number of CPUs isn’t fixed, so maybe limiting the addition SAs at the beginning will damage the flexibility.
> > > 
> > > The limit is really a very high maximum number and not a very low number exactly matching CPUs. We had that at first, to try and optimize it but there were too many race conditions, eg with rekeying. So if you have a peer with 4 CPUs and a peer with 2 CPUs, you might just want to set the max to 8 or even 12. It is mostly meant to try and avoid doing CREATE_CHILD_SA's that are just doomed to failure anyway. So see it more as a resource cap that a strict physical limitation.
> > > 
> > > > 2.
> > > > 
> > > > The CPU_QUEUES notification payload is sent in the IKE_AUTH or
> > > > 
> > > > CREATE_CHILD_SA Exchange indicating the negotiated Child SA is a
> > > > 
> > > > Fallback SA.
> > > > 
> > > > Before additional SAs are created, is there any difference between 
> > > > using this first/Fallback SA and using other normal SA? I think there is no difference. So maybe we don’t need to add this notification payload when creating the first SA.
> > > 
> > > the problem right now is that implementations often will discard an an older identical Child SA for the newest one. So one of the key intentions of the document is for the initiator/responder to clearly negotiate from the start they are going to be using Child SA's with identical Traffic Selectors and they want the older ones to stick around. Also, it is important that there is always 1 fallback Child SA that can be used on any CPU resource. So we really wanted to mark that one very clearly. For instance, if it becomes idle, it should NOT be deleted.
> > > 
> > > > When the initiator wants
> > > > to create an additional SA, it can directly send the request with CPU_QUEUE_INFO notification payload.
> > > 
> > > It would be good to know from the responder if they support this and if they are willing to do this before doing the CREATE_CHILD_SA. And as I said above, to ensure both parties agree on which Child SA is the "always be present" fallback SA to ensure things like adding a new CPU always results in encrypted packets via the fallback SA.
> > > 
> > > > There are 3 ways that the
> > > > responder may reply: 1) The responder doesn’t support/recognize 
> > > > this notification, it will ignore this notification and reply as usual.
> > > 
> > > But there is no "as usual" for what happens to the older Child SA. Some implementations will allow it, some will only allow it if it has its own IKE SA, and some will just delete the old one. This is the ambiguity we are trying to address with the draft.
> > > 
> > > > 2) It supports this function and is willing to create the 
> > > > additional SA, so it will reply with CPU_QUEUE_INFO notification too. 3) It supports this function, but it isn’t willing to create more additional SAs, so it will reply with TS_MAX_QUEUE.
> > > > Therefore, it seems like that CPU_QUEUE_INFO and TS_MAX_QUEUE 
> > > > these 2 notifications are enough to use, and the draft can be simplified to only use these 2 notifications.
> > > 
> > > I hope I explained why we think some clear signal has its use. If you take your assumptions to the max, one would need no document at all, as the IKEv2 specification states there can be Child SAs that are duplicates or with overlapping IP ranges, so in theory, nothing is needed.
> > > 
> > > > 3.
> > > > 
> > > > Both peers send
> > > > 
> > > > the preferred minimum number of additional Child SAs to install.
> > > > 
> > > > First, I think sending the number of additional Child SAs is 
> > > > unnecessary. Second, when using “minimum” here my first impression 
> > > > is that it means 0, so in order to remove ambiguity I suggest just saying “the preferred number” (if you think sending the number is necessary).
> > > 
> > > The use of minimum is indicating what the peer needs. A peer with 4 CPUs does not prefer 4, it really prefers as many as the highest number of CPUs of the two peers - within reason. The preference is really a combination of what works best for the combination of the two peers.
> > > 
> > > Note the minimum is not about the minimum number required for functioning, but the minimum number to get optimum performance.
> > > 
> > > By indicating the minimum, both sides can pick the highest minimum and then allow a few more (for race conditions during rekeying).
> > > 
> > > > 4.
> > > > 
> > > > If a CREATE_CHILD_SA exchange request containing both a 
> > > > CPU_QUEUE_INFO and a CPU_QUEUES notification is received, the 
> > > > responder MUST ignore the CPU_QUEUE_INFO payload. If a 
> > > > CREATE_CHILD_SA exchange reply is received with both 
> > > > CPU_QUEUE_INFO and CPU_QUEUES notifications, the initiator MUST 
> > > > ignore the notification that it did not send in the request.
> > > > 
> > > > I think there is ambiguity here. When the initiator sends the 
> > > > CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO 
> > > > and a CPU_QUEUES notification, and the responder also adds 
> > > > CPU_QUEUE_INFO and CPU_QUEUES notifications in the reply, the initiator doesn’t know how to process with this situation, should the initiator ignore the CPU_QUEUE_INFO payload or notify an error to the responder?
> > > 
> > > We went back and forth on this a couple of times with the authors. We really wanted to keep it as simple as possible but also not be too pedantic. From a protocol point of view, we could say to just return an error like SYNTAX_ERROR, but that would cause the IKE SA and all its working Child SAs to also be torn down, and we wanted to avoid that so bugs in the performance implementation does not result in completely tunnel failures. Hence our phrasing of "just ignore X"
> > > on both the initiator and responder.
> > > 
> > > We agree that a broken initiator with a broken responder leads to 
> > > something broken. I think specifying how a broken initiator should 
> > > respond to a broken responder is taking the Postel Principle a step 
> > > too far? :)
> > > 
> > > Paul
> 
> On Fri, Nov 05, 2021 at 21:39:05 +0000, Bottorff, Paul wrote:
> > Hi Paul:
> > 
> > I've reviewed your draft to determine if it is viable as a solution to the network multi-pathing problems I've been investigation. Though I have no objection to your solution for multi-cpu balancing, it does not seem to provide a reasonable alternative to draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. Hopefully, we can move forward with draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. We currently have it implemented and working quite successfully in our smart NICs.
> > 
> > For network multi-pathing within a highly meshed data center we want to be able to create an identifier for every flow to allow the network to perform load balancing algorithms which measure the load on links and dynamically shift flows to achieve balance (rather than balancing the number of flows on links). For a typical big server we would expect 100s of flows and for a network middle box we would expect 1000s. The amount of state required to carry independent SA state for each of these is troublesome.
> > 
> > Further, to use the additional SAs we would need to dynamically generate new identical SAs based on a current flow table which will result in startup delays for new flows as we establish the new SA.
> > 
> > Finally, even though some of the switches some of the time could add the SPI into their hash for IPsec packets (and use a different hash for other packets), this feature is not available universally across all data center switches and router. Also the standard procedure for existing data center switches, even those that could generate a special hash for SPI, is to use 5 tuple hash as the flow identifier across all packets.
> > 
> > It is true that for network multi-pathing delivery order is only guaranteed per flow, and therefore replay detection must either be disabled or would needed to take this into account re-ordering between flows.
> > 
> > Cheers,
> > 
> > Paul
> > 
> > 
> > 
> > -----Original Message-----
> > From: IPsec [mailto:ipsec-bounces@ietf.org] On Behalf Of Paul Wouters
> > Sent: Monday, November 1, 2021 8:27 PM
> > To: Panwei (William) <william.panwei@huawei.com>
> > Cc: ipsec@ietf.org; 
> > draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> > Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > 
> > On Fri, 29 Oct 2021, Panwei (William) wrote:
> > 
> > Hi William,
> > 
> > > Subject: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > 
> > > I’ve read the recent version. This is an interesting solution. I think it should be adopted. Below are some comments.
> > 
> > Thnks for reading the draft and giving us feedback!
> > 
> > > 1.
> > > 
> > > The CPU_QUEUES notification value refers to the number of additional
> > > 
> > > resource-specific Child SAs that may be installed for this 
> > > particular
> > > 
> > > TSi/TSr combination excluding the Fallback Child SA.
> > > 
> > > Is it necessary to limit the amount of additional SAs at the 
> > > beginning while TS_MAX_QUEUE can be used to reject the request of 
> > > creating additional SA at any time? In the virtualization scenario, the new VMs can be launched on-demand, in other words, it may be seen as the number of CPUs isn’t fixed, so maybe limiting the addition SAs at the beginning will damage the flexibility.
> > 
> > The limit is really a very high maximum number and not a very low number exactly matching CPUs. We had that at first, to try and optimize it but there were too many race conditions, eg with rekeying. So if you have a peer with 4 CPUs and a peer with 2 CPUs, you might just want to set the max to 8 or even 12. It is mostly meant to try and avoid doing CREATE_CHILD_SA's that are just doomed to failure anyway. So see it more as a resource cap that a strict physical limitation.
> > 
> > > 2.
> > > 
> > > The CPU_QUEUES notification payload is sent in the IKE_AUTH or
> > > 
> > > CREATE_CHILD_SA Exchange indicating the negotiated Child SA is a
> > > 
> > > Fallback SA.
> > > 
> > > Before additional SAs are created, is there any difference between 
> > > using this first/Fallback SA and using other normal SA? I think there is no difference. So maybe we don’t need to add this notification payload when creating the first SA.
> > 
> > the problem right now is that implementations often will discard an an older identical Child SA for the newest one. So one of the key intentions of the document is for the initiator/responder to clearly negotiate from the start they are going to be using Child SA's with identical Traffic Selectors and they want the older ones to stick around. Also, it is important that there is always 1 fallback Child SA that can be used on any CPU resource. So we really wanted to mark that one very clearly. For instance, if it becomes idle, it should NOT be deleted.
> > 
> > > When the initiator wants
> > > to create an additional SA, it can directly send the request with CPU_QUEUE_INFO notification payload.
> > 
> > It would be good to know from the responder if they support this and if they are willing to do this before doing the CREATE_CHILD_SA. And as I said above, to ensure both parties agree on which Child SA is the "always be present" fallback SA to ensure things like adding a new CPU always results in encrypted packets via the fallback SA.
> > 
> > > There are 3 ways that the
> > > responder may reply: 1) The responder doesn’t support/recognize this 
> > > notification, it will ignore this notification and reply as usual.
> > 
> > But there is no "as usual" for what happens to the older Child SA. Some implementations will allow it, some will only allow it if it has its own IKE SA, and some will just delete the old one. This is the ambiguity we are trying to address with the draft.
> > 
> > > 2) It supports this function and is willing to create the additional 
> > > SA, so it will reply with CPU_QUEUE_INFO notification too. 3) It supports this function, but it isn’t willing to create more additional SAs, so it will reply with TS_MAX_QUEUE.
> > > Therefore, it seems like that CPU_QUEUE_INFO and TS_MAX_QUEUE these 
> > > 2 notifications are enough to use, and the draft can be simplified to only use these 2 notifications.
> > 
> > I hope I explained why we think some clear signal has its use. If you take your assumptions to the max, one would need no document at all, as the IKEv2 specification states there can be Child SAs that are duplicates or with overlapping IP ranges, so in theory, nothing is needed.
> > 
> > > 3.
> > > 
> > > Both peers send
> > > 
> > > the preferred minimum number of additional Child SAs to install.
> > > 
> > > First, I think sending the number of additional Child SAs is 
> > > unnecessary. Second, when using “minimum” here my first impression 
> > > is that it means 0, so in order to remove ambiguity I suggest just saying “the preferred number” (if you think sending the number is necessary).
> > 
> > The use of minimum is indicating what the peer needs. A peer with 4 CPUs does not prefer 4, it really prefers as many as the highest number of CPUs of the two peers - within reason. The preference is really a combination of what works best for the combination of the two peers.
> > 
> > Note the minimum is not about the minimum number required for functioning, but the minimum number to get optimum performance.
> > 
> > By indicating the minimum, both sides can pick the highest minimum and then allow a few more (for race conditions during rekeying).
> > 
> > > 4.
> > > 
> > > If a CREATE_CHILD_SA exchange request containing both a 
> > > CPU_QUEUE_INFO and a CPU_QUEUES notification is received, the 
> > > responder MUST ignore the CPU_QUEUE_INFO payload. If a 
> > > CREATE_CHILD_SA exchange reply is received with both CPU_QUEUE_INFO 
> > > and CPU_QUEUES notifications, the initiator MUST ignore the 
> > > notification that it did not send in the request.
> > > 
> > > I think there is ambiguity here. When the initiator sends the 
> > > CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO 
> > > and a CPU_QUEUES notification, and the responder also adds 
> > > CPU_QUEUE_INFO and CPU_QUEUES notifications in the reply, the initiator doesn’t know how to process with this situation, should the initiator ignore the CPU_QUEUE_INFO payload or notify an error to the responder?
> > 
> > We went back and forth on this a couple of times with the authors. We really wanted to keep it as simple as possible but also not be too pedantic. From a protocol point of view, we could say to just return an error like SYNTAX_ERROR, but that would cause the IKE SA and all its working Child SAs to also be torn down, and we wanted to avoid that so bugs in the performance implementation does not result in completely tunnel failures. Hence our phrasing of "just ignore X"
> > on both the initiator and responder.
> > 
> > We agree that a broken initiator with a broken responder leads to 
> > something broken. I think specifying how a broken initiator should 
> > respond to a broken responder is taking the Postel Principle a step 
> > too far? :)
> > 
> > Paul