Re: [IPsec] Comments on draft-pwouters-multi-sa-performance

Antony Antony <antony.antony@secunet.com> Wed, 10 November 2021 08:11 UTC

Date: Wed, 10 Nov 2021 09:10:48 +0100
From: Antony Antony <antony.antony@secunet.com>
To: "Bottorff, Paul" <paul.bottorff@hpe.com>
CC: Paul Wouters <paul.wouters=40aiven.io@dmarc.ietf.org>, "Panwei (William)" <william.panwei@huawei.com>, "ipsec@ietf.org" <ipsec@ietf.org>, "draft-pwouters-ipsecme-multi-sa-performance@ietf.org" <draft-pwouters-ipsecme-multi-sa-performance@ietf.org>
Message-ID: <YYt+iIbwjZuh4uSI@moon.secunet.de>
Reply-To: antony.antony@secunet.com
References: <cc0b5528e7c047e0a9073f637218f013@huawei.com> <3c525728-22e8-d5ef-f183-c2c9d622cc54@nohats.ca> <CS1PR8401MB11924248C78CF59B633D931EFE8E9@CS1PR8401MB1192.NAMPRD84.PROD.OUTLOOK.COM>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CS1PR8401MB11924248C78CF59B633D931EFE8E9@CS1PR8401MB1192.NAMPRD84.PROD.OUTLOOK.COM>
Precedence: first-class
Priority: normal
Organization: secunet
Archived-At: <https://mailarchive.ietf.org/arch/msg/ipsec/TcUhx3P3F3h_QF2_tWGkQ5wmOc4>
Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance

Hi Paul, 

I think our draft is a better solution for the network multipath problem too, definitely for a few per Path SAs. Larger number of paths, say 32 or more paths, may cause scaling issues in SPD or/and SAD lookup; the data path lookup. However, data path lookup speed would depend on the implementation. Creating a SA per path would decrease chance out of order delivery, ESP sequence number issues will be minimal.

To create per path SA the IKE negotiations would be like the current draft, perCPU. The SAD/SPD lookup code path will differ from the prototype code we have for Linux/XFRM. The multi path will be a property of SPD. Use the multip path attribute to lookup in SAD.  Each SA can be created dynamically based on traffic; actually path. First, create the Fallback SA and install the policy with the new flag, perPath SADB_ACQUIRE flag. And when there is traffic and no SA matching packet's path, the policy should trigger an SADB_ACQUIRE and IKE negotiation follows.  After a successful IKE negotiation install the new per Path SA. While IKE negotiates the path SA, use the Fallback SA. So there won’t be any packet drop.

I am working on a draft, how to deal with UDP encapsulation. Each SA with a different UDP port. My idea is to make UDP encapsulated IPsec SA a pair of SAs. IKE negotiator will choose source port. In one direction source port will change and in the other direction destination port will change, based on path entropy. This solution is generic and will work with the common NAT gateways. With a limitation, IKE peer which is not behind NAT must initiate new SA. 

RSS/ntuple would use UDP source port plus destination port for entropy. Would this work for network switches/routers along the path? My guess it will!

Are you seeing out of order delivery because ESP packets take of multiple paths or at the sender? I noticed multiple CPU ESP sender can send the packets out of order. I think if you install an SA per path, out-of-order issues due paths will be resolved. There may be corner cases when there are multiple CPUs and multiple paths. In those cases, you may need CPU x paths SAs. CPU x Paths SAs will add lookup complexity. Out of order sending and delivery seems to be a problem with multiple CPUs on the sender. Even with IPsec offload NICs in Linux out of order sending could be an issue. Using per-CPU would reduce it considerably.

 

On Wed, Nov 10, 2021 at 08:32:13AM +0100, Antony Antony wrote:
> Hi Paul,
> 
> I think our draft is a better solution for the network multipath problem, definitely for small number of per Path SAs. Larger number of paths, say 16 or more paths, may cause scaling issues in SPD or/and SAD lookup; the data path. However, data path lookup speed would depend on the implementation.
> Creating an SA per path would increase chance in order delivery, ESP sequence number issues will be minimal.
> 
> To create per path SA the IKE negotiations would be similar to current draft, perCPU. The SAD/SPD code path will be different from the prototype we have for Linux/XFRM. The path will be a property of SPD. When looking up SAD use the path attribute and the correct SA will be used.
> 
> Each SA can be created dynamically if and when necessary. First create
> the Fallback SA and install the policy with new flag, per-Path SADB_ACQUIRE flag.
> And when there is traffic and no SA matching per path SAD  entry, the
> policy should trigger IKE negotiations and install a new per Path SA.
> While per path SA is negotiated use the Fallback SA. So there won't be
> any packet drop.
> 
> I am working a draft, how to deal with UDP encapsulation.  Each SA with different UDP port.
> My idea is make UDP encapsulated IPsec SA a pair of SAs. In one direction source port will change and in the other direction destination port will change based on path entropy. This solution is generic and will work with the common NAT gateways too.
> RSS/ntuple would use source port plus destination port for entropy.
> Would this work for network switches/routers along the path?
> 
> Are you seeing out of order delivery because ESP packets take of multiple paths or multiple CPUs on the ESP sender? 
> I think if you install an SA per path, out of order issues due to path
> will mostly solved. There may be corner cases when there are multiple CPUs
> and multiple paths. In those cases you may need CPU x paths SAs.
> CPU x Paths SAs will add lookup complexity. I have no operational experience yet.
> Out of order delivery/sending seems to be a problem when there are multiple CPUs on the sender.  Even when using IPsec offload NICs in Linux. Using per-CPU would reduce it considerably.
> 
> 
> -antony
> 
> On Fri, Nov 05, 2021 at 21:39:05 +0000, Bottorff, Paul wrote:
> > Hi Paul:
> > 
> > I've reviewed your draft to determine if it is viable as a solution to the network multi-pathing problems I've been investigation. Though I have no objection to your solution for multi-cpu balancing, it does not seem to provide a reasonable alternative to draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. Hopefully, we can move forward with draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. We currently have it implemented and working quite successfully in our smart NICs.
> > 
> > For network multi-pathing within a highly meshed data center we want to be able to create an identifier for every flow to allow the network to perform load balancing algorithms which measure the load on links and dynamically shift flows to achieve balance (rather than balancing the number of flows on links). For a typical big server we would expect 100s of flows and for a network middle box we would expect 1000s. The amount of state required to carry independent SA state for each of these is troublesome.
> > 
> > Further, to use the additional SAs we would need to dynamically generate new identical SAs based on a current flow table which will result in startup delays for new flows as we establish the new SA.
> > 
> > Finally, even though some of the switches some of the time could add the SPI into their hash for IPsec packets (and use a different hash for other packets), this feature is not available universally across all data center switches and router. Also the standard procedure for existing data center switches, even those that could generate a special hash for SPI, is to use 5 tuple hash as the flow identifier across all packets.
> > 
> > It is true that for network multi-pathing delivery order is only guaranteed per flow, and therefore replay detection must either be disabled or would needed to take this into account re-ordering between flows.
> > 
> > Cheers,
> > 
> > Paul  
> > 
> > 
> > 
> > -----Original Message-----
> > From: IPsec [mailto:ipsec-bounces@ietf.org] On Behalf Of Paul Wouters
> > Sent: Monday, November 1, 2021 8:27 PM
> > To: Panwei (William) <william.panwei@huawei.com>
> > Cc: ipsec@ietf.org; draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> > Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > 
> > On Fri, 29 Oct 2021, Panwei (William) wrote:
> > 
> > Hi William,
> > 
> > > Subject: [IPsec] Comments on draft-pwouters-multi-sa-performance
> > 
> > > I’ve read the recent version. This is an interesting solution. I think it should be adopted. Below are some comments.
> > 
> > Thnks for reading the draft and giving us feedback!
> > 
> > > 1.
> > > 
> > > The CPU_QUEUES notification value refers to the number of additional
> > > 
> > > resource-specific Child SAs that may be installed for this particular
> > > 
> > > TSi/TSr combination excluding the Fallback Child SA.
> > > 
> > > Is it necessary to limit the amount of additional SAs at the beginning 
> > > while TS_MAX_QUEUE can be used to reject the request of creating 
> > > additional SA at any time? In the virtualization scenario, the new VMs can be launched on-demand, in other words, it may be seen as the number of CPUs isn’t fixed, so maybe limiting the addition SAs at the beginning will damage the flexibility.
> > 
> > The limit is really a very high maximum number and not a very low number exactly matching CPUs. We had that at first, to try and optimize it but there were too many race conditions, eg with rekeying. So if you have a peer with 4 CPUs and a peer with 2 CPUs, you might just want to set the max to 8 or even 12. It is mostly meant to try and avoid doing CREATE_CHILD_SA's that are just doomed to failure anyway. So see it more as a resource cap that a strict physical limitation.
> > 
> > > 2.
> > > 
> > > The CPU_QUEUES notification payload is sent in the IKE_AUTH or
> > > 
> > > CREATE_CHILD_SA Exchange indicating the negotiated Child SA is a
> > > 
> > > Fallback SA.
> > > 
> > > Before additional SAs are created, is there any difference between 
> > > using this first/Fallback SA and using other normal SA? I think there is no difference. So maybe we don’t need to add this notification payload when creating the first SA.
> > 
> > the problem right now is that implementations often will discard an an older identical Child SA for the newest one. So one of the key intentions of the document is for the initiator/responder to clearly negotiate from the start they are going to be using Child SA's with identical Traffic Selectors and they want the older ones to stick around. Also, it is important that there is always 1 fallback Child SA that can be used on any CPU resource. So we really wanted to mark that one very clearly. For instance, if it becomes idle, it should NOT be deleted.
> > 
> > > When the initiator wants
> > > to create an additional SA, it can directly send the request with CPU_QUEUE_INFO notification payload.
> > 
> > It would be good to know from the responder if they support this and if they are willing to do this before doing the CREATE_CHILD_SA. And as I said above, to ensure both parties agree on which Child SA is the "always be present" fallback SA to ensure things like adding a new CPU always results in encrypted packets via the fallback SA.
> > 
> > > There are 3 ways that the
> > > responder may reply: 1) The responder doesn’t support/recognize this 
> > > notification, it will ignore this notification and reply as usual.
> > 
> > But there is no "as usual" for what happens to the older Child SA. Some implementations will allow it, some will only allow it if it has its own IKE SA, and some will just delete the old one. This is the ambiguity we are trying to address with the draft.
> > 
> > > 2) It supports this function and is willing to create the additional 
> > > SA, so it will reply with CPU_QUEUE_INFO notification too. 3) It supports this function, but it isn’t willing to create more additional SAs, so it will reply with TS_MAX_QUEUE.
> > > Therefore, it seems like that CPU_QUEUE_INFO and TS_MAX_QUEUE these 2 
> > > notifications are enough to use, and the draft can be simplified to only use these 2 notifications.
> > 
> > I hope I explained why we think some clear signal has its use. If you take your assumptions to the max, one would need no document at all, as the IKEv2 specification states there can be Child SAs that are duplicates or with overlapping IP ranges, so in theory, nothing is needed.
> > 
> > > 3.
> > > 
> > > Both peers send
> > > 
> > > the preferred minimum number of additional Child SAs to install.
> > > 
> > > First, I think sending the number of additional Child SAs is 
> > > unnecessary. Second, when using “minimum” here my first impression is 
> > > that it means 0, so in order to remove ambiguity I suggest just saying “the preferred number” (if you think sending the number is necessary).
> > 
> > The use of minimum is indicating what the peer needs. A peer with 4 CPUs does not prefer 4, it really prefers as many as the highest number of CPUs of the two peers - within reason. The preference is really a combination of what works best for the combination of the two peers.
> > 
> > Note the minimum is not about the minimum number required for functioning, but the minimum number to get optimum performance.
> > 
> > By indicating the minimum, both sides can pick the highest minimum and then allow a few more (for race conditions during rekeying).
> > 
> > > 4.
> > > 
> > > If a CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO 
> > > and a CPU_QUEUES notification is received, the responder MUST ignore 
> > > the CPU_QUEUE_INFO payload. If a CREATE_CHILD_SA exchange reply is 
> > > received with both CPU_QUEUE_INFO and CPU_QUEUES notifications, the 
> > > initiator MUST ignore the notification that it did not send in the 
> > > request.
> > > 
> > > I think there is ambiguity here. When the initiator sends the 
> > > CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO and 
> > > a CPU_QUEUES notification, and the responder also adds CPU_QUEUE_INFO 
> > > and CPU_QUEUES notifications in the reply, the initiator doesn’t know how to process with this situation, should the initiator ignore the CPU_QUEUE_INFO payload or notify an error to the responder?
> > 
> > We went back and forth on this a couple of times with the authors. We really wanted to keep it as simple as possible but also not be too pedantic. From a protocol point of view, we could say to just return an error like SYNTAX_ERROR, but that would cause the IKE SA and all its working Child SAs to also be torn down, and we wanted to avoid that so bugs in the performance implementation does not result in completely tunnel failures. Hence our phrasing of "just ignore X"
> > on both the initiator and responder.
> > 
> > We agree that a broken initiator with a broken responder leads to something broken. I think specifying how a broken initiator should respond to a broken responder is taking the Postel Principle a step too far? :)
> > 
> > Paul

On Fri, Nov 05, 2021 at 21:39:05 +0000, Bottorff, Paul wrote:
> Hi Paul:
> 
> I've reviewed your draft to determine if it is viable as a solution to the network multi-pathing problems I've been investigation. Though I have no objection to your solution for multi-cpu balancing, it does not seem to provide a reasonable alternative to draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. Hopefully, we can move forward with draft-xu-ipsecme-esp-in-udp-lb-08 for network multi-pathing. We currently have it implemented and working quite successfully in our smart NICs.
> 
> For network multi-pathing within a highly meshed data center we want to be able to create an identifier for every flow to allow the network to perform load balancing algorithms which measure the load on links and dynamically shift flows to achieve balance (rather than balancing the number of flows on links). For a typical big server we would expect 100s of flows and for a network middle box we would expect 1000s. The amount of state required to carry independent SA state for each of these is troublesome.
> 
> Further, to use the additional SAs we would need to dynamically generate new identical SAs based on a current flow table which will result in startup delays for new flows as we establish the new SA.
> 
> Finally, even though some of the switches some of the time could add the SPI into their hash for IPsec packets (and use a different hash for other packets), this feature is not available universally across all data center switches and router. Also the standard procedure for existing data center switches, even those that could generate a special hash for SPI, is to use 5 tuple hash as the flow identifier across all packets.
> 
> It is true that for network multi-pathing delivery order is only guaranteed per flow, and therefore replay detection must either be disabled or would needed to take this into account re-ordering between flows.
> 
> Cheers,
> 
> Paul  
> 
> 
> 
> -----Original Message-----
> From: IPsec [mailto:ipsec-bounces@ietf.org] On Behalf Of Paul Wouters
> Sent: Monday, November 1, 2021 8:27 PM
> To: Panwei (William) <william.panwei@huawei.com>
> Cc: ipsec@ietf.org; draft-pwouters-ipsecme-multi-sa-performance@ietf.org
> Subject: Re: [IPsec] Comments on draft-pwouters-multi-sa-performance
> 
> On Fri, 29 Oct 2021, Panwei (William) wrote:
> 
> Hi William,
> 
> > Subject: [IPsec] Comments on draft-pwouters-multi-sa-performance
> 
> > I’ve read the recent version. This is an interesting solution. I think it should be adopted. Below are some comments.
> 
> Thnks for reading the draft and giving us feedback!
> 
> > 1.
> > 
> > The CPU_QUEUES notification value refers to the number of additional
> > 
> > resource-specific Child SAs that may be installed for this particular
> > 
> > TSi/TSr combination excluding the Fallback Child SA.
> > 
> > Is it necessary to limit the amount of additional SAs at the beginning 
> > while TS_MAX_QUEUE can be used to reject the request of creating 
> > additional SA at any time? In the virtualization scenario, the new VMs can be launched on-demand, in other words, it may be seen as the number of CPUs isn’t fixed, so maybe limiting the addition SAs at the beginning will damage the flexibility.
> 
> The limit is really a very high maximum number and not a very low number exactly matching CPUs. We had that at first, to try and optimize it but there were too many race conditions, eg with rekeying. So if you have a peer with 4 CPUs and a peer with 2 CPUs, you might just want to set the max to 8 or even 12. It is mostly meant to try and avoid doing CREATE_CHILD_SA's that are just doomed to failure anyway. So see it more as a resource cap that a strict physical limitation.
> 
> > 2.
> > 
> > The CPU_QUEUES notification payload is sent in the IKE_AUTH or
> > 
> > CREATE_CHILD_SA Exchange indicating the negotiated Child SA is a
> > 
> > Fallback SA.
> > 
> > Before additional SAs are created, is there any difference between 
> > using this first/Fallback SA and using other normal SA? I think there is no difference. So maybe we don’t need to add this notification payload when creating the first SA.
> 
> the problem right now is that implementations often will discard an an older identical Child SA for the newest one. So one of the key intentions of the document is for the initiator/responder to clearly negotiate from the start they are going to be using Child SA's with identical Traffic Selectors and they want the older ones to stick around. Also, it is important that there is always 1 fallback Child SA that can be used on any CPU resource. So we really wanted to mark that one very clearly. For instance, if it becomes idle, it should NOT be deleted.
> 
> > When the initiator wants
> > to create an additional SA, it can directly send the request with CPU_QUEUE_INFO notification payload.
> 
> It would be good to know from the responder if they support this and if they are willing to do this before doing the CREATE_CHILD_SA. And as I said above, to ensure both parties agree on which Child SA is the "always be present" fallback SA to ensure things like adding a new CPU always results in encrypted packets via the fallback SA.
> 
> > There are 3 ways that the
> > responder may reply: 1) The responder doesn’t support/recognize this 
> > notification, it will ignore this notification and reply as usual.
> 
> But there is no "as usual" for what happens to the older Child SA. Some implementations will allow it, some will only allow it if it has its own IKE SA, and some will just delete the old one. This is the ambiguity we are trying to address with the draft.
> 
> > 2) It supports this function and is willing to create the additional 
> > SA, so it will reply with CPU_QUEUE_INFO notification too. 3) It supports this function, but it isn’t willing to create more additional SAs, so it will reply with TS_MAX_QUEUE.
> > Therefore, it seems like that CPU_QUEUE_INFO and TS_MAX_QUEUE these 2 
> > notifications are enough to use, and the draft can be simplified to only use these 2 notifications.
> 
> I hope I explained why we think some clear signal has its use. If you take your assumptions to the max, one would need no document at all, as the IKEv2 specification states there can be Child SAs that are duplicates or with overlapping IP ranges, so in theory, nothing is needed.
> 
> > 3.
> > 
> > Both peers send
> > 
> > the preferred minimum number of additional Child SAs to install.
> > 
> > First, I think sending the number of additional Child SAs is 
> > unnecessary. Second, when using “minimum” here my first impression is 
> > that it means 0, so in order to remove ambiguity I suggest just saying “the preferred number” (if you think sending the number is necessary).
> 
> The use of minimum is indicating what the peer needs. A peer with 4 CPUs does not prefer 4, it really prefers as many as the highest number of CPUs of the two peers - within reason. The preference is really a combination of what works best for the combination of the two peers.
> 
> Note the minimum is not about the minimum number required for functioning, but the minimum number to get optimum performance.
> 
> By indicating the minimum, both sides can pick the highest minimum and then allow a few more (for race conditions during rekeying).
> 
> > 4.
> > 
> > If a CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO 
> > and a CPU_QUEUES notification is received, the responder MUST ignore 
> > the CPU_QUEUE_INFO payload. If a CREATE_CHILD_SA exchange reply is 
> > received with both CPU_QUEUE_INFO and CPU_QUEUES notifications, the 
> > initiator MUST ignore the notification that it did not send in the 
> > request.
> > 
> > I think there is ambiguity here. When the initiator sends the 
> > CREATE_CHILD_SA exchange request containing both a CPU_QUEUE_INFO and 
> > a CPU_QUEUES notification, and the responder also adds CPU_QUEUE_INFO 
> > and CPU_QUEUES notifications in the reply, the initiator doesn’t know how to process with this situation, should the initiator ignore the CPU_QUEUE_INFO payload or notify an error to the responder?
> 
> We went back and forth on this a couple of times with the authors. We really wanted to keep it as simple as possible but also not be too pedantic. From a protocol point of view, we could say to just return an error like SYNTAX_ERROR, but that would cause the IKE SA and all its working Child SAs to also be torn down, and we wanted to avoid that so bugs in the performance implementation does not result in completely tunnel failures. Hence our phrasing of "just ignore X"
> on both the initiator and responder.
> 
> We agree that a broken initiator with a broken responder leads to something broken. I think specifying how a broken initiator should respond to a broken responder is taking the Postel Principle a step too far? :)
> 
> Paul

[IPsec] Comments on draft-pwouters-multi-sa-perfo… Panwei (William)
Re: [IPsec] Comments on draft-pwouters-multi-sa-p… Paul Wouters
Re: [IPsec] Comments on draft-pwouters-multi-sa-p… Bottorff, Paul
Re: [IPsec] Comments on draft-pwouters-multi-sa-p… Antony Antony
Re: [IPsec] Comments on draft-pwouters-multi-sa-p… Bottorff, Paul
Re: [IPsec] Comments on draft-pwouters-multi-sa-p… Antony Antony
Re: [IPsec] Comments on draft-pwouters-multi-sa-p… Paul Wouters