[Anima] some implementor comments on RFC8994

Michael Richardson <mcr+ietf@sandelman.ca> Wed, 22 September 2021 21:20 UTC

From: Michael Richardson <mcr+ietf@sandelman.ca>
To: anima@ietf.org
cc: Minerva-project@lists.sandelman.ca
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-="; micalg="pgp-sha512"; protocol="application/pgp-signature"
Date: Wed, 22 Sep 2021 17:20:03 -0400
Message-ID: <27729.1632345603@localhost>
Archived-At: <https://mailarchive.ietf.org/arch/msg/anima/NPegpCKfF2fwQqd0E4gYEvV3gGw>
Subject: [Anima] some implementor comments on RFC8994
Precedence: list

As some of you know I have been working on an RFC8994 (ACP) implementation
since Nov. 2020. About 700 hours internally funded, with bursts of effort
followed by weeks of it being lower priority.

The code is in Rust, and it's my first major project in Rust.
But, with sub-components in C (OpenswanX:IKEv2), and C++ (Unstrung:RPL)
It is not yet integrated with my RFC8995 code, that comes next.

I wanted to share a few thoughts. Since TL;DR>, I'm putting my discussion
points first, and then the summary of architecture. Some things belong in
IPsecME WG, but I'll post that separately at some point.
This posting is from notes I wrote up at:
https://github.com/AnimaGUS-minerva/connect/tree/main/doc

I've managed to assemble all the pieces, and I see DIO/DAOs in the overlay
network, but then I add a third node and things fail. Dr. Atwood and his
students want to play with this code.

I am now convinced that either my plans how to use VTI/fwmark with the Linux
XFRM kernel stack will not work, or that I'm using it wrong, and I have to go
back to a simpler test case. Things are falling apart when I'm trying to
make the second tunnel. This email is not about that part.

1) the high-level IPsec policy that we do has essentially:
myid= certificate othername=rfc8994....@example
hisid= certificate othername=*

But, also is very specific about the IP addresses.

In Openswan terminology:
::/0===fe80::d4d4:9aff:fe37:72c0[E=rfc8994+fd739fc23c3440112233445500000300+@acp.example.com]
...fe80::e0d0:4eff:fee4:79d6[E=*]===::/0

First note: it is very odd to create a policy that accepts any identity on the right(remote),
while actually locking the right hand side IP address down.
In general, this creates a template policy which is then instantiated as remote peers arrive.
Locking down to a single IP address, because of IPv4 NAPT44 could actually
still have multiple peers on the right.

Normally, such a template can not be initiated, but some changes were made
to enable this template to initiate if the righthand IP address is present.
Upon reflection, it might be that an override to the policy system to disable
that the wildcard creates a template might have been better.
This change is being considered, and I've hacked it in as a test.
The template, plus single instance works, but is kinda a of a mess.

I think that this will help both ends to know that the policy is active.
Otherwise, the initial responding side sees that the TEMPLATE is not
active, even though an instance of it is.

**SPEC THOUGHTS***
An alternative is that the GRASP announcement could be extended to include
the DN of the announcing node. This could be done by adding a fourth
element to the locator that was the DER encoded DN of the announcing node.

There are privacy implications of this as the DN contains the assigned ULA
prefix for the node.

2) On flat (ACP-unware) LAN with a bunch of ACP nodes, the resulting set of
tunnels could be O(n^2), to no real advantage. All systems essentially
share most of the same fate being on the same L2 fabric.
An example of such a scenario is a cabinet full (~40 to ~80) of hypervisor systems
with an ACP system running in the hypervisor and/or in the BMC.
(This assumes that the L2 fabrics between cabinets are not bridged, which
they frequently are, via all manner of BGP-based DC stuff that the IETF RTG
area has worked on)

It's enough that each node connects to 3 or 4 other systems.
At the RPL level, for 80 nodes in a cabinet, a fan-out of 2 or 3 means a
depth of 5, which seems just fine to me. Otherwise, given no other
feedback, (i.e. "ETX") RPL would just find the node with the lowest rank
and that node would have 79 children.
Of course, if the ToR switch was GRASP/ACP aware, that's exactly what we'd
want, as that would exactly follow the physical topology. But in that
case, the ToR switch would really be there to support such a load.

We need two things:
a) some way to translate something from IKEv2 into an ETX that RPL can
deal with. IKEv2 is going to do Dead Peer Detection (DPD) messages
across the IKEv2 PARENT SA. We could do some kind of round-trip
calculation there. Of course, if DPD messages ever get lost, then
that would go into the ETX calculation.

b) an ability for a responding IKEv2 to say, perhaps in the R1 message, that
the node is already at ideal capacity, and that perhaps bringing a
tunnel up *now* isn't the best choice.
In which case, the I2/R2 would proceed without a CHILD SA proposal
(no tunnel).
It's better to do this at R1, even though that isn't authenticated
(yet!), because that avoids allocating CHILD SA resources that won't
get used.

I think that IKEv2 has some extensions involving gateway overcapacity
that might help us here.

3) I had a lot of troubles with simultaneous initiation from both ends.
This will be the subject of an email to IPsecME WG. RFC7296 has text but
it needs slight clarification.
I don't quite understand why I had such issues, but I think that the
templates are partly responsible. That is, the occurances should have
been rare, but I ran into them all the time. So I had to solve the problem.
There are still some issues that I haven't quite figured out, and I've
"solved" the problem by having some nodes never initiate.

4) I also noticed that there is a race condition between seeing the GRASP
AN_ACP and setting up the policy.

Node A says, "AN_ACP", "I'm here".
Node B sees it, and initiates to Node A.
But, node A hasn't seen node B's AN_ACP yet, so node A hasn't got a policy
to talk to node A yet. The node B->A result is an IKEv2 authorization/authentication failure.
Then node A will see node B's AN_ACP, install a policy, and initiate from
A->B, and everything is fine.

What could occur is that I could remove the very specific remoteip= in the
policy, and have a less specific policy that accepted a connection for
remoteid=* from any IPv6-LL. I'm not really crazy about that solution.

I'm not actually sure that there is a problem. The issue is noise.

5) Whether I create ethernet pairs (for physical interfaces that are part of
a bridge, which is common for hypervisors), or macvlan interfaces [which
are bridges down inside the Linux kernel, and so conflict with them], I
get randomized L2 addresses. Thus I get IIDs which change each time.
Plus, the VTI interfaces that I create also have randomized IIDs as well.

The result is that "up-arrow-return" to restart a test results in new IPv6
LL addresses each time. This is not exactly a problem for any of the
protocols, but it is annoying for testing.

I haven't figured out how to clean up for failing interfaces/tunnels yet.
My tunnel interfaces are named "acp_XXX" [for incrementing XXX], and I
fear that while a system runs, there will be more than 999 interfaces
coming and going. I am thinking about having it be acp_HHHH, where HHHH
is the last 16 bits of the peer's IPv6. But conflicts will occur at some
point and mess stuff up. I could go to all 64-bits of IID.

A reason to attempt to keep the interface names in the ACP that connect to
the same peer host is for network monitoring. It's relatively easy with
SNMP/YANG to make nice graphs for interfaces that keep the same name,
even if the ifindex changes.

--
Michael Richardson <mcr+IETF@sandelman.ca> . o O ( IPv6 IøT consulting )
Sandelman Software Works Inc, Ottawa and Worldwide

Attachment: signature.asc

[Anima] some implementor comments on RFC8994 Michael Richardson
[Anima] Race condition [was: some implementor com… Brian E Carpenter
Re: [Anima] Race condition [was: some implementor… Michael Richardson

[Anima] some implementor comments on RFC8994

Attachment: signature.asc