Re: [Idr] BGP autoconfiguration - draft-ymbk-idr-l3nd - Please provide equivalent review for draft-minto-idr-bgp-autodiscovery

Jeff:

[author hat on] 
This first response to your email focuses on getting two equivalent chair
reviews for the L3 proposals. 

As the IDR Chair leading the auto-configuration effort,  would you do the
same deep dive on the other L3 proposal for IDR: 
draft-minto-idr-bgp-autodiscovery-01.txt 
https://datatracker.ietf.org/doc/draft-minto-idr-bgp-autodiscovery/

I do not see any record of this draft deep dive on idr list on from the
interim: 
https://datatracker.ietf.org/meeting/interim-2022-idr-02/materials/minutes-i
nterim-2022-idr-02-202201241000-01
does not provide this level of comments.

The only deep dive done on draft-minto-idr-bgp-autodiscovery is a response
to Robert Raszuk:
https://mailarchive.ietf.org/arch/msg/idr/2XZQ14OQ24TUlpbUWiq3H6oOVq0/

My 2nd response will solicit questions on scoping that you should ask.  

Sue 
[author hat on]

-----Original Message-----
From: Idr [mailto:idr-bounces@ietf.org] On Behalf Of Jeffrey Haas
Sent: Tuesday, March 1, 2022 4:22 PM
To: idr@ietf.org
Subject: [Idr] BGP autoconfiguration - draft-ymbk-idr-l3nd

Working Group,

One of the topics for our last IDR interim was an attempt to conclude
discussions on BGP autoconfiguration proposals prior to sending adoption
requests to the Working Group.  Just prior to that call going out, we were
requested to add one additional proposal for consideration that hadn't yet
been published: l3nd[1] and l3nd-ulpc[2].

The authors have requested a presentation slot at the upcoming IETF 113 for
this proposal.  Since this is a very late entrant, I'd like to try to get
discussion of the properties of this proposal started on the mailing list
using the same discussion points we've had from prior interim meetings.

The Working Group adopted requirements and some of the prior proposal
discussions are part of the design team document.[3]

Below, find my observations on the draft with questions and commentary
intermixed.  I finish this response with my high level analysis and personal
opinion on what these properties accomplish.

-----

Scoping discussion:

The l3nd proposal appears to be designed to start from similar premises as
l3dl.  l3dl has been previously discussed as a candidate for a layer 2 BGP
autoconfiguration proposal and is covered in [3]A.1.2.

:   This document (set) provides a similar solution at Layer-3,
:   attempting to be as similar as reasonable to L3DL.

The intended scoping is point to point links:

:   In this document, the use case for L3ND is for point to point links
:   in a datacenter Clos ([Clos])

This is somewhat unusual language since the typical deployment interface is
Ethernet.  While such Ethernet connections are typically deployed in a point
to point fashion, Ethernet itself is not specifically point to point.

There is also a pointer that the proposal may be intended for more general
purposes than BGP Autoconfiguration:

:   L3ND might be found to be widely applicable to a range of routing and
:   similar protocols which need Layer-3 neighbor discovery.

The proposal appears to be intended to be able to expanded to non-data
center cases.

:   While L3ND is designed for the MDC, there are no inherent reasons it
:   could not run on a WAN.  The authentication and authorization needed
:   to run safely on a WAN need to be considered, and the appropriate
:   level of security options chosen.

Such wider scoping was permitted by the autoconfiguration analysis but such
work was left as out of scope for the DC case.

l3nd seems to be targeting different levels of scale than prior proposals,
and that appears to motivate a different property:

:   The number of addresses of one Encapsulation type on an interface
:   link may be quite large given a TOR switch with tens of servers, each
:   server having a few hundred micro-services, resulting in an
:   inordinate number of addresses.  And highly automated micro-service
:   migration can cause serious address prefix disaggregation, resulting
:   in interfaces with thousands of disaggregated prefixes.
:
:   To meet such scaling needs, the L3ND protocol is session oriented and
:   uses incremental announcement and withdrawal with session restart, a
:   la BGP ([RFC4271]).

Given that the desired scoping context is point to point (typically a small
number of addresses on a given link), it'd be helpful if the authors discuss
scenarios wherein a large number of BGP sessions are expected to be
discovered on such a link.

-----

Transport layer considerations:

l3nd leverages a multicast hello message and requires a TCP (possibly
protected by TLS) session to do its work.

In comparison to other proposals previously discussed:
- draft-xu-idr-neighbor-autodiscovery was session based, with comparability
  to the LDP hello mechanism.  It uses TCP after multicast discovery.
- l3dl itself is session based and implements its own transport and session
  layers. 
- draft-raszuk-idr-bgp-autodiscovery, determined to be not a fit for the
  core data center case, used "BGP over BGP".
- draft-acee-idr-lldp-peer-discovery operates over layer 2 embedded in the
  LLDP protocol.  It is currently sessionless.  Discussion has suggested
  that if, in the future, security or large packet considerations drive
  requirements to need larger packets that LLDPv2 that is being currently
  stanardized in IEEE can serve those purposes.  LLDPv2 provides a session
  layer.
- draft-minto-idr-bgp-autodiscovery uses layer 3 IP multicast exclusively
  and does not have a session layer.

-----

Protocol state and FSM considerations:

- Similar to BGP, parallel session are a possibility.  I'm unclear how
  connection collisions are handled.
  + The procedures seem to try to cover this by saying "drop hello when
    there is an established session".  However, timing could result in two
    sessions happening in parallel prior to drop.
  + There isn't a clear idea of the FSM having an "established" state, or
    the related "OpenSent/OpenConfirm" states similar ot  BGP that could
    prevent this?

Open message:
- If Attributes are single-octet values of local significance, is there any
  significance to receiving the same Attribute twice?  If not, consider
  adding text about their uniqueness.
  + Also, why so small?  Payload length appears to be 4 octets.  A larger
    value permits more flexibility for local semantics; e.g. bit vectors.
  + Is there significance to resuming a session using the documented
    procedures when the Attributes may not be identical to the ones
previously
    received in the interrupted session?
- For clarity, consider labeling Serial Number as "last received serial
  number".

Ack message:
- The nybble next to Etype needs some flavor the usual text to deal with its
  contents.  "Must be zero on transmit, should ignored on receipt"
- The Etype conditions are a bit lax for a useful state machine for what we
  have documentation for:
  + If Open can't be acked successfully, should the session be permitted to
    continue?
  + Is "restart is hopeless" a shutdown?  An admindown?  What should the
    implementation do with its PDUs?  Should it tear down the session?
- 3 seconds seems a bit low for a response timeout.  While the protocol
  seems to be structured toward easy restart, why disconnect so
  aggressively?
- The message exchange protocol is "one outstanding ack" which means there
  is no pipelining of the various messages.  Given this, the lack of a
  serial number being acked in the message is understandable, but seems like
  it'd be good debug info.
  + That lack of pipelining means that full exchange of state will be gated
    on slowest sender on the session; the transmitter or the acknowledger..

Encapsulations:
- The text for addressing conflicts needs to specify the Etype along with
  the Error Code.
  + It can't be 0, because that says No Error.
  + It's probably 1 as a warning?
  + In general, normative text needs to specify Etype + Error Code together.
- 10.1, the procedure for expecting an ACK seems conflicting with the
  retransmission text in 9.1.  In that, it says if you're expecting an ack
  and don't get one, the session should be closed.  However, it's implied
  (but unclearly so) that you should have one pending ack at a time.  If
  retransmits of an encapsulation pdu are possible, this means multiple
  outstanding acks may be needed.  In such a case, serial numbers being
  acknowledged are more desirable.
  + In general, the text here for retransmit vs. ack seems to be a mismatch
    between the expectations of working on a reliable stream protocol like
    TCP.
  + Similarly, the "link is broken below layer 3" comment seems a mismatch
    since TCP shouldn't pay direct attention to that.
- 24 bits is a peculiar length for a counter.  Why this size?

Encapsulation Flags:
- 10.2 typo on Encapsulation
- Primary is interesting operational state.  What should an implementation
  do if it gets conflicting primary entries?

Encapsulations (varying):
- You can pack more than one prefix in a PDU for a given serial number.  In
  the event that a single prefix is an error (see question about about Etype
  for such cases), are all prefixes in the bundle intended to be rejected? 
  + How would the ack receiver know which prefix is the bad one if the whole
    bundle isn't rejected?

MPLS Label List/Encapsulation:
- After our "fun" dealing with label issues covering RFC 3107 and others,
  please make sure to tell people the obvious: the label stack needs to be
  well formed and bottom of stack bit should only be set on last entry.

Hello discussion:
- The English of the first paragraph is in need of significant work for
  clarity.
- I think this is trying to say that in the presence of a LAG that
  transmission over more than one LAG member might be okay.
- I also think that it's trying to say that once you have a session up you
  can stop hellos on the individual member links.
- I think the intent of the second paragraph is that if a link is on a
  multi-access network and non-p2p behavior is desired, keep sending hellos.

ULPC draft:

3.1:
:   A peer receiving BGP ULPC PDUs has only one active BGP ULPC PDU for
:   an particular address family on a specific link at any point in time;
:   receipt of a new BGP ULPC PDU for a particular address family
:   replaces any previous one.

Okay, implicit replace.  That's fine.

:   If there are one or more open BGP sessions, receipt of a new BGP ULPC
:   PDU does not affect these sessions 

Do not disturb existing sessions.  That's fine.

:   and the PDU SHOULD be discarded.

Probably not fine.  If the session bounces, you likely intend for it to
connect to the newly advertised entry.

:   If a peer wishes to replace an open BGP session, they MUST first
:   close the running BGP session and then send a new BGP ULPC PDU.

This seems to assume a tight coupling between the implementations for BGP
and for the discovery mechanism.  A peer bounce without having first
advertised the new address can mean a race condition where the bounced peer
continues trying (and possibly failing) to establish the connection with the
old peer.

It'd be cleaner to remove the prior binding first.

However, there's no withdraw procedure currently specified for these TLVs to
remove peer addresses, only to advertise them.

:   For each BGP peering on a link here MUST be one agreed encapsulation,
:   and the addresses used MUST be in the corresponding L3DP IPv4/IPv6
:   Announcement PDUs.  If the choice is ambiguous, an Attribute may be
:   used to signal preferences.

How?  Among other things, the attributes are bound to a session and not any
individual address.

3.1.2/3.1.3 Prefix-Len is probably not appropriate for the addresses.  If
there's an intent to permit peering from more than one address in the
subnet, how should the implementation pick?  From state learned in l3nd
encapsulation bindings for an overlapping subnet?

3.1.4 Authentication: You're referring to multiple possible ways security
could work.  This likely means multiple authentication types and should be
its own code point.

3.1.5 "No flags are currently defined"... and then defines GTSM.

:   As the ULPC PDU may contain keying material, see Section 3.1.4, it
:   SHOULD BE signed.

What does signed mean in this instance?  Do you mean that the transport
session should be given secrecy properties by TLS?

:   Any keying material in the PDU SHOULD BE salted and hashed.

I'm very confused here.

The material that a BGP peer needs for bringing up a session is the key to
pass to the relevant algorithm like TCP-MD5 or TCP-AO.  Typically when we
discuss salt and hashing, we're discussing the ciphertext from those
operations.  

Are you suggesting you send the ciphertext with the intent that an
implementation can consult its own tables to find what input password for
that salt should be used?  If so, this is a flavor of pre-shared key.  It'd
also require coordination of salt values between the implementations.

If this is an obvious technique that I'm not familiar with, a pointer to the
procedure would be useful.

-----

Misc:
- The embedded attempts at humor decrease the clarity of the document for
  readers that are less familiar with English and should be excised.
- In the PDU diagrams, explicitly document what the field lengths are.
  Don't make people interpret ASCII diagrams.
- The HELLO packet is itself without security.  In circumstances in which a
  man-in-the-middle attack is possible and raw TCP is offered as an option,
  active interception and replacement of the HELLO is sufficient to cause
  the less secure mechanism to be used.
- Section 7, the text about GTSM as a SHOULD on SYN should simply be GTSM on
  the TCP session.  GTSM applies to all of the packets in a TCP session.
- Section numbering under encapsulations could be more consistent.  For
  example, each section for a PDU type.  MPLS Label List probably shouldn't
  be a section at the same level as the mpls encapsulation pdus that use it.
- The session is long-lived but not protected similarly to BGP; i.e. tcp-md5
  or tcp-ao.  Given that the intent of the mechanism is to setup BGP which
  may use those properties, but uses TLS for protection, this isn't
  surprising.  But it means that the sessions are vulnerable to reset
attacks
  those TCP mechanisms are intended to protect against.

-----

Jeff's read of the proposal:

1. The proposal is session based.  While the prior discussion was primarily
in the context of draft-xu, sessions were seen as potentially problematic
for a number of reasons.  They also had some possibly useful properties.
1a. Session based makes large amounts of state easy to deal with.  That
said, the requirements show we need very little state for a given session.
That's true even in the ULPC portion of this proposal.
1b. These sessions are long-lived.  You'll have one for each interface you
want to run auto-discovery on.  This minimally doubles the number of active
sockets your stack will need because for each BGP session you get, you'll
have a lingering discovery session.  Perhaps they don't need to be
long-lived if the only thing you're using this for is BGP
auto-configuration?
1c. If you're going to use some flavor of TLS (including DTLS) as we
discussed during the requirements phase as a possible mechanism, you need a
session.
1d. If you're going to use something that rides on top of TCP, you have all
of the TCP vulnerabilities that TCP-MD5 and TCP-AO are intended to mitigate.
Using those mechanisms to mitigate the mechanism means that you're already
solving the keying issue... and perhaps don't need TLS at all.

2. The hello is completely unprotected.  This means there's an edge case
man-in-the-middle condition.  Protecting it means you've already solved the
keying discussions and wouldn't need TLS.

3. TLS crypto infrastructure is operationally weighty if you're covering the
usual things:
- What certificate authority are you using?
- How is revocation handled, if at all?
- If you want to use certificates for authentication, you now have to get
  those distributed to all of your participating routers.
- Certificates have lifetimes - although potentially very long lived ones.
  You have to now worry not only about usual key rollover considerations
  that all of the likley mechanisms need, but also making sure you don't
lose
  your auto-discovered sessions on a bounce because a certificate expired.

I'm loosely aware of the ACME work in IETF and perhaps the intent is that
these things are no longer quite as operationally cumbersome.  However,
that's an entire ecosystem that needs to be integrated into your router
operations if that's the case.

4. TLS provides integrity to the transport session and its contents, even if
it can't protect against attacks on TCP itself.  Privacy is optional,
although given the somewhat odd ULPC security text it seems that a scenario
under considertion is some form of distribution the pre-shared keys or
something similar.  However, the majority of the state for ULPC and l3nd
isn't exactly secret; it's stuff you'll see in ARP and ND if you're on link.

I don't think keychain information is considered something from most models
to need privacy. 

5. The proposal likely shouldn't call itself BGP-like.  The explicit ack
model minimally makes that not the case.  The state machine isn't terribly
comparable.

In particular, the Serial Number restart infrastructure seems to be very
heavy-weight for something running over a reliable transport like TCP.  But
this perception is tied to the idea that the BGP piece of this is small.
And even if there's a modestly large number of interfaces on the link, the
protocol should be able to send the entire bulk of them in a small number of
TCP frames.  (Compare vs. BGP send rates in even low end implementations.)
So, why complicate the mechanism with a highly stateful restart mechanism?
I suspect this makes more sense for the layer 2 proposal.

6. The sheer weight of this thing would make more sense if it's a more
general mechanism that is already deployed and the small BGP piece is along
for the ride.  I understand the motivation for a feature like this at layer
2, especially in the lsvr context.

I don't understand the use case for something like this in its current form
at layer 3.  What's that use case?

7. The crypto itself makes the autoconfiguration mechanism a point of attack
on the system rather than something to mitigate the attack.  The overhead to
deal with an on-link attack on TCP-MD5/-AO (symmetrical ciphers) is lower
than the overhead of TLS negotiation.  Rate limiting the creation of
incoming l3nd sessions might help things to some extent, but then you're
worrying similarly about TCP session exhaustion as well.

-- Jeff

[1] https://datatracker.ietf.org/doc/html/draft-ymbk-idr-l3nd-00
[2] https://www.ietf.org/rfcdiff?url2=draft-ymbk-idr-l3nd-ulpc-02.txt
[3]
https://datatracker.ietf.org/doc/html/draft-ietf-idr-bgp-autoconf-considerat
ions-02

_______________________________________________
Idr mailing list
Idr@ietf.org
https://www.ietf.org/mailman/listinfo/idr