Re: [Idr] Comments on draft-ietf-idr-5g-edge-service-metadata

Jeffrey Haas <jhaas@pfrc.org> Sun, 05 November 2023 10:18 UTC

Return-Path: <jhaas@slice.pfrc.org>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A6D82C17C880; Sun, 5 Nov 2023 02:18:18 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.908
X-Spam-Level:
X-Spam-Status: No, score=-6.908 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URIBL_DBL_BLOCKED_OPENDNS=0.001, URIBL_ZEN_BLOCKED_OPENDNS=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BttCodchkmq1; Sun, 5 Nov 2023 02:18:16 -0800 (PST)
Received: from slice.pfrc.org (slice.pfrc.org [67.207.130.108]) by ietfa.amsl.com (Postfix) with ESMTP id 2E143C17C894; Sun, 5 Nov 2023 02:17:10 -0800 (PST)
Received: by slice.pfrc.org (Postfix, from userid 1001) id 6930D1E2F2; Sun, 5 Nov 2023 05:17:10 -0500 (EST)
Date: Sun, 05 Nov 2023 05:17:10 -0500
From: Jeffrey Haas <jhaas@pfrc.org>
To: Linda Dunbar <linda.dunbar@futurewei.com>
Cc: "draft-ietf-idr-5g-edge-service-metadata@ietf.org" <draft-ietf-idr-5g-edge-service-metadata@ietf.org>, "idr@ietf.org" <idr@ietf.org>
Message-ID: <20231105101710.GA27820@pfrc.org>
References: <20231027210120.GA22739@pfrc.org> <CO1PR13MB4920E09EA39BE24554F8245485A2A@CO1PR13MB4920.namprd13.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CO1PR13MB4920E09EA39BE24554F8245485A2A@CO1PR13MB4920.namprd13.prod.outlook.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/5GE7YL6XvRzsXiJy63RIN_hh3a4>
Subject: Re: [Idr] Comments on draft-ietf-idr-5g-edge-service-metadata
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Nov 2023 10:18:18 -0000

Linda,

On Sun, Oct 29, 2023 at 09:45:19PM +0000, Linda Dunbar wrote:
> Thank you very much for the detailed review and the comments.
> Please see below for the detailed resolutions.
> A couple of them I would need to chat with you during IETF118 for the exact changes.

Feel free to grab time with me when you see me.

> > While the intent of this feature appears to be "walled garden" scenarios, the fact that BGP is used in these environments in circumstances that can leak on accident, these issues should be discussed.
> 
> [Linda] The Security Consideration of the Version -11 has the following statement:
>       "The ingress routers should not propagate the Edge Service Metadata to any nodes that are not within the trusted domain."

That hasn't been working out very well for us in bgp-land.  There needs to
be an enforcement mechanism.

> Should we expand this statement with the following ?
>       BGP Route Filtering or BGP Route Policies [RFC 7454] can be used to ensure that BGP update messages with Metadata Path Attribute attached do not get forwarded out of the administrative domain.  BGP route filtering  [RFC 7454]  allows network administrators to control the advertisements and acceptance of BGP routes, ensuring that certain routes do not leak outside of the intended administrative domain. Here are the steps to achieve this:
> 
>       Use Route Filtering: Implement route filtering policies on the ingress routers to restrict the propagation of BGP update messages for the registered 5G edge services beyond the administrative domain. You can use access control lists (ACLs), prefix lists, or route maps to filter the BGP routes classified as the 5G edge services, which need the Metadata Path Attributes to be distributed from egress routers to ingress routers.
> 
>        Filter by Prefix: Use prefix filtering to specify which IP prefixes should be advertised to peers and which should be suppressed. This step ensures that only authorized routes are sent to external peers.
> 
>       Use Route Maps: Route maps provide a flexible way to filter and manipulate BGP route advertisements. You can create route maps to match specific conditions and then apply them to the BGP configuration.
> 
> 
> The majority of the PDU considerations are covered, but not all.  It'd be good to get some clarity on the items flagged below as they are likely to impact the ability to have interoperable implementations.

The necessary filters are on the new attribute itself rather than other,
more traditional, route properties.  Additionally, the real consideration
has to do with what happens when filtering fails either due to
misconfiguration, or other reasons.

> > I find the site-id procedures to be unclear and some additional text might be helpful.
> 
> [Linda] the Site-ID in this document is an identifier for a group of routes associated with a common physical characteristic, for example a pod, a row of server racks, a floor, or an entire DC.  The purpose is to use one UPDATE message to indicate a group of routes being impacted by a physical event. Those routes might be from different address families or NLRI.

I understand that, but where exactly is that tagged in the PDU?

> The Version 11 is specified as the following:
>       "Identifier for a site, which can be a pod, a row of server racks, a floor, or an entire DC. There could be multiple sites connected to the egress router (a.k.a. Edge DC GW)."
> 
> Is it clearer to change to the following?
>       "Site ID is an identifier for a group of routes associated with a common physical characteristic, for example, a pod, a row of server racks, a floor, or an entire DC. The purpose is to use one UPDATE message to indicate a group of routes impacted by a physical event. Those routes might be from different address families or NLRIs. There could be multiple sites connected to one egress router (a.k.a. Edge DC GW)."

Better text, but the real question is where the site ID is tagged in the PDU?

> > The potential inconsistencies in route selection and the partial deployment model MUST be addressed.  I suspect the intended deployment scenario is such that this may not be a problem when the feature is kept contained in a consistently deployed walled garden.
> 
> [Linda] Do you mean when some nodes don't support Metadata Path Attribute?
> is it appropriate to add a statement in Section 3?
>       The goal of this Edge Service Metadata Path Attribute is for egress routers to propagate the metrics about their running environment for a subset of Edge Services to ingress routers so that the ingress routers can make path selections based on not only the routing cost but also the running environment for those edge services. The BGP speakers that don't support the Metadata Path Attribute can ignore the Metadata Path Attribute in a BGP UPDATE Message. All intermediate nodes can forward the entire BGP UPDATE as it is.

The fundamental issue is that when subsets of the BGP Speakers in a network
inconsistently do route selection based on the presence of the new
attribute, you can end up with things like loops.

> The feature does note that filtering of the attribute may be necessary.
> However, we're getting very publicly reminded that path attributes often go further than we like, and cause outages and security issues.  I'd strongly recommend the attribute escape scenario be dealt with prior to early assignment.
> [Linda] What do you mean by "the attribute escape scenario"?

As pasted below:
https://datatracker.ietf.org/doc/draft-haas-idr-bgp-attribute-escape/

Also, see presentations in IDR 117.

> > Transitivity:
> > =============
> > 
> > The transitivity of thie new path attribute isn't currently clear in the document.  Here are some conflicting sections:
> > 
> > : 4.1. Metadata Path Attribute
> > :
> > : The Metadata Path Attribute is an optional transitive BGP Path attribute to
> > : carry metrics and metadata about the edge services attached to the egress
> > : router.
> > 
> > Here, we're saying the attribute is optional, transitive.
> 
> [Linda] In our current implementation, the Metadata Path Attribute is "Transitive" only on the RR, but NOT forwarded by the ingress nodes.  Majority of the routes' UPDATE don't have Metadata Path Attribute.  The Metadata Path Attribute is only included for a small set of prefixes.
> Can you suggest a better term to describe this?

Transitivity of Path Attributes is defined in RFC 4271, §4.3.  It's a key
property of BGP.  Please spend some time understanding this property.  The
attribute escape document can help you understand it better.

The request is to properly document the path attribute's transitivity so we
can discuss its protocol behaviors.

> > : 4.1.1. Metadata Path Attribute Handling Procedure
> > : [...]
> > :
> > : When a BGP Speaker does not recognize some of the Sub-TLVs within one
> > : Metadata Path Attribute in a BGP UPDATE message, the BGP Speaker should
> > : forward the received BGP UPDATE message without any change if the BGP UPDATE
> > : message is marked as transitive.
> > 
> > BGP UPDATEs themselves aren't "transitive".  What's the intention here?
> > Simply that if the BGP route is propagated that unrecognized TLVs should be propagated?
> [Linda] RFC 4271 says that the Transitive (bit 4 set to 1)  of the BGP Attribute Header: If the transitive bit is set to 1, the attribute is transitive. This means that the attribute must be passed along to other BGP routers in the path.
> I meant to say that if the Transitive bit is set to 1, the BGP speaker should forward the received BGP UPDATE unless the BGP Speaker is an AS boundary node.

That's not what transitive means.  Again, please see the attribute escape
document to help understand this property.

I'll skip responding to the other points relating to transitivity in this
reply, but please address them once you've had time to read the escape
document.

> > Other PDU details:
> > ==================
> > 
> > : 4.1.2. TLV Format
> > : [...]
> > :
> > : - The third high-order bit (bit 2): same as specified by RFC4721.
> > : - The fourth high-order bit (bit 3): set to 1 to indicate there are two
> > :   octets for the Length field.
> > 
> > Don't try to describe the details here.  It's base BGP protocol behavior.
> > In particular, the text covering the extended length behavior is erroneous and had been mentioned previously in another thread.  The Length will be one or two octets based on the length of the contained attribute value.
> 
> [Linda] Should those two bullets be deleted?
> Possible to sit down with you in IETF118 to discuss?

Yes, they should be deleted.  Just as importantly, you and anyone else who
wants to specify new path attributes should understand these core details
when authoring BGP extensions.

> > : 4.3. Capacity Availability Index Metadata
> > :
> > : The Capacity Availability Index Sub-TLV:
> > :
> > :  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> > : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > : |      CapAvailIdx Sub-Type     |         Reserved              |
> > : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > : |        Site-ID (2 octets)     | Site Availability Percentage  |
> > : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > 
> > Where's the Length?  Is Reserved a typo?
> [Linda] There is no Length field. Only 4 octets.
> Change the text to the following:
>         "The Capacity Availability Index Sub-TLV has a fixed length of 4 Octets. Therefore, there is no Length field."

Um, don't do that.  If you have a TLV protocol, don't throw in exceptions as
to when it's a TLV vs. a fixed sized field.

> > Note that the site-id and site availability percentage breaks your "all sub-tlv values are 32 bit integers". (Noted below commenting on §4.1.2)
> [Linda] two Octets is enough to represent a percentage: e.g., 100%, 50%, or 0%.

I'm not arguing that you can't put the information in a smaller field.  You
have presumably normative text in the document (cited in the reply) that
says "these things are all 32 bit integers".  If that's not true, remove
that text.

> > : 4.4.3. Raw Load Measurement Sub-TLV
> > : [...]
> > : Raw Load Measurement Sub-TLV has the following format:
> > :
> > :    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> > :   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > :   | Raw-Load-Measurement Sub-Type |               Length          |
> > :   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > :   |                   Measurement Period                          |
> > :   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > :   |           total number of packets to the Edge Service         |
> > :   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > :   |           total number of packets from the Edge Service       |
> > :   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > :   |           total number of bytes to the Edge Service           |
> > :   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > :   |           total number of bytes from the Edge Service         |
> > :   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > 
> > This also breaks the "all sub-tlv values are 32 bit integers" rule.
> 
> [Linda]  Each value is represented by 32 bits. How does it break the rule?

In a TLV protocol, the field after the length is the value.
This value is a vector of 5 32-bit items, thus longer than 32-bits.

> > Additionally, these values likely want to be 64-bits wide.  Given network capacities and speeds these days 32 bits is likely too small.
> [Linda] That is a good point. Does 64 bits break the 32 bits rule?

I think you want to delete your 32-bit text - it's just wrong.
As for size of the fields, this is your proposal.  I'd suggest making them
longer.

> > : - Measurement Period: BGP Update period in Seconds or user-specified period.
> > 
> > Bad idea.  Either specify seconds or provide an additional field so that that measurement quantum can be determined by the applications or management platforms.
> 
> [Linda] The 32 bits Measurement Period is intended to indicate the Period. Can you suggest another format?

The issue is "pick one".  Don't make it the field semantically mean more
than one possible thing in an unqualified fashion.

> > Error Handling:
> > ---------------
> > 
> > : 4.1.3. Error Handling
> > :
> > : A BGP speaker SHOULD NOT include more than one Metadata Path Attribute in
> > : one BGP Update message.
> > 
> > Largely shouldn't be mentioned.  It's a session fatal error per RFC 4271, §6.3.
> [Linda] Is it better to change the text to the following?
>       "A BGP speaker  MUST NOT include more than one Metadata Path Attribute in one BGP Update message. Per [RFC4271] Section 6.3, A BGP speaker must not include multiple instances with the same type for the Sub-TLVs specified in this document in one Metadata Path Attribute."
> 
>   I thought it is better to emphasize this.

Ideally we avoid restating core protocol procedures.  When that's done, it
has the appearance, when it's not a direct quote, of being a normative
update to the base protocol.  We don't do that.

> > : If one of the Sub-TLVs has an invalid value, e.g., out of its specified.
> > : ranges, the Sub-TLV with the invalid value is ignored by the BGP receiver.
> > 
> > Ignored is fine.  Should the TLV be stripped prior to propagation?  Unlike unknown TLVs, a known TLV with an invalid value could be cleaned up.
> [Linda] Is it OK just forward as it is? as the other nodes might be able to handle the value.

That's the question here.  Practice varies.

If you have a value that one device knows is outright invalid rather than
of unknown validity, you normally should trigger your error handling
procedures.

If you locally ignore it but propagate it, you have equal chances of "this
was really correct, but the local device thinks it isn't" and "this was
really wrong, and the next device may experience a fault trying to handle
it".  Also known as, 'why did you send me junk?'

I'd recommend cleaning up (do not propagate) known bad items.

> > : By default, no notification is required unless configured to send a
> > : notification to its management system.
> > 
> > I don't think you mean to use "notification" here since that word tends to be used for the NOTIFICATION PDU as part of a session reset.  Is your intent that logging of the error locally or to a management system is optional rather than a SHOULD?
> > 
> [Linda] yes. Logging to management system is optional. By default, it is not.
>       Logging the error locally or to a management system is optional.

Then you'll want to change the wording to logging rather than notify.

> > 4.3. Capacity Availability Index Metadata Issues:
> > -------------------------------------------------
> > 
> > Section 4.3.2 discussses that this TLV is used to help decorate routes that have a nexthop where the routes share a site-id.  However it's unclear if this TLV is intended to be used BOTH on the routes used as nexthops and the routes that resolve over said nexthops as the correlator?
> [Linda] How about changing the first paragraph to the following?
> 
>       The Capacity Availability Index indicates the percentage of impact on a group of routes associated with a common physical characteristic, for example, a pod, a row of server racks, a floor, or an entire DC. The purpose is to use one UPDATE message to indicate a group of routes of different NLRIs impacted by a physical event. For example, a power outage to a pod can cause the Capacity Availability Index to be 0% for all the routes in the pod. Partial fiber cut to a row of shelves can cause the Capacity Availability Index to 50% for all the routes in those shelves. The value is 0-100, with 100% indicating the site is fully functional, 0% indicating the site is entirely out of service, and 50% indicating the site is 50% degraded.
>        It is recommended to assign each route with one Site-ID. Depending on deployment, one DC can use POD number as Site-ID, another DC can use Row of Shelves as the Site-ID
> 
> > Can routes used as nexthops have more than one site-id bound to them?
> > Can the routes resolving over them?
> [Linda] It is recommended to assign each route with one Site-ID. Having multiple Site-ID, even though more flexible, is too complicated.  Depending on deployment, one DC can use POD number as Site-ID, another DC can use Row of Shelves as the Site-ID.  See the updated text for Section 4.3.2 above.
> 
> 
> > If the same TLV is used on the routes resolving over the nexthops, how is the site availability percentage filled in?
> [Linda] BGP UPDATE with standalone Site Availability Index is NOT intended for resolving NextHop.

I think the necessary thing is to review how site index is associated with
routes and used for resolution before trying to craft new text for it.


> > The following text is also unclear:
> > : However, it is unnecessary to include the Site Capacity Availability Index
> > : for every BGP Update message if there is no change to the site-reference
> > : identifier or the Capacity Availability value for the service instances.
> > 
> > BGP uses implicit withdraws when UPDATES advertise a given NLRI with a set of Path Attributes.  If the Path Attributes contain metadata that excludes the capacity TLV, I'd presume that it goes away from the route in question.
> > What is the intent of the text above?
> [Linda] the intent is to say that the Metadata Path Attribute values stays unchanged if not included in the UPDATE.  Is it OK?

No.

Every update containing an NLRI is considered to replace the NLRI's path
attributes that were previously received with the attributes in the current
update.  This is a fundamental BGP property.


> > Route Selection Considerations:
> > -------------------------------
> > 
> > : 5.2. Integrating with BGP decision process
> > 
> > : For the selected services configured to be influenced by the Edge Service
> > : Metadata, the ingress router BGP Decision process [IDR-CUSTOM-DECISION]
> > 
> > While the custom decision process draft is adopted work, it's not widely implemented.
> > 
> > If the proposal is using the cost community, what is the recommended contents of that community for these procedures?
> > 
> [Linda] Robert asked us to use the BGP Decision process [IDR-CUSTOM-DECISION]. Can we chat how to write this at IETF 118?

We can chat about this, but my discussion point (for the WG's consumption)
will be "don't use custom decision".  If you still want to use that, get
Robert to supply text.

> > Inconsistent Route Selection Considerations:
> > ============================================
> > 
> > Devices that recognize TLVs inconsistently may have inconsistent route selection.  This should be flagged as an issue.
> [Linda] having inconsistent route selection should be OK as any of the destinations can serve the request.

You like route loops?  That's what you can get if you have inconsistent
route selection.

> > Features that impact BGP's route selection need significant additional scrutiny, and IDR review hasn't always been great about catching such things.  The primary issue is that when such features are inconsistently deployed, or their inputs are inconsistently made use of (see comment above), BGP Speakers in the network can come to different conclusions about what the active route should be.
> [Linda] We are talking about a very small set of selective Edge services routes that are instantiated in the Edge DC directly attached to the Egress routers. Majority of the routes don't have Metadata Path Attributes attached.

Walled garden scenarios are not going to excuse security and route selection
impacts in BGP these days.  The issues need to be addressed.

If not, consider something that isn't BGP.

> > RTC procedures:
> > ===============
> > 
> > : 6. Service Metadata Propagation Scope
> > :
> > : For each registered low-latency Service, BGP RT Constrained Distribution
> > : [RFC4684] can be used to form the Group interested in the Service. The
> > : "Service ID", an IP address prefix, is the Route Target.
> > 
> > Is the intention here that a general-purpose IP-formatted route-target should contain the service ID?
> [Linda] Yes.
> > 
> > This seems to be limiting the feature's use to only VPN service routes using this specific format.  If that's not the case and this is intended as a way to mark subsets of routes using this feature for constrained distribution, you probably want to use a new extended community type/subtype.
> [Linda] the 5G edge service Metadata Path Attribute is not intended for VPN. Rather, using the IP prefixes.

route-targets are for VPNs.  If the intent is that this isn't a VPN, pick a
new extended community and suggest rtc semantics get used for it.

Basically, "don't overload route-targets".

> Attribute Escape Considerations:
> ================================
> > This new attribute may be attached to Internet scoped routes.
> [Linda] the Metadata Path Attribute is NOT intended for Internet scoped routes. Rather, for the limited routes instantiated in 5G edge DCs which are connected with the 5G ingress routers by 5G Local Data Networks.

Are those IPv4/IPv6 / Unicast routes?
You can't tell at the afi/safi level when those are "Internet scoped" or
not.  If you have a different afi/safi for this feature, and no leaking from
that family to/from IPv4/6 Unicast routes, then this point is better
addressed.

I don't think that's what you're doing.

> > Section 9 and Section 10 attempt to limit the deployment of this feature within "trusted domains" "between Ingress and egress routers of one single BGP domain".
> [Linda] Yes.

See prior comment about walled gardens.

> > Section 4.1.1 attempts to address the scoping consideration further by:
> > : In order to prevent distribution of the BGP Metadata Path Attribute beyond
> > : its intended scope of applicability, attribute filtering SHOULD be deployed
> > : to remove the BGP Metadata Path attribute at the administrative boundary.
> > 
> > As addressed in draft-haas-idr-bgp-attribute-escape, such filtering desires and expectations of limited domains have tended to be wishful thinking and we keep ending up with operational accidents.
> [Linda] any suggestions for the wording?  The Metadata Path Attribute is for the limited routes instantiated in 5G edge DCs which are connected with the 5G ingress routers by 5G Local Data Networks.

See text immediately below.

> 
> I would STRONGLY suggest that the Path Attribute definition be updated to provide additional scoping information wherein a remote BGP domain receiving an escaped metadata Path Attribute can determine that it should NOT be locally used for the procedures discussed in this document.
> [Linda] The Introduction section already has the following statement:
>       This document describes a new Metadata Path Attribute added to a BGP UPDATE message [RFC4271] for egress routers to advertise the Metadata about a subset of  edge services directly attached to the egress routers.
> 
> Is it enough?

No.

-- Jeff