[Idr] Some comments on draft-spaghetti-idr-bgp-sendholdtimer-09
Jeffrey Haas <jhaas@pfrc.org> Tue, 07 March 2023 22:23 UTC
Return-Path: <jhaas@slice.pfrc.org>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 668AEC14CF1A; Tue, 7 Mar 2023 14:23:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.899
X-Spam-Level:
X-Spam-Status: No, score=-1.899 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_BLOCKED=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DjbgzhXPG0la; Tue, 7 Mar 2023 14:23:12 -0800 (PST)
Received: from slice.pfrc.org (slice.pfrc.org [67.207.130.108]) by ietfa.amsl.com (Postfix) with ESMTP id A9D23C1524DB; Tue, 7 Mar 2023 14:22:41 -0800 (PST)
Received: by slice.pfrc.org (Postfix, from userid 1001) id DBEE71E039; Tue, 7 Mar 2023 17:22:40 -0500 (EST)
Date: Tue, 07 Mar 2023 17:22:40 -0500
From: Jeffrey Haas <jhaas@pfrc.org>
To: draft-spaghetti-idr-bgp-sendholdtimer@ietf.org, idr@ietf.org
Message-ID: <20230307222240.GA16033@pfrc.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.5.21 (2010-09-15)
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/u_1u4fa3JpHHvwBzVYTEde8IDFg>
Subject: [Idr] Some comments on draft-spaghetti-idr-bgp-sendholdtimer-09
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Mar 2023 22:23:13 -0000
Authors, Here's some comments on your proposal: In general, the draft will take a tone that the situation where the sendhold-timer feature is required because of "attacks". I don't think that serves to illustrate the scenario. It is also likely to bring undesired and poorly focsued attention from our security-minded people in the IETF. I'd suggest restraining your comparisons to clogged pipes at most. :-) Another general thought is the document is currently written solely with the thought that the only desired action is to bounce the session. While I agree that this is probably the most desired case for long-stalled sessions, have you given any consideration to permitting the mechanisms solely to raise an alarm and permit manual intervention? For short timers, this may be more desirable than aggressively bouncing a session in some circumstances; e.g. the remote side is undergoing an upgrade and has intentionally stalled the session. One detail that could use some expansion is more detail as to when you bump the MsgSent (noted below) counter. While you might consider it obvious that it should be bumped only when the entire BGP PDU is put into the socket, you'd be surprised at what some other people are likely to decide is fine. This clarifies the partial writes of the PDU scenario. Finally, before the misc. comments, while it's not your proposal, it might be useful to contrast this vs. tcp user-timeout. (Ignoring the inconsistent support in various stacks.) That mechanism works at a much finer level of granularity than your proposal in the sense that any unacknowledged TCP state within the user-defined time can result in the session being dropped. For your proposal, you're relying on counters being reset on PDU boundaries: - This means that in the face of large PDUs (max 4k without extension), and sluggish receivers, it may be difficult to reset the sendholdtimer. - By contrast, sluggish but still moving receivers won't trip the user time out feature as much. - As RFC 8654 gets deployed in the future, this constrast will become more of an issue. Misc comments: 11 Abstract 13 This document defines the SendHoldTimer session attribute for the 14 Border Gateway Protocol (BGP) Finite State Machine (FSM). SendHoldTime is the session attribute. Updating the document to distinguish the SendHoldTime vs. its timer should be done. Compare vs. the other session attributes that are times and their respective timers in RFC 4271. 107 2. Example of a problematic scenario 109 In implementations lacking the concept of a SendHoldTimer, a 110 malfunctioning or overwhelmed remote peer may cause data on the BGP 111 socket in the local system to accumulate ad infinitum. This could 112 result in forwarding failure and traffic loss, as the overwhelmed 113 peer continues to utilize stale routes. The above is an example where saying less may be helpful. It's not necessary to speculate what went wrong. All that is necessary is to say that stalled sessions can result in stale routing state downstream of the stalled session. 115 An example fault state: as BGP runs over TCP [RFC9293] it is possible 116 for hosts in the ESTABLISHED state to encounter a BGP peer that is 117 advertising a TCP Receive Window (RCV.WND) of size zero, this 0 118 window prevents the local system from sending KEEPALIVE, CEASE, 119 WITHDRAW, UPDATE, or any other critical BGP messages across the 120 network socket to the remote peer. Historically, many BGP 121 implementations were unable to handle this situation in a robust 122 fashion. Previous BGP RFC specifications would not give cause for 123 the session to be torn down in such situations. Here, I suggest you be cautious about worrying over-much about zero-windowing. This behavior is normal in TCP, and ideally short lived. If you must mention it at all, note that zero-windowing is an observable behavior when there is potentially long-lived congestion. 146 3.2. SendHoldTimer_Expires Event Definition [...] 154 If the SendHoldTimer_Expires (Event XX1), the local system: [...] 163 - drops the TCP connection, 165 - increments the ConnectRetryCounter, 167 - (optionally) performs peer oscillation damping if the 168 DampPeerOscillations attribute is set to TRUE, and Delete the "increments" and "peer damping". Those cases are handled by the Idle state machinery. 172 If the DelayOpenTimer_Expires event (Event 12) occurs in the Connect 173 state, the local system: 175 - sends an OPEN message to its peer, 177 - sets the HoldTimer to a large value, and 179 - sets the SendHoldTimer to a large value, and You should define a "large value" here compared to the default value of 8 minutes. The same comment applies to the section directly below the above one in the text. 201 3.3. MsgSent Event Definition 203 Section 8.1.5 [RFC4271] is extended as following: 205 Event XX2: MsgSent 206 Definition: An event is generated when a KEEPALIVE or UPDATE message is transmitted. 207 Status: Mandatory It's worth noting that this corresponds to the event that bumps this in the BGP MIB: bgpPeerOutTotalMessages OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The total number of messages transmitted to the remote peer on this connection." REFERENCE "RFC 4271, Section 4." ::= { bgpPeerEntry 13 } And similar in the BGP YANG modules (IETF and OC) By proxy, this means you have the ability to monitor the state of events that should reset the sendholdtimer. 209 3.4. Restarting the SendHoldTimer 211 On page 74 [RFC4271] before "If the local system receives an UPDATE 212 message, and the UPDATE message error handling procedure (see 213 Section 6.3) detects an error (Event 28), the local system:", add the 214 following: 216 If the local system transmits a KEEPALIVE or UPDATE message (MsgSent 217 (Event XX2)), the local system: 219 - restarts the SendHoldTimer, and 221 - remains in the Established state. I think this section is probably in error. The section you're doing surgery on is part of the update error detection and we're about to drop the connection. No additional work should be done. 223 4. Send Hold Timer Expired Error Handling 225 If a system does not send successive KEEPALIVE, UPDATE, and/or 226 NOTIFICATION messages within the period specified in the Send Hold 227 Time, then the BGP connection is closed and a log message is emitted. While this is a good attempt at RFC 4271 surgery, we've added other messages since then. (E.g., refresh.) Better text is to simply discuss sending a BGP message. RFC 4271, §4 and later uses this term. 253 * Failure to disconnect from a 'stuck' peer hinders the local 254 system's ability to construct a non-stale local Routing 255 Information Base (RIB). I don't follow this point. The local system has valid RIBs. What is failing is syncing the Adj-Rib-Out for a particular remote bgp speaker. -- Jeff
- [Idr] Some comments on draft-spaghetti-idr-bgp-se… Jeffrey Haas