Re: My BGP Route Update Pacing Draft

andrewl@xix-w.bengi.exodus.net Mon, 22 July 2002 21:36 UTC

Received: from trapdoor.merit.edu (postfix@trapdoor.merit.edu [198.108.1.26]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id RAA11942 for <idr-archive@ietf.org>; Mon, 22 Jul 2002 17:36:23 -0400 (EDT)
Received: by trapdoor.merit.edu (Postfix) id 982B79120D; Mon, 22 Jul 2002 17:37:06 -0400 (EDT)
Delivered-To: idr-outgoing@trapdoor.merit.edu
Received: by trapdoor.merit.edu (Postfix, from userid 56) id 61A209121C; Mon, 22 Jul 2002 17:37:06 -0400 (EDT)
Delivered-To: idr@trapdoor.merit.edu
Received: from segue.merit.edu (segue.merit.edu [198.108.1.41]) by trapdoor.merit.edu (Postfix) with ESMTP id 1A9429120D for <idr@trapdoor.merit.edu>; Mon, 22 Jul 2002 17:37:05 -0400 (EDT)
Received: by segue.merit.edu (Postfix) id ED6785DE06; Mon, 22 Jul 2002 17:37:04 -0400 (EDT)
Delivered-To: idr@merit.edu
Received: from demiurge.exodus.net (demiurge.exodus.net [216.32.171.82]) by segue.merit.edu (Postfix) with ESMTP id 6BDEA5DDF6 for <idr@merit.edu>; Mon, 22 Jul 2002 17:37:04 -0400 (EDT)
Received: (from andrewl@localhost) by demiurge.exodus.net (8.9.3+Sun/8.9.3) id OAA19286; Mon, 22 Jul 2002 14:34:15 -0700 (PDT)
Date: Mon, 22 Jul 2002 14:34:15 -0700
From: andrewl@xix-w.bengi.exodus.net
To: "Abarbanel, Benjamin" <Benjamin.Abarbanel@Marconi.com>
Cc: "'idr@merit.edu'" <idr@merit.edu>
Subject: Re: My BGP Route Update Pacing Draft
Message-ID: <20020722143415.G18094@demiurge.exodus.net>
References: <39469E08BD83D411A3D900204840EC558227B4@vie-msgusr-01.dc.fore.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.2.5i
In-Reply-To: <39469E08BD83D411A3D900204840EC558227B4@vie-msgusr-01.dc.fore.com>; from Benjamin.Abarbanel@Marconi.com on Thu, Jul 18, 2002 at 03:31:54PM -0400
Sender: owner-idr@merit.edu
Precedence: bulk
Content-Transfer-Encoding: 8bit

Ben,

One thing we discussed at the meeting in Yokohama was that we MUST NOT
associate protocol actions with the INFORM message.  It is Informational
ONLY.  An INFORM may be followed by a protocol action (such as a NOTIFY),
but, in and of itself, it is not to be used to trigger protocol actions.

Andrew

On Thu, Jul 18, 2002 at 03:31:54PM -0400, Abarbanel, Benjamin wrote:
> Delivered-To: idr-outgoing@trapdoor.merit.edu
> Delivered-To: idr@trapdoor.merit.edu
> Delivered-To: idr@merit.edu
> From: "Abarbanel, Benjamin" <Benjamin.Abarbanel@Marconi.com>
> To: "'idr@merit.edu'" <idr@merit.edu>
> Subject: My BGP Route Update Pacing Draft
> Date: Thu, 18 Jul 2002 15:31:54 -0400
> X-Mailer: Internet Mail Service (5.5.2650.21)
> Precedence: bulk
> X-OriginalArrivalTime: 18 Jul 2002 19:34:50.0014 (UTC) FILETIME=[2E818BE0:01C22E92]
> 
> Hi all:
> 
>    Our recent discussions on this list and my recent work 
>  experiences have led me to write this draft and offer it 
>  to the IETF community. I would appreciate any comments 
>  anyone has to make.
>  
>  Thanks in advance,
>  Ben
> 
>   
> 

> 
> 
> 
> 
> Network Working Group                                     Ben Abarbanel
> Internet Draft                                            Marconi Communicatons
> Expiration Date: December 2002                          
> 
> 
>           				
> 					BGP Route Update Pacing
> 
>                     draft-abarbanel-bgp-route-update-pacing-00.txt
> 
> 
> 1. Status of this Memo
> 
>    This document is an Internet-Draft and is in full conformance with
>    all provisions of Section 10 of RFC2026 except that the right to
>    produce derivative works is not granted.
> 
>    Internet-Drafts are working documents of the Internet Engineering
>    Task Force (IETF), its areas, and its working groups.  Note that
>    other groups may also distribute working documents as Internet-
>    Drafts.
> 
>    Internet-Drafts are draft documents valid for a maximum of six months
>    and may be updated, replaced, or obsoleted by other documents at any
>    time.  It is inappropriate to use Internet-Drafts as reference
>    material or to cite them other than as ``work in progress.''
> 
>    The list of current Internet-Drafts can be accessed at
>    http://www.ietf.org/ietf/1id-abstracts.txt
> 
>    The list of Internet-Draft Shadow Directories can be accessed at
>    http://www.ietf.org/shadow.html.
> 
> 
> 2. Abstract
> 
> This document defines a mechanism for controlling or limiting the rate at 
> which BGP update messages are sent from one peer to the next when a BGP peer 
> experiences internal congestion. With the introduction of new dynamic BGP 
> protocol capabilities [CAP] message or other none BGP session destructive 
> messages, it is necessary to limit the rate at which the BGP update messages 
> are sent without affecting the entire BGP session and without relying on the 
> transport (TCP) layer to do so as a normal reaction to data congestion of the 
> TCP session across the network.
> 
> 3. Specification of Requirements
> 
> The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", 
> "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be 
> interpreted as described in RFC 2119 [RFC2119].
> 
>  
> 
> 
> Internet Draft      draft-abarbanel-bgp-route-update-pacing-00.txt    [Page 2]
> 
>  
> 
> 4. Introduction
> 
> This document defines a mechanism for controlling or limiting the rate at 
> which BGP update messages are sent from one peer to the next when a BGP peer 
> experiences internal congestion. With the introduction of new dynamic BGP 
> protocol capabilities [CAP] message or other none BGP session destructive 
> messages, it is necessary to limit the rate at which the BGP update messages 
> are sent without affecting the entire BGP session and without relying on the 
> transport (TCP) layer to do so as a normal reaction to data congestion of the 
> TCP session across the network.
> 
> When a router enters a state where either its CPU utilization is maximized 
> (reaches close to 100%) or its memory is nearly depleted (less than 10% of 
> memory left), it cannot handle new or heavy streams of updates from its peers 
> and at times unable to send messages to its peer in a timely manner (within 
> 30 seconds). As a consequence it cannot keep current with the topological 
> state. In some scenarios, a non-congested peer might want to negotiate new 
> capabilities with a congested peer. The congested peer is so degraded that 
> its TCP session goes into significant flow control off conditions and is 
> unable to see the new BGP messages. Peer to peer communication is severely 
> hampered and as a result the uncongested peer will take corrective action 
> when its hold timer expires and drops the session. The uncongested peer 
> computes alternate routing paths that are suboptimal in distance or attribute 
> and thus affect the forwarding decisions of all routers in this network. It 
> is possible that the congested peer's routing (control) plane is badly 
> degraded but its forwarding plane is at normal working level. The mistake is 
> made by the uncongested peer since it does not see any Keepalive/Update 
> messages before its Hold timer expires and drops the session thereby dropping 
> all its associated routes. After routes are dropped, network instability 
> occurs and suboptimal paths are used by the remaining peers.
> 
> The MinRouteAdvertisementInternal and MinAsOriginationInterval timers are 
> inadequate since they are mostly implemented on a peer session basis and 
> studies have shown when they are used they severely degrade route convergence 
> time. The problem with statically defined timers (initialized during system 
> load or session establishment phase) is that they do not adjust to peer 
> internal dynamically changing congestion conditions. The problem with the 
> congested peer using TCP flow control to reduce its congestion condition, is 
> that it completely stops all incoming session traffic and thus preventing any 
> messages of high priority nature from being seen.
> 
> TCP Out of Band Data or Urgent Message is one way to bypass the TCP flow 
> control condition and allow high priority BGP messages to get to the 
> congested peer, assuming it is able to read the session socket queue. This 
> solution has its drawbacks as described in [UNIX-NET] chapter 21, p. 568.  
> 
> 
> 
> 
> 
> 
> 
> 
> Internet Draft      draft-abarbanel-bgp-route-update-pacing-00.txt    [Page 3]
> 
>  
> 
> This document presents a mechanism by which the BGP session of a congested 
> router need not be degraded to the level that communication is broken with 
> its peers. By using a peer to peer pacing mechanism which allows one peer to 
> rate limit the number of update messages per second received from another 
> peer, it can avoid the severe congestion conditions and process these 
> messages at a manageable level. In addition, the uncongested peer has the 
> knowledge to send high priority messages ahead of or in place of the normal 
> high volume of update messages.
> 
> Usually, topological disturbances are spiky in nature and once they subside, 
> the network returns to its optimum path oriented level. Whatever caused the 
> network to become unstable, such as routers handling too much data in a small 
> period of time or routers loosing their sessions or their links going up and 
> down, occurs in most live network on an infrequent basis. By using the pacing 
> mechanism as outlined in this draft, a BGP peer can prevent the serious 
> congestion long before it is in trouble and thus ride through topological 
> disturbances and still regain its stability without causing its sessions to 
> drop or its routes/paths to be discarded and recomputed.
> 
> The assumption in this spec is that the source of most or all of the internal 
> BGP router congestion is due to the heavy reception of update messages from 
> neighboring peers containing large number of routes.
> 
> 
> 5. BGP Update Pacing Mechanism
> 
> The BGP Update Message Pacing Mechanism is used to slow down the rate at 
> which a peer sends update messages to another. This extension to the BGP 
> protocol is simplified to use session non-destructive messages such as INFORM 
> as described in [INFORM]. The pacing is performed dynamically upon congestion 
> detection and subsidence and thus needs to use the new [INFORM] message that 
> will not infringe on the underlying BGP protocol or all its semantics and 
> rules.
>  
> 
> 5.1 BGP Update Message Pacing (PACE) Dynamic Capability
> 
> The BGP Update Message Pacing capability is dynamically negotiated with all 
> BGP speakers as type code=PACE (where, PACE=TBD) per TLV structures as 
> defined in [BGP-CAP] of the OPEN message or anytime after session is 
> established using the Dynamic Capability Message as defined [DYN-CAP] 
> specification. All those peers that accept the PACE capability will be 
> expected to support the new INFORM message as defined in [INFORM] 
> specification to carry the pacing TLV structure. 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Internet Draft      draft-abarbanel-bgp-route-update-pacing-00.txt    [Page 4]
> 
>  
> 
> All PACE capable routers will provide a configuration option to their 
> operator to enable the BGP Update Message Pacing mechanism on a per peer 
> basis. 
> 
> Any peer wishing to withdraw the PACE capability can do so dynamically using 
> the Dynamic Capability message as outlined in [DYN-CAP] specification. Once 
> withdrawn, affected peering session will remain intact but will not benefit 
> from the performance improvements offered by the pacing mechanism.
> 
> 
> 5.2 Use of INFORM Message
> 
> The INFORM message as described in [INFORM] is used to carry the rate 
> limiting (pacing TLV) control structure to neighboring peers. This 
> information informs the peer that the current BGP router has entered a 
> congestion state and it is to rate limit its transmission to the level 
> specified. 
> 
> The INFORM message contains the following PACE TLV structure:
>  
>    Type = Pacing Information, type=TBD 
>    Length = 2
>    1st Byte of Value = Cmd as shown below 
>    2nd Byte of Value = Level as shown below
> 
>  Cmd       Description 
> -------    -----------
>   11       Request to Pace update messages
>   12       ACK Response to Pace Update messages
>   13       NACK Response to Pace Update Messages
> 
> A. Request to Pace Update Messages Level Indication (code=11)
> 
> The rate is divided into sub levels in term of categories (Gray, Yellow, 
> Orange, Red, and Green) to denote the level of pacing which is directly 
> related to the level of congestion experienced within the congested peer. The 
> colors are used as simple handles used for flagging the severity of the 
> condition. The colors are also used as indicators for display by the NMS to 
> identify the rate limiting levels from any neighboring peer to the operator. 
> Associated Level Management MIBs are defined in section 6. 
> 
> The actual pacing control is done via the sub levels within these colors. 
>   
>     Level     Description
>    ------     -----------
>     Gray      (Entering minor congestion state)
>          1       Reduce update message traffic by 10% your normal rate.
>          2       Reduce update message traffic by 20% your normal rate.
>          3       Reduce update message traffic by 30% your normal rate.
> 
> 
> 
>                                                     
> Internet Draft      draft-abarbanel-bgp-route-update-pacing-00.txt    [Page 5]
> 
> 
> 
>       Yellow     (Entering medium congestion state)
>          4       Reduce update message traffic by 40% your normal rate. 
>          5       Reduce update message traffic by 50% your normal rate.
>          6       Reduce update message traffic by 60% your normal rate.
> 
>       Orange     (Entering major congestion state)
>          7       Reduce update message traffic by 70% your normal rate. 
>          8       Reduce update message traffic by 80% your normal rate.
>          9       Reduce update message traffic by 90% your normal rate
> 
>         Red      (Entering critical congestion state)
>         10       Complete cessation of all update message traffic (flow off). 
>                  At this Level the receiving peer might decide to bypass the 
>                  Congested Router and pick another less optimal router for the
>                  affected Routes.
> 
>     Green     (Exiting any congestion state)
>          0       Restore update message traffic to your normal level (flow on).
>               Routes redirected can be resumed to the original peer.
> 
> The receiving peer will send an INFORM message with an ACK or NACK Indicating 
> it has received and understood the pacing request and will either comply with 
> the it or forbid it. If the congested peer receives a NACK, it should remove 
> that peer from its list of PACE capable peers but still maintaining its 
> session without the use of Pacing.  If the NACKing peer decides at a future 
> date to re-enable the Pacing with the local peer, it will renegotiate the 
> PACE capability with the local peer at that time.
> 
> B. Response Message Error Indication:
> 
>     Level     Description
>    ------     -----------
> 0	 ACK indication. The peer will comply with the 
>      		     Pacing level request.
> 1	 NACK indication. The peer is unable to perform Pacing or comply  
>         with the pacing level requested.
> 
> 
> 5.3 INFORM Message Response Timer
>    
> When a congested peer sends an INFORM message with a "Request to Pace Update 
> Messages (code=11)" to all peers that support the pacing feature, it will 
> also start an associated INFORM Message Response Timer. Any peer that does 
> not respond within a 30 second timeout period with either an ACK/NACK INFORM 
> Response message, its associated session will be dropped. This is done to 
> clear any PACE capable peers that are also congested to the point where their 
> communication with the local congested peer is severed.
> 
> 
>  
> 
> 
> 
> Internet Draft      draft-abarbanel-bgp-route-update-pacing-00.txt    [Page 6]
> 
>  
> 
> 5.4 Congestion Detection Within the Router
> 
> There are at least two ways congestion could be detected and measured by a 
> BGP router.
> 
> - A high CPU utilization condition 
> 
> - A Lack of available memory to accept incoming BGP messages or inability 
>   to get memory to successfully complete BGP processing. 
> 
> 5.4.1 CPU Utilization Based Level Computation
> 
> It is recommended that when the congested router detects its inability to 
> perform route calculations or accept new BGP session messages at a normal 
> rate meaning the current router's CPU utilization is more than 49%, it should 
> inform all its peers of a rate limiting level and slow them down accordingly. 
> 
> The recommended way for computing the Level, based on percent CPU 
> Utilization, is done using the following:
> 
>    Level = (((%CPU Utilization – 50) x 2) + 5) / 10).
>    Where, % CPU Utilization is a whole integer number. Use Integer math  
>    here to remove any fractional value.
> 
>   Note: If computed Level is negative, go to Green and set Level = 0. If   
>          Level = 0 send INFORM message with pacing Level = 0, implying return 
>          to normal (100%) update rate.
> 
> Level implies, whatever number of messages you transmitted per second 
> before, transmit (100 – (Level x 10))% of that now. 
> 
>   e.g. If you transmitted 100 messages/second before. A level 2 (20%) 
>        reduction will cause you to transmit 80 messages/second now.
>           Assumption is that these messages have an average size (1500 bytes).
> 
> 
> 5.4.2 Memory Allocation Based Level Computation
> 
> It is recommended that when the congested router detects its inability to 
> perform route calculations or accept new BGP session messages at a normal 
> rate meaning the current router's memory allocation is more than 69%, it 
> should inform all its peers of a rate limiting level and slow them down 
> accordingly. 
> 
> The recommended way for computing the Level, based on percent Memory 
> Allocation, is done using the following:
> 
>    Level = (((%Memory Allocated – 70) x 2) + 5) / 10).
>     Where, % Memory Allocated is a whole integer number. Use Integer math  
>     here to remove any fractional value.
> 
> 
> 
> Internet Draft      draft-abarbanel-bgp-route-update-pacing-00.txt    [Page 7]
> 
> 
> 
>    Note: If computed Level is negative, go to Green sub-group and set 
>          Level = 0. If Level = 0 send INFORM message with pacing Level = 0, 
>          implying return to normal (100%) update rate.
>  
> 
> 5.5 INFORM Message Throttling
> 
> Once the congested peer receives acknowledgement from another peer, it will 
> send a modification INFORM message with a new Level to that peer after the 
> computed pacing Level changes by at least 1 value. This will amount to no 
> more than one INFORM modification message every 5 seconds. This is done to 
> debounce any spiky bursts of INFORM messages to all PACE negotiated peers 
> each time the computed pacing Level changes. Depending on vendor 
> implementations, the internal utilization levels could change at the 
> Microsecond or Millisecond rate.
>   
> 
> 6. Implementation Specific Mechanisms
> 
> The Memory Allocation and CPU Utilization Level detection algorithms 
> discussed in section 5.4.1 and 5.4.2 are suggested ways one can implement 
> these solutions. However, each vendor can implement a unique Memory 
> Allocation and CPU Utilization Level detection algorithms that best suits 
> his/her needs and will not negatively impact the overall BGP Route Update 
> Pacing mechanism described in this spec. Any issues relating to internal 
> implementation algorithms are outside the scope of this document.
> 
>  
> 7. Level Indicators Management MIBs
> 
>    TBD
> 
> 
> 8. Security Considerations
> 
>    This extension to BGP does not change the underlying security issues.
> 
> 
> 9. References
>    
>    [BGP-CAP] Chandra, R., Scudder, J., "Capabilities Advertisement with 
>              BGP-4", draft-ietf-idr-rfc2842bis-02.txt
> 
>        [DYN-CAP]         Chen E., Sangli S.,  "Dynamic Capability for BGP-4", 
>             draft-ietf-idr-dynamic-cap-02.txt, October 2002.
> 
>    [INFORM] Nalawade G., Scudder J., "BGPv4 INFORM Message",
>             draft-nalawade-bgp-inform-00.txt, December 2002
> 
>    [BGP-4]  Rekhter, Y., and T. Li, "A Border Gateway Protocol 4 
>                               (BGP-4)", RFC 1771, March 1995.
> 
> 
> Internet Draft      draft-abarbanel-bgp-route-update-pacing-00.txt    [Page 8]
> 
> 
> 
>    [BGP-4-DRAFT] Rekhter, Y. and T. Li (editors), "A Border Gateway Protocol 4
>                           (BGP-4)", Internet Draft draft-ietf-idr-bgp4-18.txt, 
>                            January 2002.
> 
>    [UNIX-NET]  Stevens, R., "UNIX Network Programming, Vol 1", Second Edition
>                             1998 Prentice Hall, Inc.
>    
> 
> 10. Author Information
> 
>    Ben Abarbanel
>    Marconi Communications
>    1595 Spring Hill Road, 5th Floor
>    Vienna, VA 22182
>    Email: benjamin.abarbanel@marconi.com
> 
>