Re: My BGP Route Update Pacing Draft

vrishab sikand <v_sikand@yahoo.com> Thu, 18 July 2002 20:37 UTC

Received: from trapdoor.merit.edu (trapdoor.merit.edu [198.108.1.26]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id QAA12288 for <idr-archive@ietf.org>; Thu, 18 Jul 2002 16:37:30 -0400 (EDT)
Received: by trapdoor.merit.edu (Postfix) id D1BFD912D7; Thu, 18 Jul 2002 16:37:52 -0400 (EDT)
Delivered-To: idr-outgoing@trapdoor.merit.edu
Received: by trapdoor.merit.edu (Postfix, from userid 56) id 9AD0B912CD; Thu, 18 Jul 2002 16:37:52 -0400 (EDT)
Delivered-To: idr@trapdoor.merit.edu
Received: from segue.merit.edu (segue.merit.edu [198.108.1.41]) by trapdoor.merit.edu (Postfix) with ESMTP id 8D76F912D7 for <idr@trapdoor.merit.edu>; Thu, 18 Jul 2002 16:37:24 -0400 (EDT)
Received: by segue.merit.edu (Postfix) id 5EE515DF02; Thu, 18 Jul 2002 16:37:24 -0400 (EDT)
Delivered-To: idr@merit.edu
Received: from web12808.mail.yahoo.com (web12808.mail.yahoo.com [216.136.174.43]) by segue.merit.edu (Postfix) with SMTP id 463D15DDD0 for <idr@merit.edu>; Thu, 18 Jul 2002 16:37:23 -0400 (EDT)
Message-ID: <20020718203720.79222.qmail@web12808.mail.yahoo.com>
Received: from [208.246.215.128] by web12808.mail.yahoo.com via HTTP; Thu, 18 Jul 2002 13:37:20 PDT
Date: Thu, 18 Jul 2002 13:37:20 -0700
From: vrishab sikand <v_sikand@yahoo.com>
Subject: Re: My BGP Route Update Pacing Draft
To: "Abarbanel, Benjamin" <Benjamin.Abarbanel@Marconi.com>, "'idr@merit.edu'" <idr@merit.edu>
In-Reply-To: <39469E08BD83D411A3D900204840EC558227B4@vie-msgusr-01.dc.fore.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="0-504351559-1027024640=:78832"
Sender: owner-idr@merit.edu
Precedence: bulk

Specific comment:
Unless the rate of pacing is defined in absolute terms (as opposed to the a subjectively for e.g
yellow 4: reduce update message traffic by 40% of your normal rate). It will be very difficult to verify compliance.
Also in my experience, the "normal rate" varies drastically from vendor to vendor.
 
General comment:
 It appears to me that here is an attempt to solve an internal scheduling or horse power problem with the help of external protocol. 
This mechanism may put undue burden on the transmitting router, which is in the best interest of quick convergence in mind is sending prefixes as fast as possible. 
  "Abarbanel, Benjamin" <Benjamin.Abarbanel@Marconi.com> wrote: Hi all:

Our recent discussions on this list and my recent work 
experiences have led me to write this draft and offer it 
to the IETF community. I would appreciate any comments 
anyone has to make.

Thanks in advance,
Ben







Network Working Group Ben Abarbanel
Internet Draft Marconi Communicatons
Expiration Date: December 2002 



BGP Route Update Pacing

draft-abarbanel-bgp-route-update-pacing-00.txt


1. Status of this Memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026 except that the right to
produce derivative works is not granted.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as ``work in progress.''

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.


2. Abstract

This document defines a mechanism for controlling or limiting the rate at 
which BGP update messages are sent from one peer to the next when a BGP peer 
experiences internal congestion. With the introduction of new dynamic BGP 
protocol capabilities [CAP] message or other none BGP session destructive 
messages, it is necessary to limit the rate at which the BGP update messages 
are sent without affecting the entire BGP session and without relying on the 
transport (TCP) layer to do so as a normal reaction to data congestion of the 
TCP session across the network.

3. Specification of Requirements

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", 
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be 
interpreted as described in RFC 2119 [RFC2119].




Internet Draft draft-abarbanel-bgp-route-update-pacing-00.txt [Page 2]



4. Introduction

This document defines a mechanism for controlling or limiting the rate at 
which BGP update messages are sent from one peer to the next when a BGP peer 
experiences internal congestion. With the introduction of new dynamic BGP 
protocol capabilities [CAP] message or other none BGP session destructive 
messages, it is necessary to limit the rate at which the BGP update messages 
are sent without affecting the entire BGP session and without relying on the 
transport (TCP) layer to do so as a normal reaction to data congestion of the 
TCP session across the network.

When a router enters a state where either its CPU utilization is maximized 
(reaches close to 100%) or its memory is nearly depleted (less than 10% of 
memory left), it cannot handle new or heavy streams of updates from its peers 
and at times unable to send messages to its peer in a timely manner (within 
30 seconds). As a consequence it cannot keep current with the topological 
state. In some scenarios, a non-congested peer might want to negotiate new 
capabilities with a congested peer. The congested peer is so degraded that 
its TCP session goes into significant flow control off conditions and is 
unable to see the new BGP messages. Peer to peer communication is severely 
hampered and as a result the uncongested peer will take corrective action 
when its hold timer expires and drops the session. The uncongested peer 
computes alternate routing paths that are suboptimal in distance or attribute 
and thus affect the forwarding decisions of all routers in this network. It 
is possible that the congested peer's routing (control) plane is badly 
degraded but its forwarding plane is at normal working level. The mistake is 
made by the uncongested peer since it does not see any Keepalive/Update 
messages before its Hold timer expires and drops the session thereby dropping 
all its associated routes. After routes are dropped, network instability 
occurs and suboptimal paths are used by the remaining peers.

The MinRouteAdvertisementInternal and MinAsOriginationInterval timers are 
inadequate since they are mostly implemented on a peer session basis and 
studies have shown when they are used they severely degrade route convergence 
time. The problem with statically defined timers (initialized during system 
load or session establishment phase) is that they do not adjust to peer 
internal dynamically changing congestion conditions. The problem with the 
congested peer using TCP flow control to reduce its congestion condition, is 
that it completely stops all incoming session traffic and thus preventing any 
messages of high priority nature from being seen.

TCP Out of Band Data or Urgent Message is one way to bypass the TCP flow 
control condition and allow high priority BGP messages to get to the 
congested peer, assuming it is able to read the session socket queue. This 
solution has its drawbacks as described in [UNIX-NET] chapter 21, p. 568. 








Internet Draft draft-abarbanel-bgp-route-update-pacing-00.txt [Page 3]



This document presents a mechanism by which the BGP session of a congested 
router need not be degraded to the level that communication is broken with 
its peers. By using a peer to peer pacing mechanism which allows one peer to 
rate limit the number of update messages per second received from another 
peer, it can avoid the severe congestion conditions and process these 
messages at a manageable level. In addition, the uncongested peer has the 
knowledge to send high priority messages ahead of or in place of the normal 
high volume of update messages.

Usually, topological disturbances are spiky in nature and once they subside, 
the network returns to its optimum path oriented level. Whatever caused the 
network to become unstable, such as routers handling too much data in a small 
period of time or routers loosing their sessions or their links going up and 
down, occurs in most live network on an infrequent basis. By using the pacing 
mechanism as outlined in this draft, a BGP peer can prevent the serious 
congestion long before it is in trouble and thus ride through topological 
disturbances and still regain its stability without causing its sessions to 
drop or its routes/paths to be discarded and recomputed.

The assumption in this spec is that the source of most or all of the internal 
BGP router congestion is due to the heavy reception of update messages from 
neighboring peers containing large number of routes.


5. BGP Update Pacing Mechanism

The BGP Update Message Pacing Mechanism is used to slow down the rate at 
which a peer sends update messages to another. This extension to the BGP 
protocol is simplified to use session non-destructive messages such as INFORM 
as described in [INFORM]. The pacing is performed dynamically upon congestion 
detection and subsidence and thus needs to use the new [INFORM] message that 
will not infringe on the underlying BGP protocol or all its semantics and 
rules.


5.1 BGP Update Message Pacing (PACE) Dynamic Capability

The BGP Update Message Pacing capability is dynamically negotiated with all 
BGP speakers as type code=PACE (where, PACE=TBD) per TLV structures as 
defined in [BGP-CAP] of the OPEN message or anytime after session is 
established using the Dynamic Capability Message as defined [DYN-CAP] 
specification. All those peers that accept the PACE capability will be 
expected to support the new INFORM message as defined in [INFORM] 
specification to carry the pacing TLV structure. 









Internet Draft draft-abarbanel-bgp-route-update-pacing-00.txt [Page 4]



All PACE capable routers will provide a configuration option to their 
operator to enable the BGP Update Message Pacing mechanism on a per peer 
basis. 

Any peer wishing to withdraw the PACE capability can do so dynamically using 
the Dynamic Capability message as outlined in [DYN-CAP] specification. Once 
withdrawn, affected peering session will remain intact but will not benefit 
from the performance improvements offered by the pacing mechanism.


5.2 Use of INFORM Message

The INFORM message as described in [INFORM] is used to carry the rate 
limiting (pacing TLV) control structure to neighboring peers. This 
information informs the peer that the current BGP router has entered a 
congestion state and it is to rate limit its transmission to the level 
specified. 

The INFORM message contains the following PACE TLV structure:

Type = Pacing Information, type=TBD 
Length = 2
1st Byte of Value = Cmd as shown below 
2nd Byte of Value = Level as shown below

Cmd Description 
------- -----------
11 Request to Pace update messages
12 ACK Response to Pace Update messages
13 NACK Response to Pace Update Messages

A. Request to Pace Update Messages Level Indication (code=11)

The rate is divided into sub levels in term of categories (Gray, Yellow, 
Orange, Red, and Green) to denote the level of pacing which is directly 
related to the level of congestion experienced within the congested peer. The 
colors are used as simple handles used for flagging the severity of the 
condition. The colors are also used as indicators for display by the NMS to 
identify the rate limiting levels from any neighboring peer to the operator. 
Associated Level Management MIBs are defined in section 6. 

The actual pacing control is done via the sub levels within these colors. 

Level Description
------ -----------
Gray (Entering minor congestion state)
1 Reduce update message traffic by 10% your normal rate.
2 Reduce update message traffic by 20% your normal rate.
3 Reduce update message traffic by 30% your normal rate.




Internet Draft draft-abarbanel-bgp-route-update-pacing-00.txt [Page 5]



Yellow (Entering medium congestion state)
4 Reduce update message traffic by 40% your normal rate. 
5 Reduce update message traffic by 50% your normal rate.
6 Reduce update message traffic by 60% your normal rate.

Orange (Entering major congestion state)
7 Reduce update message traffic by 70% your normal rate. 
8 Reduce update message traffic by 80% your normal rate.
9 Reduce update message traffic by 90% your normal rate

Red (Entering critical congestion state)
10 Complete cessation of all update message traffic (flow off). 
At this Level the receiving peer might decide to bypass the 
Congested Router and pick another less optimal router for the
affected Routes.

Green (Exiting any congestion state)
0 Restore update message traffic to your normal level (flow on).
Routes redirected can be resumed to the original peer.

The receiving peer will send an INFORM message with an ACK or NACK Indicating 
it has received and understood the pacing request and will either comply with 
the it or forbid it. If the congested peer receives a NACK, it should remove 
that peer from its list of PACE capable peers but still maintaining its 
session without the use of Pacing. If the NACKing peer decides at a future 
date to re-enable the Pacing with the local peer, it will renegotiate the 
PACE capability with the local peer at that time.

B. Response Message Error Indication:

Level Description
------ -----------
0 ACK indication. The peer will comply with the 
Pacing level request.
1 NACK indication. The peer is unable to perform Pacing or comply 
with the pacing level requested.


5.3 INFORM Message Response Timer

When a congested peer sends an INFORM message with a "Request to Pace Update 
Messages (code=11)" to all peers that support the pacing feature, it will 
also start an associated INFORM Message Response Timer. Any peer that does 
not respond within a 30 second timeout period with either an ACK/NACK INFORM 
Response message, its associated session will be dropped. This is done to 
clear any PACE capable peers that are also congested to the point where their 
communication with the local congested peer is severed.






Internet Draft draft-abarbanel-bgp-route-update-pacing-00.txt [Page 6]



5.4 Congestion Detection Within the Router

There are at least two ways congestion could be detected and measured by a 
BGP router.

- A high CPU utilization condition 

- A Lack of available memory to accept incoming BGP messages or inability 
to get memory to successfully complete BGP processing. 

5.4.1 CPU Utilization Based Level Computation

It is recommended that when the congested router detects its inability to 
perform route calculations or accept new BGP session messages at a normal 
rate meaning the current router's CPU utilization is more than 49%, it should 
inform all its peers of a rate limiting level and slow them down accordingly. 

The recommended way for computing the Level, based on percent CPU 
Utilization, is done using the following:

Level = (((%CPU Utilization – 50) x 2) + 5) / 10).
Where, % CPU Utilization is a whole integer number. Use Integer math 
here to remove any fractional value.

Note: If computed Level is negative, go to Green and set Level = 0. If 
Level = 0 send INFORM message with pacing Level = 0, implying return 
to normal (100%) update rate.

Level implies, whatever number of messages you transmitted per second 
before, transmit (100 – (Level x 10))% of that now. 

e.g. If you transmitted 100 messages/second before. A level 2 (20%) 
reduction will cause you to transmit 80 messages/second now.
Assumption is that these messages have an average size (1500 bytes).


5.4.2 Memory Allocation Based Level Computation

It is recommended that when the congested router detects its inability to 
perform route calculations or accept new BGP session messages at a normal 
rate meaning the current router's memory allocation is more than 69%, it 
should inform all its peers of a rate limiting level and slow them down 
accordingly. 

The recommended way for computing the Level, based on percent Memory 
Allocation, is done using the following:

Level = (((%Memory Allocated – 70) x 2) + 5) / 10).
Where, % Memory Allocated is a whole integer number. Use Integer math 
here to remove any fractional value.



Internet Draft draft-abarbanel-bgp-route-update-pacing-00.txt [Page 7]



Note: If computed Level is negative, go to Green sub-group and set 
Level = 0. If Level = 0 send INFORM message with pacing Level = 0, 
implying return to normal (100%) update rate.


5.5 INFORM Message Throttling

Once the congested peer receives acknowledgement from another peer, it will 
send a modification INFORM message with a new Level to that peer after the 
computed pacing Level changes by at least 1 value. This will amount to no 
more than one INFORM modification message every 5 seconds. This is done to 
debounce any spiky bursts of INFORM messages to all PACE negotiated peers 
each time the computed pacing Level changes. Depending on vendor 
implementations, the internal utilization levels could change at the 
Microsecond or Millisecond rate.


6. Implementation Specific Mechanisms

The Memory Allocation and CPU Utilization Level detection algorithms 
discussed in section 5.4.1 and 5.4.2 are suggested ways one can implement 
these solutions. However, each vendor can implement a unique Memory 
Allocation and CPU Utilization Level detection algorithms that best suits 
his/her needs and will not negatively impact the overall BGP Route Update 
Pacing mechanism described in this spec. Any issues relating to internal 
implementation algorithms are outside the scope of this document.


7. Level Indicators Management MIBs

TBD


8. Security Considerations

This extension to BGP does not change the underlying security issues.


9. References

[BGP-CAP] Chandra, R., Scudder, J., "Capabilities Advertisement with 
BGP-4", draft-ietf-idr-rfc2842bis-02.txt

[DYN-CAP] Chen E., Sangli S., "Dynamic Capability for BGP-4", 
draft-ietf-idr-dynamic-cap-02.txt, October 2002.

[INFORM] Nalawade G., Scudder J., "BGPv4 INFORM Message",
draft-nalawade-bgp-inform-00.txt, December 2002

[BGP-4] Rekhter, Y., and T. Li, "A Border Gateway Protocol 4 
(BGP-4)", RFC 1771, March 1995.


Internet Draft draft-abarbanel-bgp-route-update-pacing-00.txt [Page 8]



[BGP-4-DRAFT] Rekhter, Y. and T. Li (editors), "A Border Gateway Protocol 4
(BGP-4)", Internet Draft draft-ietf-idr-bgp4-18.txt, 
January 2002.

[UNIX-NET] Stevens, R., "UNIX Network Programming, Vol 1", Second Edition
1998 Prentice Hall, Inc.


10. Author Information

Ben Abarbanel
Marconi Communications
1595 Spring Hill Road, 5th Floor
Vienna, VA 22182
Email: benjamin.abarbanel@marconi.com





---------------------------------
Do You Yahoo!?
Yahoo! Autos - Get free new car price quotes