Re: DPA replacement?

Curtis Villamizar <curtis@ans.net> Fri, 13 September 1996 23:46 UTC

Received: from cnri by ietf.org id aa09369; 13 Sep 96 19:46 EDT
Received: from merit.edu by CNRI.Reston.VA.US id aa16241; 13 Sep 96 19:46 EDT
Received: (from daemon@localhost) by merit.edu (8.7.5/merit-2.0) id TAA09087 for idr-outgoing; Fri, 13 Sep 1996 19:17:58 -0400 (EDT)
Received: from interlock.ans.net (interlock.ans.net [147.225.5.5]) by merit.edu (8.7.5/merit-2.0) with SMTP id TAA09080 for <bgp@merit.edu>; Fri, 13 Sep 1996 19:17:55 -0400 (EDT)
Received: by interlock.ans.net id AA07255 (InterLock SMTP Gateway 3.0 for bgp@ans.net); Fri, 13 Sep 1996 19:17:47 -0400
Received: by interlock.ans.net (Internal Mail Agent-1); Fri, 13 Sep 1996 19:17:47 -0400
Message-Id: <199609132316.TAA20378@brookfield.ans.net>
To: bgp@ans.net
Cc: curtis@ans.net
Reply-To: curtis@ans.net
Subject: Re: DPA replacement?
Date: Fri, 13 Sep 1996 19:16:53 -0400
From: Curtis Villamizar <curtis@ans.net>
Sender: owner-idr@merit.edu
Precedence: bulk

John,

I finally got around to reading this.  This would be useful for a dual
homed site with an absolute preference for one of the connections.  I
think this is a rare case, so I'm proposing an alternative.

Unfortunately, more sites seem to be interested in pulling more than
one circuit and then getting some sort of load split across the two
providers.  This amounts to everyone else taking shortest path out.
Such a site would benefit from a cumulative metric so that the reverse
traffic took the same path.  If the cummulative metric was applied the
same in both directions traffic (and therefore load balance) would be
symetric.  Given enough range, the metric could be tweaked to balance
the load a bit better.

This brings us back to the issue of a globally recognized metric.
That always brings up the issue of whether to use bandwidth, delay,
percent loading, percent loss, weighted hop count, something else, or
some combination of factors, or just punt and use a arbitrarily
choosen unitless integer.  On this issue I suggest we declare the
metric to be "pain" and allow the value of "pain" and the rules for
accumulating pain to be defined according to TOS or QoS.  We also have
to keep in mind that this attribute needs to be relatively fixed.

The attribute should be optional and transitive.  Any router that
can't figure out what its loss rate or has not idea what the attribute
means or how to deal with a particular TOS or QOS should set the BGP
partial attribute flag.  If we set aside two attribute numbers, we can
pack complete attributes in one and partial in another if a router
knows how to deal with a subset of the TOS or QoS.  Treat a partial
BGP attribute to mean a really bad path.

Still with me?  Good.  For the default TOS/QoS, I suggest we go with a
weighted hop count with some guidelines on weighting because its
simple and anyone can do it.  I'd like to suggest that we attempt to
make measure of "pain" for the throughput TOS to be as close as we can
get to a measure of "pain" inflicted on a moderate window size TCP
preferably considering the most widely deployed TCP behaviors.  So how
much "pain" does a link or router contribute to a TCP flow?

This can't be twitchy.  For all of these metrics, lower "pain" values
are good, higher values are bad.  If the value changes the value
should tend toward the higher values in the range.  Anyone origination
or modifying a metric should dampen.  An EBGP peer that dampens should
latch at a higher value and ignore changes lowering the value only
after the value has been lower for a long while.

These values can be used to bias local-pref at so interior routers that
don't understand it don't form routing loops.

The whole idea of a globally significant metric has been one that BGP
has avoided and load sensitive routing even more so.  This has been
all with good reason.  Now that we've figured out that dampening is
not rocket science it may be time to revisit the idea of load
sensitive routing and consider making such emperically derived metrics
highly damped and globally significant.

btw- It would probably be a lot more productive to get the wording
right in the BGP4 draft so now I'll shut up and see about suggesting
some text.

Curtis


ps- further detail - how are TOS or QoS handled?  Here are some
suggestions.

Pain is inflicted to TCP in one of two ways.  Dropping a packet causes
a slow down.  Dropping multiple closely spaced packets (within the
same TCP window) can cause considerably more pain.  A good measure of
pain would then be a loss rate with a factor for biasing closely
spaced loss.  Having coded BGP route flap damping, we already know how
to efficiently do an exponential decay.  So I suggest the forwarding
engine exponentially decay a cummulative prior drop value over a short
time period (note: use time here) and add that to a loss rate metric
that is exponsitially decayed over a very long number of packets
(note: use packets here).  

If you don't want to do the first decay, you can just add a ceiling
value, approximating "don't know" to mean "closely bunched".  With
that assumption you just need to keep the packet number (from your
SNMP counter) and the prior metric and take action on a queue drop.
We you queue drop you need to decay the value (table lookup and
multiply), and add to it, then store the new packet number and value
pair.  If your forwarding engine can't deal with that treat queue drop
the way you deal with link loss.  For link loss use LQM stats if you
can get them and every so often decay the metric according to how many
packets went by and increment according to how many were dropped.  If
you can't get LQM stats use some other approximation of loss.  Use a
pessimistic estimate.  For example, for ATM if you can get cell loss
stats assume each cell represents a lost packet, for DS3 HSSI or DS1
assume each SES or UAS represents loss of all packets during that
second.  These values are link and router specific and would have to
be flooded in the IGP so the border router could make the computation
periodically.

For TCP, a loss rate of "all packets are lost" would be 2^32-1 and no
loss would be 0.  Loss is combined by multiplying two unsigned 32 bit
integers and shifting down the 64 bit unsigned integer by 32 bits.

Other QoS can be treated similarly.  Delay TOS is additive.  Bandwidth
TOS takes the lowest value of remaining unused bandwidth.  QoS metrics
could be defined as an ability to support various services and tspecs
and is left as an exercise for the reader (that means I don't think
its impoertant and don't care but see a need to accomodate the QoS
routing zealots.


------- Forwarded Message

Received: from interlock.ans.net (interlock.ans.net [147.225.5.5]) by brookfield.ans.net (8.7.3/8.7.3) with SMTP id UAA11966 for <curtis@brookfield.ans.net>; Sat, 24 Aug 1996 20:57:16 -0400 (EDT)
Received: by interlock.ans.net id AA03678
  (InterLock SMTP Gateway 3.0 for curtis@ans.net);
  Sat, 24 Aug 1996 20:58:41 -0400
Received: by interlock.ans.net (Internal Mail Agent-4);
  Sat, 24 Aug 1996 20:58:41 -0400
Received: by interlock.ans.net (Internal Mail Agent-3);
  Sat, 24 Aug 1996 20:58:41 -0400
Received: by interlock.ans.net (Internal Mail Agent-2);
  Sat, 24 Aug 1996 20:58:41 -0400
Received: by interlock.ans.net (Internal Mail Agent-1);
  Sat, 24 Aug 1996 20:58:41 -0400
Message-Id: <v03007803ae44ecd551c6@[152.160.213.42]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Sat, 24 Aug 1996 17:50:58 -0700
To: bgp@ans.net
From: "John G. Scudder" <jgs@ieng.com>
Subject: DPA replacement?
Sender: owner-idr@merit.edu
Precedence: bulk

A little while ago I was wrestling with the current DPA drafts and figured
out the the respective drafts and RFCs (the DPA draft, the DPA usage draft,
and RFC 1771) form kind of a contradiction loop.  This seems to stem from
the fact that the current DPA draft wants to treat DPA as a global
administratively defined metric (which still carries around AS numbers for
some reason even though there's not a reasonable way to use them) or maybe
(it isn't clear) as an arbitrary tag (even though Communities does this
just fine).  It's quite clear in either case that the DPA should be used to
bias or set localpref.  The usage draft, on the other hand, keeps the
earlier idea of making DPA a MED-like metric which is only comparable when
the AS part of the DPA matches.  But, RFC 1771 says you can't do elections
when biasing or setting localpref.

This got me thinking about what DPA really wants to do and I ended up with
the following ideas, which are basically tricks to try to make a global
metric useful.  Your feedback is solicited.  Comments from network
operators as to the real-world usefulness of any mechanism such as this (or
the current DPA draft for that matter) would be particularly interesting.

This is still pretty drafty; there are a number of unresolved notes in
[square brackets].  Also, I didn't spend any time trying to come up with a
clever TLA, this leads to some textual awkwardness where I refer to "the
attribute" or "the metric" generically.  Sorry.

- --John


Introduction
- ------------

The current DPA proposal <draft-ietf-idr-bgp-dpa-05.txt> differs from
earlier versions of the document in that it eliminates the requirement that
only DPAs with like source ASes be comparable.  This makes the AS# portion
of the DPA largely redundant; it also increases the possibility of
conflicts in scenarios such as that shown in section 3 of
<draft-ietf-idr-dpa-application-02.txt>.

Also, scenarios such as the aforementioned suffer from DPA's inability to
select between multiple branching paths.  The application document suggests
using Local_Pref to overcome this limitation.  This proposal outlines
another approach, involving a globally significant metric which is refined
in smaller and smaller increments as it propagates further from its
originator.

The proposed algorithm allows the preference of each AS in the path to be
reflected in the metric.  Total ordering is preserved in many cases, but
see the "Limitations" section for notable exceptions.

Path Attribute
- --------------

This document proposes an optional transitive path attribute of fixed
length.  The attribute is a range of integers represented by a pair <lower,
upper>.  Both values are four octet, non-negative integers.  The attribute
may be modified by ASes other than the orginator, according to the
following algorithm.  The attribute must not be replaced other than
according to the following algorithm.  [How about removing it?]

[Originator's AS# could also be included if anyone can think of a good use
for it.]

Route Selection Process
- -----------------------

For the purposes of route selection, <lower> should be considered to be the
metric.  Routes with a higher metric should be preferred to routes with a
lower metric.  [This could be by biasing the local_pref, or by including
the comparison after the local_pref comparison.]  Routes with any metric
should be preferred over similar routes without any metric.

Operation
- ---------

When an AS originates routes for which it wishes to bias upstream selection
preference, it should set the metrics according to the following algorithm:

Call the received attribute <rcv_lower, rcv_upper>
range = rcv_upper - rcv_lower
n = number of distinct preferences with which the route is to be offered
p = ordinal preference of the route in question, with the most preferred
route having the highest preference, and the least preferred route having
preference ordinal 1.  Note that several routes could have the same
preference.

For a route with preference p,

        lower = rcv_lower + (p-1) * range/n
        upper = rcv_lower +   p   * range/n

When originating a route, rcv_lower is defined as 0 and rcv_upper is
defined as 2^32-1.  [Or rcv_lower could be defined as 1; this could be
convenient since 0 could be used as a magic cookie meaning "no metric"
which would also sort correctly per the rules above.]  Otherwise, when
propagating a route, rcv_lower and rcv_upper are taken from the received
attribute.

If the AS propagating the route has no wish to bias upstream selection
preference, it need not (and should not) modify the attribute.

The intent of this algorithm is to divide the original sender's preference
up into a set of non-overlapping ranges whose ordering reflects the
preferences of all transit ASes which have such preference.  [But see
"Limitations" for cases where ranges can overlap.]

[Hmm, what do we do when we work range sizes all the way down to zero?
This won't take long in principle, but in practice we might actually have a
low enough diameter to make this a non-issue?]

Example
- -------

Here is an example, with a comparison to DPA:

        A-------B----------------D----+
        |                        |\   |
        +-------C--------E-------+ \  |
                |\                  | |
                | -------F----------+ |
                |                     |
                +--------G------------+

Consider A-G to be ASes.  Suppose that A prefers to receive packets via B.
Likewise, C prefers to receive packets for A via E over F and F over G.

Using DPAs, A could send DPA <A, 2> to B and <A, 1> to C.  C could send DPA
<C, 3> to E, <C, 2> to F and <C, 1> to G.  D would have no way within the
context of DPA of choosing between B and E.  This could be fixed using
Local_Pref, but the amount of coordination required seems undesirable if
another workable option is available.

Using the algorithm above, A would originate two attributes.  To B, it
would send <2147483648, 4294967295>.  To C it would send <0, 2147483648>.
C would originate three attributes.  To E it would send <1431655765,
2147483648>.  To F it would send <715827883, 1431655765>.  To G it would
send <0, 715827883>.  D would then be able to determine the desired
ordering (B beats E beats F beats G).

Limitations
- -----------

This scheme preserves a total ordering on the metric so long as each AS
where branching occurs has a preference and thus modifies the attribute
(such as in the example).  In the event that a branching occurs with no
modification of the preference, followed by a reconvergence of paths, some
of the ordering property is lost.  That is, ranges may overlap.  For
example:

	B--------C--------E--------J
	|	 |		   |
	|	 +--------F--------|
	|			   |
	+--------D--------G--------|
		 |		   |
		 +--------H--------+

Suppose B has no preference between C or D, but C prefers E over F, and D
prefers G over H.  When J receives B's routes from E, F, G and H it will be
able to determine that it should prefer (E or G) over (F or H) but will not
be able to determine by examining the metric whether to prefer E or G.

As long as all "primary" paths are up, this behaves as desired, by allowing
J to choose whichever preferred path it likes.  However, if one primary
path (E, for example) goes down, traffic for B will be "attracted" to the
remaining "primary" path (G in this case).  This might not be desirable.

In a similar vein, consider this case, similar to the previous but with the
addition of AS K.  D prefers G over H over K:

	B--------C--------E--------J
	|	 |		   |
	|	 +--------F--------|
	|			   |
	+--------D--------G--------|
		 |		   |
		 |--------H--------|
		 |		   |
		 +--------K--------+

In this case J will prefer (E or G) over (F or H or K).  However, we also
now have H preferred over F preferred over K.  This is just an artifact of
the algorithm and doesn't really reflect any particular preference of any
downstream AS.

- --
John Scudder                        email:  jgs@ieng.com
Internet Engineering Group, LLC     phone:  (313) 669-8800
122 S. Main, Suite 280              fax:    (313) 669-8661
Ann Arbor, MI  41804                www:    http://www.ieng.com



------- End of Forwarded Message