Re: OSPF WG Charter Proposal

"Choudhury, Gagan L, ALASO" <gchoudhury@ATT.COM> Mon, 11 November 2002 17:42 UTC

Received: from cherry.ease.lsoft.com (cherry.ease.lsoft.com [209.119.0.109]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id MAA25630 for <ospf-archive@LISTS.IETF.ORG>; Mon, 11 Nov 2002 12:42:46 -0500 (EST)
Received: from walnut (209.119.0.61) by cherry.ease.lsoft.com (LSMTP for Digital Unix v1.1b) with SMTP id <3.007C1EBF@cherry.ease.lsoft.com>; Mon, 11 Nov 2002 12:45:10 -0500
Received: from DISCUSS.MICROSOFT.COM by DISCUSS.MICROSOFT.COM (LISTSERV-TCP/IP release 1.8e) with spool id 344256 for OSPF@DISCUSS.MICROSOFT.COM; Mon, 11 Nov 2002 12:45:10 -0500
Received: from 192.128.134.71 by WALNUT.EASE.LSOFT.COM (SMTPL release 1.0f) with TCP; Mon, 11 Nov 2002 12:45:09 -0500
Received: from attrh3i.attrh.att.com ([135.71.62.12]) by kcmso2.proxy.att.com (AT&T IPNS/MSO-4.0) with ESMTP id gABGxvkQ012757 for <OSPF@DISCUSS.MICROSOFT.COM>; Mon, 11 Nov 2002 11:45:10 -0600 (CST)
Received: from occlust04evs1.ugd.att.com (135.71.164.13) by attrh3i.attrh.att.com (6.5.019) id 3D8B5605011A3B77 for OSPF@DISCUSS.MICROSOFT.COM; Mon, 11 Nov 2002 12:45:09 -0500
X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Thread-Topic: OSPF WG Charter Proposal
Thread-Index: AcKHrcK9KaWiOExTQ+y8b/iUH8+fHQB+7Cvg
Message-ID: <28F05913385EAC43AF019413F674A017023F684D@OCCLUST04EVS1.ugd.att.com>
Date: Mon, 11 Nov 2002 12:45:08 -0500
Reply-To: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
Sender: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
From: "Choudhury, Gagan L, ALASO" <gchoudhury@ATT.COM>
Subject: Re: OSPF WG Charter Proposal
To: OSPF@DISCUSS.MICROSOFT.COM
Precedence: list
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by ietf.org id MAA25630

Dennis,
  Thanks for your message and taking the time to read our draft. 

You point out very correctly that adjacencies should not be taken down in the presence of a spike of (non-Hello, not adjacency related) data base traffic (i.e., LSUs).  I suspect that you do it by looking at the OSPF packet header and if you find it to be a Hello then you process it ahead of other packets (LSUs).  This way even if there is a large spike of LSUs, you never miss Hello and keep the adjacency up.  We call this "Prioritized Treatment to Hello Packets".  We are not doubting that you do it but since it is not stated explicitly in OSPF we feel that it is important to explicitly point out the importance of "Prioritizing Hello Packets".  We suggest use of special marking so that Hello packets can be easily separated from LSUs and it would be easy to prioritize them.  However, even without special marking if you can easily separate out Hello packets and process them ahead of LSUs then that is perfectly alright.

We did the simulation study to show what happens if Hellos are processed at the same priority as incoming LSUs (and so can get behind many LSUs under congestion), we did not mean to say "implementations always do this".  In fact we say in the abstract that some implementations are perhaps already prioritizing Hellos in order to not lose adjacency but we want all implementations to do them.

In the simulation as we prioritized Hello we saw that adjacencies were never declared down (as expected).  However, at a larger LSA storm threshold there were a large number of Retransmissions  which caused sustained heavy congestion at a subset of the node processors (the ones with high adjacency). (We used fixed retransmission timer values of 5 seconds or 10 seconds throughout the simulation).  In this case even though links were not being declared down, the sustained heavy congestion due to retransmissions is unacceptable and so we call this a "cliff" even though it is not as harmful as losing adjacencies.  We show by simulations that the retransmission traffic may be reduced significantly (and therefore the "cliff" moved to the right significantly) by prioritizing Acknowledgment packets.  However, this is not necessarily the only solution.  There are beneficial effects of slowing down retransmissions under congestion as well (those simulation results are not reported yet).  Again, we are not saying that implementations do not take any special measures to reduce retransmission traffic under heavy congestion.  However, these special measures are not spelled out explicitly in OSPF but are important enough for explicit mention so that all implementations can benefit from it.

You wondered the cause of the problem or LSA storm (near synchronous update of a very large number of LSAs).  It can be caused by one/more link failures, one/more node failures, sudden synchronization/generation of a large number of LSAs (just by coincidence or even triggered by some software bug or improper operational procedure).  The problem happens very rarely but it does happen.  Also, it is a problem only in large networks with say hundreds of nodes, thousands of trunks, very high adjacency at some nodes, and very large number of LSAs, e.g., large number of ASE LSAs or large number of LSAs that carry link reserved bandwidth information (similar to traffic-engineering LSAs).

                Gagan Choudhury  

-----Original Message-----
From: Dennis Ferguson [mailto:dennis@JUNIPER.NET]
Sent: Saturday, November 09, 2002 12:05 AM
To: OSPF@DISCUSS.MICROSOFT.COM
Subject: Re: OSPF WG Charter Proposal


Hello,

I just read draft-ietf-ospf-scalability-02.txt, along with
draft-ash-manral-ospf-congestion-control-00.txt which seems
to be related, and find I am made a little uncomfortable by
the way the problem(s) is being described and hence the
inevitability (or even desirability) of the solutions offered.
Please excuse me if this has been discussed before; I skimmed
and deleted a bunch of messages about this before actually
reading the documents and discovering that I cared.

> The various proposals for OSPF Scalability improvements are to move
> the LSA storm threshold (or "cliff") significantly upwards.  Some people
> feel that this should be done only through smart (proprietary)
> implementation.  However, the points against that are the following:

I think the LSA storm threshold, or cliff, is not a problem in itself,
but rather is a symptom of an underlying problem.  Because the problem
was not identified explicitly, however, it took some time to figure out what
might be causing this cliff (guessed at mostly by working backwards from
the proposed solution(s)).  The problem which causes this appears to me to
actually be that the implementation allowed load from non-adjacency-
maintenance traffic (i.e. essentially traffic which wasn't Hellos) to
effect adjacency maintenance, in particular by incorrectly dropping
adjacencies to routers which were in fact still operating.

The problem of erroneously dropping adjacencies to routers which are
still up when under (unrelated) stress is the canonical problem which
leads to network routing collapse.  Just about anyone who has implemented
a link state protocol (or any protocol where adjacencies are maintained
for that matter, e.g. BGP or PPP), deployed it and watched it break a
couple of times knows this by heart, though it is certainly worth writing
down for implementors who haven't got this far yet.  If you want stable
routing then implementations must be done in a way such that this *does*
*not* *happen*, that is the implementation must ensure that the cost of
spikes in database flooding is paid solely by increased convergence time,
and is never the cause of adjacencies being incorrectly declared down.  If
implementations are done this way then the ultimate scalability of the
protocol implementation (beyond data base memory requirements, though
data base memory has gotten inordinately cheap these days) is dictated
solely by how long you are willing to wait for your network to converge
after spikes, and that is a continuous degradation with no "cliff" at all.

Now I would simply assert that there is nothing about the OSPF protocol,
as currently specified, which prevents implementations from achieving the
nirvana described in the previous paragraph.  Nothing.  It may not be
easy to get an implementation to work this way, but I think this difficulty
is inherent in the problem that needs to be solved within the (inherently
proprietary) real-life environment that the implementation runs inside
which sometimes leaves it short of resources, and is hence far removed from
the protocol definition itself.  Errors which cause adjacencies to be
incorrectly declared down are implementation errors, not protocol errors,
and are probably best addressed in the implementation which is broken
rather than the protocol which isn't.  An implementation which doesn't
work like the previous paragraph is one which still has bugs in need
of fixing.

So while I like the simulations done in the scalability draft a lot, I'll
just point out how this draft seems to go given the assumptions
and views above:

(1) You did simulations using an OSPF implemention which had a bug, in
    that it unnecessarily and erroneously declared adjacencies down in
    the presence of a spike of (non-Hello, not adjacency related) data
    base traffic.  How you came to the conclusion that implementations
    which don't make this error are "proprietary" while the particular
    implementation you simulated which made this error is not, I don't
    understand.

(2) Rather than addressing the problem within your implementation, you
    elected to try to solve the problem by having the implementation's
    neighbours help out by setting some new bits in packets (i.e. a
    protocol change) to assist in a prioritization scheme.  I don't
    mind this, except to note that changing an established protocol
    and then waiting for everyone to catch up to the change is usually
    the least productive, longest-term way to fix bugs in an implementation
    which has router neighbours from multiple vendors it is relying on
    to do this for a "fix".

(3) You then redid the simulations after the "fix" was applied, and found
    that while there was still a cliff the "fix" had moved it up and to
    the right.  That is, despite the fact that you even went to the trouble
    of making a protocol change, the "fix" still didn't actually fix the
    problem, which is that you have a cliff at all.  The routing failure can
    apparently still occur, the danger of your network routing melting is
    still there, it just apparently needs a much bigger event to set it
    off.  This is telling me that, independent of this, you really need
    to look at the implementation you are simulating with and determine
    what's actually wrong with it, as it is still an accident waiting to
    happen.

But, more important than this, note that the above comments are still
based on speculation about the root cause of the problem, since according
to the draft the problem is a "cliff" and that the fix is to move the
cliff up and to the right, and it is hard to relate any of this to the
underlying protocol.  You really, really must identify what the routers
are actually doing wrong, in words rather than from numerical simulation
results, since if you don't understand and present this your readers will
never understand the spectrum of possible fixes and why they should work.  For
example, if the problem really is one of routers erroneously dropping
adjacencies where the neighbour is still up, then prioritization of
Hello packet reception and processing seems like not a bad way to help
keep this from occurring (we can debate whether this is better done
by a protocol change or by just having the routers look at the OSPF
packet type before queuing), since its effect is to keep adjacency
errors from occuring by spending CPU keeping this accurate before
spending CPU doing anything which can be deferred.  It is far from
the only way to help, however.  For example, another way to help fix
this (perhaps in addition to the above) might be for the routers to
observe when they've been under load, with a long queue of work to do
that they've perhaps been dropping packets off the end of, and to defer
making any decisions to drop adjacencies until after this condition clears.
This still makes errors about the state of adjacencies, but changes the
nature of the errors from one where the errors you make are terminating
adjacencies which are really still up, which can melt your network, to
one where the errors you make are concluding adjacencies might still be
up when they have in fact gone down, a temporary booboo whose sole effect
might be to increase convergence time (which is how you should be paying
for load spikes in any event).  Note that this adds robustness (i.e.
probably is the best you can do in the circumstance) not only when you
are getting too many LSAs, but also when you are temporarily too short
of CPU to even keep up to the incoming Hellos.

There are a lot of ways to address this particular problem (if this actually
is the "cliff" problem) in implementations, enough that your implementation
really shouldn't have "cliff" problems to begin with, and you should
just keep fixing the implementation until the "cliff" problems go away
so you don't have to live in fear that something big will happen which
will cause your network routing to melt and fail to recover.  This is
so fundamental that I wouldn't wait for the wide deployment of protocol
changes to ensure that this is fixed (even if protocol changes were
exceptionally useful, which I doubt), but would just get it fixed in
the implementations you've already got talking to the other routers you've
already got.

Once you've got basic reliability thing out of the way you can then
begin to look at addressing the real scaling problem which remains, which
is convergence time and maximizing the size of the network you can have
while still having it converge in acceptably short periods.  This is the
problem which I think might benefit from protocol changes, both related
to minimizing flooding load and otherwise.  It is just important to be
clear on why these changes are being done as well, for about the same
reasons it is important to understand what a "cliff" problem is, that
is you'll never get the right answer without being very clear about
why you're doing it.  For example, sending a lot of unnecessary
retransmissions to a neighbour should never, ever cause that neighbour
to break (if it does he's the one with the problem needing fixing), but
it is still not a good thing to do since it makes everyone work harder
and that is guaranteed to slow convergence.  The latter thing is still
very much worth fixing, you just shouldn't be doing the fixing to hack
around implementation-robustness "cliff" problems.

To cut this short, then, I'll just point out what I wanted to say in the
first place about the congestion draft, which is that in general I
don't think using rate controls ever works right given the problem which
I think this should be addressing.  The problem with rate controls is
that if they are fixed configuration they're always wrong, while if
they're automatically adjustable you need to compute adjustments
slowly to ensure stability, which in turn means that the rate you compute
is usually wrong for current circumstances.  It is either too high
or too low.

In the case of flooding to neighbours, both of these cases are bad.
Sending too fast is bad, not because it will break anything (it
absolutely shouldn't) but because processing a big queue of stale
stuff, perhaps with accompanying spurious retransmissions, makes
everyone work harder than they need to and delays convergence.
Sending too slow, however, is also bad because it puts you in
the incredibly silly situation of having an unconverged network where
the routers are simultaneously idle some of the time, which slows
convergence for no discernable advantage at all.  I hence, without
having thought enough about it, like schemes which attempt to do
window control (i.e. limit the number of outstanding, un-acked LSAs
instead), using the delay of and rate at which acks come back as the
control signal, if only because there are examples of transport
protocols working like this which exhibit both stability and a fairly
quick response to changes.

In the case of the SPF computation, however, imposing standardized
rate limits on the maximum frequency with which an implementation may
compute this just seems irrational, and makes no sense at all unless the
implementation entirely lacks the ability to prioritize its actions, an
attribute which is certainly not universal and is unlikely to lead to
great success in making any of this stuff work reliably.  I mean, if
a router is one SPF computation away from converging its routing, and
knows it has absolutely nothing else to do at the moment, is there any
rational reason to require it to sit idle for the four seconds to some
timer expiry before actually doing it?  I think most implementations
should be able to figure out for themselves when it is a good idea to
run the Dykstra and when it isn't, and so should retain the freedom to
do this.

So I think the "congestion" draft has some good ideas (I also like other
attempts to find ways to minimize flooding work) but seems confused about
why these are good ideas which makes some of its other ideas not-so-good,
while the "scalability" draft is proposing a protocol change to relieve
(but not entirely fix) a mysterious problem which, if better defined,
might turn out to be a classic adjacency maintenance implementation bug
which is fixable with no protocol changes at all.  I think the latter
draft would be a whole lot better (in fact, would be really, really
great) if, instead of making a protocol change, you instead did and
documented whatever was necessary to make the "cliff" disappear entirely
from your simulation implementation without making protocol changes,
at which point the (convergence) scaling problem the former draft should
probably be aimed at addressing would be a lot clearer.

Dennis Ferguson