Re: OSPF WG Charter Proposal

Dennis Ferguson <dennis@JUNIPER.NET> Sat, 09 November 2002 05:04 UTC

Received: from cherry.ease.lsoft.com (cherry.ease.lsoft.com [209.119.0.109]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id AAA01916 for <ospf-archive@LISTS.IETF.ORG>; Sat, 9 Nov 2002 00:04:09 -0500 (EST)
Received: from walnut (209.119.0.61) by cherry.ease.lsoft.com (LSMTP for Digital Unix v1.1b) with SMTP id <11.007BC796@cherry.ease.lsoft.com>; Sat, 9 Nov 2002 0:04:57 -0500
Received: from DISCUSS.MICROSOFT.COM by DISCUSS.MICROSOFT.COM (LISTSERV-TCP/IP release 1.8e) with spool id 335786 for OSPF@DISCUSS.MICROSOFT.COM; Sat, 9 Nov 2002 00:04:56 -0500
Received: from 207.17.136.129 by WALNUT.EASE.LSOFT.COM (SMTPL release 1.0f) with TCP; Sat, 9 Nov 2002 00:04:55 -0500
Received: from juniper.net (wawa.juniper.net [172.17.20.55]) by merlot.juniper.net (8.11.3/8.11.3) with ESMTP id gA954sS67699 for <OSPF@DISCUSS.MICROSOFT.COM>; Fri, 8 Nov 2002 21:04:54 -0800 (PST) (envelope-from dennis@juniper.net)
X-Mailer: exmh version 2.0.2 2/24/98
X-Exmh-Isig-CompType: repl
X-Exmh-Isig-Folder: lists/ospf
Mime-Version: 1.0
Content-Type: text/plain
Message-ID: <200211090504.gA954sS67699@merlot.juniper.net>
Date: Fri, 08 Nov 2002 21:04:54 -0800
Reply-To: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
Sender: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
From: Dennis Ferguson <dennis@JUNIPER.NET>
Subject: Re: OSPF WG Charter Proposal
To: OSPF@DISCUSS.MICROSOFT.COM
In-Reply-To: Your message of "Thu, 07 Nov 2002 19:16:44 EST." <28F05913385EAC43AF019413F674A017123E4D@OCCLUST04EVS1.ugd.att.com>
Precedence: list

Hello,

I just read draft-ietf-ospf-scalability-02.txt, along with
draft-ash-manral-ospf-congestion-control-00.txt which seems
to be related, and find I am made a little uncomfortable by
the way the problem(s) is being described and hence the
inevitability (or even desirability) of the solutions offered.
Please excuse me if this has been discussed before; I skimmed
and deleted a bunch of messages about this before actually
reading the documents and discovering that I cared.

> The various proposals for OSPF Scalability improvements are to move
> the LSA storm threshold (or "cliff") significantly upwards.  Some people
> feel that this should be done only through smart (proprietary)
> implementation.  However, the points against that are the following:

I think the LSA storm threshold, or cliff, is not a problem in itself,
but rather is a symptom of an underlying problem.  Because the problem
was not identified explicitly, however, it took some time to figure out what
might be causing this cliff (guessed at mostly by working backwards from
the proposed solution(s)).  The problem which causes this appears to me to
actually be that the implementation allowed load from non-adjacency-
maintenance traffic (i.e. essentially traffic which wasn't Hellos) to
effect adjacency maintenance, in particular by incorrectly dropping
adjacencies to routers which were in fact still operating.

The problem of erroneously dropping adjacencies to routers which are
still up when under (unrelated) stress is the canonical problem which
leads to network routing collapse.  Just about anyone who has implemented
a link state protocol (or any protocol where adjacencies are maintained
for that matter, e.g. BGP or PPP), deployed it and watched it break a
couple of times knows this by heart, though it is certainly worth writing
down for implementors who haven't got this far yet.  If you want stable
routing then implementations must be done in a way such that this *does*
*not* *happen*, that is the implementation must ensure that the cost of
spikes in database flooding is paid solely by increased convergence time,
and is never the cause of adjacencies being incorrectly declared down.  If
implementations are done this way then the ultimate scalability of the
protocol implementation (beyond data base memory requirements, though
data base memory has gotten inordinately cheap these days) is dictated
solely by how long you are willing to wait for your network to converge
after spikes, and that is a continuous degradation with no "cliff" at all.

Now I would simply assert that there is nothing about the OSPF protocol,
as currently specified, which prevents implementations from achieving the
nirvana described in the previous paragraph.  Nothing.  It may not be
easy to get an implementation to work this way, but I think this difficulty
is inherent in the problem that needs to be solved within the (inherently
proprietary) real-life environment that the implementation runs inside
which sometimes leaves it short of resources, and is hence far removed from
the protocol definition itself.  Errors which cause adjacencies to be
incorrectly declared down are implementation errors, not protocol errors,
and are probably best addressed in the implementation which is broken
rather than the protocol which isn't.  An implementation which doesn't
work like the previous paragraph is one which still has bugs in need
of fixing.

So while I like the simulations done in the scalability draft a lot, I'll
just point out how this draft seems to go given the assumptions
and views above:

(1) You did simulations using an OSPF implemention which had a bug, in
    that it unnecessarily and erroneously declared adjacencies down in
    the presence of a spike of (non-Hello, not adjacency related) data
    base traffic.  How you came to the conclusion that implementations
    which don't make this error are "proprietary" while the particular
    implementation you simulated which made this error is not, I don't
    understand.

(2) Rather than addressing the problem within your implementation, you
    elected to try to solve the problem by having the implementation's
    neighbours help out by setting some new bits in packets (i.e. a
    protocol change) to assist in a prioritization scheme.  I don't
    mind this, except to note that changing an established protocol
    and then waiting for everyone to catch up to the change is usually
    the least productive, longest-term way to fix bugs in an implementation
    which has router neighbours from multiple vendors it is relying on
    to do this for a "fix".

(3) You then redid the simulations after the "fix" was applied, and found
    that while there was still a cliff the "fix" had moved it up and to
    the right.  That is, despite the fact that you even went to the trouble
    of making a protocol change, the "fix" still didn't actually fix the
    problem, which is that you have a cliff at all.  The routing failure can
    apparently still occur, the danger of your network routing melting is
    still there, it just apparently needs a much bigger event to set it
    off.  This is telling me that, independent of this, you really need
    to look at the implementation you are simulating with and determine
    what's actually wrong with it, as it is still an accident waiting to
    happen.

But, more important than this, note that the above comments are still
based on speculation about the root cause of the problem, since according
to the draft the problem is a "cliff" and that the fix is to move the
cliff up and to the right, and it is hard to relate any of this to the
underlying protocol.  You really, really must identify what the routers
are actually doing wrong, in words rather than from numerical simulation
results, since if you don't understand and present this your readers will
never understand the spectrum of possible fixes and why they should work.  For
example, if the problem really is one of routers erroneously dropping
adjacencies where the neighbour is still up, then prioritization of
Hello packet reception and processing seems like not a bad way to help
keep this from occurring (we can debate whether this is better done
by a protocol change or by just having the routers look at the OSPF
packet type before queuing), since its effect is to keep adjacency
errors from occuring by spending CPU keeping this accurate before
spending CPU doing anything which can be deferred.  It is far from
the only way to help, however.  For example, another way to help fix
this (perhaps in addition to the above) might be for the routers to
observe when they've been under load, with a long queue of work to do
that they've perhaps been dropping packets off the end of, and to defer
making any decisions to drop adjacencies until after this condition clears.
This still makes errors about the state of adjacencies, but changes the
nature of the errors from one where the errors you make are terminating
adjacencies which are really still up, which can melt your network, to
one where the errors you make are concluding adjacencies might still be
up when they have in fact gone down, a temporary booboo whose sole effect
might be to increase convergence time (which is how you should be paying
for load spikes in any event).  Note that this adds robustness (i.e.
probably is the best you can do in the circumstance) not only when you
are getting too many LSAs, but also when you are temporarily too short
of CPU to even keep up to the incoming Hellos.

There are a lot of ways to address this particular problem (if this actually
is the "cliff" problem) in implementations, enough that your implementation
really shouldn't have "cliff" problems to begin with, and you should
just keep fixing the implementation until the "cliff" problems go away
so you don't have to live in fear that something big will happen which
will cause your network routing to melt and fail to recover.  This is
so fundamental that I wouldn't wait for the wide deployment of protocol
changes to ensure that this is fixed (even if protocol changes were
exceptionally useful, which I doubt), but would just get it fixed in
the implementations you've already got talking to the other routers you've
already got.

Once you've got basic reliability thing out of the way you can then
begin to look at addressing the real scaling problem which remains, which
is convergence time and maximizing the size of the network you can have
while still having it converge in acceptably short periods.  This is the
problem which I think might benefit from protocol changes, both related
to minimizing flooding load and otherwise.  It is just important to be
clear on why these changes are being done as well, for about the same
reasons it is important to understand what a "cliff" problem is, that
is you'll never get the right answer without being very clear about
why you're doing it.  For example, sending a lot of unnecessary
retransmissions to a neighbour should never, ever cause that neighbour
to break (if it does he's the one with the problem needing fixing), but
it is still not a good thing to do since it makes everyone work harder
and that is guaranteed to slow convergence.  The latter thing is still
very much worth fixing, you just shouldn't be doing the fixing to hack
around implementation-robustness "cliff" problems.

To cut this short, then, I'll just point out what I wanted to say in the
first place about the congestion draft, which is that in general I
don't think using rate controls ever works right given the problem which
I think this should be addressing.  The problem with rate controls is
that if they are fixed configuration they're always wrong, while if
they're automatically adjustable you need to compute adjustments
slowly to ensure stability, which in turn means that the rate you compute
is usually wrong for current circumstances.  It is either too high
or too low.

In the case of flooding to neighbours, both of these cases are bad.
Sending too fast is bad, not because it will break anything (it
absolutely shouldn't) but because processing a big queue of stale
stuff, perhaps with accompanying spurious retransmissions, makes
everyone work harder than they need to and delays convergence.
Sending too slow, however, is also bad because it puts you in
the incredibly silly situation of having an unconverged network where
the routers are simultaneously idle some of the time, which slows
convergence for no discernable advantage at all.  I hence, without
having thought enough about it, like schemes which attempt to do
window control (i.e. limit the number of outstanding, un-acked LSAs
instead), using the delay of and rate at which acks come back as the
control signal, if only because there are examples of transport
protocols working like this which exhibit both stability and a fairly
quick response to changes.

In the case of the SPF computation, however, imposing standardized
rate limits on the maximum frequency with which an implementation may
compute this just seems irrational, and makes no sense at all unless the
implementation entirely lacks the ability to prioritize its actions, an
attribute which is certainly not universal and is unlikely to lead to
great success in making any of this stuff work reliably.  I mean, if
a router is one SPF computation away from converging its routing, and
knows it has absolutely nothing else to do at the moment, is there any
rational reason to require it to sit idle for the four seconds to some
timer expiry before actually doing it?  I think most implementations
should be able to figure out for themselves when it is a good idea to
run the Dykstra and when it isn't, and so should retain the freedom to
do this.

So I think the "congestion" draft has some good ideas (I also like other
attempts to find ways to minimize flooding work) but seems confused about
why these are good ideas which makes some of its other ideas not-so-good,
while the "scalability" draft is proposing a protocol change to relieve
(but not entirely fix) a mysterious problem which, if better defined,
might turn out to be a classic adjacency maintenance implementation bug
which is fixable with no protocol changes at all.  I think the latter
draft would be a whole lot better (in fact, would be really, really
great) if, instead of making a protocol change, you instead did and
documented whatever was necessary to make the "cliff" disappear entirely
from your simulation implementation without making protocol changes,
at which point the (convergence) scaling problem the former draft should
probably be aimed at addressing would be a lot clearer.

Dennis Ferguson