Re: OSPF WG Charter Proposal

Dave Katz <dkatz@JUNIPER.NET> Thu, 07 November 2002 20:17 UTC

Received: from cherry.ease.lsoft.com (cherry.ease.lsoft.com [209.119.0.109]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA09469 for <ospf-archive@LISTS.IETF.ORG>; Thu, 7 Nov 2002 15:17:04 -0500 (EST)
Received: from walnut (209.119.0.61) by cherry.ease.lsoft.com (LSMTP for Digital Unix v1.1b) with SMTP id <22.007B8496@cherry.ease.lsoft.com>; Thu, 7 Nov 2002 15:19:33 -0500
Received: from DISCUSS.MICROSOFT.COM by DISCUSS.MICROSOFT.COM (LISTSERV-TCP/IP release 1.8e) with spool id 330867 for OSPF@DISCUSS.MICROSOFT.COM; Thu, 7 Nov 2002 15:19:32 -0500
Received: from 207.17.136.129 by WALNUT.EASE.LSOFT.COM (SMTPL release 1.0f) with TCP; Thu, 7 Nov 2002 15:19:32 -0500
Received: from cirrus.juniper.net (cirrus.juniper.net [172.17.20.57]) by merlot.juniper.net (8.11.3/8.11.3) with ESMTP id gA7KJWS42445; Thu, 7 Nov 2002 12:19:32 -0800 (PST) (envelope-from dkatz@juniper.net)
Received: (from dkatz@localhost) by cirrus.juniper.net (8.11.6/8.11.6) id gA7KJVi68469; Thu, 7 Nov 2002 12:19:31 -0800 (PST) (envelope-from dkatz@cirrus.juniper.net)
References: <28F05913385EAC43AF019413F674A0170167B229@OCCLUST04EVS1.ugd.att.com>
Message-ID: <200211072019.gA7KJVi68469@cirrus.juniper.net>
Date: Thu, 07 Nov 2002 12:19:31 -0800
Reply-To: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
Sender: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
From: Dave Katz <dkatz@JUNIPER.NET>
Subject: Re: OSPF WG Charter Proposal
Comments: To: gash@att.com
Comments: cc: gash@att.com
To: OSPF@DISCUSS.MICROSOFT.COM
In-Reply-To: <28F05913385EAC43AF019413F674A0170167B229@OCCLUST04EVS1.ugd.att.com> (gash@att.com)
Precedence: list

Other folks have pretty well summarized my feelings on this stuff, and
I've said all this in the past, but at the risk of redundancy I'll
restate.


My problems with the document are as follows:

Firstly, the claim that LS protocol collapses are primarily due to
deficiencies in the protocol design is just not accurate, and lets the
implementors off the hook.  While a different protocol design might
make such collapses more difficult (or at least different) it can be
shown on first principles that the existing protocols as specified can
be implemented in such a way that collapse (defined as adjacency loss,
which is the symptom at the heart of such collapses) will happen only
if the volume of Hello traffic exceeds the capacity of the receiver to
sink it and get its own Hellos out the door.  If properly implemented,
the failure mode under extreme flooding conditions should be to have
convergence time suffer (there's no free lunch, of course) and *not*
to lose adjacencies.  Implementations that suffer from this problem
are essentially doomed, since the failure mode is not predictable.  If
everyone stays lucky, the various workarounds to try to avoid the
problem will keep the implementation from falling over the edge, but
of course you can't control what your neighbor is doing to you.


Secondly, most of what the document describes is not subject to
standardization.  Discussion of implementation techniques is a fine
subject for an informational RFC, and in fact a number of of the
mechanisms described are already in use in some implementations and
are helpful (as are a bunch of other tricks that provide us with
product differentiation.)  All of the MUSTs and so forth give a false
impression that these techniques are required for protocol correctness
and interoperability, which they are not.  The fact that these are
implementation choices underscores my claim that this is not a
protocol design issue, which I define to be the bits on the wire and
the elements of procedure that determine interoperability.  As far as
I can see, there is only one suggestion in this paper that involves a
protocol change (the signalling of congestion.)


The appendices suffer from an excess of academia.  The reality of
routing protocol implementations is that analysis and simulation
seldom bears any resemblance to reality (not to mention that any
vendor-supplied data is likely to be self-serving.)  If we could
actually predict the behavior of networks of the size proposed, we
wouldn't be having collapses--they happen because of what the
implementors don't know, the subtle interactions and plain old
screwups, which of course won't figure into the analysis and
simulation.  It's certainly true that backing away from the cliff will
help avoid collapse, but the problem with this approach is that you
never know where the cliff is.


As Joel and others have pointed out, the addition of even more
complexity to try to fix problems is unlikely to be satisfying.  The
industry's history shows that these protocols are on the hairy edge of
what people are able to implement in a stable, robust way, and making
it even harder is not the path to nirvana.


There are also operational complexities involved--if you try to avoid
disaster by using parametric mechanisms like limiters and such, you
end up in a situation where either values are fixed (based on a wet
finger in the air or some empirical testing) or else a bunch of
mysterious knobs are provided that nobody has any idea how to set.  My
whipping boy for this approach is the multivariate SPF rate control
knob that was recently added by a major vendor.  If you put the
product together properly, the SPF rate shouldn't matter, and nobody
knows how to set the knob anyhow.


I guess my primary point is this--if you build your implementation in
such a way that it is close to impossible to lose adjacencies under
load conditions, the rest of this stuff is gravy (albeit tasty gravy)
but if your implementation can lose adjacencies when it is constipated,
none of this will address the real robustness problems.  The document
operates under the assumption that flooding traffic causes adjacency
loss (which is certainly true in many implementations) but it is this
issue that is the key to stability--the techniques themselves are
band-aids.

I don't have any particular problem with the techniques discussed in the
document (though in-band congestion notification has been shown to be
flawed many years ago--see any of the literature on ICMP Source Quench.)
Matter of fact, I use a lot of this stuff.  It's just that none of this
is really a fix, just a workaround, and the only part of it that I can
see that should be in the standards track are the congestion notification
extensions.

Where these and other techniques are quite valuable is when looking at
convergence/recovery times after major network events (other than load-
related adjacency collapses which, of course, should never happen.)
Expeditious flooding for its own sake is a fine goal indeed.


--Dave