Re: OSPF WG Charter Proposal

Manohar Naidu Ellanti <ellanti@ATTBI.COM> Thu, 07 November 2002 23:06 UTC

Received: from cherry.ease.lsoft.com (cherry.ease.lsoft.com [209.119.0.109]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id SAA17356 for <ospf-archive@LISTS.IETF.ORG>; Thu, 7 Nov 2002 18:06:50 -0500 (EST)
Received: from walnut (209.119.0.61) by cherry.ease.lsoft.com (LSMTP for Digital Unix v1.1b) with SMTP id <3.007B8CB8@cherry.ease.lsoft.com>; Thu, 7 Nov 2002 18:09:18 -0500
Received: from DISCUSS.MICROSOFT.COM by DISCUSS.MICROSOFT.COM (LISTSERV-TCP/IP release 1.8e) with spool id 331385 for OSPF@DISCUSS.MICROSOFT.COM; Thu, 7 Nov 2002 18:09:17 -0500
Received: from 216.148.227.88 by WALNUT.EASE.LSOFT.COM (SMTPL release 1.0f) with TCP; Thu, 7 Nov 2002 18:09:17 -0500
Received: from rwcrwbc69 ([204.127.198.52]) by rwcrmhc52.attbi.com (InterMail vM.4.01.03.27 201-229-121-127-20010626) with SMTP id <20021107230916.JPPU12281.rwcrmhc52.attbi.com@rwcrwbc69> for <OSPF@DISCUSS.MICROSOFT.COM>; Thu, 7 Nov 2002 23:09:16 +0000
Received: from [12.146.158.112] by rwcrwbc69; Thu, 07 Nov 2002 23:09:16 +0000
X-Mailer: AT&T Message Center Version 1 (Nov 5 2002)
X-Authenticated-Sender: ZWxsYW50aUBhdHRiaS5jb20=
Message-ID: <20021107230916.JPPU12281.rwcrmhc52.attbi.com@rwcrwbc69>
Date: Thu, 07 Nov 2002 23:09:16 +0000
Reply-To: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
Sender: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
From: Manohar Naidu Ellanti <ellanti@ATTBI.COM>
Subject: Re: OSPF WG Charter Proposal
To: OSPF@DISCUSS.MICROSOFT.COM
Precedence: list

> Other folks have pretty well summarized my feelings on this stuff, and
> I've said all this in the past, but at the risk of redundancy I'll
> restate.
>
>
> My problems with the document are as follows:
>
> Firstly, the claim that LS protocol collapses are primarily due to
> deficiencies in the protocol design is just not accurate, and lets the
> implementors off the hook.  While a different protocol design might
> make such collapses more difficult (or at least different) it can be
> shown on first principles that the existing protocols as specified can
> be implemented in such a way that collapse (defined as adjacency loss,
> which is the symptom at the heart of such collapses) will happen only
> if the volume of Hello traffic exceeds the capacity of the receiver to
> sink it and get its own Hellos out the door.  If properly implemented,
> the failure mode under extreme flooding conditions should be to have
> convergence time suffer (there's no free lunch, of course) and *not*
> to lose adjacencies.  Implementations that suffer from this problem
> are essentially doomed, since the failure mode is not predictable.  If
> everyone stays lucky, the various workarounds to try to avoid the
> problem will keep the implementation from falling over the edge, but
> of course you can't control what your neighbor is doing to you.
>
>
> Secondly, most of what the document describes is not subject to
> standardization.  Discussion of implementation techniques is a fine
> subject for an informational RFC, and in fact a number of of the
> mechanisms described are already in use in some implementations and
> are helpful (as are a bunch of other tricks that provide us with
> product differentiation.)  All of the MUSTs and so forth give a false
> impression that these techniques are required for protocol correctness
> and interoperability, which they are not.  The fact that these are
> implementation choices underscores my claim that this is not a
> protocol design issue, which I define to be the bits on the wire and
> the elements of procedure that determine interoperability.  As far as
> I can see, there is only one suggestion in this paper that involves a
> protocol change (the signalling of congestion.)
>
>
> The appendices suffer from an excess of academia.  The reality of
> routing protocol implementations is that analysis and simulation
> seldom bears any resemblance to reality (not to mention that any
> vendor-supplied data is likely to be self-serving.)  If we could
> actually predict the behavior of networks of the size proposed, we
> wouldn't be having collapses--they happen because of what the
> implementors don't know, the subtle interactions and plain old
> screwups, which of course won't figure into the analysis and
> simulation.  It's certainly true that backing away from the cliff will
> help avoid collapse, but the problem with this approach is that you
> never know where the cliff is.
>
>
> As Joel and others have pointed out, the addition of even more
> complexity to try to fix problems is unlikely to be satisfying.  The
> industry's history shows that these protocols are on the hairy edge of
> what people are able to implement in a stable, robust way, and making
> it even harder is not the path to nirvana.
>
>
> There are also operational complexities involved--if you try to avoid
> disaster by using parametric mechanisms like limiters and such, you
> end up in a situation where either values are fixed (based on a wet
> finger in the air or some empirical testing) or else a bunch of
> mysterious knobs are provided that nobody has any idea how to set.  My
> whipping boy for this approach is the multivariate SPF rate control
> knob that was recently added by a major vendor.  If you put the
> product together properly, the SPF rate shouldn't matter, and nobody
> knows how to set the knob anyhow.
>
>
> I guess my primary point is this--if you build your implementation in
> such a way that it is close to impossible to lose adjacencies under
> load conditions, the rest of this stuff is gravy (albeit tasty gravy)
> but if your implementation can lose adjacencies when it is constipated,
> none of this will address the real robustness problems.  The document
> operates under the assumption that flooding traffic causes adjacency
> loss (which is certainly true in many implementations) but it is this
> issue that is the key to stability--the techniques themselves are
> band-aids.
>
> I don't have any particular problem with the techniques discussed in the
> document (though in-band congestion notification has been shown to be
> flawed many years ago--see any of the literature on ICMP Source Quench.)
> Matter of fact, I use a lot of this stuff.  It's just that none of this
> is really a fix, just a workaround, and the only part of it that I can
> see that should be in the standards track are the congestion notification
> extensions.
And no need to address to ways of reducing the LSAs that need to be flooded
in a timely manner? When both sides have synced up LSDB then only the  link
is declared FULL. But all the auxliary information is also exchanged before a l
ink is declared FULL. Shouldn't that be part of protocol - to declare adjacency
is FULL once basic link/node information is in sync and in the second phase the
neighbors can exchange auxiliary information such as link colors etc i.e phased DB exchange. This all
This all boils down to seperating the basic functionality from extended functionality which
should reduce the congestion.
>
> Where these and other techniques are quite valuable is when looking at
> convergence/recovery times after major network events (other than load-
> related adjacency collapses which, of course, should never happen.)
> Expeditious flooding for its own sake is a fine goal indeed.
>
>
> --Dave