Re: OSPF WG Charter Proposal

"Ash, Gerald R (Jerry), ALASO" <gash@ATT.COM> Thu, 07 November 2002 18:12 UTC

Received: from cherry.ease.lsoft.com (cherry.ease.lsoft.com [209.119.0.109]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id NAA04829 for <ospf-archive@LISTS.IETF.ORG>; Thu, 7 Nov 2002 13:12:07 -0500 (EST)
Received: from walnut (209.119.0.61) by cherry.ease.lsoft.com (LSMTP for Digital Unix v1.1b) with SMTP id <10.007B8051@cherry.ease.lsoft.com>; Thu, 7 Nov 2002 13:14:36 -0500
Received: from DISCUSS.MICROSOFT.COM by DISCUSS.MICROSOFT.COM (LISTSERV-TCP/IP release 1.8e) with spool id 330428 for OSPF@DISCUSS.MICROSOFT.COM; Thu, 7 Nov 2002 13:14:35 -0500
Received: from 192.128.134.71 by WALNUT.EASE.LSOFT.COM (SMTPL release 1.0f) with TCP; Thu, 7 Nov 2002 13:14:35 -0500
Received: from attrh3i.attrh.att.com ([135.71.62.12]) by kcmso2.proxy.att.com (AT&T IPNS/MSO-4.0) with ESMTP id gA7Hwqhk020938 for <OSPF@DISCUSS.MICROSOFT.COM>; Thu, 7 Nov 2002 12:14:35 -0600 (CST)
Received: from occlust04evs1.ugd.att.com (135.71.164.13) by attrh3i.attrh.att.com (6.5.019) id 3D8B56050101527A; Thu, 7 Nov 2002 13:14:34 -0500
X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Thread-Topic: OSPF WG Charter Proposal
Thread-Index: AcKGdBq71FJLb7xoQs6UJn1g4SeoFQACHP0A
Message-ID: <28F05913385EAC43AF019413F674A01701B6569F@OCCLUST04EVS1.ugd.att.com>
Date: Thu, 07 Nov 2002 13:14:34 -0500
Reply-To: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
Sender: Mailing List <OSPF@DISCUSS.MICROSOFT.COM>
From: "Ash, Gerald R (Jerry), ALASO" <gash@ATT.COM>
Subject: Re: OSPF WG Charter Proposal
Comments: To: Rohit Dube <rohit@XEBEO.COM>
To: OSPF@DISCUSS.MICROSOFT.COM
Precedence: list
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by ietf.org id NAA04829

Rohit,

> > I think you should review the ample evidence presented in 
> > http://www.ietf.org/internet-drafts/draft-ash-manral-ospf-congestion-control-00.txt 
> > that the protocols need to be enhanced to better respond to congestion collapse:
> > Section 2: documented failures and their root-cause analysis, across multiple
> > service provider networks (also review the cited references)

> The cited references [att, cholewka, jander, pappalardo*] are from trade
> rags. Unless there is some other peer reviewed paper or standards document,
> these incidents can hardly be used to point to the root cause. No, I am
> not saying that something wasn't wrong - just that these references don't
> carr much weight.

A summary of the extensive root-cause analysis of one incident (performed by both service provider and vendors) is presented in Section 2 of http://www.ietf.org/internet-drafts/draft-ash-manral-ospf-congestion-control-00.txt.  The other 2 incidents cited had similar extensive root cause analysis with similar conclusions.  The trade articles give some additional information about the incidents, yes, they are not refereed papers.  I believe the cited failures and summary of root cause analysis presented in the I-D should carry some weight.

As a result of these failure experiences, our vendors made protocol upgrades (albeit proprietary upgrades) to address the problems.  These protocol upgrades were along the lines of the proposals made in the I-D to address these same problems in a standard, interoperable way.

> In one of these reported incidents, I was physically present and happen
> to know the root cause. It was an "implementation mistake" which was
> triggered by two different versions of the software being present in
> the network simultaneously during a network wide s/e upgrade 
> - a flooding storm resulted. 

Right, there are many problems/bugs/manual-errors/etc. that trigger a catastrophic failure and flooding storm, this is pointed out specifically in the I-D, from Section 2:

"For example, in the failure in the AT&T Frame Relay Network on April 13, 1998 [att], an initial procedural error triggered two undetected software bugs, leading to a huge overload of control messages in the network.  The result of this control overload was the loss of all topology information, which the LS protocol then attempted to recover using the usual Hello and LS updates. However, the LS protocol was overwhelmed and unable to recover, and manual means had to be used to restart the network after a long outage."

The other failures cited in Section 2 had different means triggering the flooding storm -- there are an infinite number of ways to get into this situation.  We would not like flooding storms to be triggered, but unfortunately they *are* triggered, and when they are the problem is getting out of them quickly (not taking hours and days as experienced).

> The problem here is _not_ in the flooding. 

The problem *was* in the flooding storm that was triggered in all the failures cited, and the inability to recover.  

The scenario presented in Appendix B mimics the failure cited above (from Section 2), for which a few vendors provided analysis of how fast their protocol implementation recovers.  We invite other vendors to analyze this scenario for their OSPF implementation.

> While one can certainly provide more knobs to limit flooding, 
> one can also solve the problem equally well or better by fixing 
> the base implementation or doing better tests before upgrading 
> a network.

As above, even with the best implementations and testing, stuff happens to trigger flooding-storm events from which the protocol cannot adequately recover.  We need to limit flooding, etc. *in addition to* having the best implementations and testing.

> > To say that network collapse in *every* case is due to *naive design 
> > choices* ignores the evidence/analysis presented.  Based on the 
> > evidence/analysis, there is clearly room for the protocols to be 
> > improved to the point where networks *never* go down for hours or 
> > days at a time (drawing unwanted headlines & business impact).

> I don't think this is what Dave is saying - he is simply referring to 
> all the cases that _he_ has seen.

OK.  I took his drift to mean that every failure, including the ones I cited, can be attributed to naive design, and that no protocol extensions are necessary, just better design.

> This matches my experience with the caveat that
> in some cases (a) there was faulty hardware (b) network/ospf 
> parameters were inconsistently or incorrectly applied.

Well, given your above statement that you were physically present at one of the reported incidents and know the root cause ("implementation mistake"), and witnessed the flooding storm that resulted, then you also know that the recovery from that incident was totally inadequate and unacceptable.  I would infer that your experience then is similar to my experience and not Dave's.

Jerry