Re: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Dave Katz <dkatz@juniper.net> Wed, 24 August 2016 20:21 UTC

Return-Path: <dkatz@juniper.net>
X-Original-To: ospf@ietfa.amsl.com
Delivered-To: ospf@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 4240612D76F for <ospf@ietfa.amsl.com>; Wed, 24 Aug 2016 13:21:46 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.911
X-Spam-Level:
X-Spam-Status: No, score=-1.911 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=junipernetworks.onmicrosoft.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XCTkLmkLYpIU for <ospf@ietfa.amsl.com>; Wed, 24 Aug 2016 13:21:27 -0700 (PDT)
Received: from NAM03-DM3-obe.outbound.protection.outlook.com (mail-dm3nam03on0114.outbound.protection.outlook.com [104.47.41.114]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 68D1912D75D for <ospf@ietf.org>; Wed, 24 Aug 2016 13:21:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=junipernetworks.onmicrosoft.com; s=selector1-juniper-net; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=xlPQ0boGrK9lJw0GtTW0HMf0y/3ccjMlos8FAHNcsVc=; b=DjW++86yQ3xNOU67jo9NSaAJuwOiyPMFtQuWZCcwNqZh57mdKeKL0JbBW+NTycjfxchvX4+6hfk3/2Cyuj8ifbohNXzzUxkhtuhkoscGPjYTazebMKhC0z96X/iVYR5Oc8r4euOovNY3j3aoLPtv6dd+Ie/p+lL4hf/Pap1OyM4=
Received: from DM5PR05MB2954.namprd05.prod.outlook.com (10.168.176.142) by DM5PR05MB2954.namprd05.prod.outlook.com (10.168.176.142) with Microsoft SMTP Server (version=TLS1_0, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA_P384) id 15.1.599.8; Wed, 24 Aug 2016 20:21:19 +0000
Received: from DM5PR05MB2954.namprd05.prod.outlook.com ([10.168.176.142]) by DM5PR05MB2954.namprd05.prod.outlook.com ([10.168.176.142]) with mapi id 15.01.0599.008; Wed, 24 Aug 2016 20:21:19 +0000
From: Dave Katz <dkatz@juniper.net>
To: "Acee Lindem (acee)" <acee@cisco.com>
Thread-Topic: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement
Thread-Index: AQHR/jpJWHQbFqIEZ0aRevPlEY/PXqBYjbCA
Date: Wed, 24 Aug 2016 20:21:19 +0000
Message-ID: <45CEC282-FBED-4BFF-9BF0-FC6AFBA18244@juniper.net>
References: <D3E36483.7B0EB%acee@cisco.com>
In-Reply-To: <D3E36483.7B0EB%acee@cisco.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-mailer: Apple Mail (2.3124)
authentication-results: spf=none (sender IP is ) smtp.mailfrom=dkatz@juniper.net;
x-ms-exchange-messagesentrepresentingtype: 1
x-originating-ip: [66.129.239.14]
x-ms-office365-filtering-correlation-id: db40abd4-1f06-489e-f085-08d3cc5c355f
x-microsoft-exchange-diagnostics: 1; DM5PR05MB2954; 6:igsRvdHnBEWeAgGXpOMtF0JHkZzmJHJXDesPraVmNgGRn2gxKiraYqqlUa0GUuSyZZWDJnU4TGRJ3aKPgOoPyhaaeOmwYnSEE25AF/eJ1Jf89yEHdxZJXPg+kdbfRkdKzJec8KqACbFM5MiK8S1nAA2Ne3TBw7HXDovIQZhMyOhMmh4xX8JJrqX6tNyUj8tGG0FvafKBVW0Ob6GVCJkCjsMvoS3u9AmPvalpjz0Yy7tAmK3ikSZYVsBJ2kbpfvb0r+YUCVQHhNzs2nKlCOnL2+aZF2B/KJTr7D5PhsVe4FUFCwtkpdnsPqfRNomioTIynhkyb/3L6IsWRgzTSkgujA==; 5:EJ3e4XXXvOYmG4gfH56Gseaex5sUYX5VHn+84+/dh6p3WvDD5W4KybpeWld13MnS2g+NO6wDuz39qkpAD86viiplQLRjgn7F9QClH1Cp1y8CEqrNn0y88YMW4AhGZ6fITsgI1M9Yw/KMQSxRD7wDjg==; 24:iQesvlbwb3GhAjp44ANVU0jl784mH67NAAnqa9BDLPJvUiKPJWobOkzr1cJ2A6LzK0FCNrzv43tqqVoDZMJWaiBPopJR7tSagUw/XxjcEOw=; 7:Rb4bUqvqnW/XqrPAheMfLNdefB2EdB7EsB2XFBg/AHkryirZLDas4KmItEdil8HPL1n5T/We0ijPxih4ghRj6lQ8QEsj61iGZjbQT1XF+4J0tFH30167tgpziEXNwfQzx7atQJiCqObDfpTpPDcYagNE/OVnfcs6uXN+Mop308jiTeSgeecKR5SrA66ppAO5/mz5tl1Diio7A8GjO8dUQANB6QG7TiFOIAMeIvKy5bBheMDNrJ82EpRnSqmgd6Ri
x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DM5PR05MB2954;
x-microsoft-antispam-prvs: <DM5PR05MB2954C3806A8940BE4DA833BAB4EA0@DM5PR05MB2954.namprd05.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(50582790962513)(100405760836317)(95692535739014)(17755550239193);
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040176)(601004)(2401047)(8121501046)(5005006)(3002001)(10201501046)(6055026); SRVR:DM5PR05MB2954; BCL:0; PCL:0; RULEID:; SRVR:DM5PR05MB2954;
x-forefront-prvs: 0044C17179
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(7916002)(189002)(377454003)(24454002)(199003)(377424004)(53754006)(106356001)(83716003)(66066001)(19617315012)(10400500002)(81166006)(7906003)(3280700002)(3660700001)(81156014)(77096005)(50226002)(8936002)(11100500001)(82746002)(8676002)(33656002)(92566002)(2950100001)(7736002)(2900100001)(4326007)(68736007)(15975445007)(7846002)(86362001)(2906002)(189998001)(122556002)(110136002)(19580405001)(5002640100001)(19580395003)(101416001)(105586002)(5660300001)(87936001)(102836003)(76176999)(57306001)(16236675004)(106116001)(6116002)(50986999)(99286002)(586003)(230783001)(97736004)(3846002)(36756003)(160933001)(104396002)(569005); DIR:OUT; SFP:1102; SCL:1; SRVR:DM5PR05MB2954; H:DM5PR05MB2954.namprd05.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en;
received-spf: None (protection.outlook.com: juniper.net does not designate permitted sender hosts)
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: multipart/alternative; boundary="_000_45CEC282FBED4BFF9BF0FC6AFBA18244junipernet_"
MIME-Version: 1.0
X-OriginatorOrg: juniper.net
X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Aug 2016 20:21:19.3238 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: bea78b3c-4cdb-4130-854a-1d193232e5f4
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR05MB2954
Archived-At: <https://mailarchive.ietf.org/arch/msg/ospf/hDElJ2XSqvvcQ8Ceag-Oo4GL6ZE>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxudong@huawei.com>, "ospf@ietf.org" <ospf@ietf.org>, "lizhenqiang@chinamobile.com" <lizhenqiang@chinamobile.com>
Subject: Re: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement
X-BeenThere: ospf@ietf.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: The Official IETF OSPG WG Mailing List <ospf.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ospf>, <mailto:ospf-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ospf/>
List-Post: <mailto:ospf@ietf.org>
List-Help: <mailto:ospf-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ospf>, <mailto:ospf-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 24 Aug 2016 20:21:46 -0000

Speaking as a long time implementor of OSPF, IS-IS, et al, I agree.  While making protocols as robust as we can is a good thing, there are rapidly diminishing returns in trying to make protocol changes to help detect one-off bugs, especially if the protocol is not friendly to changes and extensions.  The number of possible bugs is essentially infinite.

I’ve seen a number of bugs in other implementations that have made it into production implementations, especially as I have had a tendency to “stretch” the specs in ways that are guaranteed to work so long as other implementations are following the spec.  These have been few and far between, however.

Classic example:  the 30 minute “architectural constant” refresh time.  This is *not* an architectural constant (defined as “must be true or things won’t work”);  the refresh just needs to happen often enough to keep the LSA from being maxaged anywhere.  So I made a one-line change to change the refresh time to 50 minutes, reducing the refresh load by 40% (not really “scaling” but it was easy, cheap, and guaranteed to work properly <cough>.)  This was fine for several years until someone introduced a router from a small, now long-dead vendor, at which point things started flapping weirdly.  Turns out that an engineer at said company got carried away jittering timers, and jittered the LSA age timer by 25% (very very bad).  So the LSA maxage timeout would fire after a random interval between 45 and 60 minutes.  Of course, if you were refreshing at 30 minute intervals, you’d never notice, but at 50 you get 1/3 of your LSAs being purged by said broken router.  I initially refused to change it, but then an engineer at a Very Large Router Company made exactly the same mistake.  Sigh.

The same Very Large Router Company had a bug in their first implementation of OSPF back in about 1991 which was a subtlety involving handing the receipt of MaxAge LSAs in some circumstances where they were getting acked instead of dropped (IIRC;  it’s been 25 years);  this had the curious effect of causing MaxAge LSAs to slosh around the network at random intervals and light up the FDDI ring in the Stanford machine room.

But this has been the sum total of the kind of insidious bugs I’ve run across that affect network stability, and neither of them could have been helped by making a protocol change.  The vast majority of stability problems have to do with the dynamics of the implementation rather than adherence to the spec (say, melting down and dropping adjacencies when somebody redistributes BGP into OSPF, as used to be an annual occurrence somewhere).

As you point out, these kinds of bugs generally don’t make it out of the lab, unless you’re really unlucky.  As such, there’s little return for the cost of changing the protocol.

—Dave


On Aug 24, 2016, at 12:04 PM, Acee Lindem (acee) <acee@cisco.com<mailto:acee@cisco.com>> wrote:

Speaking as WG member:

Hi Zhenjiang,

I don’t doubt that this was a very disquieting experience. However, I still don’t think we should attempt to change the protocol to compensate for routers that do not adhere to the protocol. To make an analogy, in my years of OSPF experience I’ve been subject to a number of bugs related to OSPF’s usage of local wire multicast (some triggered by obscure conditions such as routing and bridging on the same port). However, I’ve never proposed to not use local wire multicast. Also, after 25 years of OSPFv2, it doesn’t make sense to try and change the protocol to avoid bugs in this area. As for identifying the nefarious router, I think adding a counter and possibly a separate notification to the YANG model might be warranted since purging a non-self-originated LSA should not be a common occurrence in most networks.

Thanks,
Acee
P.S. Since this is an OSPF standards list, I’ve purposely avoided the questions as to how this catastrophic bug made it into a production network.


From: "lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>" <lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>>
Date: Wednesday, August 24, 2016 at 2:11 PM
To: Jie Dong <jie.dong@huawei.com<mailto:jie.dong@huawei.com>>, Acee Lindem <acee@cisco.com<mailto:acee@cisco.com>>, "Les Ginsberg (ginsberg)" <ginsberg@cisco.com<mailto:ginsberg@cisco.com>>, OSPF WG List <ospf@ietf.org<mailto:ospf@ietf.org>>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxudong@huawei.com<mailto:zhangxudong@huawei.com>>
Subject: Re: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Hello Jie, Acee and Les,

I am a coauthor of this draft from operator China Mobile. Thank you all for your discussion and suggestion in the previous mails. As you all discussed, a misbehavior OSPF router (due to software or hardware problem) can cause severe problem in the whole OSPF domain.

Here I want to point out that OSPF route flapping DID occour in my field network contributed by a misbehavior OSPF router installed. The procedure to analyze and look for the cause were very complicated because we did not know the source of the flushing. Two hours past, we could not identify the real cause and restore our network. The CPU utilization of OSPF routers was high, the network traffic decreased significantly, lots of tunnel down warnings raised. When we tried to shutdown one OSPF router, route flapping stopped. This router was a newly deployed one. Through communication with our vendor, they admitted that this product had some defects in dealing with OSPF protocol. This kind of defects are difficult for us to test  when they apply for entrance in our network. Once defective products are deployed in the field network,  locating the problem is very hard and time consuming.

So, I think it is necessary for us to solve the problem and improve the robustness of the protocol. At least it should provide the means to help us locate the OSPF route flapping problem.

________________________________
lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>

From: Dongjie (Jimmy)<mailto:jie.dong@huawei.com>
Date: 2016-08-18 17:09
To: Acee Lindem (acee)<mailto:acee@cisco.com>; Les Ginsberg (ginsberg)<mailto:ginsberg@cisco.com>; ospf@ietf.org<mailto:ospf@ietf.org>
CC: Zhangxudong (zhangxudong, VRP)<mailto:zhangxudong@huawei.com>; lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement
Hi Acee,

Please see my replies inline:

From: Acee Lindem (acee) [mailto:acee@cisco.com]
Sent: Thursday, August 18, 2016 2:23 AM
To: Dongjie (Jimmy); Les Ginsberg (ginsberg); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: Re: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Speaking as a WG member who has some experience with OSPF implementations:

Hi Jie,

Along with Les, I’m also against progressing this draft.

From: Jie Dong <jie.dong@huawei.com<mailto:jie.dong@huawei.com>>
Date: Tuesday, August 16, 2016 at 9:56 AM
To: Acee Lindem <acee@cisco.com<mailto:acee@cisco.com>>, "Les Ginsberg (ginsberg)" <ginsberg@cisco.com<mailto:ginsberg@cisco.com>>, OSPF WG List <ospf@ietf.org<mailto:ospf@ietf.org>>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxudong@huawei.com<mailto:zhangxudong@huawei.com>>, "lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>" <lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Hi Acee,

Thanks a lot for your feedbacks.

For packet corruption which impacts the LS age before the LSAs are packed into LSU packet, I agree it is less likely to happen than the other cases. However I think we agree that OSPF authentication only protect the packet level corruption, which cannot help to detect the corruption at LSA level.

So, you are suggesting that LSAs are corrupted in the database in such a way that the LSA Age is set exactly to 0xE10? How would the implementation know that this had happened and prematurely age the packet? Database aging just doesn’t work this way (unless the implementation is particularly naïve).

[Jie] Actually the case is when the LSA is about to be exchanged with neighbor, during the message packing the LS age is corrupted to either Maxage or a large number close to Maxage. The sending router does not intend to do a Maxage flush, however the neighbor routers which receive the message would treat this as a flush. This is a possible case although less likely to happen than the other cases.


In my understanding, robustness is an important feature of network protocols, which include the robustness to errors and failures happened in the network. If there is a bug in a particular router in the network, operator would not allow the whole network being impacted, which means other routers in the network needs to work properly in this situation. For example in BGP, the error handling mechanism has been optimized to avoid unnecessary session teardown.

So you agree your problem statement is confined to a software bug resulting in LSAs being aged too quickly? I think this is the third time I’ve raised this question.

[Jie] As I said before, the problems happened in the production network are caused by software bug in LSA aging, so I think this is the major case.

If it has such a problem (whether it be due to a system timer bug or a some more specific aging problem), it seems the router would also be refreshing its LSAs all too frequently (at least at twice the rate) and it would be readily identifiable. For a system time problem, the router would likely have many other problems. For example, it would not maintain OSPF adjacencies if the dead timer advances fast enough. It would retransmit at a very fast rate as well. Are you going to write problem statements and suggest solutions for these situations as well?

[Jie] This depends on the implementation. the software bug may only impact the aging of LSAs received from other routers. And frequent LSA refreshing may be caused by other cases such as link oscillation.  For a system timer problem, OSPF adjacency may oscillate, but if the management connection is impacted, such oscillation is difficult to be identified.

What about other bugs? What if the router erroneously specifies a neighbor’s router-id as its own in a Router-LSA? Is this a problem the protocol should handle?

[Jie] Depends on the significance to network, case by case analysis may be needed.


I agree that OSPF Yang notification for LSA timeout is a nice thing to have and could be useful to identify the misbehaved router. My concern is sometimes the network may be severely impacted that the connectivity of netconf/restconf is also impacted. To avoid this, some mechanism to mitigate the impact of this problem could help.

I believe a router have such impact would be easy to identify…

[Jie] According to the feedback from on-site engineers, when IGP routing is oscillating severely which makes the management connection unavailable, it usually takes much longer time for troubleshooting, as logging to any router cannot be done via the management network. So maybe it would be better to have some automatic mechanism to reduce the impact before it becomes a big problem to troubleshoot.

Best regards,
Jie

Thanks,
Acee


Best regards,
Jie

From: Acee Lindem (acee) [mailto:acee@cisco.com]
Sent: Saturday, August 13, 2016 3:27 AM
To: Les Ginsberg (ginsberg); Dongjie (Jimmy); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: Re: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Speaking as a WG member:

Hi Jie,

I believe we agree that the problem is confined to OSPF bugs, system timer bugs,  and packet corruption. I’d assert that corruption can be detected via OSPF authentication. In fact, there is a well-known antidote where IS-IS authentication was enabled solely for the purpose of filtering corrupted protocol packets in an environment with line cards that were prone to such corruption. Hence, we are left with problems based on OSPF or system timer bugs. If there were a system timer bug, I’d doubt that networking device with such a bug would be functional to the point of being able to establish and maintaining OSPF adjacencies.  Do we really want to enhance the protocol to deal with bugs?

I’ve thought about this and one potential action I could envision would be to add a separate OSPF YANG notification where an LSA times out and a router other than the originator purges it. This way, the misbehaving OSPF router could be readily identified.

Thanks,
Acee


From: OSPF <ospf-bounces@ietf.org<mailto:ospf-bounces@ietf.org>> on behalf of "Les Ginsberg (ginsberg)" <ginsberg@cisco.com<mailto:ginsberg@cisco.com>>
Date: Thursday, August 11, 2016 at 1:29 PM
To: Jie Dong <jie.dong@huawei.com<mailto:jie.dong@huawei.com>>, OSPF WG List <ospf@ietf.org<mailto:ospf@ietf.org>>
Cc: "Zhangxudong (zhangxudong, VRP)" <zhangxudong@huawei.com<mailto:zhangxudong@huawei.com>>, "lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>" <lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>>
Subject: Re: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Jie –

Having the discussion has certainly been a good thing, but if the consensus of the WG is that there is no protocol change required then there is no need for any draft – which is my current position.

The other point is that you seem to be confusing the IS-IS Purge origination TLV (RFC 6232) with detecting invalid purges/remaining lifetime corruption. This is not the case. RFC 6232 simply allows us to detect which router originated a purge – it is not able to detect whether a purge is valid/invalid – and was not motivated by concerns about remaining lifetime corruption.

   Les


From: Dongjie (Jimmy) [mailto:jie.dong@huawei.com]
Sent: Wednesday, August 10, 2016 9:24 PM
To: Les Ginsberg (ginsberg); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Hi Les,

The current draft is about problem statement, so IMO what the WG needs to consider is whether this is a vulnerability of OSPF protocol, and whether it can have negative impact to the network. If the problem is acknowledged, IMO it is worth to be documented.

The “ROI” as you mentioned is for the evaluation of the proposed solutions. I totally agree that for the timer bug case, recognizing and ignoring the received abnormal Maxage LSAs cannot stop the misbehaved router from generating further Maxage LSA, as it is a systematic problem, which can only be fixed after the operator identifies that router. This is also similar to the systematic corruption of IS-IS remain time.  And this is why this draft mentions two kinds of potential solutions, the mitigation mechanism can avoid the network being severely impacted by the problem, while for systematic problems, problem localization is needed to identify the misbehaved router and then solve the problem.

Best regards,
Jie

From: OSPF [mailto:ospf-bounces@ietf.org] On Behalf Of Les Ginsberg (ginsberg)
Sent: Monday, August 08, 2016 2:14 AM
To: Dongjie (Jimmy) <jie.dong@huawei.com<mailto:jie.dong@huawei.com>>; ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP) <zhangxudong@huawei.com<mailto:zhangxudong@huawei.com>>;lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: Re: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Jie –

Thinking about the following some more:

<snip>
What remains is the possibility that an implementation has some bug and unintentionally modifies the age to something other than what it should be due to the actual elapsed time since LSA generation. I suppose a mechanism equivalent to what the IS-IS draft defined i.e. setting the age to “new” (0 in OSPF case) when first receiving a non-self-generated LSA could be useful to prevent negative impacts of such an implementation bug. Is this what you intend?

[Jie]: More specifically, the problem could be caused by either “setting the LS age field incorrectly due to implementation bug” or “system timer runs so fast that the LS age reaches MaxAge much earlier than other routers”. Another less likely case is that the LS age field is corrupted before the LSA is assembled into OSPF packet.
<end snip>

The benefits are extremely limited. If a router prematurely ages an LSA due to a timer bug, ignoring the received LSA age on reception isn’t going to prevent premature purging by the router which has the bug. So the effect of ignoring the received LSA age prior to reaching MAXAGE will be short lived. You are then left with the possibility that an implementation corrupts the LSA age BEFORE calculating checksum/crypto authentication – but its local timeout logic is unaffected. This has very limited value. Whether the WG considers this worth pursuing is something you need to ask. For myself, I don’t see much ROI here.

  Les



From: Dongjie (Jimmy) [mailto:jie.dong@huawei.com]
Sent: Monday, August 01, 2016 9:43 PM
To: Les Ginsberg (ginsberg); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Hi Les,

Please see my replies with [Jie2]:

From: Les Ginsberg (ginsberg) [mailto:ginsberg@cisco.com]
Sent: Monday, August 01, 2016 9:57 PM
To: Dongjie (Jimmy); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Jie -

From: Dongjie (Jimmy) [mailto:jie.dong@huawei.com]
Sent: Monday, August 01, 2016 1:44 AM
To: Les Ginsberg (ginsberg); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Hi Les,

Please see inline with [Jie]:

From: Les Ginsberg (ginsberg) [mailto:ginsberg@cisco.com]
Sent: Monday, August 01, 2016 3:09 PM
To: Dongjie (Jimmy); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Jie –

Fully agree that IS-IS and OSPF differ in this regard.

https://www.ietf.org/id/draft-ietf-isis-remaining-lifetime-01.txt addresses problems where corruption of the remaining lifetime occurs either during transmission/reception or due to some DOS attack. This isn’t a concern w OSPF (hope you agree).

[Jie]: Yes, for OSPF the corruption during packet transmission can be detected.

What remains is the possibility that an implementation has some bug and unintentionally modifies the age to something other than what it should be due to the actual elapsed time since LSA generation. I suppose a mechanism equivalent to what the IS-IS draft defined i.e. setting the age to “new” (0 in OSPF case) when first receiving a non-self-generated LSA could be useful to prevent negative impacts of such an implementation bug. Is this what you intend?

[Jie]: More specifically, the problem could be caused by either “setting the LS age field incorrectly due to implementation bug” or “system timer runs so fast that the LS age reaches MaxAge much earlier than other routers”. Another less likely case is that the LS age field is corrupted before the LSA is assembled into OSPF packet.

[Jie]: Regarding the solutions space, IMO we need to consider both cases: “LS age reaches MaxAge” and “LS age close to MaxAge”. For IS-IS, RFC 6232 and RFC 6233 provide solutions for the detection and identification of corrupted IS-IS purge, while OSPF does not have similar mechanisms.

[Les:] It is incorrect to say that RFC 6232 makes it possible to detect a corrupt purge. What it does do is to provide an indication as to which IS initiated a purge. I don’t know how OSPF would address this issue, but for OSPFv2 at least any solution would likely not be backwards compatible. For this reason I suggest that you not try to address this issue in the same draft.

[Jie2]: Agreed, RFC 6232 provide the mechanism to track the misbehaved routers so that operator can fix the problem, the detection can be based on the rules in RFC 6233 or some other anomalies. Indeed for OSPFv2 legacy LSAs, it is difficult to introduce the mechanism similar to RFC 6232, while it can be easier for the OSPFv2/v3 Extended LSAs. So it depends on how backward compatible the solution should be. I agree with you that the solution for Problem Localization in OSPF needs to be provided in a separate document.

Solutions to LS age  corruption can be done in a backwards compatible way, but they  MUST NOT result in discarding purges which pass authentication- doing so places you at risk for having inconsistent LSDBs in the network.

[Jie2]: Exactly. The received MaxAge LSAs cannot simply be discarded, the decision must be made carefully, probably based on some additional information. The authors has discussed some possible solution internally, and will prepare some material for further open discussion.

As written, the draft makes claims that are at least misleading – and I believe actually incorrect. In Section 6 you say:

“The LS age field may be altered as a result of
   packet corruption, such modification cannot be detected by LSA
   checksum nor OSPF packet cryptographic authentication.”

This isn’t correct.

[Jie] Thanks for pointing out this. This sentence need to be revised to mention “LSA corruption” rather than “packet corruption”.

What would be helpful – at least to me – is to move from a generic problem statement to the specific problem you want to solve and the proposed solution. This also requires you to more clearly state the cases where there is an actual vulnerability. It would be a lot easier to support the draft if this were done.

[Jie] Thanks for your suggestion. Yes we can update this draft with more specific problem statements as I mentioned above.

[Jie] As for the proposed solutions, the current draft specifies the requirements on the potential solutions, from which we envision that different solutions maybe needed for “Impact Mitigation” and “Problem Localization”. The solution for “Impact mitigation” can be the easier one, for which we can start to discuss the potential solutions now. While the solution for “problem localization” may need more considerations.

[Les:] A discussion of the requirements is useful and necessary, but IMO until you propose a solution there isn’t enough substance for the document to become a WG document.

[Jie2] Yes the current draft focuses on the problem statement and the requirements, the goal is to firstly get the MaxAge flush problem acknowledged and reach consensus on the requirements. Then the plan is to specify the solutions in separate documents.  Your valuable suggestions will be considered, and further contributions are welcome.

Best regards,
Jie

    Les

Best regards,
Jie

   Les


From: Dongjie (Jimmy) [mailto:jie.dong@huawei.com]
Sent: Sunday, July 31, 2016 11:48 PM
To: Les Ginsberg (ginsberg); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Hi Les,

Thanks for your comments.

OSPF packet level checksum and authentication can only protect the assembled LSU packet one hop on the wire, while cannot detect any change to LSA made by the routers. This is because the OSPF packets are re-assembled on each hop, which is slightly different from IS-IS. So the problem for OSPF is mainly due to the problems inside the router, for example protocol implementations, system timers, or some hardware problem. Actually this problem has been seen in several production networks.

We can improve the description in the draft to make this clear.

Best regards,
Jie

From: Les Ginsberg (ginsberg) [mailto:ginsberg@cisco.com]
Sent: Monday, August 01, 2016 1:30 PM
To: Dongjie (Jimmy); ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: RE: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Jie –

The draft says (Section 2):

“Since cryptographic authentication is executed at the OSPF packet
   level, it can only protect the assembled LSU packet for one hop and
   does not provide any additional protection for the corruption of LS
   age field.”

But as authentication is calculated at the OSPF packet level, any change to the LS age field for an individual LSA contained within the OSPF packet (e.g. by some packet corruption in transmission) would cause authentication to fail when the packet is received. So the statement you make is not correct. I therefore am struggling to understand what problem you believe is not addressed by existing authentication techniques.

   Les



From: OSPF [mailto:ospf-bounces@ietf.org] On Behalf Of Dongjie (Jimmy)
Sent: Sunday, July 31, 2016 8:15 PM
To: ospf@ietf.org<mailto:ospf@ietf.org>
Cc: Zhangxudong (zhangxudong, VRP); lizhenqiang@chinamobile.com<mailto:lizhenqiang@chinamobile.com>
Subject: [OSPF] Solicit feedbacks on draft-dong-ospf-maxage-flush-problem-statement

Hi all,

draft-dong-ospf-maxage-flush-problem-statement describes the problems caused by the corruption of the LS Age field, and summarizes the requirements on potential solutions. This draft received good comments during the presentation on the IETF meeting in B.A.

The authors would like to solicit further feedbacks from the mailing list, on both the problem statement and the solution requirements. Based on the feedbacks, we will update the problem statement draft, and work together to build suitable solutions.

The URL of the draft is:
https://tools.ietf.org/html/draft-dong-ospf-maxage-flush-problem-statement-00

Comments & feedbacks are welcome.

Best regards,
Jie

_______________________________________________
OSPF mailing list
OSPF@ietf.org<mailto:OSPF@ietf.org>
https://www.ietf.org/mailman/listinfo/ospf