Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
"Chris Hall" <chris.hall@highwayman.com> Mon, 10 December 2012 16:18 UTC
Return-Path: <chris.hall@highwayman.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A03CE21F8541 for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:18:49 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.216
X-Spam-Level: **
X-Spam-Status: No, score=2.216 tagged_above=-999 required=5 tests=[AWL=-2.245, BAYES_00=-2.599, GB_SUMOF=5, HELO_MISMATCH_UK=1.749, HOST_MISMATCH_NET=0.311]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MazEA3PyvSZG for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:18:48 -0800 (PST)
Received: from smtp.demon.co.uk (mdfmta005.mxout.tbr.inty.net [91.221.168.46]) by ietfa.amsl.com (Postfix) with ESMTP id B785321F853D for <idr@ietf.org>; Mon, 10 Dec 2012 08:18:47 -0800 (PST)
Received: from mdfmta005.tbr.inty.net (unknown [127.0.0.1]) by mdfmta005.tbr.inty.net (Postfix) with ESMTP id 4E77BA64451; Mon, 10 Dec 2012 16:18:46 +0000 (GMT)
Received: from mdfmta005.tbr.inty.net (unknown [127.0.0.1]) by mdfmta005.tbr.inty.net (Postfix) with ESMTP id 2A5FCA64435; Mon, 10 Dec 2012 16:18:46 +0000 (GMT)
Received: from hestia.halldom.com (unknown [80.177.246.130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mdfmta005.tbr.inty.net (Postfix) with ESMTP; Mon, 10 Dec 2012 16:18:45 +0000 (GMT)
Received: from hyperion.halldom.com ([80.177.246.170] helo=HYPERION) by hestia.halldom.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.76) (envelope-from <chris.hall@highwayman.com>) id 1Ti644-0005gk-D2; Mon, 10 Dec 2012 16:18:44 +0000
From: Chris Hall <chris.hall@highwayman.com>
To: idr@ietf.org
References: <20121121191321.6164.6887.idtracker@ietfa.amsl.com> <50AD2986.90705@cisco.com> <058b01cdd3b4$9f5193b0$ddf4bb10$@highwayman.com> <8ED5B0B0F5B4854A912480C1521F973A0F4940@xmb-rcd-x13.cisco.com> <94913EE5-2864-4EE2-B474-9631430B1E22@ericsson.com> <068701cdd478$2cf01cf0$86d056d0$@highwayman.com> <CAEGVVtBy-zdLz8hVajLnuAqgzfgQHrseK4r-N9=pOZGtqV7LbA@mail.gmail.com> <CAH1iCipfup-GEeJduBti_KHvX1pUZfmZLA3Zz5Y9Aw9xV3fQ9w@mail.gmail.com> <07e901cdd667$31c593e0$9550bba0$@highwayman.com> <CAPWAtbJ4WqoyrzE87v-7hJpp_=fL=B-LevdSe9Q-_m8FLYdFZw@mail.gmail.com>
In-Reply-To: <CAPWAtbJ4WqoyrzE87v-7hJpp_=fL=B-LevdSe9Q-_m8FLYdFZw@mail.gmail.com>
Date: Mon, 10 Dec 2012 16:18:39 -0000
Organization: Highwayman
Message-ID: <007801cdd6f2$069cc720$13d65560$@highwayman.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Outlook 14.0
Thread-Index: AQHwJ9rDNhpCAk7gfRWZlMlTSLUu6QFwpw6KAjDRnx0CVlUcVAFHaBeAARUnQBoBYBPk8QGYNqrRARMQrlQBmj3n7ZddXMMQ
Content-Language: en-gb
X-MDF-HostID: 8
Subject: Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/idr>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Dec 2012 16:18:51 -0000
Jeff Wheeler wrote (on Mon 10-Dec-2012 at 02:26 +0000): > On Sun, Dec 9, 2012 at 6:44 PM, Chris Hall > <chris.hall@highwayman.com> wrote: > > end. However, forming these attributes is very simple, well > > exercised, and easily tested code. It seems to me that this > > sort of anomaly is more likely to be symptomatic of a framing > > issue than it is to be a well-formed attribute which the > > sender has, for reasons unknown, decided to send with > > unexpected Flags and/or Length. > Here is an example from two weeks ago of some routes injected to the > DFZ with malformed attribute flags. Evidence :-) > The result was that everyone > running OpenBGPd more than a few months old, and some networks with > Alcatel routers, and who knows what else, had their BGP sessions > resetting endlessly. > http://mailman.nanog.org/pipermail/nanog/2012-November/053754.html It would seem that some BGP implementation(s) managed to send an attribute with a Flags octet with a bit set in the LS part (which by RFC it MUST not do) and some implementation failed to ignore that (which by RFC it MUST do). [I guess the originator is using the LS bits for its own nefarious purposes.] So, we have two bugs, in a trivial operation which is common to all attribute handling. <sigh> The evidence is clear: even trivially silly bugs happen. What's more, such bugs can happen in code which is intended to improve robustness. So... the evidence suggests that adding more code, with the intent of improving robustness, also adds opportunities for more bugs. I wonder whether the requirement to ignore the meaningless bits has improved or reduced robustness. By tolerating meaningless rubbish, the system manages to sweep under the carpet a failure at the sender end. If the rubbish was not tolerated, then one of these bugs would never have made it out of the lab -- unless, of course, there was a bug in the receiving code used during testing ! But the real lesson here is not so much that fault tolerance fails to reveal faults, but that *silent* fault tolerance does. The draft goes further, and requires that some bits which *do* have meaning should be ignored if their "true" value can be deduced from other parts of the attribute. In the light of the above, this doesn't give me a warm feeling. What's worse, the extra code required by the draft is in exception paths, which 99.9...9% of the time are not exercised. The bugs in your example are on the main path for goodness sake ! ... > BGP is mission-critical for everyone. If it stops working, you > start losing money, instantly. The BGP protocol is the single > point of failure that we all live with. Increasing its > robustness is highly important. Amen to that. The problem with any discussion of how to handle errors in attributes is that it tends to start with the unspoken assumption that the attributes in question have been correctly identified -- which means that the discussion starts with a false or at least doubtful premise. For example: we all know that a LOCAL_PREF attribute is neither use nor ornament when received from an eBGP peer. So, we honestly don't care what the attribute says, or whether its length is correct, we can just throw it on the floor and get on with the business of keeping the network running. Further, LOCAL_PREF is a well-known attribute, so we know it's not Optional and it is Transitive, so it seems daft to worry about the state of those bits (also the Partial bit). BUT: to arrive at the octets which appear to be a LOCAL_PREF attribute, unless it is the first attribute, we have stepped over one or more earlier attributes, on the basis that each one's length is correct. So... what we seem to feel happy treating as a malformed LOCAL_PREF, may actually be some part of some other attribute(s), because some earlier attribute length is broken and we either did not realise that, or we chose to ignore the problem (in the interests of rubustness !). There is no complete way to resolve this for all attributes. For "treat-as-withdraw", however, the key thing is to be able to identify just the NLRI. If any NLRI attributes are guaranteed by the sender to be the first attributes, then the problem is finessed -- the receiver doesn't need to care which attribute is malformed or how, it can just "treat-as-withdraw" and move on, leaving the session running and (presumably) the operational layer running round trying to resolve the root cause. Otherwise, we must consider ways to achieve an acceptable compromise between (a) accurately identifying every attribute the sender sent, and (b) the risk of proceeding with an incomplete set of attributes, some of which are malformed in some way. Noting that the goal is to at least identify the NLRI attributes or be comfortable assuming that one or both of MP_REACH_NLRI and MP_UNREACH_NLRI are not visible because they are not there (and not because they are buried under a heap of broken attribute(s)). I'm sorry to keep harping on about this balls-aching detail, when the real aim is to keep the network running... But, to resolve the broken attributes problem we have to decide which part of each attribute to trust, given that all parts may be broken. As discussed elsewhere, if the sum of the (apparent) attribute lengths is correct, then prima facie we can identify all the attributes. But, suppose we then find (say) something which appears to be LOCAL_PREF, but whose flags or length are incorrect, or which should not be there in the first place. Now we must decide, on the balance of probabilities, whether this really a broken LOCAL_PREF or actually a symptom of a more serious problem, namely that the length(s) of some earlier attribute(s) are, in fact, broken, and we have failed to correctly identify all attributes. If we have failed to identify all attributes, we may be failing to find all the NLRI. If we fail to find all the NLRI we may me in more or less trouble network-state-wise. It's a bleedin' nightmare. Anyway... bad cases make bad law, as they say. So, the incident you reference certainly says that there is no such thing as a bug which is too trivial to make it out into the wild. However, when faced with deciding whether some anomaly is the symptom of a trivial bug, or the symptom of something more serious, then (a) one might give the sending software the benefit of the doubt, and assume it's not a trivial bug, particularly as (b) assuming a trivial bug may well be the more dangerous option. > It should be done with the goal of allowing operators, > without much knowledge, to potentially work around problems > until they can actually be solved. This should be true > whether malformed updates are the result of wrong flags, > length/type errors, or ascii-art unicorns dancing in RPKI > signatures. Well, perhaps what you really want is not some improved way for BGP to automagically work around these issues, but for: * significantly better diagnostic information, so that the operator can properly assess a given problem, * knobs, switches and dials to patch up particular broken UPDATE messages, pro tem. >From my (software) perspective, that's a more interesting challenge. Certainly more interesting than trying to solve the intractable problem of parsing the unparsable to some acceptable extent, TBD. And possibly more robust than layering more edge-cases onto the attribute handling code ! Mind you, I suspect that operational remedies will also depend on the accurate identification of all affected NLRI, which is the horse I rode in on :-( Chris
- [Idr] I-D Action: draft-ietf-idr-error-handling-0… internet-drafts
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Enke Chen
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Saikat Ray (sairay)
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jakob Heitz
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Enke Chen
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Shyam Sethuram
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Brian Dickson
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jakob Heitz
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jakob Heitz
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Enke Chen
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Brian Dickson
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Robert Raszuk
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jakob Heitz
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Rob Shakir
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jakob Heitz
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jeff Wheeler
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… bruno.decraene
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… bruno.decraene
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jeff Wheeler
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jeff Wheeler
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jakob Heitz
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jeff Wheeler
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jakob Heitz
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jeff Wheeler
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Brian Dickson
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jeff Wheeler
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Chris Hall
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… Jeff Wheeler
- Re: [Idr] I-D Action: draft-ietf-idr-error-handli… John Leslie