Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt

"Chris Hall" <chris.hall@highwayman.com> Mon, 10 December 2012 16:18 UTC

Return-Path: <chris.hall@highwayman.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A03CE21F8541 for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:18:49 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.216
X-Spam-Level: **
X-Spam-Status: No, score=2.216 tagged_above=-999 required=5 tests=[AWL=-2.245, BAYES_00=-2.599, GB_SUMOF=5, HELO_MISMATCH_UK=1.749, HOST_MISMATCH_NET=0.311]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MazEA3PyvSZG for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:18:48 -0800 (PST)
Received: from smtp.demon.co.uk (mdfmta005.mxout.tbr.inty.net [91.221.168.46]) by ietfa.amsl.com (Postfix) with ESMTP id B785321F853D for <idr@ietf.org>; Mon, 10 Dec 2012 08:18:47 -0800 (PST)
Received: from mdfmta005.tbr.inty.net (unknown [127.0.0.1]) by mdfmta005.tbr.inty.net (Postfix) with ESMTP id 4E77BA64451; Mon, 10 Dec 2012 16:18:46 +0000 (GMT)
Received: from mdfmta005.tbr.inty.net (unknown [127.0.0.1]) by mdfmta005.tbr.inty.net (Postfix) with ESMTP id 2A5FCA64435; Mon, 10 Dec 2012 16:18:46 +0000 (GMT)
Received: from hestia.halldom.com (unknown [80.177.246.130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mdfmta005.tbr.inty.net (Postfix) with ESMTP; Mon, 10 Dec 2012 16:18:45 +0000 (GMT)
Received: from hyperion.halldom.com ([80.177.246.170] helo=HYPERION) by hestia.halldom.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.76) (envelope-from <chris.hall@highwayman.com>) id 1Ti644-0005gk-D2; Mon, 10 Dec 2012 16:18:44 +0000
From: Chris Hall <chris.hall@highwayman.com>
To: idr@ietf.org
References: <20121121191321.6164.6887.idtracker@ietfa.amsl.com> <50AD2986.90705@cisco.com> <058b01cdd3b4$9f5193b0$ddf4bb10$@highwayman.com> <8ED5B0B0F5B4854A912480C1521F973A0F4940@xmb-rcd-x13.cisco.com> <94913EE5-2864-4EE2-B474-9631430B1E22@ericsson.com> <068701cdd478$2cf01cf0$86d056d0$@highwayman.com> <CAEGVVtBy-zdLz8hVajLnuAqgzfgQHrseK4r-N9=pOZGtqV7LbA@mail.gmail.com> <CAH1iCipfup-GEeJduBti_KHvX1pUZfmZLA3Zz5Y9Aw9xV3fQ9w@mail.gmail.com> <07e901cdd667$31c593e0$9550bba0$@highwayman.com> <CAPWAtbJ4WqoyrzE87v-7hJpp_=fL=B-LevdSe9Q-_m8FLYdFZw@mail.gmail.com>
In-Reply-To: <CAPWAtbJ4WqoyrzE87v-7hJpp_=fL=B-LevdSe9Q-_m8FLYdFZw@mail.gmail.com>
Date: Mon, 10 Dec 2012 16:18:39 -0000
Organization: Highwayman
Message-ID: <007801cdd6f2$069cc720$13d65560$@highwayman.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Outlook 14.0
Thread-Index: AQHwJ9rDNhpCAk7gfRWZlMlTSLUu6QFwpw6KAjDRnx0CVlUcVAFHaBeAARUnQBoBYBPk8QGYNqrRARMQrlQBmj3n7ZddXMMQ
Content-Language: en-gb
X-MDF-HostID: 8
Subject: Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/idr>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Dec 2012 16:18:51 -0000

Jeff Wheeler wrote (on Mon 10-Dec-2012 at 02:26 +0000):
> On Sun, Dec 9, 2012 at 6:44 PM, Chris Hall
> <chris.hall@highwayman.com> wrote:
> > end.  However, forming these attributes is very simple, well
> > exercised, and easily tested code.  It seems to me that this 
> > sort of anomaly is more likely to be symptomatic of a framing
> > issue than it is to be a well-formed attribute which the
> > sender has, for reasons unknown, decided to send with
> > unexpected Flags and/or Length.

> Here is an example from two weeks ago of some routes injected to the
> DFZ with malformed attribute flags.

Evidence :-) 

> The result was that everyone
> running OpenBGPd more than a few months old, and some networks with
> Alcatel routers, and who knows what else, had their BGP sessions
> resetting endlessly.
> http://mailman.nanog.org/pipermail/nanog/2012-November/053754.html

It would seem that some BGP implementation(s) managed to send an
attribute with a Flags octet with a bit set in the LS part (which by
RFC it MUST not do) and some implementation failed to ignore that
(which by RFC it MUST do).  [I guess the originator is using the LS
bits for its own nefarious purposes.]

So, we have two bugs, in a trivial operation which is common to all
attribute handling.  <sigh>

The evidence is clear: even trivially silly bugs happen.

What's more, such bugs can happen in code which is intended to improve
robustness.

So... the evidence suggests that adding more code, with the intent of
improving robustness, also adds opportunities for more bugs.

I wonder whether the requirement to ignore the meaningless bits has
improved or reduced robustness.  By tolerating meaningless rubbish,
the system manages to sweep under the carpet a failure at the sender
end.  If the rubbish was not tolerated, then one of these bugs would
never have made it out of the lab -- unless, of course, there was a
bug in the receiving code used during testing !  But the real lesson
here is not so much that fault tolerance fails to reveal faults, but
that *silent* fault tolerance does.

The draft goes further, and requires that some bits which *do* have
meaning should be ignored if their "true" value can be deduced from
other parts of the attribute.  In the light of the above, this doesn't
give me a warm feeling.  What's worse, the extra code required by the
draft is in exception paths, which 99.9...9% of the time are not
exercised.  The bugs in your example are on the main path for goodness
sake !

...
> BGP is mission-critical for everyone.  If it stops working, you
> start losing money, instantly.  The BGP protocol is the single
> point of failure that we all live with.  Increasing its
> robustness is highly important.

Amen to that.

The problem with any discussion of how to handle errors in attributes
is that it tends to start with the unspoken assumption that the
attributes in question have been correctly identified -- which means
that the discussion starts with a false or at least doubtful premise.

For example: we all know that a LOCAL_PREF attribute is neither use
nor ornament when received from an eBGP peer.  So, we honestly don't
care what the attribute says, or whether its length is correct, we can
just throw it on the floor and get on with the business of keeping the
network running.  Further, LOCAL_PREF is a well-known attribute, so we
know it's not Optional and it is Transitive, so it seems daft to worry
about the state of those bits (also the Partial bit).

BUT: to arrive at the octets which appear to be a LOCAL_PREF
attribute, unless it is the first attribute, we have stepped over one
or more earlier attributes, on the basis that each one's length is
correct.  So... what we seem to feel happy treating as a malformed
LOCAL_PREF, may actually be some part of some other attribute(s),
because some earlier attribute length is broken and we either did not
realise that, or we chose to ignore the problem (in the interests of
rubustness !).

There is no complete way to resolve this for all attributes.  For
"treat-as-withdraw", however, the key thing is to be able to identify
just the NLRI.  If any NLRI attributes are guaranteed by the sender to
be the first attributes, then the problem is finessed -- the receiver
doesn't need to care which attribute is malformed or how, it can just
"treat-as-withdraw" and move on, leaving the session running and
(presumably) the operational layer running round trying to resolve the
root cause.

Otherwise, we must consider ways to achieve an acceptable compromise
between (a) accurately identifying every attribute the sender sent,
and (b) the risk of proceeding with an incomplete set of attributes,
some of which are malformed in some way.  Noting that the goal is to
at least identify the NLRI attributes or be comfortable assuming that
one or both of MP_REACH_NLRI and MP_UNREACH_NLRI are not visible
because they are not there (and not because they are buried under a
heap of broken attribute(s)).

I'm sorry to keep harping on about this balls-aching detail, when the
real aim is to keep the network running... But, to resolve the broken
attributes problem we have to decide which part of each attribute to
trust, given that all parts may be broken.  As discussed elsewhere, if
the sum of the (apparent) attribute lengths is correct, then prima
facie we can identify all the attributes.  But, suppose we then find
(say) something which appears to be LOCAL_PREF, but whose flags or
length are incorrect, or which should not be there in the first place.
Now we must decide, on the balance of probabilities, whether this
really a broken LOCAL_PREF or actually a symptom of a more serious
problem, namely that the length(s) of some earlier attribute(s) are,
in fact, broken, and we have failed to correctly identify all
attributes.  If we have failed to identify all attributes, we may be
failing to find all the NLRI.  If we fail to find all the NLRI we may
me in more or less trouble network-state-wise.  It's a bleedin'
nightmare.

Anyway... bad cases make bad law, as they say.  So, the incident you
reference certainly says that there is no such thing as a bug which is
too trivial to make it out into the wild.  However, when faced with
deciding whether some anomaly is the symptom of a trivial bug, or the
symptom of something more serious, then (a) one might give the sending
software the benefit of the doubt, and assume it's not a trivial bug,
particularly as (b) assuming a trivial bug may well be the more
dangerous option.

> It should be done with the goal of allowing operators,
> without much knowledge, to potentially work around problems
> until they can actually be solved.  This should be true
> whether malformed updates are the result of wrong flags,
> length/type errors, or ascii-art unicorns dancing in RPKI
> signatures.

Well, perhaps what you really want is not some improved way for BGP to
automagically work around these issues, but for:

  * significantly better diagnostic information, so that
    the operator can properly assess a given problem,

  * knobs, switches and dials to patch up particular
    broken UPDATE messages, pro tem.

>From my (software) perspective, that's a more interesting challenge.
Certainly more interesting than trying to solve the intractable
problem of parsing the unparsable to some acceptable extent, TBD.  And
possibly more robust than layering more edge-cases onto the attribute
handling code !

Mind you, I suspect that operational remedies will also depend on the
accurate identification of all affected NLRI, which is the horse I
rode in on :-(

Chris