Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt

Jakob Heitz <jakob.heitz@ericsson.com> Mon, 10 December 2012 16:31 UTC

Return-Path: <jakob.heitz@ericsson.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 46F1B21F84EE for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:31:13 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -6.299
X-Spam-Level:
X-Spam-Status: No, score=-6.299 tagged_above=-999 required=5 tests=[AWL=-0.300, BAYES_00=-2.599, J_CHICKENPOX_13=0.6, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 8SMc2b4MC3bT for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:31:12 -0800 (PST)
Received: from imr4.ericy.com (imr4.ericy.com [198.24.6.9]) by ietfa.amsl.com (Postfix) with ESMTP id 43B1621F84ED for <idr@ietf.org>; Mon, 10 Dec 2012 08:31:12 -0800 (PST)
Received: from eusaamw0706.eamcs.ericsson.se ([147.117.20.31]) by imr4.ericy.com (8.14.3/8.14.3/Debian-9.1ubuntu1) with ESMTP id qBAGfFhe022013; Mon, 10 Dec 2012 10:41:16 -0600
Received: from EUSAAHC007.ericsson.se (147.117.188.93) by eusaamw0706.eamcs.ericsson.se (147.117.20.31) with Microsoft SMTP Server (TLS) id 8.3.279.1; Mon, 10 Dec 2012 11:31:00 -0500
Received: from EUSAAMB109.ericsson.se ([147.117.188.126]) by EUSAAHC007.ericsson.se ([147.117.188.93]) with mapi id 14.02.0318.001; Mon, 10 Dec 2012 11:31:00 -0500
From: Jakob Heitz <jakob.heitz@ericsson.com>
To: Chris Hall <chris.hall@highwayman.com>
Thread-Topic: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
Thread-Index: AQHNyBxqy9fEwUYlVUSnx1DosT0Aspf0/iwAgBcupYCAADNcgP//yMCqgAGK/gCAAJrggIAA4PUAgAAGNdyAAlGBgIAACv4AgADlMQD//9+bDQ==
Date: Mon, 10 Dec 2012 16:30:58 +0000
Message-ID: <828AAFF5-0260-4AA6-BBDC-6C1F69919837@ericsson.com>
References: <20121121191321.6164.6887.idtracker@ietfa.amsl.com> <50AD2986.90705@cisco.com> <058b01cdd3b4$9f5193b0$ddf4bb10$@highwayman.com> <8ED5B0B0F5B4854A912480C1521F973A0F4940@xmb-rcd-x13.cisco.com> <94913EE5-2864-4EE2-B474-9631430B1E22@ericsson.com> <068701cdd478$2cf01cf0$86d056d0$@highwayman.com> <CAEGVVtBy-zdLz8hVajLnuAqgzfgQHrseK4r-N9=pOZGtqV7LbA@mail.gmail.com>, <074d01cdd536$173f5830$45be0890$@highwayman.com> <9474D8DC-30FF-4C52-9504-15CBCC47E7D8@ericsson.com> <07df01cdd661$f28ef7c0$d7ace740$@highwayman.com> <36E98AE5-3EF8-4738-9982-42B9CA0BAAF5@rob.sh>, <005001cdd6da$099f1e90$1cdd5bb0$@highwayman.com>
In-Reply-To: <005001cdd6da$099f1e90$1cdd5bb0$@highwayman.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "idr@ietf.org" <idr@ietf.org>
Subject: Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/idr>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Dec 2012 16:31:13 -0000

If, at the time a bgp speaker detects a malformation in a received UPDATE, it has completely parsed at least one of:
. Withdrawn routes section
. NLRI
. MP_REACH
. MP_UNREACH
then it assumes that there are no more of these following.

A capability to say so adds nothing.
The router will behave exactly the same way.
Either way, human intervention is required to restore correct routing.

--
Jakob Heitz.


On Dec 10, 2012, at 5:27 AM, "Chris Hall" <chris.hall@highwayman.com> wrote:

> Rob Shakir wrote (on Sun 09-Dec-2012 at 23:47 +0000):
> ....
>> In my opinion (and as per Jakob's comment earlier in the thread) you
>> are looking at the treat-as-withdraw behaviour in the wrong way. I
>> have written multiple messages about this previously, but let me
>> reiterate some of the discussions.
> 
> It seems to me that "treat-as-withdraw" is "safe" if, and only if, all
> NLRI in a broken UPDATE can be identified.
> 
> I suggest that this can be achieved in two ways:
> 
>  a) if the sender always sends the NLRI attributes
>     as the first attributes -- as per the draft,
> 
>     AND
> 
>     the receiver knows that is the case -- for which
>     a capability would serve.
> 
> or:
> 
>  b) if the receiver requires (at a minimum) all
>     attributes to be correctly "framed".
> 
> The advantage of (a) is that one no longer really cares how badly
> broken the attributes are.  The disadvantage of (a) is that it
> requires a small change to the protocol (a much smaller change than
> the improved error handling, but nevertheless an extra change).
> 
> The advantage of (b) is that it can be applied without any change at
> the sender end.  The disadvantage of (b) is that it will not accept
> every conceivable form of broken attribute.
> 
> The advantage of "safe" "treat-as-withdraw" is that it does not
> introduce any new inconsistency in the RIB -- "first do no harm".
> 
> I am not arguing that safety is essential -- I am trying to be precise
> about how safety may be achieved, and what the compromises are in
> doing so.  If those compromises are unacceptable, then we need to be
> clear on the impact of removing the safety belt, so that it is clear
> whether things are better or worse: under some circumstances we may be
> thrown clear of the pile-up and avoid being burnt to a crisp, or we
> may sail through the windscreen and kiss our backsides goodbye, while
> on the other hand, an air-bag may be better all round; who can tell ?
> 
>> The current BGP implementation of session reset optimises solely to
>> make sure it maintaining single device RIB consistency, and
>> knowledge of correct routing information being used by that local
>> speaker. What this misses out is the other dimension to the
>> correctness, which is that of whether services within the network as
>> a system are functional. Where we pursue anything around the revised
>> error handling functionality that is being discussed in this draft,
>> we balance the correctness of the local device's RIB, against the
>> functionality of the overall network system.
> 
> Sure... session-reset is an extreme measure, and can be positively
> destructive... cue sound of many babies being ejected with bathwater.
> 
> I'm happy to be counted as a fan of "safe" "treat-as-withdraw".  And
> yes, that would preserve the correctness of the RIB, or at least avoid
> incorrectness thereof.
> 
> In case (b) the safety of "treat-as-withdraw" means that there is an
> error for which session-reset would continue to be the result --
> namely a "framing" error.  For all other errors, case (b) avoids
> session-reset -- hurrah !  Over time case (b) would be overtaken by
> case (a), as more devices are upgraded to the new error handling.  So,
> the residual session-reset cases would dwindle away.
> 
> If "framing errors" are determined to be a significant risk, then I
> guess that's an incentive for the deployment of case (a).
> 
> But, "unsafe" "treat-as-withdraw" may still be better than
> session-reset, which we are agreed is simply ghastly.  Inconsistencies
> in the RIB may, as you say, be tolerable in the larger context of the
> network, and treatable at an operational level -- getting away from
> the tedious bits and stuff that I keep droning on about.
> 
> I note that the inconsistencies which may be introduced by "unsafe"
> "treat-as-withdraw" are perhaps different to other inconsistencies: in
> particular, the operator can no longer tell which routes are good and
> which are bad.  In "unsafe" "treat-as-withdraw", each broken UPDATE
> may or may not have contained some NLRI which should have been
> withdrawn, or which are now out of date, but the receiver does not
> know which (if any) NLRI are in that (inconsistent) state !
> 
> I note also that Appendix A of the draft waxes lyrical on the subject
> of "Why not Discard UPDATE Messages".  With "unsafe"
> "treat-as-withdraw" the effect is to discard *part* of the UPDATE
> message -- the part which may or may not (and you cannot tell which)
> contain NLRI attribute(s) which have been obscured by earlier broken
> attribute(s).
> 
> I do not know how to assess the possible impact of "unsafe"
> "treat-as-withdraw"... but Appendix A appears to argue against ?
> 
> It is perfectly possible that I have my hands clenched firmly around
> the wrong end of this stick.  If the risk of "unsafe"
> "treat-as-withdraw" is understood and it is determined that the cure
> is (generally) not worse than the disease, then I can let go (yay !). 
> 
>> As such, I think we have to accept that the current protocol
>> behaviour can be *very* damaging to network deployments and hence
>> operators of real networks are prepared to tweak this balance
>> somewhat. The requirements draft that I am continuing to edit tries
>> to lay out a framework whereby one can limit the amount of time over
>> which this inconsistency may affect the network through having means
>> by which RIB consistency may be recovered. For example, these are:
>> 
>> - A more selective means by which ROUTE REFRESH can be achieved
>> (e.g., one-time ORF, using rt-constrain to refresh a subset of
>> routes, or building upon the Enhanced GR UPDATE-VERSION message) -
>> which allows the individual speaker to recovery consistency of the
>> RIB.
> 
> That appears to require the receiver to know which NLRI are no longer
> consistent, which is not entirely possible with "unsafe"
> "treat-as-withdraw".
> 
>> - Better ways to be able to do session reset (the observation being
>> that the session-level error handling causes most problems due to
>> forwarding outages during it) - which is answered by GR based on
>> NOTIFICATION, and Enhanced GR.
> 
> This would not help with "unsafe" "treat-as-withdraw", since the
> session-reset has been avoided in any case.
> 
> However, it would help in case (b).  So, for "framing" errors a
> session-reset is required if "unsafe" "treat-as-withdraw" is to be
> avoided, but that session-reset would be mitigated along with all
> other (residual) session-resets.
> 
> Mind you, changes in GR will require changes at both ends, and I
> suspect rather larger changes than those required for case (a) "safe"
> "treat-as-withdraw" -- but I guess those GR changes are more generally
> a Good Thing.
> 
> ....
>> My view is that we should *not* have a capability to indicate this
>> behaviour. I would like a means by which I am not reliant on 3rd
>> party actions (be it my peers in the dfz, or l3vpn deployments, or
>> all device vendors) to begin to address a risk within my network
>> deployments.
> 
> OK... to try to summarise succinctly, I think there are two levels at
> which "safe" "treat-as-withdraw" may be implemented, as above:
> 
>  a) where the sender sends NLRI attributes as required by
>     section 3 of the draft...
> 
>     ...PLUS a capability... without which the receiver
>     cannot *know* that the sender is being helpful, and
>     has to assume otherwise.
> 
>     This can tolerate any (non-NLRI attribute related)
>     aberrations.  
> 
>  b) without any change at the sender end,
> 
>     ...OR where the receiver does not *know* that the
>     sender is being helpful.
> 
>     This can tolerate anything except "framing" errors (as
>     defined elsewhere).
> 
> Ruling out (a) limits the choice to:
> 
>  i) case (b) "safe" "treat-as-withdraw"
> 
> ii) "unsafe" "treat-as-withdraw"
> 
> Since "unsafe" "treat-as-withdraw" gives me the screaming hab-dabs, my
> view would be that starting with case (b) is a reasonable compromise,
> as a first step towards case (a).  
> 
> There is obviously an incentive to deploy improved error handling.
> Let us assume (for a moment) that improved error handling includes the
> case (a) sender behaviour.  Early adopters reap the benefit of case
> (b) improved error handling immediately on the devices where new
> software is deployed.  And they reap the benefit of case (a) improved
> error handling for their iBGP just as quickly as new software is
> deployed across their network.  For eBGP, availability of case (a)
> improved error handling depends on the strength of the incentive --
> but good coverage requires only that the relatively small number of
> Transit Providers adopt reasonably quickly.
> 
> However, given some way of determining the (likely ?) impact of
> "unsafe" "treat-as-withdraw", then one could assess whether that is
> better or worse than session-reset (under some  circumstances ?) -- in
> the (unlikely ?) event that some particularly dim BGP implementation
> fails to correctly frame a set of attributes.  I wish I knew where to
> start to untangle this problem.
> 
> Chris
>