Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt

"Chris Hall" <chris.hall@highwayman.com> Mon, 10 December 2012 13:27 UTC

Return-Path: <chris.hall@highwayman.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 6FE6021F8C1A for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 05:27:06 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.011
X-Spam-Level:
X-Spam-Status: No, score=0.011 tagged_above=-999 required=5 tests=[AWL=-0.050, BAYES_00=-2.599, HELO_MISMATCH_UK=1.749, HOST_MISMATCH_NET=0.311, J_CHICKENPOX_13=0.6]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id juny6cKSZL6T for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 05:27:05 -0800 (PST)
Received: from smtp.demon.co.uk (mdfmta004.mxout.tbr.inty.net [91.221.168.45]) by ietfa.amsl.com (Postfix) with ESMTP id 376AB21F8C14 for <idr@ietf.org>; Mon, 10 Dec 2012 05:27:04 -0800 (PST)
Received: from mdfmta004.tbr.inty.net (unknown [127.0.0.1]) by mdfmta004.tbr.inty.net (Postfix) with ESMTP id 8D423A0C08B; Mon, 10 Dec 2012 13:27:03 +0000 (GMT)
Received: from mdfmta004.tbr.inty.net (unknown [127.0.0.1]) by mdfmta004.tbr.inty.net (Postfix) with ESMTP id 5F545A0C089; Mon, 10 Dec 2012 13:27:03 +0000 (GMT)
Received: from hestia.halldom.com (unknown [80.177.246.130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mdfmta004.tbr.inty.net (Postfix) with ESMTP; Mon, 10 Dec 2012 13:27:02 +0000 (GMT)
Received: from hyperion.halldom.com ([80.177.246.170] helo=HYPERION) by hestia.halldom.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.76) (envelope-from <chris.hall@highwayman.com>) id 1Ti3Nu-0005gY-0i; Mon, 10 Dec 2012 13:27:02 +0000
From: Chris Hall <chris.hall@highwayman.com>
To: idr@ietf.org
References: <20121121191321.6164.6887.idtracker@ietfa.amsl.com> <50AD2986.90705@cisco.com> <058b01cdd3b4$9f5193b0$ddf4bb10$@highwayman.com> <8ED5B0B0F5B4854A912480C1521F973A0F4940@xmb-rcd-x13.cisco.com> <94913EE5-2864-4EE2-B474-9631430B1E22@ericsson.com> <068701cdd478$2cf01cf0$86d056d0$@highwayman.com> <CAEGVVtBy-zdLz8hVajLnuAqgzfgQHrseK4r-N9=pOZGtqV7LbA@mail.gmail.com>, <074d01cdd536$173f5830$45be0890$@highwayman.com> <9474D8DC-30FF-4C52-9504-15CBCC47E7D8@ericsson.com> <07df01cdd661$f28ef7c0$d7ace740$@highwayman.com> <36E98AE5-3EF8-4738-9982-42B9CA0BAAF5@rob.sh>
In-Reply-To: <36E98AE5-3EF8-4738-9982-42B9CA0BAAF5@rob.sh>
Date: Mon, 10 Dec 2012 13:26:56 -0000
Organization: Highwayman
Message-ID: <005001cdd6da$099f1e90$1cdd5bb0$@highwayman.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Outlook 14.0
Thread-Index: AQHwJ9rDNhpCAk7gfRWZlMlTSLUu6QFwpw6KAjDRnx0CVlUcVAFHaBeAARUnQBoBYBPk8QGjHInVAU6Z2PwCWugrJwCjhW3Cl09w78A=
Content-Language: en-gb
X-MDF-HostID: 9
Subject: Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/idr>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Dec 2012 13:27:06 -0000

Rob Shakir wrote (on Sun 09-Dec-2012 at 23:47 +0000):
....
> In my opinion (and as per Jakob's comment earlier in the thread) you
> are looking at the treat-as-withdraw behaviour in the wrong way. I
> have written multiple messages about this previously, but let me
> reiterate some of the discussions.

It seems to me that "treat-as-withdraw" is "safe" if, and only if, all
NLRI in a broken UPDATE can be identified.

I suggest that this can be achieved in two ways:

  a) if the sender always sends the NLRI attributes
     as the first attributes -- as per the draft,

     AND

     the receiver knows that is the case -- for which
     a capability would serve.

or:

  b) if the receiver requires (at a minimum) all
     attributes to be correctly "framed".

The advantage of (a) is that one no longer really cares how badly
broken the attributes are.  The disadvantage of (a) is that it
requires a small change to the protocol (a much smaller change than
the improved error handling, but nevertheless an extra change).

The advantage of (b) is that it can be applied without any change at
the sender end.  The disadvantage of (b) is that it will not accept
every conceivable form of broken attribute.

The advantage of "safe" "treat-as-withdraw" is that it does not
introduce any new inconsistency in the RIB -- "first do no harm".

I am not arguing that safety is essential -- I am trying to be precise
about how safety may be achieved, and what the compromises are in
doing so.  If those compromises are unacceptable, then we need to be
clear on the impact of removing the safety belt, so that it is clear
whether things are better or worse: under some circumstances we may be
thrown clear of the pile-up and avoid being burnt to a crisp, or we
may sail through the windscreen and kiss our backsides goodbye, while
on the other hand, an air-bag may be better all round; who can tell ?

> The current BGP implementation of session reset optimises solely to
> make sure it maintaining single device RIB consistency, and
> knowledge of correct routing information being used by that local
> speaker. What this misses out is the other dimension to the
> correctness, which is that of whether services within the network as
> a system are functional. Where we pursue anything around the revised
> error handling functionality that is being discussed in this draft,
> we balance the correctness of the local device's RIB, against the
> functionality of the overall network system.

Sure... session-reset is an extreme measure, and can be positively
destructive... cue sound of many babies being ejected with bathwater.

I'm happy to be counted as a fan of "safe" "treat-as-withdraw".  And
yes, that would preserve the correctness of the RIB, or at least avoid
incorrectness thereof.

In case (b) the safety of "treat-as-withdraw" means that there is an
error for which session-reset would continue to be the result --
namely a "framing" error.  For all other errors, case (b) avoids
session-reset -- hurrah !  Over time case (b) would be overtaken by
case (a), as more devices are upgraded to the new error handling.  So,
the residual session-reset cases would dwindle away.

If "framing errors" are determined to be a significant risk, then I
guess that's an incentive for the deployment of case (a).

But, "unsafe" "treat-as-withdraw" may still be better than
session-reset, which we are agreed is simply ghastly.  Inconsistencies
in the RIB may, as you say, be tolerable in the larger context of the
network, and treatable at an operational level -- getting away from
the tedious bits and stuff that I keep droning on about.

I note that the inconsistencies which may be introduced by "unsafe"
"treat-as-withdraw" are perhaps different to other inconsistencies: in
particular, the operator can no longer tell which routes are good and
which are bad.  In "unsafe" "treat-as-withdraw", each broken UPDATE
may or may not have contained some NLRI which should have been
withdrawn, or which are now out of date, but the receiver does not
know which (if any) NLRI are in that (inconsistent) state !

I note also that Appendix A of the draft waxes lyrical on the subject
of "Why not Discard UPDATE Messages".  With "unsafe"
"treat-as-withdraw" the effect is to discard *part* of the UPDATE
message -- the part which may or may not (and you cannot tell which)
contain NLRI attribute(s) which have been obscured by earlier broken
attribute(s).

I do not know how to assess the possible impact of "unsafe"
"treat-as-withdraw"... but Appendix A appears to argue against ?

It is perfectly possible that I have my hands clenched firmly around
the wrong end of this stick.  If the risk of "unsafe"
"treat-as-withdraw" is understood and it is determined that the cure
is (generally) not worse than the disease, then I can let go (yay !). 

> As such, I think we have to accept that the current protocol
> behaviour can be *very* damaging to network deployments and hence
> operators of real networks are prepared to tweak this balance
> somewhat. The requirements draft that I am continuing to edit tries
> to lay out a framework whereby one can limit the amount of time over
> which this inconsistency may affect the network through having means
> by which RIB consistency may be recovered. For example, these are:
> 
> - A more selective means by which ROUTE REFRESH can be achieved
> (e.g., one-time ORF, using rt-constrain to refresh a subset of
> routes, or building upon the Enhanced GR UPDATE-VERSION message) -
> which allows the individual speaker to recovery consistency of the
> RIB.

That appears to require the receiver to know which NLRI are no longer
consistent, which is not entirely possible with "unsafe"
"treat-as-withdraw".

> - Better ways to be able to do session reset (the observation being
> that the session-level error handling causes most problems due to
> forwarding outages during it) - which is answered by GR based on
> NOTIFICATION, and Enhanced GR.

This would not help with "unsafe" "treat-as-withdraw", since the
session-reset has been avoided in any case.

However, it would help in case (b).  So, for "framing" errors a
session-reset is required if "unsafe" "treat-as-withdraw" is to be
avoided, but that session-reset would be mitigated along with all
other (residual) session-resets.

Mind you, changes in GR will require changes at both ends, and I
suspect rather larger changes than those required for case (a) "safe"
"treat-as-withdraw" -- but I guess those GR changes are more generally
a Good Thing.

....
> My view is that we should *not* have a capability to indicate this
> behaviour. I would like a means by which I am not reliant on 3rd
> party actions (be it my peers in the dfz, or l3vpn deployments, or
> all device vendors) to begin to address a risk within my network
> deployments.

OK... to try to summarise succinctly, I think there are two levels at
which "safe" "treat-as-withdraw" may be implemented, as above:

  a) where the sender sends NLRI attributes as required by
     section 3 of the draft...

     ...PLUS a capability... without which the receiver
     cannot *know* that the sender is being helpful, and
     has to assume otherwise.

     This can tolerate any (non-NLRI attribute related)
     aberrations.  

  b) without any change at the sender end,

     ...OR where the receiver does not *know* that the
     sender is being helpful.

     This can tolerate anything except "framing" errors (as
     defined elsewhere).

Ruling out (a) limits the choice to:

  i) case (b) "safe" "treat-as-withdraw"

 ii) "unsafe" "treat-as-withdraw"

Since "unsafe" "treat-as-withdraw" gives me the screaming hab-dabs, my
view would be that starting with case (b) is a reasonable compromise,
as a first step towards case (a).  

There is obviously an incentive to deploy improved error handling.
Let us assume (for a moment) that improved error handling includes the
case (a) sender behaviour.  Early adopters reap the benefit of case
(b) improved error handling immediately on the devices where new
software is deployed.  And they reap the benefit of case (a) improved
error handling for their iBGP just as quickly as new software is
deployed across their network.  For eBGP, availability of case (a)
improved error handling depends on the strength of the incentive --
but good coverage requires only that the relatively small number of
Transit Providers adopt reasonably quickly.

However, given some way of determining the (likely ?) impact of
"unsafe" "treat-as-withdraw", then one could assess whether that is
better or worse than session-reset (under some  circumstances ?) -- in
the (unlikely ?) event that some particularly dim BGP implementation
fails to correctly frame a set of attributes.  I wish I knew where to
start to untangle this problem.

Chris