Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt

Chris,

In my opinion (and as per Jakob's comment earlier in the thread) you are looking at the treat-as-withdraw behaviour in the wrong way. I have written multiple messages about this previously, but let me reiterate some of the discussions.

The current BGP implementation of session reset optimises solely to make sure it maintaining single device RIB consistency, and knowledge of correct routing information being used by that local speaker. What this misses out is the other dimension to the correctness, which is that of whether services within the network as a system are functional. Where we pursue anything around the revised error handling functionality that is being discussed in this draft, we balance the correctness of the local device's RIB, against the functionality of the overall network system.

As such, I think we have to accept that the current protocol behaviour can be *very* damaging to network deployments and hence operators of real networks are prepared to tweak this balance somewhat. The requirements draft that I am continuing to edit tries to lay out a framework whereby one can limit the amount of time over which this inconsistency may affect the network through having means by which RIB consistency may be recovered. For example, these are:

- A more selective means by which ROUTE REFRESH can be achieved (e.g., one-time ORF, using rt-constrain to refresh a subset of routes, or building upon the Enhanced GR UPDATE-VERSION message) - which allows the individual speaker to recovery consistency of the RIB.
- Better ways to be able to do session reset (the observation being that the session-level error handling causes most problems due to forwarding outages during it) - which is answered by GR based on NOTIFICATION, and Enhanced GR.

It strikes me that your analysis is aiming to ensure 100% consistency at the device level - which I think is something that we have to accept is incompatible with overall network system robustness. Can I suggest that you contribute some text to the draft as to where there are caveats (i.e., where the treat-as-withdraw mechanism may fail), and note that these are positions where a device may wish to continue to implement session level reset behaviours, or additional risk may be faced by the device?

My view is that we should *not* have a capability to indicate this behaviour. I would like a means by which I am not reliant on 3rd party actions (be it my peers in the dfz, or l3vpn deployments, or all device vendors) to begin to address a risk within my network deployments.

Kind regards,
r.

On 9 Dec 2012, at 23:07, Chris Hall <chris.hall@highwayman.com> wrote:

> Jakob Heitz wrote (on Sat 08-Dec-2012 at 16:43 +0000):
>> The goal of "treat as withdraw" is not to reinterpret a broken
>> update message and continue the session, like nothing happened.
>> 
>> IMO, the goal is to limit the disruption caused by a session reset,
>> while alerting a human to fix the problem that no machine can.
> 
> I guess you are suggesting that it does not then matter if a broken
> UPDATE message results in some NLRI being missed, and so not
> "treated-as-withdraw", and hence the receiver continues with some
> invalid or out of date routes, for some time.
> 
> Clearly session-reset is a less than perfect remedy.  But in proposing
> an alternative treatment, perhaps "first do no harm" is as good a
> guide as any.  I think that to achieve that, one needs to be sure that
> *all* NLRI in a broken update can be identified if "treat-as-withdraw"
> is to be applied.  
> 
> If the intention is to "treat-as-withdraw" any NLRI which is visible,
> but continue the session in any case (so, accepting the risks of
> invalid or out of date routes) then I think the draft should estimate
> the risks and set out a justification for this being a less-bad remedy
> than session-reset.
> 
> Of course, a major issue with session-reset is that the error may well
> simply be repeated, creating a ghastly cycle session-reset/restart.
> It could well be better to avoiding session-reset, and continue with
> some invalid or out of date routes -- or a while, defined somehow ?  I
> just don't know how to demonstrate that, or how to limit the downside
> of accepting that risk, etc.
> 
> "Treat-as-withdraw" is an excellent and minimally disruptive response
> in those cases where all NLRI can be identified.  But it is not the
> only alternative to session-reset.  If there is doubt and uncertainty
> about some routes, the receiver could deem *all* routes learned from
> the peer in question to be "routes-of-last-resort", which it then uses
> if and only if it had nothing else, but would not advertise them to
> other peers.  This is just short of a "session-reset", and avoids
> falling into a cycle of session-reset/restart.
> 
> Chris
> 
> _______________________________________________
> Idr mailing list
> Idr@ietf.org
> https://www.ietf.org/mailman/listinfo/idr