Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt

"Chris Hall" <chris.hall@highwayman.com> Tue, 11 December 2012 11:37 UTC

Return-Path: <chris.hall@highwayman.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7C40721F8496 for <idr@ietfa.amsl.com>; Tue, 11 Dec 2012 03:37:25 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 2.34
X-Spam-Level: **
X-Spam-Status: No, score=2.34 tagged_above=-999 required=5 tests=[AWL=-2.121, BAYES_00=-2.599, GB_SUMOF=5, HELO_MISMATCH_UK=1.749, HOST_MISMATCH_NET=0.311]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rOsdaS8JChUv for <idr@ietfa.amsl.com>; Tue, 11 Dec 2012 03:37:24 -0800 (PST)
Received: from smtp.demon.co.uk (mdfmta009.mxout.tch.inty.net [91.221.169.50]) by ietfa.amsl.com (Postfix) with ESMTP id 452BC21F8488 for <idr@ietf.org>; Tue, 11 Dec 2012 03:37:15 -0800 (PST)
Received: from mdfmta009.tch.inty.net (unknown [127.0.0.1]) by mdfmta009.tch.inty.net (Postfix) with ESMTP id A7136128416; Tue, 11 Dec 2012 11:37:14 +0000 (GMT)
Received: from mdfmta009.tch.inty.net (unknown [127.0.0.1]) by mdfmta009.tch.inty.net (Postfix) with ESMTP id 7AE84128415; Tue, 11 Dec 2012 11:37:14 +0000 (GMT)
Received: from hestia.halldom.com (unknown [80.177.246.130]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mdfmta009.tch.inty.net (Postfix) with ESMTP; Tue, 11 Dec 2012 11:37:14 +0000 (GMT)
Received: from hyperion.halldom.com ([80.177.246.170] helo=HYPERION) by hestia.halldom.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.76) (envelope-from <chris.hall@highwayman.com>) id 1TiO9A-0005hj-Gj; Tue, 11 Dec 2012 11:37:12 +0000
From: Chris Hall <chris.hall@highwayman.com>
To: idr@ietf.org
References: <20121121191321.6164.6887.idtracker@ietfa.amsl.com> <50AD2986.90705@cisco.com> <058b01cdd3b4$9f5193b0$ddf4bb10$@highwayman.com> <8ED5B0B0F5B4854A912480C1521F973A0F4940@xmb-rcd-x13.cisco.com> <94913EE5-2864-4EE2-B474-9631430B1E22@ericsson.com> <068701cdd478$2cf01cf0$86d056d0$@highwayman.com> <CAEGVVtBy-zdLz8hVajLnuAqgzfgQHrseK4r-N9=pOZGtqV7LbA@mail.gmail.com>, <074d01cdd536$173f5830$45be0890$@highwayman.com> <9474D8DC-30FF-4C52-9504-15CBCC47E7D8@ericsson.com> <07df01cdd661$f28ef7c0$d7ace740$@highwayman.com> <36E98AE5-3EF8-4738-9982-42B9CA0BAAF5@rob.sh>, <005001cdd6da$099f1e90$1cdd5bb0$@highwayman.com> <828AAFF5-0260-4AA6-BBDC-6C1F69919837@ericsson.com> <009001cdd6ff$1c982530$55c86f90$@highwayman.com> <2F3EBB88EC3A454AAB08915FBF0B8C7E10DD99@eusaamb109.ericsson.se>
In-Reply-To: <2F3EBB88EC3A454AAB08915FBF0B8C7E10DD99@eusaamb109.ericsson.se>
Date: Tue, 11 Dec 2012 11:37:07 -0000
Organization: Highwayman
Message-ID: <013301cdd793$dcac5be0$960513a0$@highwayman.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Outlook 14.0
Thread-Index: AQHwJ9rDNhpCAk7gfRWZlMlTSLUu6QFwpw6KAjDRnx0CVlUcVAFHaBeAARUnQBoBYBPk8QGjHInVAU6Z2PwCWugrJwCjhW3CAl9IPxQBWty0zQCWm32nAay9fNuXIKxf4A==
Content-Language: en-gb
X-MDF-HostID: 22
Subject: Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/idr>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 11 Dec 2012 11:37:25 -0000

Jakob Heitz wrote (on Mon 10-Dec-2012 at 18:09 +0000):
> On Monday, December 10, 2012 9:52 AM, Chris Hall wrote:
....
> Once you have a malformed update, NOTHING is certain.
> We limit the damage and call for human intervention.

OK.  That's the "mission statement".  As we get into the detail I
think I see where my disconnect is.

The draft gets very picky about NLRI and NLRI attributes.  The moment
it sees any malformation in those, it throws up its hands and hits the
session-reset button.  The draft allows repeat attributes of the same
type, except for NLRI attributes.  The draft uses a fair amount of
pencil explaining why simply discarding UPDATE messages is unsafe,
which suggests that "treat-as-withdraw" is the minimum requirement if
session-reset is to be avoided.

[BTW, why are repeat NLRI attributes deemed a cardinal sin
(session-reset, already) ?  If they are valid, the semantics are
entirely clear... here are some more prefixes to be
updated/withdrawn.]

The introduction to the draft states: "The goal ... is to minimise the
impact on routing ... while maintaining protocol correctness ....
removing the routes carried in the malformed UPDATE from the routing
system."

>From all that, I have been working on the assumption that "limit the
damage" implies, at a minimum, not proceeding unless all NLRI have
been identified, and can therefore be "treated-as-withdraw" -- or, at
least, not proceeding unless reasonable steps (TBD) have been taken to
identify all NLRI most of the time (also TBD).

As you say, with a malformed update "NOTHING is certain", so this is
Tricky, and the receiver has to take some view on what are reasonable
steps.  I have suggested verifying the attribute "framing" as a
minimum requirement for that.  It has also been suggested that, having
found a malformed attribute, the receiver should stop stepping through
attributes by attribute length, and scan octet-wise looking for NLRI
attribute(s).  However, if the sender is helpful, and places NLRI
attributes ahead of all others (as required by the draft) then the
receiver can stop worrying about trying to deal with attributes where
"NOTHING is certain", and can process the NLRI (effectively)
separately -- provided the receiver has some means of knowing the
sender is being helpful.

However, it seems that identifying all the NLRI is not as important as
I had understood it to be, so the receiver need take no special steps
to extract the NLRI.  So, if the sender is helpful, things will
generally work better, but otherwise it makes no difference to the
receiver.  I don't agree, but that seems to be the approach.

> > I can quite believe that, in practice, few BGP implementations (if
> > any) send more than one of the above forms of NLRI in a single
> > UPDATE message. 
> >
> > But that is not a requirement of the RFC or the draft -- so the
> > receiver is not (strictly speaking) entitled to assume it.

> It assumes the best it can to limit the damage.

What do you suggest that should be ?

In a world were "NOTHING is certain", does it matter if different
implementations make different assumptions ?  Or should the
specification require consistent behaviour in the face of uncertainty,
if nothing else, to avoid increasing that uncertainty ?

> > What is the receiver supposed to do if it has not found any NLRI
> > at the point that it hits a malformed attribute ?

> Reset the session.

We have established that the error-handling is not required to find
all NLRI.  That is, we are happy proceeding with a session where there
are some NLRI which we would prefer to have "treated-as-withdraw", but
could not.  In effect, we are prepared to tolerate a measure of
"UPDATE-discard" (Appendix A of the draft, notwithstanding).  I
suppose there is a difference between knowing that we have missed some
NLRI and not knowing whether we have found all NLRI.  However, given
the pain associated with session-reset, is there a good reason for
accepting one degree of "UPDATE-discard" and not another ?

> > So, the receiver scans the attributes, and on the first malformed
> > one it stops.  Yes ?  Or, perhaps it ploughs on to the end
> > stepping past malformed attributes, and truncating the final
> > attribute if it overruns the 'Total Attributes Length'.  Yes ?

> It doesn't stop.

So the parsing of attributes takes the Attribute Length of a malformed
attribute at face value.  Since "NOTHING is certain", it doesn't have
much choice.

If the final attribute overruns the 'Total Attributes Length' (or
there is an incomplete attribute header at the end) one thing is
certain, the attributes are badly broken -- the sender has been unable
to complete the simple task of correctly "framing" the attributes.  I
think the draft expects this case to be taken as a malformation of the
attribute.  The draft delegates the definition of malformation to the
relevant documentation for each Type of attribute.  If there is
intended to be a general or default way of dealing with this case,
then I think the draft needs to specify.

If the sum of the Attribute Lengths is exactly the 'Total Attributes
Length' (but some attribute is malformed) then "NOTHING is certain",
but it is possible/probable that all attributes have been identified.
[The degree of confidence may be increased if the Flags and Length of
known and well-known attributes are correct, and there are no repeated
attribute types, and perhaps other "semantic" information is taken
into account.]  This all matters rather more if one is trying to
identify all NLRI attributes, but not one jot or iota otherwise apart
from the diagnostics.

The draft has a binary approach to attributes, they are either
malformed or not malformed (well-formed).  Is there room for a
"semantic error" ?  That is, an attribute which is well-formed as far
as its Flags, Type, Length and, perhaps, internal structure are
concerned, but make no sense at all.  The result may still be
"treat-as-withdraw" (say) but the error does not cast doubt on the
attributes which follow.  The distinction would improve the
diagnostics.

So, it appears that the draft is taking a pretty relaxed view of what
is acceptable -- since "NOTHING is certain" why sweat it ?  I am
worried by the option to "attribute discard".  If things are not
certain, is "treat-as-withdraw" not the safer option ?  An
ATOMIC_AGGREGATE attribute, for example, is considered malformed if it
has any length other than 0.  Since nobody gives a rodent's posterior
about this attribute, it seems to make perfect sense to throw it on
the floor if it is malformed.  Except, except, the length of an
attribute also affects the attributes around it.  An ATOMIC_AGGREGATE
attribute with a length of (say) 700 octets is such obvious nonsense !
And that's to simply be discarded ?  Surely either the sender has
departed the reservation in something of a hurry, or some earlier
(possibly undetected) attribute error has thrown the parser off track
?  The balance of probabilities has to be that this is a symptom of
some problem deeper than a meaningless value for a meaningless
attribute... surely ?  If the attribute length is indeed invalid, then
the sum of all the attribute lengths is probably going to be wrong, so
this will decay into "treat-as-withdraw".  Nevertheless, I struggle to
see the point of applying "attribute discard" in any case of attribute
malformation.  IMO "attribute discard" will be appropriate for some
"semantic errors", only.

....
> > Any NLRI that might be in the UPDATE, but are not visible
> > because of the malformed attributes, are simply ignored.
> > Yes ?

> yes.
> Again:
> Once you have a malformed update, NOTHING is certain.
> We limit the damage and call for human intervention.

Well, as above, this means it is OK to let (some, indeterminate) NLRI
fall into a state where they should have been withdrawn or are now out
of date.  So, why should session-reset *ever* be required ?
Damage-limitation-wise, avoiding session-reset is the big win... so
why not keep going no matter what, and let the human beans deal with
it... much better than falling into a cycle of
session-reset/restart/reset/... ?

>> ...  Also, I kinda suspect that the human bean will be
>> greatly assisted if it is clear which NLRI have been affected
>> (and which definitely have not). 

> The human bean will not rely on ANY routes or information
> from the broken session.

Well, even more reason to stop being picky at the protocol level, and
hence: "treat-as-withdraw" where you can, and "UPDATE-discard" the
rest ?

Chris