Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt

Jeff Wheeler <jsw@inconcepts.biz> Mon, 10 December 2012 16:13 UTC

Return-Path: <jsw@inconcepts.biz>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 24FF421F8525 for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:13:29 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.018
X-Spam-Level:
X-Spam-Status: No, score=-2.018 tagged_above=-999 required=5 tests=[AWL=-0.526, BAYES_00=-2.599, FM_FORGED_GMAIL=0.622, HTML_FONT_FACE_BAD=0.884, HTML_MESSAGE=0.001, J_CHICKENPOX_13=0.6, RCVD_IN_DNSWL_LOW=-1]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mbf4ZuEeGFt8 for <idr@ietfa.amsl.com>; Mon, 10 Dec 2012 08:13:28 -0800 (PST)
Received: from mail-ie0-f174.google.com (mail-ie0-f174.google.com [209.85.223.174]) by ietfa.amsl.com (Postfix) with ESMTP id 0F09A21F8523 for <idr@ietf.org>; Mon, 10 Dec 2012 08:13:28 -0800 (PST)
Received: by mail-ie0-f174.google.com with SMTP id c11so8818668ieb.33 for <idr@ietf.org>; Mon, 10 Dec 2012 08:13:27 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-originating-ip:in-reply-to:references:date :message-id:subject:from:to:cc:content-type:x-gm-message-state; bh=2yPvhEfpeg165hlCL6WP+bHAn/P/kzXpCr3AtUoQIOs=; b=k0DDo2EPWfpJ0ZektyEEjgKz0zjcTZeK1joMo6nYqmdQqrwWAxodldF9hBiHy1iRai kGgsOjOSBIpIwcYDmwHM2jrn6QX6EG57GECpq7Y5kiqT2QfG6hpw0/EGyjIStgjmNjla eoFJRB02xM4m6DYlEfMNu9ytBpBvFV1WyOllmHcpTmvjyAxCy3Ek12eFJ4kuPv3SCNml Yq9UqnJQPFX7V+XtuHAc72xUXhJzMUk/pqoJYgQCgsHM9vp0RPDlFlaQ6wTZBKYv77su B0LwykEWJPiz1481wcMbUi8WSnXVF43h+udFIQB5eFktIzXBNxNGf+hu5K3JBuRqAe2F WdMQ==
MIME-Version: 1.0
Received: by 10.50.36.198 with SMTP id s6mr7164461igj.23.1355156007584; Mon, 10 Dec 2012 08:13:27 -0800 (PST)
Received: by 10.64.132.33 with HTTP; Mon, 10 Dec 2012 08:13:26 -0800 (PST)
X-Originating-IP: [74.134.22.105]
In-Reply-To: <005001cdd6da$099f1e90$1cdd5bb0$@highwayman.com>
References: <20121121191321.6164.6887.idtracker@ietfa.amsl.com> <50AD2986.90705@cisco.com> <058b01cdd3b4$9f5193b0$ddf4bb10$@highwayman.com> <8ED5B0B0F5B4854A912480C1521F973A0F4940@xmb-rcd-x13.cisco.com> <94913EE5-2864-4EE2-B474-9631430B1E22@ericsson.com> <068701cdd478$2cf01cf0$86d056d0$@highwayman.com> <CAEGVVtBy-zdLz8hVajLnuAqgzfgQHrseK4r-N9=pOZGtqV7LbA@mail.gmail.com> <074d01cdd536$173f5830$45be0890$@highwayman.com> <9474D8DC-30FF-4C52-9504-15CBCC47E7D8@ericsson.com> <07df01cdd661$f28ef7c0$d7ace740$@highwayman.com> <36E98AE5-3EF8-4738-9982-42B9CA0BAAF5@rob.sh> <005001cdd6da$099f1e90$1cdd5bb0$@highwayman.com>
Date: Mon, 10 Dec 2012 11:13:26 -0500
Message-ID: <CAPWAtbJO7dopCv9mbRHTTNDsSAimumqXu1Xy+Rn2XoE+7Rpk8Q@mail.gmail.com>
From: Jeff Wheeler <jsw@inconcepts.biz>
To: Chris Hall <chris.hall@highwayman.com>
Content-Type: multipart/alternative; boundary="14dae9340f6d72840204d081d8c8"
X-Gm-Message-State: ALoCoQlB6WvyZr+WTfW3GKuPhMfiCP14W2TlbyrsttZPWOT4oOAsVQtcSVGiFQj/AZVbyPXr+g/Z
Cc: idr@ietf.org
Subject: Re: [Idr] I-D Action: draft-ietf-idr-error-handling-03.txt
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/idr>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Dec 2012 16:13:29 -0000

On Mon, Dec 10, 2012 at 8:26 AM, Chris Hall <chris.hall@highwayman.com>
wrote:
> However, given some way of determining the (likely ?) impact of
> "unsafe" "treat-as-withdraw", then one could assess whether that is
> better or worse than session-reset (under some  circumstances ?) -- in

Why is the question not answerable by one of three options?
1# ignore
2# treat-as-withdraw
3# session-reset

1# IGNORE is hazardous but probably only to the prefixes in that update (or
withdraw) which means the scope of malfunction is relatively small.  If
someone announces a bad route to the DFZ, or a buggy switch announces wrong
L3VPN information, this will not "spill over" to any other prefixes, as
long as you honor any withdraw that happened to be in the same Message but
before the damaged UPDATE.

Yes, perhaps reachability to the affected prefixes will be gone.  There
could be a loop created and packets will have to expire due to TTL until
this loop is resolved.  But either way, I think the scope of the
malfunction is only prefixes that were in the bad UPDATE.  Even if they
weren't packed into the same UPDATE they will share the same Attributes and
so the sender is likely to make the same mistake.  The exception here is if
an MP_UN?REACH_NLRI Attribute is corrupt but the native NLRI in the outside
part of the Message are not damaged.  You would be ignoring them and
causing a little bit more RIB inconsistency.

2# TREAT-AS-WITHDRAW is hazardous.  Loops could be created in a native
forwarding path (IPv4/IPv6 with no labels) but in MPLS VPNs or when using
labeled-unicast, there will not be any loops.  Reachability may be lost but
if any undamaged path is available then it will be selected as best and
installed.

The great risk is, how do you guess what the prefixes are, if the framing
is wrong?  In my view, here is one way that is rather thorough for finding
MP NLRIs:

Beginning with or following the damaged Attribute (which one?), scan for
MP_REACH_NLRI:
*AttrFlags AttrType AttrLen* AFI SAFI NextHopLen NH *0x00* PfxLen Pfx
(PfxLen Pfx){0,}

You know what AFI, SAFI you are willing to support on this BGP session
because this was determined when the session established.  If you find 0x0e
AFI SAFI sequences you will hope the next octet is NextHopLen and then look
forward that many octets, hoping to find a 0x00 which is a handy reserved
field.*  If you found that 0x00 you can see if a PfxLen follows that is a
sane length for this AFI SAFI -- you know it won't be >= 33 for IPv4 or >=
128 for IPv6, for example.  After that you expect to skip the Pfx and keep
looking for more sequences like this until you run out of AttrLen.

* for information on this reserved field, see RFC4760 Pg4; for history,
RFC2283 Pg3 "Number of SNPAs"

So for updates to prefixes you might be able to find the MP NLRIs with a
good degree of confidence even if another Attribute is damaged.

With MP_UNREACH_NLRI you can similarly search for a known pattern and hope
to find the MP NLRIs.

If you DO get tricked by the damaged packet, you will cause yourself to
withdraw routes that had nothing to do with the malfunction.  This is
unfortunate just to avoid the chance of loops on what is hopefully a
limited number of prefixes.

If the MP_REACH_NLRI or MP_UNREACH_NLRI itself is damaged then you will
have a hard time finding the prefixes.  Maybe they won't even be there at
all, so no clever pattern match to look for them would be helpful.
 Whatever the sending side did to send this bad message is unknown.

For problems with native, non-MP Messages, you really do not have as much
context while you are looking for the NLRI.  It will be hard to find them
without false-positives.  Fortunately you still have the opportunity for a
fairly good sanity check if you simply look from the end of the Message
backwards, and do a bit of consistency checking.  This sounds
computationally-expensive but I think it actually isn't.  It would not be
very hard to write an implementation of this search and test its speed.

In any case, this code will not be executed except when damaged updates are
encountered, and at that point, you can compare the computational expense
of guessing around the problem, to the expense of resetting sessions over
and over and potentially creating work all through your network.

3# SESSION-RESET everyone understands what this behavior does.

If my analysis above is correct, you could easily make an argument for
un-safe ignore.  I think un-safe withdraw is a good OPTION because I
usually believe in giving vendors lots of flexibility to offer knobs to the
operator.  The vendor can certainly decide how robust his un-safe withdraw
checks will be.  Operators can turn the knob whatever way they want, but
probably IGNORE will be smart in almost all cases.  If I could only have
one of those two things I would rather have IGNORE.

-- 
Jeff S Wheeler <jsw@inconcepts.biz>
Sr Network Operator  /  Innovative Network Concepts