Re: [rrg] RRG to hibernation

Shane Amante <shane@castlepoint.net> Sun, 11 November 2012 01:39 UTC

Return-Path: <shane@castlepoint.net>
X-Original-To: rrg@ietfa.amsl.com
Delivered-To: rrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 7477021F850A for <rrg@ietfa.amsl.com>; Sat, 10 Nov 2012 17:39:02 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -0.437
X-Spam-Level:
X-Spam-Status: No, score=-0.437 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, FH_RELAY_NODNS=1.451, HELO_MISMATCH_ORG=0.611, RDNS_NONE=0.1]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VzSFu4zS0SP2 for <rrg@ietfa.amsl.com>; Sat, 10 Nov 2012 17:39:01 -0800 (PST)
Received: from mail.friendswithtools.org (unknown [64.78.239.70]) by ietfa.amsl.com (Postfix) with ESMTP id A770921F84F2 for <rrg@irtf.org>; Sat, 10 Nov 2012 17:39:01 -0800 (PST)
Received: from dspam (unknown [127.0.0.1]) by mail.friendswithtools.org (Postfix) with SMTP id 429A9119 for <rrg@irtf.org>; Sun, 11 Nov 2012 01:39:00 +0000 (UTC)
Received: from mbp.castlepoint.net (174-29-211-99.hlrn.qwest.net [174.29.211.99]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mail.friendswithtools.org (Postfix) with ESMTPSA id CACE1EF for <rrg@irtf.org>; Sat, 10 Nov 2012 18:38:59 -0700 (MST)
Content-Type: text/plain; charset="us-ascii"
Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\))
From: Shane Amante <shane@castlepoint.net>
In-Reply-To: <C64A3635-DE95-41F6-A70C-43597EB58CBB@tcb.net>
Date: Sat, 10 Nov 2012 18:38:57 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <81767641-8399-466D-A9F2-F2C07D3BBE0C@castlepoint.net>
References: <20121110032942.BD27018C113@mercury.lcs.mit.edu> <4C845B01-B282-46FB-A4B8-7ADDBCC9C6E5@tcb.net> <B80A8335-49BD-4B90-A024-FA82B1E8EE5F@tony.li> <C64A3635-DE95-41F6-A70C-43597EB58CBB@tcb.net>
To: rrg@irtf.org
X-Mailer: Apple Mail (2.1499)
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Nov 10 18:39:00 2012
X-DSPAM-Confidence: 1.0000
X-DSPAM-Improbability: 1 in 98689409 chance of being spam
X-DSPAM-Probability: 0.0023
X-DSPAM-Signature: 509f01b4199631755924084
X-DSPAM-Factors: 27, within+#+#+of, 0.40000, messaging+#+in, 0.40000, the+#+#+WG, 0.40000, an+engineering, 0.40000, a+#+mix, 0.40000, routed+#+#+Internet, 0.40000, for+#+years, 0.40000, size+#+coupled, 0.40000, and+#+recover, 0.40000, with+#+#+#+presents, 0.40000, routing+#+#+#+Internet, 0.40000, update+#+#+#+serious, 0.40000, of+#+#+update, 0.40000, if+#+downstream, 0.40000, bus+#+#+#+mechanism, 0.40000, serious+#+#+on, 0.40000, the+large, 0.40000, to+#+#+making, 0.40000, BGP+#+#+against, 0.40000, to+#+#+#+to, 0.40000, that+#+single, 0.40000, that+SIDR, 0.40000, ultimately+#+#+is, 0.40000, presents+#+#+#+a, 0.40000, application+#+#+#+describing, 0.40000, neighbor+#+#+AS, 0.40000, for+#+operational, 0.40000
Subject: Re: [rrg] RRG to hibernation
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Sun, 11 Nov 2012 01:39:02 -0000

On Nov 10, 2012, at 10:35 AM, Danny McPherson <danny@tcb.net> wrote:
> On Nov 10, 2012, at 12:24 PM, Tony Li wrote:
[--snip--]
>> I agree that some security needs to be deployed.  I'm not convinced that it needs to be BGPSEC.  We've muddled along for many years and never found the gumption to actually deploy anything.  Must not be important to people.  I don't get it, but that's the observable behavior.  
>> 
>> In any case, this doesn't seem like a research topic.  This is pretty clearly an engineering issue.
> 
> I don't agree.  The engineering solution that SIDR is actively working (RPKI-enabled BGPSEC) is pumping out standards track RFCs like there's no tomorrow.  The USG has stated intentions of "expediting secure routing work through the Internet standard process" and "fostering adoption through government procurement vehicles".  
> 
> As an operator this scares the hell out of me, especially considering what they've designed is largely a system to control "what's routed on the Internet and by whom".  They can't seem to do anything in BGP(SEC) without introducing the equivalent of "periodic updates", and undoing all the goodness of things like update packing completely.  
> 
> Some serious thinkers working on this problem would be goodness...

Let me add that I share Danny's concerns ...

However, let me try to take a step back and share with everyone a much broader set of, potentially, architectural concerns that I'm not sure this RG considered during the last round.

BGP was originally designed for flooding of reachability information.  But, reachability information is the end-result /after/ the application of _routing_policy_, describing "intent", by operators of individual networks based on various contractual agreements they have with parties whom they directly interconnect.  Assuming you agree with this premise, this presents a paradox from a security PoV.  Specifically, if a downstream network does not have visibility into its upstream network's routing policy is it practical/feasible for the downstream network to understand the _intended_ propagation of reachability information and, ultimately, connectivity?  Furthermore, is it feasible to carry such information within the control plane itself?  Or, should the control plane be relegated to carrying [strictly] reachability information in real-time, while offboard systems carry accompanying routing policy and security information in order to assist in making "optimal" Inter-Domain routing/forwarding decisions?

A second concern is also related to the original design of BGP and what it has organically involved into, today.  Specifically, BGP is /also/ now being tasked as a generic "message bus" and service discovery mechanism.  Not to pick on anyone, in particular, but the following are recent examples that come to my mind wrt this trend:
http://tools.ietf.org/html/draft-ietf-idr-ls-distribution-01
http://tools.ietf.org/html/draft-ietf-idr-operational-message-00
... and, there may be others.  Although, contrast those proposals with what should be most concerning to people in this RG, and in the IETF:
http://tools.ietf.org/html/draft-ietf-grow-ops-reqs-for-bgp-error-handling-05
In short, operators (such as myself) are _extremely_ concerned that a single erroneous update results in a complete reset of BGP sessions.  Due to the overwhelming success of BGP, it's now (and, has been for a while) a mission-critical protocol, thus such catastrophic session resets -- caused by a single malformed UPDATE -- are widely visible/impactful.  This impact is compounded by the 'cost to recover'.  Namely, due to the large and growing amount of information in the RIB (again, not just reachability, but also service-discovery and completely orthogonal information), it takes longer to exchange RIB information and, ultimately, restore services.  Is this really the best we, as an industry, can do?

While the IETF IDR WG has been looking at mechanisms for how BGP may defend against certain types of erroneous BGP UPDATE's for external BGP sessions:
http://tools.ietf.org/html/draft-ietf-idr-error-handling-02
... there does not appear to be any [straightforward] answer with respect to internal BGP sessions, given the requirement that BGP speakers internal to an AS must have a globally consistent RIB and FIB, otherwise packet forwarding loops will result.  And, in my personal operational experience it's _rarely_ the case that malformed UPDATE's are detected at the first ASBR (attached to an eBGP neighbor) in my AS, thus it concerns me that mechanisms such as draft-ietf-idr-error-handling-02 are an adequate solution to the problems we experience.  IOW, as an operator I desire "defense in depth" where a heterogeneous mix of vendor equipment (HW + SW), participating as interior BGP speakers, have mechanisms to detect *and* automatically recover from malformed UDPATE's received over iBGP sessions.  This is another area that I would point research colleagues toward.

So, this raises the classic conundrum of: increasing complexity, increasing RIB (and FIB) size information coupled with a contrasting need from operators who are concerned about the robustness of the protocol and the requirement to NOT sustain any failures[1].  Something's got to give.

Ultimately, this makes me question whether it's no longer _just_ growth of RIB (and, FIB) size that this RG should be (primarily?) focused on.  Rather, will the requirements for:
a) operational robustness, in the face of critical messaging errors in an Inter-Domain Routing Protocol, which the IETF may be unable to address on its own;
b) designing security as a first-class principle of an Inter-Domain Routing Protocol -- either carried within or outside of control-plane reachability information
c) increased scalability of RIB (and, other?) information
... lead us down a path of considering we may be approaching the end-of-the-road for BGPv4 and we need something new?

Does anyone on this list share similar concerns wrt operational robustness, time to recovery and (then) scalability of BGPv4?

-shane

[1] It is not cool to suggest that operators should just stop asking for new features and we wouldn't have this problem.  :)