Re: [rrg] routing security and scale impacts (was RRG to hibernation)

Russ White <russw@riw.us> Wed, 21 November 2012 12:36 UTC

Return-Path: <russw@riw.us>
X-Original-To: rrg@ietfa.amsl.com
Delivered-To: rrg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 469A521F8555 for <rrg@ietfa.amsl.com>; Wed, 21 Nov 2012 04:36:21 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: 0.304
X-Spam-Level:
X-Spam-Status: No, score=0.304 tagged_above=-999 required=5 tests=[AWL=-1.507, BAYES_40=-0.185, J_CHICKENPOX_13=0.6, MIME_QP_LONG_LINE=1.396]
Received: from mail.ietf.org ([64.170.98.30]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OATonLWEI-iu for <rrg@ietfa.amsl.com>; Wed, 21 Nov 2012 04:36:20 -0800 (PST)
Received: from da31.namelessnet.net (da31.namelessnet.net [74.124.205.66]) by ietfa.amsl.com (Postfix) with ESMTP id 0645B21F8519 for <rrg@irtf.org>; Wed, 21 Nov 2012 04:36:20 -0800 (PST)
Received: from cpe-071-074-016-043.ec.res.rr.com ([71.74.16.43] helo=[192.168.100.52]) by da31.namelessnet.net with esmtpa (Exim 4.80) (envelope-from <russw@riw.us>) id 1Tb9XP-0001j4-8A; Wed, 21 Nov 2012 04:36:19 -0800
References: <2671C6CDFBB59E47B64C10B3E0BD5923033725314C@PRVPEXVS15.corp.twcable.com>
Mime-Version: 1.0 (1.0)
In-Reply-To: <2671C6CDFBB59E47B64C10B3E0BD5923033725314C@PRVPEXVS15.corp.twcable.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Message-Id: <289A4475-7521-4147-9E7F-7201EC667910@riw.us>
X-Mailer: iPad Mail (10A523)
From: Russ White <russw@riw.us>
Date: Wed, 21 Nov 2012 07:36:20 -0500
To: "George, Wes" <wesley.george@twcable.com>
X-Antivirus-Scanner: Seems clean. You should still use an Antivirus Scanner
Cc: "rrg@irtf.org" <rrg@irtf.org>
Subject: Re: [rrg] routing security and scale impacts (was RRG to hibernation)
X-BeenThere: rrg@irtf.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: IRTF Routing Research Group <rrg.irtf.org>
List-Unsubscribe: <http://www.irtf.org/mailman/options/rrg>, <mailto:rrg-request@irtf.org?subject=unsubscribe>
List-Archive: <http://www.irtf.org/mail-archive/web/rrg>
List-Post: <mailto:rrg@irtf.org>
List-Help: <mailto:rrg-request@irtf.org?subject=help>
List-Subscribe: <http://www.irtf.org/mailman/listinfo/rrg>, <mailto:rrg-request@irtf.org?subject=subscribe>
X-List-Received-Date: Wed, 21 Nov 2012 12:36:21 -0000

+1

I might even have some ideas on where to get some ideas, if we can convince the researchers in question to come forward. We could start with a requirements doc, which I'd be willing to co with someone once I replace the hard drive in my computer.

Russ

 Sent from my iPad

On Nov 13, 2012, at 9:09 AM, "George, Wes" <wesley.george@twcable.com> wrote:

> Changing subject line to reflect topic
> 
> Shane has articulated a number of concerns that I think would be useful for RRG to spend some time working on, and I tend to agree with Danny that the current BGPSec solution seems to be more about "hacking at the edges" to get something that is marginally better in some ways than the [lack of] security that we have now, potentially ignoring the known scaling problems this group has discussed at length all the while doing several things likely to exacerbate them. It gives me concern about whether it will see significant deployment due to the large amount of required investment vs the potential benefit. I know I have asked more than once about the scaling implications of BGPSec since it potentially makes a large impact in the footprint of the routing data that must be stored and managed, and haven't exactly been pleased with the answers even though some analysis has been done to show that it's not a bad thing.
> 
> If I were to distill things down, today we have a growth curve for both the routing table (both RIB and FIB) and for cost-effective hardware with the horsepower necessary to manage it (CPU, ASIC, memory, etc). SIDR is likely not the one thing that will break the routing system by causing those curves to cross, but it certainly changes the curves' pitch such that it's more likely that the cost of keeping up with the demands of the system starts becoming unmanageable, even if it doesn't actually reach the limits of the technology. The investment in a network for scale and growth is incremental, and SIDR's full justification is that those incremental upgrades will bring hardware that can support its needs organically. However, things like BGPSec or other disturbances that increase the growth curve of the routing table and related scaling vectors mean that as an operator, I have to shorten my upgrade cycle, spend capital earlier than originally projected, possibly even to the poi
> nt where I can't manage an entire depreciation cycle (5-7 years) before needing to spend additional money on upgrades. In a network that is driven by commoditization of prices, that's not a good position to be in.
> 
> Additionally, as Shane alluded to in another message, this isn't simply about DFZ scale, but also internal scale, where there are commonly a *LOT* more routes being carried by your average router inside an ISP's network. There are also other considerations like the rate of updates due to background churn vs during an event, other things that the control plane must manage simultaneously, etc. Taking a step even further away from where RRG has been previously focused, there is a similar sort of scaling problem within the L3VPN space that is typically self-contained within the SP's network. While I think there are some engineering solutions that may help with the short-term scaling issues, there may also be some meat for research in the area of modeling and instrumentation of the routing system to give SP's better tools to use their available capacity efficiently, and possibly even changes to help the routing control plane degrade more gracefully and deterministically. The L3VPN
>  discussion is detailed in draft-gs-vpn-scaling-01 (an -02 rev is due soon, waiting on co-author review and a few more updates), specifically in section 6 and 6.5 for the modeling/instrumentation, and in sections 4 and 5 for ways that the control plane tends to break down at scale limits.
> 
> Thanks,
> 
> Wes George
> 
> 
> 
>> -----Original Message-----
>> From: rrg-bounces@irtf.org [mailto:rrg-bounces@irtf.org] On Behalf Of
>> Shane Amante
>> Sent: Saturday, November 10, 2012 8:39 PM
>> To: rrg@irtf.org
>> Subject: Re: [rrg] RRG to hibernation
>> 
>> 
>> On Nov 10, 2012, at 10:35 AM, Danny McPherson <danny@tcb.net> wrote:
>>> On Nov 10, 2012, at 12:24 PM, Tony Li wrote:
>> [--snip--]
>>>> I agree that some security needs to be deployed.  I'm not convinced
>> that it needs to be BGPSEC.  We've muddled along for many years and
>> never found the gumption to actually deploy anything.  Must not be
>> important to people.  I don't get it, but that's the observable
>> behavior.
>>>> 
>>>> In any case, this doesn't seem like a research topic.  This is pretty
>> clearly an engineering issue.
>>> 
>>> I don't agree.  The engineering solution that SIDR is actively working
>> (RPKI-enabled BGPSEC) is pumping out standards track RFCs like there's
>> no tomorrow.  The USG has stated intentions of "expediting secure
>> routing work through the Internet standard process" and "fostering
>> adoption through government procurement vehicles".
>>> 
>>> As an operator this scares the hell out of me, especially considering
>> what they've designed is largely a system to control "what's routed on
>> the Internet and by whom".  They can't seem to do anything in BGP(SEC)
>> without introducing the equivalent of "periodic updates", and undoing
>> all the goodness of things like update packing completely.
>>> 
>>> Some serious thinkers working on this problem would be goodness...
>> 
>> Let me add that I share Danny's concerns ...
>> 
>> However, let me try to take a step back and share with everyone a much
>> broader set of, potentially, architectural concerns that I'm not sure
>> this RG considered during the last round.
>> 
>> BGP was originally designed for flooding of reachability information.
>> But, reachability information is the end-result /after/ the application
>> of _routing_policy_, describing "intent", by operators of individual
>> networks based on various contractual agreements they have with parties
>> whom they directly interconnect.  Assuming you agree with this premise,
>> this presents a paradox from a security PoV.  Specifically, if a
>> downstream network does not have visibility into its upstream network's
>> routing policy is it practical/feasible for the downstream network to
>> understand the _intended_ propagation of reachability information and,
>> ultimately, connectivity?  Furthermore, is it feasible to carry such
>> information within the control plane itself?  Or, should the control
>> plane be relegated to carrying [strictly] reachability information in
>> real-time, while offboard systems carry accompanying routing policy and
>> security information in order to assist in making "optimal" Inter-Domain
>> rou  ting/forwarding decisions?
>> 
>> A second concern is also related to the original design of BGP and what
>> it has organically involved into, today.  Specifically, BGP is /also/
>> now being tasked as a generic "message bus" and service discovery
>> mechanism.  Not to pick on anyone, in particular, but the following are
>> recent examples that come to my mind wrt this trend:
>> http://tools.ietf.org/html/draft-ietf-idr-ls-distribution-01
>> http://tools.ietf.org/html/draft-ietf-idr-operational-message-00
>> ... and, there may be others.  Although, contrast those proposals with
>> what should be most concerning to people in this RG, and in the IETF:
>> http://tools.ietf.org/html/draft-ietf-grow-ops-reqs-for-bgp-error-
>> handling-05
>> In short, operators (such as myself) are _extremely_ concerned that a
>> single erroneous update results in a complete reset of BGP sessions.
>> Due to the overwhelming success of BGP, it's now (and, has been for a
>> while) a mission-critical protocol, thus such catastrophic session
>> resets -- caused by a single malformed UPDATE -- are widely
>> visible/impactful.  This impact is compounded by the 'cost to recover'.
>> Namely, due to the large and growing amount of information in the RIB
>> (again, not just reachability, but also service-discovery and completely
>> orthogonal information), it takes longer to exchange RIB information
>> and, ultimately, restore services.  Is this really the best we, as an
>> industry, can do?
>> 
>> While the IETF IDR WG has been looking at mechanisms for how BGP may
>> defend against certain types of erroneous BGP UPDATE's for external BGP
>> sessions:
>> http://tools.ietf.org/html/draft-ietf-idr-error-handling-02
>> ... there does not appear to be any [straightforward] answer with
>> respect to internal BGP sessions, given the requirement that BGP
>> speakers internal to an AS must have a globally consistent RIB and FIB,
>> otherwise packet forwarding loops will result.  And, in my personal
>> operational experience it's _rarely_ the case that malformed UPDATE's
>> are detected at the first ASBR (attached to an eBGP neighbor) in my AS,
>> thus it concerns me that mechanisms such as draft-ietf-idr-error-
>> handling-02 are an adequate solution to the problems we experience.
>> IOW, as an operator I desire "defense in depth" where a heterogeneous
>> mix of vendor equipment (HW + SW), participating as interior BGP
>> speakers, have mechanisms to detect *and* automatically recover from
>> malformed UDPATE's received over iBGP sessions.  This is another area
>> that I would point research colleagues toward.
>> 
>> So, this raises the classic conundrum of: increasing complexity,
>> increasing RIB (and FIB) size information coupled with a contrasting
>> need from operators who are concerned about the robustness of the
>> protocol and the requirement to NOT sustain any failures[1].
>> Something's got to give.
>> 
>> Ultimately, this makes me question whether it's no longer _just_ growth
>> of RIB (and, FIB) size that this RG should be (primarily?) focused on.
>> Rather, will the requirements for:
>> a) operational robustness, in the face of critical messaging errors in
>> an Inter-Domain Routing Protocol, which the IETF may be unable to
>> address on its own;
>> b) designing security as a first-class principle of an Inter-Domain
>> Routing Protocol -- either carried within or outside of control-plane
>> reachability information
>> c) increased scalability of RIB (and, other?) information ... lead us
>> down a path of considering we may be approaching the end-of-the-road for
>> BGPv4 and we need something new?
>> 
>> Does anyone on this list share similar concerns wrt operational
>> robustness, time to recovery and (then) scalability of BGPv4?
>> 
>> -shane
>> 
>> [1] It is not cool to suggest that operators should just stop asking for
>> new features and we wouldn't have this problem.  :)
>> _______________________________________________
>> rrg mailing list
>> rrg@irtf.org
>> http://www.irtf.org/mailman/listinfo/rrg
> 
> This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.
> _______________________________________________
> rrg mailing list
> rrg@irtf.org
> http://www.irtf.org/mailman/listinfo/rrg