Re: [Idr] dear diary: well-known community vs new path attribute

Hi, Job,

Your message was very well written, well thought-out, and presents a good
case for use of communities.

I think it makes one or two assumptions that turn out to be incorrect.

I will try to address a couple of issues, that I think need to be
discussed. (Sorry, it is a long message.)

First, a point of clarification - in Sri's presentation, the slide where he
is discussing "community vs attribute", the context is ONLY for intra-AS,
not inter-AS, usage.

This is also in-scope for the "open" draft, which includes a non-transitive
attribute (iOTC, internal only to customer), and probably more relevant in
that discussion. (For that case, I would point out that they are not
mutually exclusive, and for any AS already using communities, the impact of
adding the Open and the iOTC attribute, would be classified as "belt and
suspenders" - redundant methods of protection against accidental leaks.

Second is the propagation characteristics of communities (1997 in
particular), vs any new path attribute.

It is unfortunately the case, that in at least one major vendor's
implementation, propagation of communities is off by default. This means
that even if a new well-known community were agreed upon, the usefulness of
it would be extremely limited until every AS using firmware that does not
propagate-by-default, has changed the configuration on every BGP neighbor
to propagate communities.

In contrast, the BGP protocol RFCs require optional transitive attributes
which are not understood, to be propagated. If the rules for a particular
new attribute also require the attribute to be propagated (regardless of
the attributes value(s)), a new attribute would have global visibility as
soon as it gets deployed by any ASN, and as it gained adoption, the
usefulness of the attribute (based on the attribute values) would increase
monotonically.

The _Feature_X_ value IS in its propagation, NOT on its active use. This
distinction is very important, and tips the equation in favor of
attributes. Note also, this does NOT preclude the use of communities, even
before the attribute is available in shipping code. It does mean that the
usefulness is significantly restricted outside of the contiguous zone of
bidirectional community usage. E.g. Does every ISP that accepts communities
also send them, and if that is not the case, the community implementation
would suffer.

The presentation and current state of the I-D are not necessarily clear
(and possibly not correct) on the intended behavior. I will be working with
my co-authors to correct these problems. The correct rules for the
attribute are as follows:

   - Session-level Role is used to control marking and detection rules
   applicable for that session.
   - Marking and detection is per-prefix.
   - Some details on interactions with other BGP mechanisms needs more work
   (I'll be working with my co-authors on that), such as atomic-aggregate.
   - Incremental deployment is a fundamental characteristic, and is
   intended to incrementally improve detection, without causing false
   positives if deployed consistently with actual neighbor roles
   - If a given neighbor does not support the Role attribute (during Open),
   the locally configured Role should still be used:
      - Since the inbound marking will not exist on the last hop, the
      locally configured Role should be used to apply an inbound marking as if
      the Role had been negotiated and the prefix marked by the neighbor
      accordingly
      - Inbound detection based on the Role should be done.
   - Of the four Role types, this unilateral Role setting (if not
   understood by a peer) will work correctly for three (peer, customer,
   transit) but not the fourth (complex)
      - Complex Role, and complex behavior requires successful negotiation
      with a peer that supports Role, and is also configured for complex
      - Complex is the one Role where marking is dependent on prefixes
      having existing markings, or local origination.
      - Complex
   - All marking is intended to be automatic, with no operator input other
   than Role setting towards neighbors
      - It may be the case that implementations provide a "safety-valve"
      during early deployments, to disable marking to handle situations where
      implementation errors are discovered
      - iBGP peers and confederation peers should probably be implemented
      to automatically have role Complex
      - Sensible defaults should be used to avoid creating the problem
      being solved.
      - Default Role, if any, should be Peer (or maybe Transit).
      - The combination of existing marking and outbound Role type must
      also prevent announcement or propagation of leaks
         - During ongoing WG discussion and revisions, the consensus on
         "must" might not be reached, in which case "should" is the
minimum, with
         local policy override possible
         - Regardless of announcement, the received marking MUST NOT be
         modified, so that other parties are aware of the prefix's status
      - As long as a prefix is marked at some point as having been received
   from a peer or transit neighbor, it will be possible to detect that a leak
   has occurred by a receiving party whose neighbor is "peer" or "customer".
   - Two parties benefit in deploying the attribute immediately, by being
   protected against leaks of their announced routes by any direct or indirect
   mutual customer.
      - This happens even without coordination between the two parties.
   - Until very widely deployed, leaks will still be possible, however:
      - The scope of leaks will be reduced to the set of unmarked prefixes
      - Unmarked prefixes can only be heard from ISPs who don't mark, or
      from their customer cone(s)
         - At that point, the feedback mechanism provides strong incentive
         to mark
         - The bigger an ISP is, the more impact to a leak beneath them
         there is, and the more incentive there is to mark
      - The set of ISPs who don't mark could be significantly reduced
      if/when individual IXPs adopt policies that require marking
      - ISPs who require peers and customers to mark would also have
      significant impact on potential sources
   - Early deployment by large ISPs (preferably Tier-1) would directly
   benefit themselves, and impact the cone of leak-resistant networks.
   - There is benefit even to leaf networks, in protecting themselves from
   leaks heard from transit providers
      - Suppose a leaf network has providers A and B
      - Suppose a leak is propagated by B, but not by A
         - Suppose B only has one path to the leaked prefix(es) and elects
         to propagate a known leak
      - By filtering out (automatically) prefixes from B that are leaks:
         - The leaf network can send traffic to those prefixes via A
         - The path through A is highly likely to not be adversely affected
         by traffic following the leak
         - This is still per-prefix behavior

The incremental deployment does require setting the Role without the
benefit of negotiation. In the case of correctly applied Roles, the
automated per-prefix marking and detection works. The marking controls the
internal propagation behavior to prevent leak origination, and depending on
existing markings, may prevent leak propagation. Similarly, detection may
allow inbound detection of leaks or propagated leaks, making filtering
possible.

Unilateral Role-setting (and "proxy" marking) allows gap-filling in
late-deployment scenarios.
Here are the behaviors of unilateral role setting, at different stages of
partial deployment:

   - Suppose only one AS, X, does not participate
   - All of X's neighbors do participate
      - Every prefix propagated through X will be marked, including with
      values that correspond to the receiver's configured Role assignment of X
      - If all the Role assignments toward X are correct, then:
      - Leaks propagated and not blocked by X (which would have been
         blocked, if X implemented) will be blocked by every neighbor of X upon
         receipt (modulo local policy on the recipient)
         - Leaks by X will be detected and blocked (modulo local policy)
         - If a Role assignment toward X is wrong:
      - X is transit, configured as customer
         - Sender sends all routes, leak toward X; X may filter (no
            impact), otherwise everyone is impacted incl. sender's
transits and peer s
            - X sends all routes, recipient leaks all routes - X, X's
            peers/transits, and recipient are all impacted
            - X is peer, configured as customer
         - Sender sends all routes, leaks; impact is sender's upstreams
            - X sends X's customer's routes, recipient leaks to
            peers/transits - X and recipient + upstreams affected
            - X is transit, configured as peer
         - Sender sends only customer prefixes - correct behavior
            - X sends all routes, sent only to customers - correct
            behavior, except X's transit's prefixes are blocked
            - X is peer, configured as transit
         - Sender sends only customer prefixes - correct behavior
            - X sends only customer routes, sent only to customers -
            correct behavior
            - X is customer, configured as transit
         - Sender sends only customer prefixes - subset of expected
            prefixes - only X is impacted
            - X sends only customer prefixes - not sent to peers/transit,
            only X is impacted
            - X is customer, configured as peer
         - Sender sends only customer prefixes - subset of expected
            prefixes - only X is impacted
            - X sends only customer prefixes - not sent to peers/transit,
            only X is impacted
            - Suppose every AS that does not participate, has only
   neighbors that do participate
   - The same situation occurs as in the "only one AS does not participate":
      - A leak has to originate somewhere
         - If the leak originates on a non-participant, it gets detected
         and block by all its neighbors, unless the unilateral Role of the
         non-participant is incorrect on the receiver side
         - Role error on the sender side, would be an instance of "leak
            originates on a participant", below
            - Leaked non-local prefixes which are first correctly marked by
            a participant, and leaked by the non-participant:
            - Transit leaked to transit-marked-as-transit: sent only to
               customers, impact is leaker and its upstreams, and customers
               - Transit leaked to peer-marked-as-transit: same
               - Peer leaked to transit-marked-as-transit: same
               - Peer leaked to peer-marked-as-transit: same
               - Transit leaked to transit-marked-as-peer: same
               - Peer leaked to peer-marked-as-transit: same
            - In all cases, impacts are to non-participant, and to
            participant making error in manual Role setting
         - If the leak originates on a participant, it means there was a
         Role mismatch between that participant, and a non-participant
         - Marking a customer as a peer or provider fails "safe" - only
            customer routes are received, and get blocked towards
peers or providers
            - Marking a peer or provider as a customer, results in the
            following:
            - Previously unmarked prefixes (local to the Peer or Provider)
               get marked as "customer"; only the Peer or Provider
itself is affected
               - Prefixes marked by the non-participant's neighbors get
               marked, then detection rules apply:
               - non-participant's peer/provider routes are detected as
                  leaks, and are blocked (modulo local policy)
                  - non-participant's customers' routes are not considered
                  leaks; only non-participant is affected
                  - Note that all of the failures (in the singleton
   contiguous non-implementer topologies) are induced by errors in Role
   setting, and possibly a combination of Role error and leakage.
      - This provides incremental incentive to request peers to implement
      and configure their Role
      - The recommended method is to keep any existing measures in place to
      prevent leaking, and identify marking errors seen inbound
      - One all Roles are set correctly, no further leaks can occur if all
      neighbors' neighbors either implement or have their Roles set correctly.

I realize that a lot of this needs to go in the document(s), but I hope I
have at least indicated that this has had some thought given to it, and
that I don't yet see situations where this makes anything worse, and in
most cases offers benefit even at relatively sparse deployment.

Brian

On Fri, Mar 31, 2017 at 9:01 PM, Job Snijders <job@ntt.net> wrote:

> Dear colleagues,
>
> In today's IDR session (thank you for the orderly meeting, chairs &
> secretary!) the topic of 'well-known BGP community vs BGP Path
> attribute' came up, in context of having a marker to signify or signal a
> route's audience [Sriram] https://datatracker.ietf.org/
> meeting/98/agenda/idr/
>
> I've come to the conclusion that we have no choice other then to use a
> well-known communities for boolean functions like the ones currently on
> the table.
>
> There are a number of significant advantages to using communities, that
> in my opinion outweigh the perceived benefit of introducing a new path
> attribute. This is asserted from a deployment process and adoption rate
> perspective. Throughout this email I'll refer to the _function_ of the
> path attribute/community as "feature X". Most reasons relate to
> "accelerated" deployment rates. With "accelerated" I'm referring to a
> 2-3 year timescale rather then 8+ years. I'll share my analysis below.
>
> We have to assume that there are many networks which might (from this
> moment on) _never_ upgrade their software to the required newer software
> to support feature X natively. There are a number of reasons why a
> network might never receive the necessary software upgrades: the network
> choose to continue using devices well after the End-of-Sale/Support/Life
> date for economic reasons. Some networks use hardware longer then the
> vendor intended, sometimes because the operator disagreed with the
> vendor's view on longevity. Like some of you, I've travelled to regions
> of the world where a good router, is a router without bullet holes in
> it, in such cases you'll have to make things work with whatever was
> loaded on there. In other cases, the vendor simply has gone bankrupt,
> and the network has to make do with what is available on their sparing
> shelves until the hardware is fully amortised.
>
> With the above in mind, I'd argue that for many BGP features it is
> entirely acceptable to state "upgrade your software and receive awesome
> feature Y!". But in the instance of routing security, one might need to
> salvage as much as one can in existing deployments, for altruistic
> reasons.
>
> I think it is fair to assume that in all cases, the BGP speakers will
> support RFC 1997 BGP Communities. We can also assume that the device
> supports neighbor-specific routing policy options to (at the very least)
> match, and subsequently deny or permit based on the RFC 1997 community.
>
> Another interesting (perhaps underappreciated) angle is that there are
> both open source and commercial ancillary configuration management
> systems on the market, which will happily manage devices which were not
> upgraded to support Feature X natively. When vendor B doesn't want to
> implement native support for Feature X, perhaps the ancillary third
> market will support Feature X on vendor B.
>
> A number considerations apply for well-known BGP communities in context
> of route leak prevention:
>
> - on day 0, there will be no routes tagged with the well-known
>   community, likewise on day 365, there will be a small number of routes
>   tagged with the well-known community.
>
> - throughout the lifetime of feature X, the tagged routs are likely
>   to be outnumbered by the untagged routes, in all contexts.
>
> - ideally feature X can be deployed incrementally within an AS, so it
>   should _add_ an extra layer of protection, rather then replace or
>   hotswap an existing protection function.
>
> - RFC 1997 communities are transitive, so Feature X must at the very
>   least not be significantly hindered by the transivity, and in an ideal
>   case actually benefit from the transivity property. Enforcing
>   non-transivity through a RFC 2119-style "When received on EBGP, MUST
>   delete" is also acceptable. Operators can manually emulate the
>   non-transivity, and wait for software upgrades to do it for them.
>
> - the presence of a well-known community on one route, cannot, and
>   should not be superimposed to other routes received through the same
>   BGP session. In a more general sense, a well-known community on one
>   route cannot act as a semaphore for the entire session. I am not aware
>   of any implementations which allow to match/act on one route and
>   perform congruent manipulation of properties on a different route.
>
> - When the well-known community for Feature X is present (aka 'true'),
>   we can assume feature X for the route was enabled intentionally,
>   however, in the case where it is absent, we're dealing with either an
>   'unknown' or 'false'. The deny/accept logic we expect to be present
>   either through manual manipulation or through
>
> We also might be able to assume that networks looking to implement
> Feature X, will do so out of their own volition, sufficiently motivated
> to do so correctly. (Even though they are implementing the feature
> manually!) Likewise, there will be a significant number of networks
> which will not hear about Feature X for the foreseeable future, and only
> receive the feature through software upgrades. In other words: network
> operators whom are ignorant of Feature X until they read Release notes,
> could be considered harmless. Furthermore, networks which are well
> intended, might be in a position to recitfy erroneous use of the
> well-known community for feature X when received across EBGP sessions.
>
> >From my own deployment perspective, when using a BGP Community,
> (disclaimer: merely stating options!) I can start deploying _right_now_,
> _network-wide_ (meanwhile waiting for software upgrades to slowly start
> catching up and replace my manual implementation). More importantly, I
> can deploy in a heterogeneous environment where the timelines for policy
> deployment and software deployment are not aligned, or with parts of the
> network not even expected to receive the required software upgrade for
> native support.
>
> We haven't seen a standards track well-known community in a while, but
> I'd be supportive of a well-known community RFC which demands rigorous
> discipline related to the transivity, semantics, and would provide
> configuration examples for those who cannot (yet) use Feature X
> natively, with an agreed upon upgrade path to native support for
> feature X.
>
> As long as the benefits of using Feature X are 'egocentric' (aka
> "deploying this concept helps me, and I don't need others to cooperate
> with me"), and misuse of Feature X through the well-known community is
> either merely self-inflicted pain, or harmless, we'll be fine.
>
> Since Feature X is positioned in context of routing security, something
> we'd probably like to see broad adoption on, I'd argue that the lowest
> common denominator should be used: well-known BGP communities.
>
> Kind regards,
>
> Job
>
> _______________________________________________
> Idr mailing list
> Idr@ietf.org
> https://www.ietf.org/mailman/listinfo/idr
>