Re: [lisp] question on EID reachability (section 6.4)

Robin Whittle <rw@firstpr.com.au> Mon, 23 May 2011 15:50 UTC

Message-ID: <4DDA8260.7020500@firstpr.com.au>
Date: Tue, 24 May 2011 01:50:56 +1000
From: Robin Whittle <rw@firstpr.com.au>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10
MIME-Version: 1.0
To: lisp@ietf.org
References: <201105201328.p4KDS3v25815@magenta.juniper.net> <D46AA8C0-499F-466D-BA2A-A1D22808A8FC@cisco.com> <201105231404.p4NE4nv55798@magenta.juniper.net>
In-Reply-To: <201105231404.p4NE4nv55798@magenta.juniper.net>
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Subject: Re: [lisp] question on EID reachability (section 6.4)
Precedence: list

Short version:  Yakov points again to the scalability problems with
                short Map Reply caching times and frequent "RLOC-
                probing" - which are the only mechanisms LISP
                provides for ITRs to be sufficiently responsive to
                changes in ETR liveness and ETR-site connectivity.

                As usual, answers from Planet LISP are avoidant of the
                fundamental nature of these problems.

                I think the only way to make a core-edge separation
                system (such as LISP, Ivip or whatever) sufficiently
                responsive is if reachability and mapping decisions
                are done outside the system - not by the ITRs at all -
                and if the mapping system is capable of sending Mapping
                Updates to all ITRs which need them in a second or two.


Hi Yakov and Dino,

Yakov, you wrote, in part:


>> If the all the ETR's site-facing interfaces go down, it can set the R- 
>> bit to 0 when sending Map-Requests.
> 
> What would trigger the ETR to send Map-Requests when its link to the site 
> goes down ?

I guess Dino meant to write something like "when responding to Map
Requests from the Map Server".


>> And Robin, one ETR *can know* the other ETR cannot reach the rest of  
>> the site when a link-state routing protocol is used inside the site.
> 
> And when a non link-state routing protocol (e.g., RIP) is used
> inside the site, one ETR may *not* know the other ETR cannot reach
> the rest of the site.

Does the LISP specification require all ETRs for a given EID prefix to
know about the reachability of the end-user network for that EID prefix
from all its ETRs?

If so, what sort of delay times might be acceptable?

AFAIK, LISP doesn't provide any mechanism for this communication between
two or more ETRs.  So if you require it, then this needs to be stated as
a responsibility of the end-user network.  Also, I think it needs to be
shown that this is a reasonable requirement that all end-user networks
can achieve in order to adopt LISP.


>>> Another solution proposed in the above is for the ETR to clear its
>>> locator-status-bit in the encapsulation data header. But it applies
>>> only when the ETR is also an ITR for the traffic going in the
>>
>> Right.
> 
> So, one can not rely on locator-status-bit when the ETR is not also
> an ITR for the traffic going in the opposite direction. 

Yes.


>> Using the R-bit in *either* or *both* the Map-Reply it sends or the  
>> Map-Register it sends to the map-server.
> 
> Wrt setting the R-bit in the Map-Register, how would it help the ITR
> (that already received the mapping from the ETR) to discover that
> the ETR can no longer reach the EID ?

As you noted below, this depends on how long the ITR caches the Map
Reply information and whether or not, and how frequently, it does "RLOC
Probing" (http://tools.ietf.org/html/draft-ietf-lisp-12#section-6.3.2)


> Moreover, if the R-bit in the Map-Register is set to 0, then when
> some other ITR requests mapping for the EID, and gets back a Map-Reply
> with the ETR as an RLOC and the R-bit set to 0, would this other
> ITR have to wait for the TTL of the mapping to expire before using
> the ETR as the RLOC (which would be 24 hours with default TTL) ?

I haven't kept up with every change to the LISP spec, but I guess that
is the purpose of the TTL - to inhibit the ITR from bugging the Map
Server again.  So again the LISP designers are back to trying to solve
problems by shortening the caching time of Map Replies.


>>> for the traffic going in the opposite direction, wouldn't that
>>> solution require the ETR to have some traffic going in the opposite
>>> direction ? If yes, then how would the ETR get this traffic, given
>>> that its connection to the site is down ?
>>
>> Note, when RLOC-probing is used and all site-facing interfaces are  
>> down, the RLOC-probe reply, which is a Map-Reply can have the R-bit  
>> set to 0.
> 
> That still does not answer my question. Namely, if one relies on
> setting the locator-status-bit in the traffic going from the
> ETR to the ITR, then how would the ETR get this traffic, given
> that its connection to the site is down ?  

Indeed . . .


> As if not, then one can not rely on the locator-status-bit to
> determine whether the ETR lost its connectivity to the site, and
> therefore the *only* possible mechanism for an ITR to determine
> this is the RLOC-probing. With this in mind the following should
> be added after the first paragraph in 6.4:
> 
>   Another possible failure mode is for an ETR to lose connectivity
>   to the site (e.g., due to the link failure between the ETR and 
>   the site). The only mechanism specified in this document to
>   detect such failure is RLOC-probing.

or perhaps:

  . . . RLOC-probing is the only mechanism specified in this document
  to detect such failure in a shorter time than that specified
  by the ITR's caching of Map Replies.


RLOC-probing might work OK if there were not many ITRs sending packets
to a given ETR.  But it doesn't scale - so it is not something LISP can
rely upon.

Shortening the TTL of Map Replies which Map Servers send to ITRs is
likewise unscalable since it overloads the ALT (or whatever) system and
the Map Servers.


I think LISP was conceived on the assumption that it is possible - or
that it must be possible - for a scheme like this to work without Map
Servers and/or ETRs (whatever sends Map Replies to ITRs) having a means
of sending a Map Update to that Map Reply when the connectivity
situation changes.  This includes when an ETR dies, or comes alive
again, or when an ETR's link to the end-user network goes down or up.

In the absence of such a "Real Time Mapping Update" mechanism, the only
way ITRs could respond rapidly to any of these changes in connectivity
is one or both of:

  1 - Very short times for ITRs to cache Map Reply messages.

  2 - ITRs frequently probing connectivity by directly bugging the
      ETR with something like an "RLOC-Probe".

If one or both of these are implemented to the degree necessary to
ensure ITRs respond quickly, then the whole thing will NOT scale.

Short caching time leads to constant hammering of the Map Servers and/or
ETRs, and whatever mechanisms are used for ITRs to request and receive
this mapping.

ITRs hammering the hapless ETRs with "RLOC-probing" doesn't scale at all
when one ETR is getting packets from lots of ITRs, which will frequently
be the case.  Furthermore, the ITR could be sending packets for a bunch
of EID prefixes to the one ETR, so it should probe the ETR about every
one of these, leading to bulky and perhaps multiple probe and response
packets.

RLOC-probing would be a DoS vulnerability for ETRs.  ETRs can't tell
whether the probe packet comes from a genuine ITR or not.  So the
attacker can send a blizzard of packets, each apparently from an ITR at
a different address.  The ETR will dutifully send off responses to
whatever addresses these are.


The only way of making the system work, AFAIK, is to have exactly one
system test reachability of the end-user network via its various ETRs
and make decisions itself about how this should affect the mapping.
This means ETRs and beyond them the end-user network routers are not
handling an excessive volume of reachability probe packets.  It also
means the probing can be done with whatever protocols, security and
frequency the end-user network desires.

This removes all the reachability testing and map-change decision making
from the LISP, Ivip or whatever system.  End-user networks are required
to do this themselves, or to pay a company to do it for them.  It needs
to be done from outside the end-user's network, so it is best to pay a
specialised company to do it.

Three more things are required to complete the system:

   1 - ITRs can request mapping from either one centralised mapping
       system (Ivip in its original incarnation) or from the correct
       one of the multiple mapping systems - the one which handles this
       EID prefix (Distributed Real Time Mapping for Ivip and LISP:
       http://tools.ietf.org/html/draft-whittle-ivip-drtm-01
       Diagrams: http://www.firstpr.com.au/ip/ivip/drtm/).

   2 - The ITR caches the mapping for some time specified in the
       Map Reply packet, chosen by the Map Server.  10 to 30 minutes
       might be a good choice, but if the ITR's cache is overloaded
       then it will cache it for a shorter time if it is not sending
       data for this EID prefix.

   3 - Most crucially, the Map Server system needs to be able to
       securely and quickly (a few seconds at most) fan out Mapping
       Updates to all ITRs which are caching the Map Replies for
       whatever EID prefix has just had its mapping changes.  Ivip
       has always had this.  Its not hard - both the Map Replies and
       the Map Updates are secured by a nonce in the Map Request.
       This would not scale very well with ITRs directly requesting
       mapping from a few authoritative Map Servers, but if they
       do it through one or more layers of caching Map Servers,
       which pass on the Updates, fanning them out to multiple
       ITRs or other Caching Map Servers closer to the ITRs, then
       it can scale well and still work, typically in a fraction of
       a second.


The unscalable nature of LISP was obvious from the outset four years
ago.  Short caching times and/or frequent RLOC-probings - choose your
poison.

Furthermore, the LISP architecture monolithically integrates the
reachability and mapping decision making work with the mechanisms which
actually handle the traffic.  Ivip separates the reachability and
mapping decision making so it is the responsibility of the end-user
network, and does not need to be specified by the Ivip system itself.
LISP requires every ITR to figure out for itself which ETR to send
packets to, so all aspects of probing and decision making need to be
done by the ITRs individually, and need to be defined in the LISP spec.

 - Robin

[lisp] question on EID reachability (section 6.4) Yakov Rekhter
Re: [lisp] question on EID reachability (section … Robin Whittle
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … Yakov Rekhter
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … Robin Whittle
Re: [lisp] question on EID reachability (section … Noel Chiappa
Re: [lisp] question on EID reachability (section … Yakov Rekhter
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … Yakov Rekhter
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … Alia Atlas
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … Alia Atlas
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … John Scudder
Re: [lisp] question on EID reachability (section … Alia Atlas
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … Yakov Rekhter
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … Dino Farinacci
Re: [lisp] question on EID reachability (section … John Scudder
Re: [lisp] question on EID reachability (section … Luigi Iannone
[lisp] question on EID reachability (section 6.4)… Yakov Rekhter