IETF-62 RTGWG Minutes

Folks-

 Meeting minutes below. Thanks to Don for taking the notes!
 Corrections are welcome till Apr 22.

-- 
Alex
http://www.psg.com/~zinin

IETF-62 RTGWG Meeting Minutes.

THURSDAY, March 10, 2005
1300-1500 Afternoon Sessions I
Chairs: Alex Zinin and Bill Fenner

Alex Zinin started the discussion with a WG status update (see slides).

- GTSM document will be LC'ed before Paris IETF, Alex to send comments
- Basic IP FRR updated.
- Draft MIB move to WG doc.

Alia Atlas Gave a presentation on the status up the Loop Free Alternates
draft (draft-ietf-rtgwg-ipfrr-spec-base) (See Slides for details).

  Alia gave a run down to the Changes to Draft including:
  Updates to the draft.
  There are some issues with various topologies in OSPF:
  - Multi areas: Can get paths back and forth across Area boundaries. Can
  cause the LFA to loop.
  - Virtual links: Different router than the back bone topology.
  - Alternate ABRs:
  - Multiarea ABRs:
  - AS External Routes:

  Recommend we do not support Virtual links or Alternate ABR
  If ASBR is in multiple non-backbone areas, then no other ABR is also in more
  than one of those non-backbone areas.
  AS External Route of specific type (1 or 2) must not be announced with a
  forwarding addresses from multiple non-backbone areas IF those non-backbone
  areas share at least one ABR.

  Future changes:
  Added Other applicability and Strict Downstream Alternates.
  Strict Downstream Alternates guarantees loop freedom but Cannot guarantee
  SRLG protections.
  Alia: Would like opinions to the list please.
  Alia: Applicability was for OSPF.
  Alex: Comments?

  Ted Seely: Was it Inter AS you were talking about?
  Alia: Yes, it was AS-external
  Ted : Any operators would use this for inter-AS?
       Isn't it easier to solve the intra-AS problem first?
  Alia: these are external routes, but the discussion is about
        behavior within a single AS

  Dave Ward: Regarding your applicability statement note, are you planning on
             having the protocol automatically figure out what's supported and what's not
             or have it as a user burden?
  Alia: It is a User burden today.
  Dave: Many problematic situations can be discovered automatically, may want to
        converge on some protocol mechanisms for that, rather than rely solely on the
        user.
  Alia: OK, we can discuss this

Next Alex discussed :Micro-loop prevention DT update \
Reference: http://psg.com/~zinin/ietf/rtgwg/neverloop/
>         draft-bryant-shand-lf-conv-frmwk-00.txt
>         draft-zinin-microloop-analysis-00.txt

  Alex described a summary of the Micro loop design team summary which was
  on the RTGWG list. So far the team has eliminated:

  -Incremental cost change, due to multiple convergence cycles and long
   convergence
  -Synched FIB installation, due to reqs for tight sync'ing, strong
   implementation constrains, service dependency on NTP, operational
   constrains

  Brian Habberman: Is it only service dependency or an architectural problem?
  Alex: No architectural problem, since general routing still doesn't depend on NTP,
        only the FRR piece that would depend on it

  Alex went through the various options currently under consideration by the design
  team.  <See Slides for details>

  DT: PLSN

   Pros:
     -Easy to understand
     -Constant delay
     -Covers all topo changes
     -Allows SRLGs
     -About 90% of the loops
     -Changing topology is likely to improve coverage
     -Can be used as a basic method with further extensions
   Cons:
    - Less than 100 %
    - Loop traffic may congest loop free.
    - Asymmetric link costs require stricter safety condition (DS BF
      instead of LFA)

  Order FIB Install
   Pros:
     - 100% coverage [for micro-loops]
     - Asym costs covered
     - SRLGs covered at a cost
   Cons:
     - Distributed
     - Longer convergence (can be improved using explicit signaling)

  Tunnels:
   Pros:
    -Cover 100%
    -Const Convergence time (shorter than the ordered FIB)
    -Can handle SRLGs
   Cons
    -Requires "Covert announcements" in PRs
    -With Basic FRR requires additional an data-plane mechanisms
    -Tunnels have different operational and security operations
    -Distributed version involves increased complexity

  The next step for the Design Team is to come up with the recommendations for
  WG. So far Path Locking Safe Neighbors for basic FRR is a high runner. Hoping
  the design team will make recommendation in, one month?
  Questions?
  Alex: Will summarize considerations and send this to the ML and move the
        discussion there.

  Dave Ward: How did the DT come to a conclusion that Less than 100% coverage is good enough?
  Alex: Generally, there's no clear cut here--reaching 100% coverage comes at a
        high cost. Basic IP FRR doesn't give us 100% coverage, and PLSN's
        coverage is very similar to that.

  Dave: OK, do envision using PLSN as the base method and use something else for
        advanced?

  Alex: Both basic FRR and PLSN address local failures and micro-loops using
        redundancy in physical network topology. As we move towards better
        coverage, the physical-topology method may be augmented with advanced
        methods like tunnels. Use PLSN where possible, use advanced on top of
        that for the rest.

  Russ White: What is parts of topology not covered?

  Alex: Mike did simulations, that show that places where PLSN doesn't cover
        microloops are almost the same as the places where IP FRR wouldn't cover
        local failures.

  Mike: Most of the places where you can't repair you also can't prevent loops.
        There are some cases where loops and failures do not collocate, of
        course.

  [? Not sure if this was still Russ] I think IPFRR and PLSN May be valid

  Alex: The physical meaning is that since both basic IPFRR and PLSN use similar
        constructs for defining nbrs that can be used for local failure repair
        and micro-loop prevention, so if you have a problem protecting against
        a local failure, it's likely you'll have loops there too.

  Mike Shand: There are certain typical topologies that have problems:
       Rigorous Hub and spoke; some rings. If you have a well constructed mesh,
       it will play much better.

  Thomas Eriksson[sp?]:If you play with metrics it does not add much.
  Stewart: You might come up with poor topologies.
  You can reduce coverage with metrics or asymmetric link metrics.

  This line of questioning went too fast for the scribe:
  [?]Asymmetric link metrics occur as intentional or accidental?
  [Dave Ward?] How do I set up my metrics for traffic and repair.
  [?]Useful if more people with operations experiences could comment.

  IGP metric-based TE and IP FRR coverage will probably work against each
  other. Not going to solve it here and now. Tools should take this into
  account.

  Stewart asked for feedback on whether we're getting sufficient coverage given
  the restrictions.

  Alex: there was a Panel at NANOG: Message that Alex took from there, as well
  as from conversations with other SPs is operators are concerned about
  complexity and would rather see 80-90% coverage with simple methods they can
  understand and deploy.

  Stewart: did they say if they want to pick the traffic that is covered [by the 80-90%]?

  Alex: telling you what I heard.
  Alex: my position: basic IP FRR and advanced FRR will have different applicability
  compared to each other and MPLS FRR. There will be networks where they won't make
  any sense because of the topology, and networks where the fit nicely. All about
  trade-offs.

Stewart Bryant presented a New Method called NotVia. This is an improvement on the tunnels
method. (See Slides for details)

  IP Fast reroute via NotVia

  Advantages of NotVia:
    - Repairs all non-partitioning failures. When repair delivers traffic
      NotVia paths.
    - Two guardian routers deliver the traffic NotVia paths.
    - NotVia addresses are created per interface.
    - Stewart used some Diagrams illustrating NotVia
    - Some Sample results using NotVia:
    - Incremental SPF with early terminations in networks with 40-400 nodes
      equivalent of 5-13 full SPFs per node.
    - Really a very cheap [computationally] algorithm.

  More Advantages:
    - Works with MPLS LDP. Just push the label for an intermediate to get to P.
      Labels needed by S source.
    - Encapsulation (any IP encapsulation will work).
    - One tunnel does the job.
    - Not just Pt-pt unicast.
    - NotVia Covers LANs.
    - When You have a LAN Don't know whether the LAN or a component failed. Can
      diagnose what is going on. Simple case must assume LAN failed. - Can
      discover via the not via repair to test the P failure.  Powerful technique.
    - NotVia Works for Multicast. Although this is a hard problem.
    - Loop Free alternates can be used with NotVia. Can use NotVia as an
      acceleration for LFA.
    - Multicast needs an encapsulation but works [claimed by Stewart and Mike].
    - ECMP can be used with NotVia.
    - NotVia supports Incremental Deployment. [via capabilities]
    - You can Exclude routers that are not NotVia capable.

  Stewart explained the Routing extensions for NotVia:
  - Need to advertise NotVia address.
  - Must advertise protected component SRG

  Stewart explained Link Failure for NotVia.
  S (a source router in the diagram) can give any packet to any router.  S can
  optimize which router is best.
  NotVia provides for diagnosis of faults in LANs by correlating NotVia paths.

  Loops don't form[?]

  Stewart explains it supports Multi-homed prefixes with two strategies.

  Joel Halpern: <referring to diagram> Node  B [source Node] must have
  detected the failure already to use the [NotVia path].
  Stewart: Yes
  Joel: B sends to P if P cannot forward P drops [and does not loop]?
  [Yes]

  Thomas Eriksson [sp?]. How would this work with MPLS?
  Some fast discussion and debate whether all the cases are covered by several
  people.
  Stewart: Does not change the problem--B needs to know about the failure
  quickly enough.

  Summary of NotVia by Stewart:
  - Solves the problem at an Intuitive level.
  - Works with asymmetric links.
  - Uses MPLS FRR hardware
  - Single-level encapsulation
  - Repair time is bounded.

  Alex: How many NotVia addresses?
  Stewart: One per link [in network]
  Joel: Isn't it one per neighbor?
  Stewart: yes, one per neighbor
  Joel: Except for SRLG ?

  Alex: Assumption B [referring to diagram] will detect the failure soon
  enough that it is not dropped on the floor.

  Stewart: Standard failure detection applies to every problem and every
  solution here. Same response characteristics here as with basic IP FRR.

  <some more discussion on this>

  [Dave?]BFD time is comparable with all neighbors.  Part of all solution in
  all problem scenarios.

  Alia: This is Different than Loop Free Alternates.
  Might not be able to do the repair.
  Stewart: Exactly the same [problem as LFA].
  Mike: [Agrees] Does not make a difference.  If BFD time is different that is
  the least of the problems.

  George Swallow: Works for Any packet?
  {scribe missed response Affirmative I believe.}

  Joel: What about Transients? Do you Assume the path is available?
  Alex: The network is going under changes
  Stewart: NotVia is Never worse than one detection time.
  Alex and Stewart: Read the draft. It is available publicly.
  Thanks. Read send comments.

  More Discussion on NotVia:
  Alia: If the link between S and P is a broadcast link I don't think it
  works.
  Stewart: If a broadcast link it is like the pseudo node.
  Joel: Made a comment, and subsequently withdrew it.

Next Mike Shand Presented Summary of agreements from DT and non agreements
between Alia, Mike and Stewart. <See Slides for details>

  All agree: There are four advanced methods.
  - U-turn
  - IP-TE
  - PQ-Tunnels
  - NotVia

  Less agreement: What value do you assign? [To complexity etc] It is a Value
  judgment.

  Issue: Failure scenarios

  All agree Methods Must do:
  -Link failure,
  -node failure,
  -broadcast link failure,
  -local SRLG failure
  Some agreement Maybe :
  - SRLG failure
  - IP unicast failure
  - MPLS LDP failure
  - IP multicast failure?

  Questions On Complexity of the encapsulation.

  Tunneling:
  Use IP or LDP for encapsulation?
  Some traffic types probably need tunneling.
  All Tunneling requires label acquisition.

  On Computation complexity (least agreement in DT)
  More SPFs how many is too many?
  Final computation time?
  How long it take to prepare for the next failure?
  Not all SPFs are equal?

  Network is vulnerable during the [preparation] time.

  Delay convergence is provided. So you might get the computation done in
  time.

  Mike presented a number of SPF calculations table: See slides

  Notable:
  U-turn - 20 node ~ 10 SPFs
  NotVia - 15 equivalent SPFs
  Mike also presented comments on Routing extensions, Forwarding extensions,
  Coverage. See slides

  What cost completeness, what cost lack of it?

Alex:  Are we Done?

Andy Smith:
Are the Methods nodal base or link based? What if the node is congested? 
Joel: This is about failures not congestion management. 
Andy: Do you have a delay mechanism for Link Flapping, e.g. link flap dampening.
Stewart: Absolutely.

Andy: Is it Nodal based or link based?  Does the algorithm delay announcements
for each link or for the node as a whole?
Alex: We're building on the existing IGP behavior. What you're asking is how IGP will
      react to a flapping link. Different vendors have different optimizations.
      There was a draft on that, which didn't survive. Might be useful to reconsider that.
      Assume the IGP can handle this.
Andy: These are all good ideas, but DOS attacks etc.
Alex: If a router is overloaded and starts dropping hellos, it will be detected
      as a failure
Joel: Congestion better dealt with by other aspects.
Alex: Agree with Joel, this is a general problem for IGPs, IP FRR, MPLS FRR, and
      should not be solved here. Valid problem, but IPFRR is not the right place.
Don Fedyk: Do we need to allow an option to Bail out?
Some Discussion for Don to clarify his comment. 
Alex: Not required.

? [Does this mean there are] Not a lot of nodal failures. 
Ted:  How do you tell? Common failure situation is many links.  
Stewart: Do you mean SRLGs?  
Ted: Yes. 
Andy: SRLGs on a nodal basis or across an network 
Alia: Yes.  Both  local SRLGs and general SRLGs.  
[?]If we have a meshy network.  Would not change anything if it is.
Determinism. 
Ted: When something happens in the night how do you adjust?  
[Dave?]Pagers are for that. 
Dave Ward:  IP FRR is not for that. 

[?]What is the path coming to a conclusion on those things? 

Alex: Need to finnish the base Spec first. Then Do the Mibs, applicability statement.
      The framework could be applicability statement. Will need Implementation reports.

Chris Hopps[sp?]  Do you want [basic] implementations? 

Alex: Basic implementations [first yes]. 

Stewart: Concerned about Basic IPFFR; believes it is an accidental approach.
It may turn out not what we want. If it turns out to not be what we
want, if so the RFC will legitimize it. We may not be able to deploy other
better solutions. It's not entirely clear if Basic is the right
intermediate step. Look at it as more of a facilitator. May mean we miss the
Real solution for the real [situations].

Alex: talked to a few service providers. Believes the Basic FRR is still valid
and useful, more operationally feasible. The decision made while chartering
the WG--to produce basic first--is still valid, and shouldn't be revisited
it now.

Thomas Erricson: Agree keep down complexity.
Alex: exactly why the basic first.

Alex: Thanks, read drafts send comments. 

WG meeting concluded.

_______________________________________________
Rtgwg mailing list
Rtgwg@ietf.org
https://www1.ietf.org/mailman/listinfo/rtgwg