[Teas] TEAS WG Virtual Interim Meeting: Notes/Observations

Vishnu Pavan Beeram <vishnupavan@gmail.com> Thu, 28 January 2016 18:31 UTC

MIME-Version: 1.0
Date: Thu, 28 Jan 2016 13:31:32 -0500
Message-ID: <CA+YzgTv_WHgO_sDrtUUm-T8NJUPq77hPSdSRpyoyNvDPvBSL=A@mail.gmail.com>
From: Vishnu Pavan Beeram <vishnupavan@gmail.com>
To: "teas@ietf.org" <teas@ietf.org>
Content-Type: multipart/alternative; boundary="001a11415d26b48717052a69203c"
Archived-At: <http://mailarchive.ietf.org/arch/msg/teas/4PFeeJ8kJkMMsFmJ_qWUZcIrHeQ>
Cc: TEAS WG Chairs <teas-chairs@ietf.org>
Subject: [Teas] TEAS WG Virtual Interim Meeting: Notes/Observations
Precedence: list

Much Thanks to everyone who attended the interim meeting today.

Raw notes are enclosed below, and are editable, via the URL:
http://etherpad.tools.ietf.org:9000/p/notes-interim-2016-teas-1

Please review and feel free to add your corrections via the link above.
Changes/notes will be reviewed and approved by the chairs (and WG)
before being finalized. Please limit changes to what actually
transpired in the meeting.

Lou and I had a quick chat after the meeting. The following
captures some of our main observations from the discussion on the
ingress protection draft:

(1) There is an acknowledgement of a problem, but there was also
acknowledgement that the problem should be formally documented. We
request the authors put together a brief addition to the draft
summarizing the problem. Adrian Farrel has agreed to work with the
authors on this.

(2) There wasn't consensus among meeting participants on either of the 2
approaches documented in the -03 rev of the draft. There is also a fair
bit of detail missing in the draft for both approaches before either
could be viewed as being "ready".

(3) From the participants responses, it's not clear if there is general
interest in implementing either proposed solution. (Although there is
one experimental implementation of one of the proposed approaches).
Given this and the (2) we are considering proposing targeting the work
to be published as "Experimental".

Comments and discussion on the meeting, as well as our observations are
welcome. Also, if you have any additional information you'd like to
provide the chairs or the AD, e.g., related to implementation, please
feel free to contact any of us directly.

Regards,
-Pavan (and Lou)

***

> Final Agenda - TEAS Virtual Interim Meeting (RSVP Ingress Protection /
Egress Protection)
> Jan 28th, 2016, 10:00 EST | 15:00 UTC (Duration: 90 mins)
> Drafts:
> https://datatracker.ietf.org/doc/draft-ietf-teas-rsvp-ingress-protection/
> https://datatracker.ietf.org/doc/draft-ietf-teas-rsvp-egress-protection/
>

> - 10 min - Intro/Agenda (Chairs)
>
https://www.ietf.org/proceedings/interim/2016/01/28/teas/slides/slides-interim-2016-teas-1-1.pdf
>

Lou Berger: we want to be clear that implementation status is requested
by the IESG, so we'll need to talk about that. It's important that we
can demonstrate interest and implementation to get a standards-track
document through the IESG.

> - 65 min - RSVP-Ingress Protection
> - 35 min - Solution Proposals: Comparative Analysis (Huaimo Chen)
>
https://www.ietf.org/proceedings/interim/2016/01/28/teas/slides/slides-interim-2016-teas-1-0.pdf
>
https://www.ietf.org/proceedings/interim/2016/01/28/teas/slides/slides-interim-2016-teas-1-0.pptx
Huaimo presents:

Greg asks: Why do you say this mechanism is faster than e2e
protection? Would you agree detection time is the same?
Huaimo: Yes.
Greg Mirsky: Question on claim that speed of recovery drives recovery
Huaimo Chen: two things to bear in mind. One is detection time, and a
later slide will give the whole picture. In addition to detecing the
failure at ingress, also need failure detection at the CE downstream.
Greg Mirsky: alternative solution has been defined in linear protectiuon
and it solves this exact problem
Greg Mirsky: with detection and switchover, speed is not a real factor
Huaimo Chen:
Lou Berger: Greg brings up another point, at least indirectly. in
addition to having an existing solution that provides a solution at
perhaps the same or different speed, it's important that we not define
another solution if we already have one that's good enough
Huaimo Chen: speed is one effect, but also need to consider speed and
operabiity
Ross Callon: has a question about function. If want to cover any
failure along the path, end to end solves the general problem, but we
may want to go through the sides first
Lou Berger: I'd like to hear that discussion and it's fine to defer it
to the open discussion, but we need to cover it.

Ravi Torvi: Slide 5: On the semantics of the path message... you're
saying it's just like any other Path mesage, but it seems to be
completely different. Can you comment?
Huaimo Chen: it's different - it has an egress protection object
Ravi: slide doesn't say there are three steps. You have a totally
different state machine on the backup.
Huaimo Chen: have aslide with details later - we can discuss there.

Tarek Saad: (at slide 6) is there an implicit assumption that the
primary and backup ingress are connected back-to-back with a link? Can
they be multiple hops apart?
Huaimo Chen: yes, but need a tunnel to be one RSVP hop away
Tarek Saad: are we re-using the same RSVP sesssion for the primary to
relay messages for the backup?
Huaimo Chen: yes
Tarek Saad: so the backup ingress will behave as an LER in that case
Huaimo Chen: yes
Tarek Saad: thanks - that's not been clear in the past.

Lou Berger: (slide 7) so to be clear on what you said earlier, even
though the backup terminates on P1 on the slide it's actually the same
egress as the primary?
Huaimo: yes

Tarek Saad: slide 9: for completeness on the relay method, if the backup
is multiple hops away I'd need to configure another tunnel, right?
Huaimo Chen: yes
Tarek Saad: so that's another bullet you need
Huaimo Chen: yes, good catch.

(slide 12)
Pavan Beeram: Can you elaborate on how the sync happens between primary
ingress and backup ingress?
Huaimo Chen: If something changes in the primary LSP it can be
duplicated to the backup.
Pavan Beeram: So in the case of the relay message you need to figure out
additonal ways to maintain the state sync
Huaimo Chen: Any changes in the primary can be duplicated to the backup
Pavan Beeram: So you're taling about re-using the same Path message used
on the primary and redirecting it elsewhere. Any other changes? No
change to the ERO in the message from primary ingress to backup ingress?
Huaimo Chen: EROs are copied across
Pavan Beeram: So all object processing now needs special semantics
Huaimo Chen: No, it's just copied over so nothing special
Lou Berger: It sounds like this isn't a normal Path message - more a
notification to the backup that it needs to provision something. To me
it would make sense fo this to be a Notify rather than a Path
Huaimo Chen: the Path from primary ingress to backup ingress is a
special Path message; we'll check cross-connections
Pavan Beeram: early versions of the draft had a different message, I think
Ravi Torvi: initial versions defined a new message type; later it was
changed to a Path. I don't remember why.
Pavan Beeram: In the current form we've changed the semantics of all
object processing. Conventional ERO processing will fail as the backup
ingress isn't on the path.
Huaimo Chen: After some thought we decided using an existing message was
easier.
Pavan Beeram: OK, it seems clear that there's some work to be done for
state sync and the relay message between primary ingress and backup ingress.

(slide 16)
Pavan Beeram: For the relay message, is it a separate session to the
backup or a different one?
Huaimo Chen: Same, but different state. Relay message has extra state
for Path and Resv messages to backup ingress. So for proxy-ingress
method we need to keep the state for the Resv messages received from
backup egress.
Tarek Saad: On relay message point 3 (Resv coming back from primary):
does it have a purpose besides being an ack?
Huiamo Chen: It relays protection state to the primary ingress node.
Tarek Saad: Thanks. And about the proxy-ingress - you talk about two
sessions - why?
Huaimo Chen: Better to talk about state. So there's two Path messages
and two Resv messages.
Tarek Saad: So how many sessions?
Huaimo Chen: At high level, session for the relay message is the same.
For proxy-ingress all sessions are the same, so just one. But if we want
to compare from the scalability point of view, number of states is
related to number of messages as we need to keep state for each message.

> - 25 min - Open Discussion

Ravi Torvi: would you consider tweaking the session handling after this
discussion? i.e. it's not really a Path message, but something that
needs special handling
Huaimo Chen: yes, it needs special handling
Ravi Torvi: Pavan mentioned that objects have a different interpretation
Huaimo Chen: yes, the ingress protection object is different and may
contain labels. Labels for proxy-ingress are carried by Resv message.

Greg Mirsky: I wanted to come back to the earlier question about
detection. I recall the document for ingress protection has an OAM
session to detect that the primary ingress node fails from the backup
ingress. Is that right? So this requires an OAM session between PE6 and
PE5, so the backup monitors the primary ingress, right?
Huaimo Chen: That would be nice to have. If a source delivers traffic to
PE5 first and then switches it to PE6, on PE6 the forwarding entry there
is active at the beginning. PE6 will forward traffic to the bypass LSP
so traffic continues in the data plane. We also have to keep the primary
LSP up by refreshing Path messages; this is achieved by detecting the
failure of PE5, and after that we have to put Path messages into the
tunnel and into the next hop (P1). We use can OAM, or a faster method is
better. We can use a number of ways to detect failure but faster ones
are better.
Greg Mirsky: So what method are you proposing? I don't see how routing
can identify that the node isn't functioning. e.g. if you use OSPF and
expect that your LSA will age out, that takes 30 minutes.
Huaimo Chen: Yes, that's too long. We can check routes in the routing
table, so if you have no routes to PE5 you can say that it's gone.
John Drake: You could wait for a sufficient period of time and you could
verify whether individual links to the node have gone down or all of
them, but that's a long time.
Huaimo Chen: As soon as one link goes down we can check the rest
John Drake: But that's not rapid detection
Tarek Saad: Do we need to determine whether the node has failed? If I
detect at the source that my link to PE5 has gone isn't that enough?
Can't you rely on the link to PE5 going down
Greg Mirsky: if PE6 detects that the link to PE5 is down, that doesn't
mean the link from PE5 to P1 is down
Tarek Saad: but a source behind both could check the liveness of PE5
Greg Mirsky: what the source checks isn't the liveness of PE5, but the
liveness of its connection to PE5.
Tarek Saad: OK
Greg Mirsky: so PE5 can be fine as far as P1 is concerned.
Tarek Saad: yes, that's the case in transport protection
Greg Mirsky: So if we lose the link between the source and PE5 and
switch over, PE5 doesn't know that this has happened
Ravi Torvi: Upstream source detecting the failure is one model; the
backup has to protect the primary going down.
Greg Mirsky: My point is that this method is no better than multi-homing
Ross Callon: to me this discussion needs to be clear in the spec. What
are we protecting against, and not protecting against? How do we
distinguish link down vs node down vs multiple links down?
Lou Berger: discussion today isn't about whether this is complete - I
hope we all agree there's a lot of work to be done. Question is which of
these approaches should we pursue in the WG? One thing that comes to
mind is implementation - do we have people iterested in implementing one
or both models? Bearing in mind we're on the record.
Huaimo Chen: Huawei were interested in implementing the relay method for
ingress protection, and had a prototype for egress protection.
Lou Berger:: Anyone else want to comment? Greg seems interested in
linear protection?
Ravi Torvi: We (Juniper) don't see any customer interest in ingress
protection given the complexity and issues with failure detection, so we
want to come up with a simple solution and decide whether we really need
this.
Lou Berger:: so you don't like the relay message and don't think it's
consistent with rest of RSVP?
Ravi Torvi: yes
Ross Callon: on simplicity vs complexity - this seems a lot more complex
than end-to-end protection and only solves part of the problem.
Huaimo Chen: here the complexity is in the vendor-side, but we provide
simplicity to the provider.
Lou Berger:: So this solution focuses on a narrow piece of protection;
it doesn't worry about the end-to-end problem of transit problem.
Clearly some authors believe this is a problem that needs to be solved.
Anyone want to talk about htat? Is an optimised ingress protection
solution something that people think is important?
Ravi Torvi: We don't see a need to solve this.
Lou Berger:: So you thing end-to-end or other methods are good enough?
Or this just isn't a problem?
Ravi Torvi: It's just not a problem.
Huaimo Chen: In the beginning we saw issues in real deployments and
presented these at previous IETFs with early draft versions. End-to-end
protection for P2MP LSPs is really complicated - this is easier. But the
motivation for this all comes from real deployments. Advantages are that
if we protect ingress and egress nodes we have a whole solution, so it's
fast, easy and efficient.
Ravi Torvi: Ross also mentioned that reliably detecting primary failure
is something we've gone through before - we need a lot more than a BFD
session. That leads to deployment issues and so this isn't just a RSVP
extension - there's a lot of deployment issues too.
Lou Berger:: it seems from the folks here that there's some agreement
that there's problems to solve and multiple ways to solve it. There are
concerns about the original proposal which came to the point that
there's now a second one in the doc. And there's agreement that to
finalize this we need more details in the document - it's been thought
about but it needs to be documented properly. So there's three options:
carry on trying to make a standard, say there's not enough support, or
the third option is to say we've done some good work but we're not sure
how operationally viable it is, so we should run experiments - and we
could have a simgle experimental draft with mutliple soutions. I'd be
interested in hearing from folks on this. And I'd like to specifically
ask the MPLS chairs their opinions as this work started in that WG.
Loa Andersson: yes, this started in MPLS and went to TEAS. I think that
if I view this from Huaimo's point of view we're asking him to redo what
he's done before, so I don't know where this is going. If we say that we
want to do an experiment... Huawei have done that. So I don't know what
there is to add.
Lou Berger: Linear protection has been standardized and can solve this
problem, and that wasn't the case when this was first adopted in MPLS.
So we had a real problem, but it's now solved by other means.
Loa Andersson: I'll need to look into that more to say that with confidence.
Ross Callon: Not sure I can speak as MPLS chair as there's no consensus
in the WG, but my own view... there are times in the IETF where we take
on work and we don't know how it'll play out until people do a lot of
work (e.g. LISP). So in this case a lot of work has been put in to
determine that something doesn't look as optimistic now as when we
started. Publication as experimental is what we usually do in this
situation. There's no alternative to doing the work and seeing how it
turns out. I don't feel good about asking someone to do all this work
and then not being able to publish it.
Huaimo Chen: Even though this draft moved to TEAS, I think there was
good support in WG meetings.
Lou Berger:: anyone else have comments?
Huaimo Chen: Also, implementation of this is not that hard.
Pavan Beeram: so even if we go the experimental path, the draft is
nowhere near complete, right? There's more details that need to be put in.
Huaimo Chen: I've also talked to service providers who like this and
think it increses scalability.
George: if you're protecting end-to-end you don't need to protect hop by
hop, and if you protect hop by hop you have a lot of state
Lou Berger: issue is that end to end starts from one LER, and this
starts from two.
Huang Lu: In our network we have a lot of enterprise customers so local
protection is very useful to us; if we have ingress + egress protection
and FRR, that's useful.
Lou Berger:: So existing protection mechanisms are too slow?
Huang Lu: yes
Lou Berger:: And you care about protecting ingress and egress nodes?
Huang Lu: yes
Lou Berger:: Do you care about what specific solution solves the
ingress/egress problem? Do you care about the implementation, or just
what it does?
Huang Lu: I care about fast protection
Lou Berger: So I think we've heard enough to say that people care about
the problem, but we don't have consensus on the mechanism and we don't
have consensus to throw out one of the options, and we don't have
support to push for a proposed standards. I'd like to talk to Pavan
offline and then bring proposals to the list - we can't make a decision
here.
Pavan Beeram: yes, we'll discuss between chairs and get back to the WG.
We'd abandon the work if there were no interest, which is a bit extreme.
If we heard that there was an immediate need for a solution we'd ask to
continue the work as a standard. And we could go with the experimental
approach. We'll use the list to get more details on implementations and
interest in this.
Lou Berger: we can separate the ingress and egress discussions; we
should continue the egress discussion on the list (I have a question I
want to ask about)
Adrian Farrel: we keep skirting around what this problem is and we
haven't really nailed it down. George asked if you're doing FRR at every
point along the path. Maybe we need a 1-page disposable document that
sets out what the problem is and what the potential solutions are so
that folks can decide whether or not they're interested.
Lou Berger:: Can you help guide that?
Adrian Farrel: yes

Chairs: Thanks to all for coming; we'll discuss further on the list.
Special thanks to Huaimo for putting the slides together.

> - 05 min - Discussion Summary / Next-Steps (Chairs)
>

> - 15 min - RSVP-Egress Protection (Open Discussion)
>
<not enough time -- move disucssion to list>

> Meeting Materials -
> https://www.ietf.org/proceedings/interim/2016/01/28/teas/proceedings.html
>

> Etherpad -
> http://etherpad.tools.ietf.org:9000/p/notes-interim-2016-teas-1
>

Attendees:
Vishnu Pava Beeram (chair, leading meeting)
Lou Berger (chair)
Adrian Farrel
Andy Malis
Autumn Liu
Dan Romascanu
Daniel King
Dhruv Dhody
Matt Hartley
Dieter Beller
Greg Mirsky
Haomian Zheng
Huaimo Chen
Huang Lu
Igor Bryskin
Jeffery Zhang
John Drake
Lin Han
Loa Andersson
Padma
Quintin Zhao
Ross Callon
Tarek Saad
Xufeng Liu
Yimin Shen
Ravi Torvi
Mateusz Waldman
George Swallow

[Teas] TEAS WG Virtual Interim Meeting: Notes/Obs… Vishnu Pavan Beeram
Re: [Teas] TEAS WG Virtual Interim Meeting: Notes… Adrian Farrel
Re: [Teas] TEAS WG Virtual Interim Meeting: Notes… Vishnu Pavan Beeram