Re: [L2tpext] WG last call for draft-ietf-l2tpext-failover-05.txt

Hi Vipin,

Thanks for considering the comments, please see inline.

Circa 9/1/2005 2:30 AM, Vipin Jain said the following:
> Carlos,
> 
> Thanks for you comments. My responses are inline.
> 
> 
> 
>>This looks like a very good document to me, please find a couple of
>>comments/queries:
>>
>>1.
>>      2.2.1 Recovery tunnel establishment
>>
>>         corresponding old tunnel.  An endpoint SHOULD not send any
>>         control message on this tunnel, other than the messages to
>>         establish and tear down the tunnel itself.
>>
>>***CP: I know this was updated with the "and tear down" to clarify about
>>***CP: StopCCN. Maybe just a nit, I wonder however if "establish,
>>***CP: keepalive and tear down" is more complete to include ZLB as well.
>>***CP: Or possibly simpler enumerate the allowed control messages: SCCRQ,
>>***CP: SCCRP, SCCCN, StopCCN, ZLB Ack and Explicit-Ack (for L2TPv3
>>***CP: only).
> 
> I think the intention is obvious, but if intrepreted strictly it can lead to
> some confusion. Perhaps following conveys it more clearly:
> "messages other than those required to manage the life of the recovery tunnel"

This text sounds good.

> 
> 
>>2.
>>         Tunnel Recovery AVP for L2TPv3 tunnels:
>>[snip]
>>       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>       |                        Recover Tunnel Id                      |
>>       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>       |                     Recover Remote Tunnel Id                  |
>>       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>
>>***CP: s/Tunnel Id/Control Connection ID/
>>***CP: Same clarification between tunnel id and control connection id in
>>***CP: the following 2 or 3 para (only mentions tunnel id).
> 
> The protocol defines Recovery Tunnel as standard term, hence we thought using
> 'Recover Tunnel Id' would seem more appropriate. Some tunnel definitions do
> refer themselves as control connections.
> 
> 
>>3.
>>         Id and responds with an SCCRP. It MUST terminate the tunnel if:
>>         - Recover Tunnel Id or Remote Recover Tunnel Id is unknown.
>>         - Non failed endpoint did not indicate it was failover capable.
>>         - The L2TP version of recovery tunnel is different from the
>>
>>***CP: What if the _failed_ endpoint did not indicate it was failover
>>***CP: capable when established the "old tunnel"? In such case:
>>***CP: 1. non-failed would not know peer's Recovery time, and any
>>***CP: assumption for it could result in extended downtime.
>>***CP: 2. is seems could open the door for a malicious party
>>***CP: impersonating a failed party?
>>***CP: So probably should/must terminate the recovery tunnel of the
>>***CP: failed endpoint did not indicate was failover capable?
> 
> If failed endpoint did not indicate it was failover capable then we'd like to
> keep the behavior as is. The behavior mentioned in the failover protocol is
> applicable only for those tunnels that are failover capable. In that respect,
> yes it would extend the downtime, but how do we know if the remote endpoint
> really failed or not. To answer the question "how do you know if remote
> endpoint failed if it didn't initiate a failover tunnel?" is tricky. Do you
> deduce it based on another tunnel to the same peer restarting? 

This is the scenario I meant: LCCE receives an SCCRQ containing the
Tunnel Recovery AVP. That means that the remote LCCE is establishing a
recovery tunnel. The Recover Tunnel Id and Recover Remote Tunnel Id in
the AVP identify an old tunnel for which the remote LCCE had not
previously included Failover Capability AVP upon establishment.
Shouldn't the recovery tunnel be torn down in that case?

> 
> 
>>4.
>>         tunnel. If, for any reason, the failed endpoint could not
>>         establish the recovery tunnel then it MUST silently clear the
>>         recovered tunnel and sessions within, assuming the recovery
>>         process has failed.
>>
>>         Any control packet received on the recovered tunnel, before
>>         control channel reset, MUST be silently discarded.
>>
>>***CP: The "recovered tunnel" in those 2 sentences is in fact the "old
>>***CP: tunnel" at this point, right? If the recovery tunnel
>>***CP: establishment fails then the old tunnel never makes it to
>>***CP: recovered, correct?
> 
> Yes, 'recovered tunnel' is indeed the 'old tunnel'. Will change in the text.
> 
> 
>>5.
>>         An endpoint MUST use tie breaker AVP (section 4.4.3 [L2TPv2]
>>         and section 5.4.3 [L2TPv3]) in the setup of the recovery tunnel
>>
>>***CP: the "tie breaker AVP" would be the "Control Connection Tie
>>***CP: Breaker AVP" for L2TPv3; some other parts of the document make
>>***CP: the distinction between different named v2/v3 AVPs.
> 
> Yes, that's the right terminology as per L2TPv3. I'll change that.
> 
> 
> 
>>6.
>>      2.2.1 Recovery tunnel establishment
>>
>>***CP: What are the guidelines for including Failover Capability AVP in
>>***CP: the Recovery Tunnel SCCRQ/SCCRP establishment? MUST NOT be used?
>>***CP: Or use the values included in the Failover Capability AVP
>>***CP: for the Recovery Tunnel as applicable for the recovered tunnel,
>>***CP: to be used in a subsequent failure?
> 
> I think using Failover Capability AVP in Recovery tunnel would not be of much
> use. If the system fails while establishing Recovery tunnel, it will establish
> another recovery tunnel in order to recover the 'old tunnel'.

I totally agree.

> I don't see a need for explicitly recommending it. Do you?

My comment was based on that if not used, it could be clearer for an
implementor if explicitly stated. Either way is fine though.

> Should we explicitly recommend it?
> 
> 
>>7.
>>   2.1 Pre Failover Operation
>>
>>      The D bit, when set indicates that an endpoint is capable of
>>      resetting Nr value based on received Ns value(s) from one or more
>>      'out of order but in sequence' packets from the peer.  This bit is
>>      applicable only for the sessions using sequence numbers on the
>>      data channel i.e. data channel failure on the system not
>>
>>      2.2.2 Control and Data Channel Reset
>>
>>         numbers and if data channel has failed over. Failed endpoint
>>         resets its Ns value to zero, where as non failed endpoint could
>>         continue to use the Ns values it was using previously. To reset
>>         Nr values during failover, if an endpoint receives 'n' out of
>>         order but in sequence packets then it MUST set the Nr value
>>         based on the Ns value of the incoming packets, as suggested in
>>         Appendix C [L2TPv3]. The value of 'n' should be configurable.
>>
>>***CP: Nit comment: I wonder if these paragraphs should say "expected
>>***CP: sequence number" instead of Nr and Sequence Number instead of Ns
>>***CP: for data channel, to make it clear is not the control connection
>>***CP: Nr/Ns for L2TPv3. Or note what Nr/Ns are referring to in L2Tpv3,
>>***CP: as sequencing fields do not have that name.
> 
> Will change it to 'expected sequence number' and 'sequence number'.
> 
> 
>>8.
>>   2.3 Session State Synchronization
>>      Step2: Both endpoints SHOULD identify the sessions that might have
>>      been in inconsistent states, perhaps based on data channel
>>      inactivity.
>>
>>***CP: Over what period is data channel inactivity measured? In any case
>>***CP: it may not be an accurate indicator of inconsistency, a silent
>>***CP: session may be consistent and a session receiving data packets
>>***CP: may be inconsistent. Shouldn't FSS for all sessions instead?
> 
> The inactivity period was put in keeping the protocol 'echo/hello' period in
> mind. Sending FSS for all sessions is the sure way. We can leave that to the
> implementation.
> 
> Keyur suggested putting following text in the beginning of section 2.3:
> "Two new messages FSQ/FSR have been introduced to synchronize session state at
> any given point during the life of a session between the two endpoints. These
> messages are used when one endpoint determines or suspects in an implementation
> specific manner that a session state between it and its peer is in incossistent
> state.  One way to make this determination may be based on data activity for
> the session."

That looks good, the method for the determination would not impact
interoperability. I wonder if in Step2 above MAY should be used instead
of SHOULD.

> 
> 
> 
>>9.
>>      Step3: An endpoint sends Failover Session Query (FSQ) message,
>>      message type 21, to query the state of stale sessions on its peer.
>>      An FSQ message MUST include at least one Failover Session State
>>      (FSS) AVPs.  An endpoint MAY send another FSQ message on the
>>
>>***CP: It may be useful to make explicit what AVPs MAY/MUST in FSQ. MUST
>>***CP: Message Type, MAY Message Digest and MUST one or more Failover
>>***CP: Session State AVPs. Right? Additionally, in the FSS description,
>>***CP: it would be useful to specify that FSS can only be used in FSQ
>>***CP: and FSR messages.
> 
> It is a good idea to list the AVPs explicitly. In that regard:
> FSQ, FSR message MUST have 'Message Type AVP' and 'FSS AVP'. They MAY send
> 'Random Vector AVP' and 'Message digest AVP in L2TPv3'.
> 
> Similarly it could mention that 'FSS MUST Be used only in FSQ and FSR messages'

Great, thanks.

> 
> 
>>10.
>>      Before all sessions are synchronized using FSQ/FSR mechanism, if
>>      an endpoint receives an ICRQ for a session it believe is already
>>      in established state, it MUST respond to such ICRQ with a CDN,
>>      setting Assigned/Local Session ID AVP ([L2TPv2] section 4.4.4,
>>      [L2TPv3] section 5.4.4) to its local session id, and clear the
>>
>>***CP: The first line seems to be the first mention of FSR (the name
>>***CP: "Failover Session Response" is first introduced further down). It
>>***CP: would be useful to expand the name before and detail which AVPs
>>***CP: MAY/MUST in FSRs: MUST Message Type, MAY Message Digest and MUST
>>***CP: one or more Failover Session State AVPs. Correct?
> 
> Yes. Will change that.
> 
> 
>>11.
>>4.0 Security Considerations
>>   The failover mechanism described here leaves a some room (1 in 2^32)
>>   for an intruder to discover the old tunnel id of an existing tunnel
>>
>>***CP: This probability number seems only for L2TPv2. More importantly
>>***CP: though, this section should include what to can be done to
>>***CP: minimize the exposure. At least, control message security
>>***CP: mechanisms must be considered, specifically for mutual endpoint
>>***CP: authentication (for tunnel establishment for v2 and also for
>>***CP: control message auth for v3). Additionally, an impersonator may
>>***CP: try to create a recovery tunnel clearing C and D bits to drop the
>>***CP: old tunnel/sessions.
> 
> Actually the above mentioned text should have been '1 in 2^32 for an intruder
> to discover the old tunnel and session id of an existing tunnel/session. And
> that was applicable only for L2TPv2. 
> The new text should look something like:
> The failover mechanism described here leaves a room (1 in 2^16 for L2TPv2 and 1
> in 2^32 for L2TPv3) for an intruder to discover the old tunnel id, which could

Perfect. A minor comment, since the tunnel-id is non-zero, (2^16 - 1)
for v2 and (2^32 - 1) for v3.

> be misused to fake the failover to result into a complete shutdown of an
> existing tunnel. To avoid this, Control channel authentication as indicated in
> section 2.2.1. L2TPv3 tunnels should be used. L2TPv3 tunnels should also use
> the 'Digest AVP' to make it secure. Protecting L2TP with IPSec would also help
> secure the tunnels for failover.
> 
> 
>>12.
>>Appendix A
>>
>>***CP: The 4 Appendices are most useful !!!
>>
>>13.
>>      -  The mechanism should be backward compatible; i.e. it should not
>>      redefine existing behavior of [L2TP] compliant systems.
>>
>>***CP: There's no [L2TP] reference, is that v2 only or both?
> 
> I think both as the standard is coming after L2TPv2 and L2TPv3.
> 
> 
>>14.
>>Appendix B
>>   from recovering multiple tunnels in parallel. It also allows an
>>   endpoint from sending multiple FSQs to recover quickly.
>>
>>***CP: In addition, from including multiple FSSs in FSQ and FSR to
>>***CP: recover quickly.
> 
> Yes. Will add that.
> 
> 
>>I hope these help !
> 
> 
> Yes. Thanks for a thourough review.

;-)

Thanks,

--Carlos.
> 
> -- vipin
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 

-- 
--Carlos.
Escalation RTP - cisco Systems

_______________________________________________
L2tpext mailing list
L2tpext@ietf.org
https://www1.ietf.org/mailman/listinfo/l2tpext