Re: [Dots] Behavior when keep-alives fail (RE: Mirja Kühlewind's Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and COMMENT)

Hi Tiru,

Thanks for the updates. I think there is one remaining issue on the use of ping/heart-beats (see also my other message). However, I believe all other discuss points have been addressed now. Thanks for that!

Mirja

> On 15. Jul 2019, at 14:40, Konda, Tirumaleswar Reddy <TirumaleswarReddy_Konda@McAfee.com> wrote:
> 
> Hi Mirja,
> 
> We have updated the draft to address your Discuss and comments (https://tools.ietf.org/rfcdiff?url2=draft-ietf-dots-signal-channel-35.txt).  Please have a look and approve the draft. 
> 
> Best Regards,
> -Tiru
> 
>> -----Original Message-----
>> From: Jon Shallow <supjps-ietf@jpshallow.com>
>> Sent: Thursday, July 4, 2019 12:46 AM
>> To: mohamed.boucadair@orange.com; 'Mirja Kuehlewind'
>> <ietf@kuehlewind.net>; draft-ietf-dots-signal-channel@ietf.org; Konda,
>> Tirumaleswar Reddy <TirumaleswarReddy_Konda@McAfee.com>;
>> dots@ietf.org; frank.xialiang@huawei.com; 'The IESG' <iesg@ietf.org>; dots-
>> chairs@ietf.org; 'Benjamin Kaduk' <kaduk@mit.edu>
>> Subject: RE: [Dots] Behavior when keep-alives fail (RE: Mirja Kühlewind's
>> Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and COMMENT)
>> 
>> This email originated from outside of the organization. Do not click links or
>> open attachments unless you recognize the sender and know the content is
>> safe.
>> 
>> Hi Mirja,
>> 
>> In one sense Med has answered your queries below separately, but I will try
>> to expand on them from my implementation perspective.
>> 
>> Regards
>> 
>> Jon
>> 
>>> -----Original Message-----
>>> From: Dots [mailto: dots-bounces@ietf.org] On Behalf Of Mirja
>>> Kuehlewind
>>> Sent: 03 July 2019 18:12
>>> To: Jon Shallow
>>> Cc: draft-ietf-dots-signal-channel@ietf.org; Konda, Tirumaleswar
>>> Reddy; dots@ietf.org; frank.xialiang@huawei.com; The IESG;
>>> dots-chairs@ietf.org; mohamed.boucadair@orange.com; Benjamin Kaduk
>>> Subject: Re: [Dots] Behavior when keep-alives fail (RE: Mirja
>>> Kühlewind's Discuss on draft-ietf-dots-signal-channel-31: (with
>>> DISCUSS and COMMENT)
>>> 
>>> Hi Jon,
>>> 
>>> Thanks for extended explanation. Please see questions inline.c
>>> 
>>>> On 3. Jul 2019, at 16:04, Jon Shallow <supjps-ietf@jpshallow.com> wrote:
>>>> 
>>>> Hi Mirja,
>>>> 
>>>> As an implementer of DOTS I have the following comments to make to
>>>> try
>>> and help understand what is going on with the Heartbeats
>>>> 
>>>> In the peace time scenario, it is assumed that the heartbeats
>>>> function
>>> through the network in both directions and that it is possible to
>>> disable one or both of the heartbeat directions.
>>>> 
>>>> Heartbeats can use UDP or TCP depending on the session set up based
>>>> on
>>> the initial connection (UDP is preferred over TCP).  With UDP, there
>>> is a CoAP Ping (Empty request CON 0.00) and a CoAP RST response.  With
>>> TCP, there is a Coap Ping (7.02) and a Coap Pong (7.03) response.
>>>> For UDP, https://tools.ietf.org/html/rfc7252#section-4.8.2 defines
>>>> the
>>> ACK_TIMEOUT, ACK_RANDOM_FACTOR, and MAX_RETRANSMIT which
>> map onto the
>>> DOTS heartbeat parameters ack-timeout, ack-random-factor and
>>> max-retransmit.  These 3 parameters determine the elapsed time when
>>> there has been transmission failure.   There is an additional DOTS
>> parameter
>>> missing-hb-allowed to support more than one heartbeat loss should it
>>> be needed before determining that a DOTS agent has really gone away
>>> (instead of, say, going through a reboot or a restart cycle).
>>> 
>>> Why do you need this additional parameter? Why is max-retransmit not
>>> enough? Isn’t the results the same, you send more ping frames before
>>> you finally give up?
>> 
>> Well no - we are running in a failing network environment due to the DDoS
>> attacks and need to be as robust as possible before finally giving up.
>> 
>> In RFC7252 ACK_TIMEOUT, ACK_RANDOM_FACTOR, and MAX_RETRANSMIT
>> have the suggested values of 2, 1.5 and 4 respectively.  Following
>> https://tools.ietf.org/html/rfc7252#section-4.8.2, the time between the first
>> and last retransmit with these values is 45 seconds - which means that the
>> request times out after 93 seconds (MAX_TRANSMIT_WAIT) at the point
>> when the MAX_RETRANSMIT+1 is retried.
>> 
>> This final time of 48 seconds (93sec - 45sec) is bigger than a NAT refresh
>> timeout that we are comfortable with, so the recommended DOTS value  for
>> max-retransmit is 3.  We have added in this additional missing-hb-allowed to
>> increase the timeout before failure is determined.
>> 
>> However, as mentioned below, this determined failure may not be sufficient
>> to cause a session to be closed down as there could be network losses due to
>> the DDoS attacks.
>> 
>>> 
>>>> 
>>>> When handling an attack scenario, there is a good chance that the
>>>> inbound
>>> (DOTS server to DOTS client) data path is flooded /overloaded and
>>> hence packet loss (but not the case with all DDoS type scenarios).
>>>> 
>>>> A significant purpose of the DOTS client generating a heartbeat is
>>>> to make
>>> sure that any NAT devices in the path maintain their NAT associations
>>> and allow any returning responses (which could be unsolicited if the
>>> observe of a mitigation is active).
>>>> 
>>>> Even in the attack scenario, the DOTS server will see these
>>>> heartbeat
>>> messages, but can only deduce that the connection from the DOTS client
>>> to the DOTS server is good - but cannot make any assumptions about
>>> traffic flowing in the other direction.
>>> 
>>> However, if the client does never receive a Coap RST or Coap pong, it
>>> will sooner or later give up and not send any ping messages anymore.
>>> In this case the server will receive no ping anymore and can decide to send
>> own pings.
>>> Important is the idle time out is known to both ends.
>> 
>> Agreed, but as mentioned below the DOTS client must continue to transmit if
>> there has been a mitigation request (i.e. running in under attack mode and
>> networks could be flaky).
>> 
>> The retransmission parameters are negotiated between the DOTS agents at
>> client session startup ( https://tools.ietf.org/html/draft-ietf-dots-signal-
>> channel-34#section-4.5.2 ) so both ends know what the timeouts are.
>> 
>>> 
>>> Further if your really want to be sure if the RST was received or not,
>>> I’d recommend you to use an own application ping that indicated if the
>>> ping is a retransmission or not.
>> 
>> Well actually, while  the CoAP layer handles the transmission of the
>> Empty/RST, the ping is initiated by the DOTS layer (and is told either a RST or
>> timeout occurred), not the CoAP layer and so the DOTS application is in
>> control of what is going on here.
>> 
>> The Ping is not handled in the same as, say, TCP keep-alive packets which are
>> handled completely by the TCP layer.
>> 
>>> 
>>> Detection one-way congestion is a different function than keep-alive
>>> testing and it is better to use an explicit mechanism for that then
>>> trying to infer something from a mechanism that was designed for a
>> different purpose.
>>> 
>> 
>> Agreed, but the DOTS layer is triggering the CoAP ping to do this work for the
>> one-way congestion testing.
>> 
>>>> 
>>>> However, the DOTS client may not get a ping response due to the
>>>> flooded
>>> inbound pipe.  If the DOTS client has initiated a mitigation request,
>>> then it is unsafe for the DOTS client to close down the session - it
>>> will need to refresh the mitigation requests / create new ones even if
>>> the mitigation is not being that effective as traffic can still flow
>>> to the server.  It is possible that the DOTS server has just restarted
>>> - hence the requirement to try and open up a new session in parallel.
>>>> 
>>>> If the DOTS server also initiates heartbeat messages, sees the DOTS
>>>> client
>>> pings, but does not see any response to the DOTS server ping, the DOTS
>>> server can now deduce that the outbound pipe is good, but the inbound
>>> pipe to the DOTS client is failing.  The DOTS server then does not
>>> need to close down the session as it will be expecting additional
>>> mitigation requests from the DOTS client - even though the DOTS server
>> Coap Ping is failing.
>>>> 
>>>> Furthermore, if the DOTS server initiates its CoAP ping on receipt
>>>> of the
>>> DOTS client Coap Ping, then there is a good chance that the NAT
>>> sessions are "warm" on any intervening NAT devices.  If the DOTS
>>> server initiates the Coap Ping on its own cycle, there is a chance
>>> that it may not get through and confuse the logic.
>>> 
>>> This also sounds to me that you should rather design your own testing
>>> during mitigation in the dots layer, e.g. don’t use the Coap Ping, but
>>> send a non-confirmable Coap message which contains a dots layer “ping"
>>> and an indication if a dots-layer “pong" has been received or not.
>> 
>> As stated above, the CoAP ping is not initiated by the CoAP layer like TCP
>> keep-alives, but is triggered by the DOTS application by sending the Coap
>> Ping packet - which in effect is what you are suggesting (apart from the non-
>> confirmable aspect).  If using TCP, then what you describe is correct (except
>> Confirmable/Non-Confirmable are out of the picture).
>> 
>> The CoAP Empty packet must be confirmable to solicit a RST response (Table
>> 1: RFC7252).  I would rather not move away from RFC7252 here.
>> 
>>> 
>>> However, note that this still might not work with TCP as messages
>>> cannot be transmitted unreliably and not-transmitted/no-acked
>>> application layer data will block all other traffic on the same
>>> connection at some point because TCP will try to retransmit and shrink the
>> congestion window to the minimum.
>> 
>> TCP CoAP Ping/Pong does work as we are initiating it from the DOTS layer.
>> 
>> ~Jon
>> 
>>> 
>>> Mirja
>>> 
>>> 
>>>> 
>>>> Regards
>>>> 
>>>> Jon
>>>> 
>>>>> -----Original Message-----
>>>>> From: Dots [mailto:dots-bounces@ietf.org] On Behalf Of supjps-
>>> mohamed.boucadair@orange.com
>>>>> Sent: 03 July 2019 14:46
>>>>> To: Mirja Kuehlewind
>>>>> Cc: draft-ietf-dots-signal-channel@ietf.org; Konda, Tirumaleswar
>>>>> Reddy; dots@ietf.org; frank.xialiang@huawei.com; The IESG; dots-
>>> chairs@ietf.org;
>>>>> Benjamin Kaduk
>>>>> Subject: Re: [Dots] Behavior when keep-alives fail (RE: Mirja
>>>>> Kühlewind's Discuss on draft-ietf-dots-signal-channel-31: (with
>>>>> DISCUSS and
>>> COMMENT)
>>>>> 
>>>>> Re-,
>>>>> 
>>>>> Please see inline.
>>>>> 
>>>>> Cheers,
>>>>> Med
>>>>> 
>>>>>> -----Message d'origine-----
>>>>>> De : Mirja Kuehlewind [mailto:ietf@kuehlewind.net] Envoyé :
>>>>>> mercredi 3 juillet 2019 14:46 À : BOUCADAIR Mohamed TGI/OLN Cc :
>>>>>> Konda, Tirumaleswar Reddy; Benjamin Kaduk; draft-ietf-dots-signal-
>>>>>> channel@ietf.org; frank.xialiang@huawei.com; dots@ietf.org; The
>>>>>> IESG; dots-chairs@ietf.org Objet : Re: Behavior when keep-alives
>>>>>> fail (RE: [Dots] Mirja Kühlewind's Discuss on
>>>>>> draft-ietf-dots-signal-channel-31: (with DISCUSS and
>>>>> COMMENT)
>>>>>> 
>>>>>> Hi Med,
>>>>>> 
>>>>>> See below.
>>>>>> 
>>>>>>> On 3. Jul 2019, at 12:48, <mohamed.boucadair@orange.com>
>>>>>> <mohamed.boucadair@orange.com> wrote:
>>>>>>> 
>>>>>>> Mirja,
>>>>>>> 
>>>>>>>> Actually to my understanding this will not work. Both TCP
>>>>>>>> heartbeat
>>> and
>>>>>>>> Coap Ping are transmitted reliably. If you don’t receive an ack
>>>>>>>> for
>>>>>> these
>>>>>>>> transmissions you are not able to send any additional messages
>>>>>>>> and
>>> can
>>>>>>>> only choose the connection.
>>>>>>> 
>>>>>>> This behavior is implemented and tested between two
>>> implementations.
>>>>> The
>>>>>> exact procedure is described in the draft, fwiw:
>>>>>>> 
>>>>>>> ==
>>>>>>> When a Confirmable "CoAP Ping" is sent, and if there is no
>>>>>>> response,  the "CoAP Ping" is retransmitted max-retransmit number
>>>>>>> of times by  the CoAP layer using an initial timeout set to a
>>>>>>> random duration  between ack-timeout and
>>>>>>> (ack-timeout*ack-random-factor) and  exponential back-off between
>>>>>>> retransmissions.  By choosing the  recommended transmission
>>>>>>> parameters, the "CoAP Ping" will timeout  after 45 seconds.  If
>>>>>>> the DOTS agent does not receive any response  from the peer DOTS
>>>>>>> agent for 'missing-hb-allowed' number of  consecutive "CoAP Ping"
>>>>>>> Confirmable messages, it concludes that the  DOTS signal channel
>>>>>>> session is disconnected.  A DOTS client MUST NOT  transmit a "CoAP
>> Ping" while waiting for the previous "CoAP Ping"
>>>>>>> response from the same DOTS server.
>>>>>>> ==
>>>>>> 
>>>>>> First, can you explain why you need 'missing-hb-allowed’?
>>>>> 
>>>>> [Med] because we need to make sure this a "real/durable" session
>>> defunct,
>>>>> not a false positive. For example, this would have implications on
>>>>> the
>>> server
>>>>> as it may erroneously start automated mitigations (because it
>>>>> concludes
>>> the
>>>>> session is lost).
>>>>> 
>>>>> If the ping is
>>>>>> transmitted reliably, one “missed” should be enough to conclude
>>>>>> that
>>> the
>>>>>> session is disconnected.
>>>>> 
>>>>> [Med] Hmm, under some DDoS attacks, both endpoints may be
>>>>> sending/replying to confirmable ping messages, but the reply may
>>>>> get dropped. The session is not disconnected in such case.
>>>>> 
>>>>>> 
>>>>>> Yes, as Coap Ping is used, the agent should not only conclude that
>>>>>> the DOTS signal session is disconnected but also the Coap session
>>>>>> and not
>>> send
>>>>>> any further Coap messages anymore.
>>>>>> 
>>>>>> If you want to send further UDP datagram you should it
>>>>>> unreliability and not more often then one per 3 seconds.
>>>>>> 
>>>>>> Mirja
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Med
>>>>>>> 
>>>>>>>> -----Message d'origine-----
>>>>>>>> De : Mirja Kuehlewind [mailto:ietf@kuehlewind.net] Envoyé :
>>>>>>>> mercredi 3 juillet 2019 12:26 À : BOUCADAIR Mohamed TGI/OLN Cc :
>>>>>>>> Konda, Tirumaleswar Reddy; Benjamin Kaduk; draft-ietf-dots-
>>> signal-
>>>>>>>> channel@ietf.org; frank.xialiang@huawei.com; dots@ietf.org; The
>>> IESG;
>>>>>>>> dots-chairs@ietf.org
>>>>>>>> Objet : Re: Behavior when keep-alives fail (RE: [Dots] Mirja
>>>>>> Kühlewind's
>>>>>>>> Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and
>>>>>> COMMENT)
>>>>>>>> 
>>>>>>>> Hi Med,
>>>>>>>> 
>>>>>>>> See below.
>>>>>>>> 
>>>>>>>>> On 3. Jul 2019, at 09:53, mohamed.boucadair@orange.com wrote:
>>>>>>>>> 
>>>>>>>>> Hi Mirja,
>>>>>>>>> 
>>>>>>>>> (Focusing on individual issues)
>>>>>>>>> 
>>>>>>>>> Please see inline.
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Med
>>>>>>>>> 
>>>>>>>>>> -----Message d'origine-----
>>>>>>>>>> De : Mirja Kuehlewind [mailto:ietf@kuehlewind.net] Envoyé :
>>>>>>>>>> mardi 2 juillet 2019 16:00 À : BOUCADAIR Mohamed TGI/OLN Cc :
>>>>>>>>>> Konda, Tirumaleswar Reddy; Benjamin Kaduk; draft-ietf-dots-
>>>>>> signal-
>>>>>>>>>> channel@ietf.org; frank.xialiang@huawei.com; dots@ietf.org;
>>>>>>>>>> The
>>>>> IESG;
>>>>>>>>>> dots-chairs@ietf.org
>>>>>>>>>> Objet : Re: [Dots] Mirja Kühlewind's Discuss on
>>>>>>>>>> draft-ietf-dots-
>>>>>> signal-
>>>>>>>>>> channel-31: (with DISCUSS and COMMENT)
>>>>>>>>>> 
>>>>>>>>> ...
>>>>>>>>>>>>>>> 10) The document should more explicitly provide more
>>>>> guidance
>>>>>>>> about
>>>>>>>>>>>>>>> when a client should start a session and what should be
>>>>>>>>>>>>>>> done
>>>>>> (from
>>>>>>>>>> the
>>>>>>>>>>>>>>> client side) if a session is detected as inactive (other
>>>>>>>>>>>>>>> than
>>>>>>>> during
>>>>>>>>>>>>>>> migration which is discussed a bit in 4.7). Is the
>>>>>>>>>>>>>>> assumption to
>>>>>>>>>> have
>>>>>>>>>>>>>>> basically permanently an active session or connect for
>>>>> migration
>>>>>>>> and
>>>>>>>>>>>>>>> configuration requests separately at a time?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I think there was some clarifying text added, but please
>>> confirm
>>>>>> if
>>>>>>>>>> you
>>>>>>>>>>>> think it
>>>>>>>>>>>>>> is sufficient.
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry, don’t see where text was added. Can you provide a
>>> pointer?
>>>>>>>>>>> 
>>>>>>>>>>> [Med] We do have this text, for example:
>>>>>>>>>>> 
>>>>>>>>>>> The DOTS signal channel can be established between two DOTS
>>>>> agents
>>>>>>>>>>> prior or during an attack.  The DOTS signal channel is
>>>>>>>>>>> initiated by the DOTS client.  The DOTS client can then
>>>>>>>>>>> negotiate, configure,
>>> and
>>>>>>>>>>> retrieve the DOTS signal channel session behavior with its
>>>>>>>>>>> DOTS
>>> peer
>>>>>>>>>>> (Section 4.5).  Once the signal channel is established, the
>>>>>>>>>>> DOTS agents periodically send heartbeats to keep the channel
>>>>>>>>>>> active (Section 4.7).  At any time, the DOTS client may send
>>>>>>>>>>> a mitigation request message (Section 4.4) to a DOTS server
>>>>>>>>>>> over the active
>>>>>> signal
>>>>>>>>>>> channel.  While mitigation is active (because of the higher
>>>>>>>>>>> likelihood of packet loss during a DDoS attack), the DOTS
>>>>>>>>>>> server periodically sends status messages to the client,
>>>>>>>>>>> including basic mitigation feedback details.  Mitigation
>>>>>>>>>>> remains active until the DOTS client explicitly terminates
>>>>>>>>>>> mitigation, or the mitigation lifetime expires.  Also, the
>>>>>>>>>>> DOTS server may rely on the signal channel session loss to
>>>>>>>>>>> trigger mitigation for pre-configured mitigation requests (if any).
>>>>>>>>>> 
>>>>>>>>>> Okay thanks for for the pointer. What I think is missing are
>>>>>>>>>> some sentences about what the client (or server) should do if
>>>>>>>>>> the keep-
>>>>>> alive
>>>>>>>>>> fails. Try to reconnect directly or just with the next request
>>>>>>>>>> or whatever. Basically who should reconnect and when?
>>>>>>>>> 
>>>>>>>>> [Med] This is discussed in details in Section 4.7, in particular.
>>>>>>>>> 
>>>>>>>>> As a generic rule, it is always the client who connects (see
>>>>>>>>> the
>>>>>> excerpt
>>>>>>>> above).
>>>>>>>>> 
>>>>>>>>> The server may use the failure to initiate automated mitigation
>>>>>>>>> (see
>>>>>> the
>>>>>>>> excerpt above). More details are provided in other sections.
>>>>>>>>> 
>>>>>>>>> There are several heartbeat failure cases to handle by the client.
>>>>>>>> Examples from 4.7 are provided below, fwiw:
>>>>>>>>> 
>>>>>>>>>   The DOTS client MUST NOT consider the DOTS signal channel
>>> session
>>>>>>>>>   terminated even after a maximum 'missing-hb-allowed'
>> threshold is
>>>>>>>>>   reached.  The DOTS client SHOULD keep on using the current
>> DOTS
>>>>>>>>>   signal channel session to send heartbeat requests over it, so that
>>>>>>>>>   the DOTS server knows the DOTS client has not disconnected the
>>>>>>>>>   DOTS signal channel session.
>>>>>>>>> 
>>>>>>>>>   After the maximum 'missing-hb-allowed' threshold is reached,
>> the
>>>>>>>>>   DOTS client SHOULD try to resume the (D)TLS session.  The DOTS
>>>>>>>>>   client SHOULD send mitigation requests over the current DOTS
>>>>>>>>>   signal channel session, and in parallel, for example, try to
>>>>>>>>>   resume the (D)TLS session or use 0-RTT mode in DTLS 1.3 to
>>>>>>>>>   piggyback the mitigation request in the ClientHello message.
>>>>>>>>> 
>>>>>>>>>   As soon as the link is no longer saturated, if traffic from the
>>>>>>>>>   DOTS server reaches the DOTS client over the current DOTS signal
>>>>>>>>>   channel session, the DOTS client can stop (D)TLS session
>>>>>>>>>   resumption or if (D)TLS session resumption is successful then
>>>>>>>>>   disconnect the current DOTS signal channel session.
>>>>>>>>> 
>>>>>>>>> Do you think additional text is needed?
>>>>>>>> 
>>>>>>>> Actually to my understanding this will not work. Both TCP
>>>>>>>> heartbeat
>>> and
>>>>>>>> Coap Ping are transmitted reliably. If you don’t receive an ack
>>>>>>>> for
>>>>>> these
>>>>>>>> transmissions you are not able to send any additional messages
>>>>>>>> and
>>> can
>>>>>>>> only choose the connection.
>>>>>>>> 
>>>>>>>> Mirja
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Dots mailing list
>>>>> Dots@ietf.org
>>>>> https://www.ietf.org/mailman/listinfo/dots
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> Dots mailing list
>>> Dots@ietf.org
>>> https://www.ietf.org/mailman/listinfo/dots
>