Re: [Dots] Behavior when keep-alives fail (RE: Mirja Kühlewind's Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and COMMENT)

Hi Mirja,

In one sense Med has answered your queries below separately, but I will try to expand on them from my implementation perspective.

Regards

Jon

> -----Original Message-----
> From: Dots [mailto: dots-bounces@ietf.org] On Behalf Of Mirja
> Kuehlewind
> Sent: 03 July 2019 18:12
> To: Jon Shallow
> Cc: draft-ietf-dots-signal-channel@ietf.org; Konda, Tirumaleswar Reddy;
> dots@ietf.org; frank.xialiang@huawei.com; The IESG; dots-chairs@ietf.org;
> mohamed.boucadair@orange.com; Benjamin Kaduk
> Subject: Re: [Dots] Behavior when keep-alives fail (RE: Mirja Kühlewind's
> Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and COMMENT)
> 
> Hi Jon,
> 
> Thanks for extended explanation. Please see questions inline.c
> 
> > On 3. Jul 2019, at 16:04, Jon Shallow <supjps-ietf@jpshallow.com> wrote:
> >
> > Hi Mirja,
> >
> > As an implementer of DOTS I have the following comments to make to try
> and help understand what is going on with the Heartbeats
> >
> > In the peace time scenario, it is assumed that the heartbeats function
> through the network in both directions and that it is possible to disable one
> or both of the heartbeat directions.
> >
> > Heartbeats can use UDP or TCP depending on the session set up based on
> the initial connection (UDP is preferred over TCP).  With UDP, there is a CoAP
> Ping (Empty request CON 0.00) and a CoAP RST response.  With TCP, there is
> a Coap Ping (7.02) and a Coap Pong (7.03) response.
> > For UDP, https://tools.ietf.org/html/rfc7252#section-4.8.2 defines the
> ACK_TIMEOUT, ACK_RANDOM_FACTOR, and MAX_RETRANSMIT which map
> onto the DOTS heartbeat parameters ack-timeout, ack-random-factor and
> max-retransmit.  These 3 parameters determine the elapsed time when
> there has been transmission failure.   There is an additional DOTS parameter
> missing-hb-allowed to support more than one heartbeat loss should it be
> needed before determining that a DOTS agent has really gone away (instead
> of, say, going through a reboot or a restart cycle).
> 
> Why do you need this additional parameter? Why is max-retransmit not
> enough? Isn’t the results the same, you send more ping frames before you
> finally give up?

Well no - we are running in a failing network environment due to the DDoS attacks and need to be as robust as possible before finally giving up.

In RFC7252 ACK_TIMEOUT, ACK_RANDOM_FACTOR, and MAX_RETRANSMIT have the suggested values of 2, 1.5 and 4 respectively.  Following https://tools.ietf.org/html/rfc7252#section-4.8.2, the time between the first and last retransmit with these values is 45 seconds - which means that the request times out after 93 seconds (MAX_TRANSMIT_WAIT) at the point when the MAX_RETRANSMIT+1 is retried.

This final time of 48 seconds (93sec - 45sec) is bigger than a NAT refresh timeout that we are comfortable with, so the recommended DOTS value  for max-retransmit is 3.  We have added in this additional missing-hb-allowed to increase the timeout before failure is determined.

However, as mentioned below, this determined failure may not be sufficient to cause a session to be closed down as there could be network losses due to the DDoS attacks.

> 
> >
> > When handling an attack scenario, there is a good chance that the inbound
> (DOTS server to DOTS client) data path is flooded /overloaded and hence
> packet loss (but not the case with all DDoS type scenarios).
> >
> > A significant purpose of the DOTS client generating a heartbeat is to make
> sure that any NAT devices in the path maintain their NAT associations and
> allow any returning responses (which could be unsolicited if the observe of a
> mitigation is active).
> >
> > Even in the attack scenario, the DOTS server will see these heartbeat
> messages, but can only deduce that the connection from the DOTS client to
> the DOTS server is good - but cannot make any assumptions about traffic
> flowing in the other direction.
> 
> However, if the client does never receive a Coap RST or Coap pong, it will
> sooner or later give up and not send any ping messages anymore. In this case
> the server will receive no ping anymore and can decide to send own pings.
> Important is the idle time out is known to both ends.

Agreed, but as mentioned below the DOTS client must continue to transmit if there has been a mitigation request (i.e. running in under attack mode and networks could be flaky).

The retransmission parameters are negotiated between the DOTS agents at client session startup ( https://tools.ietf.org/html/draft-ietf-dots-signal-channel-34#section-4.5.2 ) so both ends know what the timeouts are.

> 
> Further if your really want to be sure if the RST was received or not, I’d
> recommend you to use an own application ping that indicated if the ping is a
> retransmission or not.

Well actually, while  the CoAP layer handles the transmission of the Empty/RST, the ping is initiated by the DOTS layer (and is told either a RST or timeout occurred), not the CoAP layer and so the DOTS application is in control of what is going on here.

The Ping is not handled in the same as, say, TCP keep-alive packets which are handled completely by the TCP layer.

> 
> Detection one-way congestion is a different function than keep-alive testing
> and it is better to use an explicit mechanism for that then trying to infer
> something from a mechanism that was designed for a different purpose.
> 

Agreed, but the DOTS layer is triggering the CoAP ping to do this work for the one-way congestion testing.

> >
> > However, the DOTS client may not get a ping response due to the flooded
> inbound pipe.  If the DOTS client has initiated a mitigation request, then it is
> unsafe for the DOTS client to close down the session - it will need to refresh
> the mitigation requests / create new ones even if the mitigation is not being
> that effective as traffic can still flow to the server.  It is possible that the
> DOTS server has just restarted - hence the requirement to try and open up a
> new session in parallel.
> >
> > If the DOTS server also initiates heartbeat messages, sees the DOTS client
> pings, but does not see any response to the DOTS server ping, the DOTS
> server can now deduce that the outbound pipe is good, but the inbound
> pipe to the DOTS client is failing.  The DOTS server then does not need to
> close down the session as it will be expecting additional mitigation requests
> from the DOTS client - even though the DOTS server Coap Ping is failing.
> >
> > Furthermore, if the DOTS server initiates its CoAP ping on receipt of the
> DOTS client Coap Ping, then there is a good chance that the NAT sessions are
> "warm" on any intervening NAT devices.  If the DOTS server initiates the
> Coap Ping on its own cycle, there is a chance that it may not get through and
> confuse the logic.
> 
> This also sounds to me that you should rather design your own testing
> during mitigation in the dots layer, e.g. don’t use the Coap Ping, but send a
> non-confirmable Coap message which contains a dots layer “ping" and an
> indication if a dots-layer “pong" has been received or not.

As stated above, the CoAP ping is not initiated by the CoAP layer like TCP keep-alives, but is triggered by the DOTS application by sending the Coap Ping packet - which in effect is what you are suggesting (apart from the non-confirmable aspect).  If using TCP, then what you describe is correct (except Confirmable/Non-Confirmable are out of the picture).

The CoAP Empty packet must be confirmable to solicit a RST response (Table 1: RFC7252).  I would rather not move away from RFC7252 here.

> 
> However, note that this still might not work with TCP as messages cannot be
> transmitted unreliably and not-transmitted/no-acked application layer data
> will block all other traffic on the same connection at some point because TCP
> will try to retransmit and shrink the congestion window to the minimum.

TCP CoAP Ping/Pong does work as we are initiating it from the DOTS layer.

~Jon

> 
> Mirja
> 
> 
> >
> > Regards
> >
> > Jon
> >
> >> -----Original Message-----
> >> From: Dots [mailto:dots-bounces@ietf.org] On Behalf Of supjps-
> mohamed.boucadair@orange.com
> >> Sent: 03 July 2019 14:46
> >> To: Mirja Kuehlewind
> >> Cc: draft-ietf-dots-signal-channel@ietf.org; Konda, Tirumaleswar Reddy;
> >> dots@ietf.org; frank.xialiang@huawei.com; The IESG; dots-
> chairs@ietf.org;
> >> Benjamin Kaduk
> >> Subject: Re: [Dots] Behavior when keep-alives fail (RE: Mirja Kühlewind's
> >> Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and
> COMMENT)
> >>
> >> Re-,
> >>
> >> Please see inline.
> >>
> >> Cheers,
> >> Med
> >>
> >>> -----Message d'origine-----
> >>> De : Mirja Kuehlewind [mailto:ietf@kuehlewind.net]
> >>> Envoyé : mercredi 3 juillet 2019 14:46
> >>> À : BOUCADAIR Mohamed TGI/OLN
> >>> Cc : Konda, Tirumaleswar Reddy; Benjamin Kaduk; draft-ietf-dots-signal-
> >>> channel@ietf.org; frank.xialiang@huawei.com; dots@ietf.org; The IESG;
> >>> dots-chairs@ietf.org
> >>> Objet : Re: Behavior when keep-alives fail (RE: [Dots] Mirja Kühlewind's
> >>> Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and
> >> COMMENT)
> >>>
> >>> Hi Med,
> >>>
> >>> See below.
> >>>
> >>>> On 3. Jul 2019, at 12:48, <mohamed.boucadair@orange.com>
> >>> <mohamed.boucadair@orange.com> wrote:
> >>>>
> >>>> Mirja,
> >>>>
> >>>>> Actually to my understanding this will not work. Both TCP heartbeat
> and
> >>>>> Coap Ping are transmitted reliably. If you don’t receive an ack for
> >>> these
> >>>>> transmissions you are not able to send any additional messages and
> can
> >>>>> only choose the connection.
> >>>>
> >>>> This behavior is implemented and tested between two
> implementations.
> >> The
> >>> exact procedure is described in the draft, fwiw:
> >>>>
> >>>> ==
> >>>>  When a Confirmable "CoAP Ping" is sent, and if there is no response,
> >>>>  the "CoAP Ping" is retransmitted max-retransmit number of times by
> >>>>  the CoAP layer using an initial timeout set to a random duration
> >>>>  between ack-timeout and (ack-timeout*ack-random-factor) and
> >>>>  exponential back-off between retransmissions.  By choosing the
> >>>>  recommended transmission parameters, the "CoAP Ping" will timeout
> >>>>  after 45 seconds.  If the DOTS agent does not receive any response
> >>>>  from the peer DOTS agent for 'missing-hb-allowed' number of
> >>>>  consecutive "CoAP Ping" Confirmable messages, it concludes that the
> >>>>  DOTS signal channel session is disconnected.  A DOTS client MUST NOT
> >>>>  transmit a "CoAP Ping" while waiting for the previous "CoAP Ping"
> >>>>  response from the same DOTS server.
> >>>> ==
> >>>
> >>> First, can you explain why you need 'missing-hb-allowed’?
> >>
> >> [Med] because we need to make sure this a "real/durable" session
> defunct,
> >> not a false positive. For example, this would have implications on the
> server
> >> as it may erroneously start automated mitigations (because it concludes
> the
> >> session is lost).
> >>
> >> If the ping is
> >>> transmitted reliably, one “missed” should be enough to conclude that
> the
> >>> session is disconnected.
> >>
> >> [Med] Hmm, under some DDoS attacks, both endpoints may be
> >> sending/replying to confirmable ping messages, but the reply may get
> >> dropped. The session is not disconnected in such case.
> >>
> >>>
> >>> Yes, as Coap Ping is used, the agent should not only conclude that the
> >>> DOTS signal session is disconnected but also the Coap session and not
> send
> >>> any further Coap messages anymore.
> >>>
> >>> If you want to send further UDP datagram you should it unreliability and
> >>> not more often then one per 3 seconds.
> >>>
> >>> Mirja
> >>>
> >>>
> >>>>
> >>>> Cheers,
> >>>> Med
> >>>>
> >>>>> -----Message d'origine-----
> >>>>> De : Mirja Kuehlewind [mailto:ietf@kuehlewind.net]
> >>>>> Envoyé : mercredi 3 juillet 2019 12:26
> >>>>> À : BOUCADAIR Mohamed TGI/OLN
> >>>>> Cc : Konda, Tirumaleswar Reddy; Benjamin Kaduk; draft-ietf-dots-
> signal-
> >>>>> channel@ietf.org; frank.xialiang@huawei.com; dots@ietf.org; The
> IESG;
> >>>>> dots-chairs@ietf.org
> >>>>> Objet : Re: Behavior when keep-alives fail (RE: [Dots] Mirja
> >>> Kühlewind's
> >>>>> Discuss on draft-ietf-dots-signal-channel-31: (with DISCUSS and
> >>> COMMENT)
> >>>>>
> >>>>> Hi Med,
> >>>>>
> >>>>> See below.
> >>>>>
> >>>>>> On 3. Jul 2019, at 09:53, mohamed.boucadair@orange.com wrote:
> >>>>>>
> >>>>>> Hi Mirja,
> >>>>>>
> >>>>>> (Focusing on individual issues)
> >>>>>>
> >>>>>> Please see inline.
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Med
> >>>>>>
> >>>>>>> -----Message d'origine-----
> >>>>>>> De : Mirja Kuehlewind [mailto:ietf@kuehlewind.net]
> >>>>>>> Envoyé : mardi 2 juillet 2019 16:00
> >>>>>>> À : BOUCADAIR Mohamed TGI/OLN
> >>>>>>> Cc : Konda, Tirumaleswar Reddy; Benjamin Kaduk; draft-ietf-dots-
> >>> signal-
> >>>>>>> channel@ietf.org; frank.xialiang@huawei.com; dots@ietf.org; The
> >> IESG;
> >>>>>>> dots-chairs@ietf.org
> >>>>>>> Objet : Re: [Dots] Mirja Kühlewind's Discuss on draft-ietf-dots-
> >>> signal-
> >>>>>>> channel-31: (with DISCUSS and COMMENT)
> >>>>>>>
> >>>>>> ...
> >>>>>>>>>>>> 10) The document should more explicitly provide more
> >> guidance
> >>>>> about
> >>>>>>>>>>>> when a client should start a session and what should be done
> >>> (from
> >>>>>>> the
> >>>>>>>>>>>> client side) if a session is detected as inactive (other than
> >>>>> during
> >>>>>>>>>>>> migration which is discussed a bit in 4.7). Is the assumption to
> >>>>>>> have
> >>>>>>>>>>>> basically permanently an active session or connect for
> >> migration
> >>>>> and
> >>>>>>>>>>>> configuration requests separately at a time?
> >>>>>>>>>>>
> >>>>>>>>>>> I think there was some clarifying text added, but please
> confirm
> >>> if
> >>>>>>> you
> >>>>>>>>> think it
> >>>>>>>>>>> is sufficient.
> >>>>>>>>>
> >>>>>>>>> Sorry, don’t see where text was added. Can you provide a
> pointer?
> >>>>>>>>
> >>>>>>>> [Med] We do have this text, for example:
> >>>>>>>>
> >>>>>>>> The DOTS signal channel can be established between two DOTS
> >> agents
> >>>>>>>> prior or during an attack.  The DOTS signal channel is initiated by
> >>>>>>>> the DOTS client.  The DOTS client can then negotiate, configure,
> and
> >>>>>>>> retrieve the DOTS signal channel session behavior with its DOTS
> peer
> >>>>>>>> (Section 4.5).  Once the signal channel is established, the DOTS
> >>>>>>>> agents periodically send heartbeats to keep the channel active
> >>>>>>>> (Section 4.7).  At any time, the DOTS client may send a mitigation
> >>>>>>>> request message (Section 4.4) to a DOTS server over the active
> >>> signal
> >>>>>>>> channel.  While mitigation is active (because of the higher
> >>>>>>>> likelihood of packet loss during a DDoS attack), the DOTS server
> >>>>>>>> periodically sends status messages to the client, including basic
> >>>>>>>> mitigation feedback details.  Mitigation remains active until the
> >>>>>>>> DOTS client explicitly terminates mitigation, or the mitigation
> >>>>>>>> lifetime expires.  Also, the DOTS server may rely on the signal
> >>>>>>>> channel session loss to trigger mitigation for pre-configured
> >>>>>>>> mitigation requests (if any).
> >>>>>>>
> >>>>>>> Okay thanks for for the pointer. What I think is missing are some
> >>>>>>> sentences about what the client (or server) should do if the keep-
> >>> alive
> >>>>>>> fails. Try to reconnect directly or just with the next request or
> >>>>>>> whatever. Basically who should reconnect and when?
> >>>>>>
> >>>>>> [Med] This is discussed in details in Section 4.7, in particular.
> >>>>>>
> >>>>>> As a generic rule, it is always the client who connects (see the
> >>> excerpt
> >>>>> above).
> >>>>>>
> >>>>>> The server may use the failure to initiate automated mitigation (see
> >>> the
> >>>>> excerpt above). More details are provided in other sections.
> >>>>>>
> >>>>>> There are several heartbeat failure cases to handle by the client.
> >>>>> Examples from 4.7 are provided below, fwiw:
> >>>>>>
> >>>>>>    The DOTS client MUST NOT consider the DOTS signal channel
> session
> >>>>>>    terminated even after a maximum 'missing-hb-allowed' threshold is
> >>>>>>    reached.  The DOTS client SHOULD keep on using the current DOTS
> >>>>>>    signal channel session to send heartbeat requests over it, so that
> >>>>>>    the DOTS server knows the DOTS client has not disconnected the
> >>>>>>    DOTS signal channel session.
> >>>>>>
> >>>>>>    After the maximum 'missing-hb-allowed' threshold is reached, the
> >>>>>>    DOTS client SHOULD try to resume the (D)TLS session.  The DOTS
> >>>>>>    client SHOULD send mitigation requests over the current DOTS
> >>>>>>    signal channel session, and in parallel, for example, try to
> >>>>>>    resume the (D)TLS session or use 0-RTT mode in DTLS 1.3 to
> >>>>>>    piggyback the mitigation request in the ClientHello message.
> >>>>>>
> >>>>>>    As soon as the link is no longer saturated, if traffic from the
> >>>>>>    DOTS server reaches the DOTS client over the current DOTS signal
> >>>>>>    channel session, the DOTS client can stop (D)TLS session
> >>>>>>    resumption or if (D)TLS session resumption is successful then
> >>>>>>    disconnect the current DOTS signal channel session.
> >>>>>>
> >>>>>> Do you think additional text is needed?
> >>>>>
> >>>>> Actually to my understanding this will not work. Both TCP heartbeat
> and
> >>>>> Coap Ping are transmitted reliably. If you don’t receive an ack for
> >>> these
> >>>>> transmissions you are not able to send any additional messages and
> can
> >>>>> only choose the connection.
> >>>>>
> >>>>> Mirja
> >>>>>
> >>>>>
> >>>>
> >>
> >> _______________________________________________
> >> Dots mailing list
> >> Dots@ietf.org
> >> https://www.ietf.org/mailman/listinfo/dots
> >
> >
> 
> _______________________________________________
> Dots mailing list
> Dots@ietf.org
> https://www.ietf.org/mailman/listinfo/dots