Re: statement regarding keepalives

Gorry Fairhurst <gorry@erg.abdn.ac.uk> Thu, 16 August 2018 07:54 UTC

Return-Path: <gorry@erg.abdn.ac.uk>
X-Original-To: tsv-area@ietfa.amsl.com
Delivered-To: tsv-area@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id C02CA130EB2; Thu, 16 Aug 2018 00:54:47 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.9
X-Spam-Level:
X-Spam-Status: No, score=-1.9 tagged_above=-999 required=5 tests=[BAYES_00=-1.9] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id BfMiMwRNVKLT; Thu, 16 Aug 2018 00:54:45 -0700 (PDT)
Received: from pegasus.erg.abdn.ac.uk (pegasus.erg.abdn.ac.uk [137.50.19.135]) by ietfa.amsl.com (Postfix) with ESMTP id 41915130E7E; Thu, 16 Aug 2018 00:54:44 -0700 (PDT)
Received: from Gs-MacBook-Pro.local (fgrpf.plus.com [212.159.18.54]) by pegasus.erg.abdn.ac.uk (Postfix) with ESMTPSA id 9951A1B001BF; Thu, 16 Aug 2018 08:54:36 +0100 (BST)
Message-ID: <5B752DBB.9030705@erg.abdn.ac.uk>
Date: Thu, 16 Aug 2018 08:54:35 +0100
From: Gorry Fairhurst <gorry@erg.abdn.ac.uk>
Reply-To: gorry@erg.abdn.ac.uk
Organization: University of Aberdeen
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:12.0) Gecko/20120428 Thunderbird/12.0.1
MIME-Version: 1.0
To: Mikael Abrahamsson <swmike@swm.pp.se>
CC: Kent Watsen <kwatsen@juniper.net>, "tsv-area@ietf.org" <tsv-area@ietf.org>, "netconf-chairs@ietf.org" <netconf-chairs@ietf.org>, "tls-ads@ietf.org" <tls-ads@ietf.org>, "tsvwg-ads@tools.ietf.org" <tsvwg-ads@tools.ietf.org>
Subject: Re: statement regarding keepalives
References: <D3326DE0-3F31-4045-B945-82B3F417BE4B@juniper.net> <alpine.DEB.2.20.1807201340240.14354@uplift.swm.pp.se> <B50DC954-CBB6-41C5-BE3A-F1DECD6046A5@juniper.net> <717202c9c6c6b3d083bfa4c8a9925e45@strayalpha.com> <6377766E-9A03-41BA-A4D4-8796F46278BD@juniper.net> <CALx6S34+rG_rx+79=iaeu5YT4pYUWRqAym6S_CNzJq9-a40Yvw@mail.gmail.com> <513E9F0D-CFAD-4009-8F86-289D9DC55A79@juniper.net> <alpine.DEB.2.20.1808160919260.19688@uplift.swm.pp.se>
In-Reply-To: <alpine.DEB.2.20.1808160919260.19688@uplift.swm.pp.se>
Content-Type: text/plain; charset="ISO-8859-1"; format="flowed"
Content-Transfer-Encoding: 7bit
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsv-area/xaE7ktM1TpLvrJif9ciWaAShh7o>
X-BeenThere: tsv-area@ietf.org
X-Mailman-Version: 2.1.27
Precedence: list
List-Id: IETF Transport and Services Area Mailing List <tsv-area.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsv-area>, <mailto:tsv-area-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsv-area/>
List-Post: <mailto:tsv-area@ietf.org>
List-Help: <mailto:tsv-area-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsv-area>, <mailto:tsv-area-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 16 Aug 2018 07:54:48 -0000

Adding some comments here. I'm playinh catch-up, so I may have comments 
on some things that have been fixed, and missed others.

On 16/08/2018, 08:28, Mikael Abrahamsson wrote:
> On Wed, 15 Aug 2018, Kent Watsen wrote:
>
>> You bring up an interesting point, it goes to the motivation for 
>> wanting to do keepalives in the first place.  The text doesn't yet 
>> mention maintain flow state as a motivation.
>
> It's not only to maintain flow state, it's also to close the 
> connection when the network goes down and doesn't work anymore, and 
> "give up" on connections that doesn't work anymore (for some 
> definition of "anymore").
>
> I have operationally been in the situation where a server/client 
> application was implemented so that the server could only handle 256 
> connections (some filedescriptor limit). Every time the firewall was 
> rebooted, lost state, the connection hung around forever. So the 
> server administrators had to go in and restart the process to clear 
> these connections, otherwise there were 256 hung connections and no 
> new connections could be established.
>
> Sometimes the other endpoint goes down, and doesn't come back. We will 
> for instance deploy home gateways probably keeping netconf-call-home 
> sessions to an NMS, and we want them to be around forever, as long as 
> they work. TCP level keepalives would solve this, as if the customer 
> just powers off the device, after a while the session will be cleared. 
> Using TCP keepalives here means you get this kind of behaviour even if 
> the upper-layer application doesn't support it (netconf might have 
> been a bad example here). It's a single socket option to set, so it's 
> very easy to do.
>
Agree. I think if we look to the transport layer that allowing a flow to 
continue to use existing "network" state (in various forms) is an 
important aspect - there are NATs, Firewalls, QoS Classifiers, etc as 
well as load balancers, and layer 2/3's that take resource decisions at 
the flow level. Normally all of these do the correct thing when there is 
a continuous flow of packets.

Somewhere in the thread I also saw statement that suggested that 
asosciations should be short-lived - If that advice is carried to the 
transport layer, I would expect it to have serious impact on the 
performance for some paths! (There are important trade-offs here, and we 
should not make sweeping assumptions).
>> From knowing approximately what settings people have in their NAT44 and 
> firewalls etc, I'd say the recommendation should be that keepalives 
> are set to around 60-300 second interval, and then kill the connection 
> if no traffic has passed in 3-5 of these intervals, kill the 
> connection. Otherwise TCP will have backed off so far anyway, that 
> it's probably faster to just re-try the connection instead of waiting 
> for TCP to re-send the packet.
>
> I have seen so many times in my 20 years working in networking where 
> lack of keepalives have caused all kinds of problems. I wish everybody 
> would turn it on and keep it on.
>
I agree.  I have the feeling that this is at all not easy advice to get 
correct in a general way (and this thread is quite there yet). e.g., RFC 
5245 set lower limits for timers - because that was thought important.

I don't agree that protocol stacks with a secure transport protocol 
layer (e.g., TLS, SSH, DTLS) that sits on top of a cleartext protocol 
layer (e.g., TCP, UDP) should be advised to do the aliveness check only 
within protection envelope afforded by the secure transport protocol 
layer - to me that seems entirely wrong - it has the same "issue" as a 
above, it depends on the function of the aliveness check and the way 
this is used by the layer's protocol machine. In many cases it is 
absolutely desirable to do this within the layer that needs this 
information. Passing the detailed state down between layers can be most 
awkward. Higher layers can make there own decisions - and suppress 
keep-alives or reaffirm state.

Guidance from the transport perspective on timers is in RFC8085 in 3.1.1 
, there is also more advice in the "behave" RFCs and a summary of the 
mechanisms in RFC8085 3.5 (noted by Lars) ....  The vulnerabilities are 
also noted in RFC8085, and I think we should be clear to differentiate 
between on-path versus off path knowledge when understanding this.

Gorry