Re: [hybi] NAT reset recovery? Was: Extensibility mechanisms?

Jamie Lokier <jamie@shareable.org> Mon, 19 April 2010 14:07 UTC

Return-Path: <jamie@shareable.org>
X-Original-To: hybi@core3.amsl.com
Delivered-To: hybi@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 05C7428C1A1 for <hybi@core3.amsl.com>; Mon, 19 Apr 2010 07:07:15 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.015
X-Spam-Level:
X-Spam-Status: No, score=-2.015 tagged_above=-999 required=5 tests=[AWL=-2.016, BAYES_50=0.001]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hlkLiBMR4H6V for <hybi@core3.amsl.com>; Mon, 19 Apr 2010 07:07:10 -0700 (PDT)
Received: from mail2.shareable.org (mail2.shareable.org [80.68.89.115]) by core3.amsl.com (Postfix) with ESMTP id A7E2028C130 for <hybi@ietf.org>; Mon, 19 Apr 2010 07:04:33 -0700 (PDT)
Received: from jamie by mail2.shareable.org with local (Exim 4.63) (envelope-from <jamie@shareable.org>) id 1O3raJ-0001FW-8B; Mon, 19 Apr 2010 15:04:23 +0100
Date: Mon, 19 Apr 2010 15:04:23 +0100
From: Jamie Lokier <jamie@shareable.org>
To: Vladimir Katardjiev <vladimir@d2dx.com>
Message-ID: <20100419140423.GC3631@shareable.org>
References: <Pine.LNX.4.64.1004181812370.751@ps20323.dreamhostps.com> <4BCB6641.70408@webtide.com> <Pine.LNX.4.64.1004182010070.751@ps20323.dreamhostps.com> <4BCB6FD0.7080003@webtide.com> <j2n5c4444771004181403o81184b00r294f3c3b878f24f6@mail.gmail.com> <20100419091736.GA28758@shareable.org> <p2w2a10ed241004190222ne3a61417i47b021dbe0422f71@mail.gmail.com> <B3F72E5548B10A4A8E6F4795430F841832040920C4@NOK-EUMSG-02.mgdnok.nokia.com> <20100419121000.GG28758@shareable.org> <87764B8E-5872-40EE-AA2F-D4E659B94F63@d2dx.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <87764B8E-5872-40EE-AA2F-D4E659B94F63@d2dx.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
Cc: Hybi <hybi@ietf.org>
Subject: Re: [hybi] NAT reset recovery? Was: Extensibility mechanisms?
X-BeenThere: hybi@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Server-Initiated HTTP <hybi.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/hybi>
List-Post: <mailto:hybi@ietf.org>
List-Help: <mailto:hybi-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/hybi>, <mailto:hybi-request@ietf.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Apr 2010 14:07:15 -0000

Vladimir Katardjiev wrote:
> On 19 apr 2010, at 14.10, Jamie Lokier wrote:
> Take, for no particular reason, the case of a web-based Tic Tac Toe
> application. Suppose it is the opponent's turn to move, but he went
> afk. Meanwhile, some NAT on the path decided it wants to close the
> connection. Since it's not my turn to move, my client should not have
> any reason to send traffic over the connection. It also doesn't have
> any set time it expects a response (because the other party is human
> the application has no way to know when he'll move).

Yes, that's the main way in which WebSocket Tic Tac Toe breaks.

> - The client can actually be told when a connection is dropped by
> onclose(wasClean=false) being triggered, and thus reestablish the
> connection. I mean, if you open a connection, surely you want to
> know if it's no longer available...

Sounds like a good plan :-)

> - Our hypothetical amateur programmer, testing his websocket
> connection against localhost, won't see the need for keepalives, but
> the protocol requiring them will stop his application from failing
> on the Wild Web. All he needs to do is echo back the bytes the
> browser sent, and the expert browser programmer handles the rest.

Keepalives are great, but this isn't a nice way to implement them.

If they are required by the protocol, then the protocol should
implement them itself.  (Besides, echo-reply isn't always bandwidth
efficient, depending on keepalive the needs of client and server).

> So I'd rather say I am having trouble seeing what applications would _not_ want keepalive functionality as part of the base offering of WebSockets.

Well, there may be disagreement over what type of keepalive, because
the optimal choice is both application dependent _and_ network link
dependent.

At the moment, it can be implemented by the WebSocket application
(albeit suboptimally, and any network link dependency has to be known
to the application as well).

> >>   but I bet NATs are smart enough to break even there too.  Data is
> >>   needed....
> > 
> > I can give you personal experience.  Yes: NATs do break connections
> > with traffic on them.
> 
> Of course, that's not the only thing that can fail silently. Lost
> connections due to other issues than NATs are also prevalent,
> anything from a cord being cut to someone using a wireless
> connection and going into a tunnel.

Yes, although those might be considered less transient errors :-)

The thing with NAT reset is that TCP won't recover, but reconnection
works immediately.

If you go into a tunnel and come out, TCP will recover.  If you cut a
cord, TCP won't recover but neither will reconnection work immediately.

So NAT reset is a bit different from other network faults.
Quick detection and recovery is useful with it.

> (This only appears to contradict what I said above on mobile keepalives if you assume all mobile networks are equal. They're not. What I want to say though is that even though we assume a general case where the connection WILL fail if it's left on its own, we should make it possible to make the transfer more efficient)

Making the failure event more efficient by pushing keepalive or other
link detection down to the lowest reliable level is generally a good idea.

> > There is currently no consensus on whether the application should
> > handle these network issues itself, or if the WebSocket implementation
> > should play a role.
> > 
> > [...]
> > 
> > Proxy and server implementers seem to prefer that WebSockets handles
> > it, so application programmers get a robust pipe without having to deal
> > with these issues (and it might use the network a bit better).
> 
> This really depends on how you define "robust". I'm okay with
> WebSockets failing because the recovery conditions aren't
> necessarily easy to define, and for some values of robustness you
> need to do stuff like deferring messages on the server-side, and
> then you need to identify the connection that requested them, and
> then authenticate it (if needed) and then you need to determine when
> you're NOT waiting for the robustness recovery, and it just goes on
> and on forever, much like this sentence.

In general you *can't* make a protocol fully robust.  There is always
a failure mode.  But:

> So, yeah. Failure is good. My preference, though, is that the
> protocol itself takes care of the failing part if the failure is due
> to the network conditions, so anything written on top of the
> protocol doesn't have to keep doing the same old networking traps
> every. single. time.

There are very good reasons to distinguish certain classes of failure:

   - Non-failures!  (Peer closes connection due to timeout / resource limit)

   - Those you can recover from quickly and safely (NAT reset)

   - Those which are recoverable but need user conformation (POST data again?)

   - Those which are treated as unrecoverable (DNS failed, or 404 Not Found)

Some of the above list is best done in the WebSocket using
application, but graceful close and NAT reset detection are debatably
better helped by lower layers.

The key characteristic is which ones result in different *desirable*
behaviour from WebSocket application code.

When a condition is an expected behaviour of the network (server
closes connection, NAT reset) *and* it's possible to reopen a
connection and repeat the same messages without causing any problems -
those are particularly useful to distinguish.

Server timeout is also useful to distinguish for a user-interface
reason: User should not be told there is an error when something
happens in normal operation and is not an error.

-- Jamie