[tcpm] More TCP option space on SYNs

Bob Briscoe <bob.briscoe@bt.com> Sat, 31 May 2014 18:20 UTC

Return-Path: <bob.briscoe@bt.com>
X-Original-To: tcpm@ietfa.amsl.com
Delivered-To: tcpm@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 393061A0058 for <tcpm@ietfa.amsl.com>; Sat, 31 May 2014 11:20:10 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.252
X-Spam-Level:
X-Spam-Status: No, score=-1.252 tagged_above=-999 required=5 tests=[BAYES_05=-0.5, J_CHICKENPOX_34=0.6, RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-0.651, SPF_PASS=-0.001] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id UCQMRR-elKPP for <tcpm@ietfa.amsl.com>; Sat, 31 May 2014 11:20:05 -0700 (PDT)
Received: from hubrelay-rd.bt.com (hubrelay-rd.bt.com [62.239.224.99]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id E86EC1A0063 for <tcpm@ietf.org>; Sat, 31 May 2014 11:20:04 -0700 (PDT)
Received: from EVMHR71-UKRD.domain1.systemhost.net (10.36.3.109) by EVMHR68-UKRD.bt.com (10.187.101.23) with Microsoft SMTP Server (TLS) id 8.3.348.2; Sat, 31 May 2014 19:19:58 +0100
Received: from EPHR01-UKIP.domain1.systemhost.net (147.149.196.177) by EVMHR71-UKRD.domain1.systemhost.net (10.36.3.109) with Microsoft SMTP Server (TLS) id 8.3.348.2; Sat, 31 May 2014 19:19:57 +0100
Received: from bagheera.jungle.bt.co.uk (132.146.168.158) by EPHR01-UKIP.domain1.systemhost.net (147.149.196.177) with Microsoft SMTP Server id 14.3.181.6; Sat, 31 May 2014 19:19:57 +0100
Received: from BTP075694.jungle.bt.co.uk ([10.111.109.114]) by bagheera.jungle.bt.co.uk (8.13.5/8.12.8) with ESMTP id s4VIJt2V003823; Sat, 31 May 2014 19:19:55 +0100
Message-ID: <201405311819.s4VIJt2V003823@bagheera.jungle.bt.co.uk>
X-Mailer: QUALCOMM Windows Eudora Version 7.1.0.9
Date: Sat, 31 May 2014 19:19:53 +0100
To: Joe Touch <touch@isi.edu>
From: Bob Briscoe <bob.briscoe@bt.com>
In-Reply-To: <5388EB6F.4010405@isi.edu>
References: <20140425221257.12559.43206.idtracker@ietfa.amsl.com> <2586_1398464386_535ADF82_2586_915_1_535ADF56.9050106@isi.edu> <CF8D8E25-E435-4199-8FD6-3F7066447292@iki.fi> <5363AF84.8090701@mti-systems.com> <5363B397.8090009@isi.edu> <CAO249yeyr5q21-=e6p5azwULOh1_jUsniZ6YPcDYd69av8MMYw@mail.gmail.com> <DCC98F94-EA74-4AAA-94AE-E399A405AF13@isi.edu> <655C07320163294895BBADA28372AF5D2CFE36@FR712WXCHMBA15.zeu.alcatel-lucent.com> <20140503122950.GM44329@verdi> <655C07320163294895BBADA28372AF5D2D009E@FR712WXCHMBA15.zeu.alcatel-lucent.com> <201405221710.s4MHAY4S002037@bagheera.jungle.bt.co.uk> <537E3ACD.5000308@isi.edu> <1AD79820-22C1-4500-84D1-1383F264D68C@weston.borman.com> <201405231213.s4NCDa5P005525@bagheera.jungle.bt.co.uk> <537F8202.4020907@isi.edu> <201405281715.s4SHFMm0014634@bagheera.jungle.bt.co.uk> <538623B9.2060209@isi.edu> <201405301642.s4UGgcvY030471@bagheera.jungle.bt.co.uk> <5388EB6F.4010405@isi.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format="flowed"
X-Scanned-By: MIMEDefang 2.56 on 132.146.168.158
Archived-At: http://mailarchive.ietf.org/arch/msg/tcpm/-QxOlgqe00Vr-1QvhZR4bYdB434
Cc: "tcpm@ietf.org" <tcpm@ietf.org>
Subject: [tcpm] More TCP option space on SYNs
X-BeenThere: tcpm@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: TCP Maintenance and Minor Extensions Working Group <tcpm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tcpm>, <mailto:tcpm-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/tcpm/>
List-Post: <mailto:tcpm@ietf.org>
List-Help: <mailto:tcpm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tcpm>, <mailto:tcpm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sat, 31 May 2014 18:20:10 -0000

Joe,

Thx for consolidating this thread. I've given it a new subject line.

1) You've silently made an important alteration to the proposed 
protocol. You've put the extra-options directly in the TCP option 
space of the C-SYN, not within the payload. This creates problems:
a) it limits additional options to another 40B, beyond which we will 
need a third SYN, then a fourth.
b) it perpetuates the deployment problem that every newly defined TCP 
option will have: when deployed in year y, there will be a new crop 
of middleboxes that only forward options defined up to year y-1.

I had deliberately squirrelled the options away in the app layer to:
a) provide expansion space for options on a SYN, limited only by the 
max segment size
b) reduce the chances that middleboxes will alter the extra options, 
given there is a higher bar to altering the payload.
c) allow for future structured ways to make extra options invisible 
and/or immutable by middleboxes.

Yes, your altered proposal is cleaner. However, don't imagine I 
didn't think of this. I did and I deliberately didn't do it this way. 
We have a choice:
         clean and vulnerable vs. messy but robust.

I'm not wedded to using port 80 and http headers, but this is perhaps 
the most pragmatic approach. It will be really unorthodox to define 
such a protocol I know. We would have to say something like

         "The dst port of the C-SYN MUST be 80, and the payload MUST start
         with the constant magic_token, where
         magic_token = 'PUT / HTTP/1.1<CRLF>Connection : DSO<CRLF><CRLF>'
         "

I'm sorry if even thinking about this makes you feel dirty :|

Other suggestions for inner protocols are welcome, including 
tunnelled protocols, as long as middleboxes widely forward them, 
given their dst port.


2) The main problem with your notation is it doesn't say /where/ the 
info is placed.
I've added notation as follows:
TCP(base header [TCP options [APP(header[payload])]])

And for the record I've made the if-else logic clearer.

Where I've made more than clarifying edits inline, I've described 
them and tagged them with [BB].

At 21:34 30/05/2014, Joe Touch wrote:
>Hi, Bob,
>
>Let's get back to the core, in a simpler fashion, so other can follow it.
>
>I stand by my "there's no way to extend the space in the initial 
>SYN", but you've convinced me there *might* be a way to provide 
>extended space that can occur during the first phase of the TWHS. I 
>think the dual-SYN approach still isn't viable, but I've outlined an 
>alternative below that's similar but doesn't have the same baggage, IMO.
>
>Again, I'm still concerned by what midboxes might do to this...
>
>What do others think??
>
>Joe
>
>For quick review, here's what I understand:
>
>                 dso = dual-syn option
>                         dso-D = data
>                         dso-C = control
>                 conn_id = identifier to link the two SYNs together
>                 extra_opt = options that didn't fit in legacy SYN
>                 fit_opt = options that do fit in the legacy SYN
         new client endpoint sends
                 TCP(port A SYN [dso-D(conn_id) + fit_opt] )
                 TCP(port B SYN [dso-C [APP(headers [conn_id + 
extra_opt] ) ] ] )

[BB]: i/APP(headers...)/

                         if (legacy server endpoint) { sends back two 
connections:
                                 TCP(port A SYN-ACK [fit_opt] )
                                 TCP(port B SYN-ACK [??] )
>                                 (it's interpretation of extra_opt)
                                         new client endpoint responds:
                                         TCP(port A ACK) (established)
                                         TCP(port B RST)

>                         Notes about legacy servers:
>                                 - they do twice the work on SYNs
>                                 - they might keep twice the state
>                                 (if not using cookies)
>                                 - they might clean state if the RST
>                                 is received, but that state might
>                                 persist indefinitely (until the next
>                                 connection, depending on timeouts, etc.)
>
>                         -----
                         } elif (new server endpoint) { sends back 
one connection:
                                 TCP(port A SYN-ACK [edo + fit_opt + 
extra_opt] )

[BB]: s/dso-d/edo/
                                         new client endpoint responds:
                                         TCP(port A ACK) (established)

>
>                         Notes:
>                                 - can stall when dso-D SYN arrives
>                                 before dso-C SYN, up to some limit
>                                 - twice the work on SYNs (or more)
                         }

>Here's what I was assuming, though admittedly it's not documented (yet):
>
>         - no significant impact on TCP connection rate for
>         legacy servers
>
>         - no significant impact on TCP connection rate for
>         legacy clients
>
>         - impact dominated by processing the extended option space
>         for extended clients
>
>         - impact dominated by processing the extended option space
>         for extended servers
>
>         - compatible with typical TCP processing optimizations,
>         notably SYN cookies
>                 you did provide a potential way forward for these
>
>         - capable of successfully traversing typical NATs
>
>Your approach has the following properties:

The 3 bullets below are not useful ways to describe performance 
impact. They selectively describe whichever gives the most 
pessimistic picture out of:
a) either the instantaneous performance change at the moment of connection
b) or the worst-case long-run performance impact

They don't describe the average long-run performance impact, which is 
important for sizing machines.

Worse, the instantaneous performance impact is only significant when 
a machine's SYN processing time is large relative to the e2e delay, 
which would be a highly unusual scenario on public networks (even in 
scenarios such as intra-data-centre, it's hard to reduce e2e delay to 
approach SYN processing time, but you could for intra-machine connections).


>         - halves the server connection rate for updated servers
>         from legacy clients when this option is in use

Eh? The long-run server connection rate will be fractionally 
decreased due to updated clients using extra options (which is your 
third case below), but the instantaneous server connection rate seen 
by a legacy client is unchanged, because it only sends one SYN.


>         - lowers (to some extent, if not halves) the client
>         connection rate of updated clients to all servers
>         when this option is in use
>
>         - halves (roughly) the server rate for all servers
>         when this option is in use

Nope. All long-run server rates are reduced by 1/(1+e), where e is 
the fraction of connections using extra options.


>It also:
>
>         - doubles the number of SYNs in the network

Nope. The number of SYNs in the network is inflated by e where e is 
the fraction of connections using extra options.


>         - susceptible to lack of fate-sharing problems, e.g.,
>         if the two SYNs experience different firewall configurations

Nope. It's fairer to say it's potentially susceptible to second-order 
fate-sharing problems like your firewall example (the first-order 
fate sharing problems have been addressed).


>         - reduces the space available for fit_opt due to the need
>         for the conn_id even in the fall-back D-SYN, which means
>         less option space in the SYNs for fall-back connections

Yup.


>         the conn_id which may need to be very large because it
>         needs to be unique per source port and source IP address
>         because that information is lost during NAT translation

Given many NATs will typically make the src IPs of both SYNs the 
same, I suggest a larger conn_id should be a fall-back option for the 
client, not a default.

Even if the src IPs of both SYNs are different once they reach the 
server, the high end bits will invariably be the same. So the max 
size of the contents of the DSO TCP option can be 6B, and the server 
can take the rest of conn_id from the higher bits of the src IP addr 
of each SYN. This is a variant of the idea in <draft-wing-nat-reveal-option>.

In fact, the server doesn't even need a small conn_id for clients 
that know they are not behind a NAT and that want more option space 
in the D-SYN - then the server could use the src port & src IP for the conn_id.

To summarise, these options could be distinguished by the length 
field of the dual-syn option.
Length = 2B                             => conn_id = src netaddr + src port
Length = 6B = 2B +4B conn_id_short      => conn_id = src netaddr + 
conn_id_short
Length = 8B = 2B +6B conn_id_long       => conn_id = hsb(src netaddr) 
+ conn_id_long

Given there have been numerous other attempts to reveal a connection 
ID that is preserved through middleboxes [RFC6967], rather than 
defining a dual-syn option that carries a conn_id, we might want to 
design a TCP connection ID option with a flag to say whether it is 
also part of a dual SYN pair or not.

Where
* hsb(src netaddr) is the netaddr with the lowest 16 bits truncated
* src netaddr is the network address (IPv4, IPv6, or any other 
network protocol)

To reduce latency, a host could use the default short_conn_id for all 
connections at first, then:
- if it finds that DSO persistently doesn't work it falls back to the 
long_conn_id for all connections
- it occasionally tests the short and zero options to see if it can 
use shorter DSO options.


>         - requires the ISNs to be related (see RFC6528 - if there's
>         a rule to generate it, there will be code to validate that
>         rule, and eventually a BCP to encourage that validation -
>         typically from the same RFC author)

Eh? The ISNs can and should be independent. To be robust against 
middleboxes that rewrite sequence numbers, we must not required ISNs 
to be related.


>I agree that you have proposed potentially viable ways to deal with 
>the SYN cookie, and that RST state is not an issue.

A feature that I think it's fair to add:
         - Good chance of passing through app-layer middleboxes that forward
         unrecognised TCP options unchanged, but not those that discard them.


>However, there are too many problems with this, IMO, to call it viable.

Once your over-pessimistic analyses of the performance impact are 
corrected, and my ideas to reduce the size of the conn_id are taken 
into account, it's a different story.

But it's up to the WG to decide whether this is worth taking further. 
Not just you or I.

>Here's another trick that might clean up the above a little:

<snip - I'll respond separately to your later updates on this ASO 
idea, with ACK=0>

Cheers


Bob


________________________________________________________________
Bob Briscoe,                                                  BT