Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt

Joe Touch <> Wed, 21 May 2014 17:02 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 46FFD1A0879; Wed, 21 May 2014 10:02:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -4.851
X-Spam-Status: No, score=-4.851 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RP_MATCHES_RCVD=-0.651] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id sKqlW9HHP-_S; Wed, 21 May 2014 10:02:54 -0700 (PDT)
Received: from ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id C88161A085D; Wed, 21 May 2014 10:02:53 -0700 (PDT)
Received: from [] ( []) (authenticated bits=0) by (8.13.8/8.13.8) with ESMTP id s4LH22ie014119 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Wed, 21 May 2014 10:02:02 -0700 (PDT)
Message-ID: <>
Date: Wed, 21 May 2014 10:02:02 -0700
From: Joe Touch <>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0
MIME-Version: 1.0
To: "Zhou, Han" <>, "" <>, "" <>, "" <>, "" <>
References: <> <> <> <> <> <>
In-Reply-To: <>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-ISI-4-43-8-MailScanner: Found to be clean
Cc: Erik Nordmark <>, Tom Herbert <>
Subject: Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion list for Tunneling over Foo \(with\)in IP networks \(TOFOO\)." <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 21 May 2014 17:02:55 -0000

On 5/20/2014 7:17 PM, Zhou, Han wrote:
> Hi Joe,
> This is an interesting topic.
>> TCP offloading is fine when the OS hands off user data, and the offload
>> engine creates the entire segment.
> Existing TSO/GSO mechanisms deliver full (large) TCP segment to
> "offload engine", which then create smaller segments according to
> physical MTU, and recalculates checksums. This is the case even
> without overlay considered. So I suppose the problem you pointed out
> is not related to my change, but a general limitation for TSO/GSO,
> right?

It depends on what part of TCP happens in the guest OS vs. the 
underlying engine. If you expose the TCP API to the guest OS (the API 
spec'd in RFC793), and hand "Send" call data down to the engine, that's 

However, what I think is happening is this:

	- the guest OS receives the "Send" call and creates a TCP
	segment, including TCP header and TCP options

	- the guest OS hands the TCP segment to the engine

	- the engine parses that TCP segment to create multiple
	outgoing segments, typically by copying the passed segment's
	header and options, and recalculating the fields it
	thinks it needs to

Simply put, that's as bad as having any middlebox re-calculating TCP 
segments, and is guaranteed to create problems (even if the 'typical' 
case doesn't trip over them).

The problem is that the engine's TCP interpreter may not understand all 
TCP header options - when (not if) that happens, what does it do?

RFC793 is clear on this - when a SYN arrives with an option that isn't 
understood, the receiver MUST silently ignore that option.

So the engine ought to have stripped out all options it doesn't 
understand from the first SYN sent*. But I suspect that's not what it 
thinks it should do - I suspect it thinks it's OK to merely copy - or 
pass through - options it doesn't understand.

What should happen is that the engine interface should NEVER be a TCP 
segment formed by the guest OS. If what you want is to offload 
segmentation, you ought to pass the user data and TCP header (and its 
options) as separate parameters.

(* this is why a correctly-written engine ends up reducing TCP 
functionality, because a connection can support only what is supported 
by the endpoints AND the engine [on each end]). Any option the engine 
doesn't support should never be allowed on the connection.

> For my understanding the TCP implementation should decide whether to
> use offloading or not according to the feature/options required by a TCP
> connection. If the option required (such as MD5) is not supported by
> offloading, the TCP stack should do the segmentation by itself instead
> of utilizing offloading.

That works if unknown options are assumed NOT SUPPORTED.

But I still don't quite understand why you want the segmentation 
happening in the VM - why not pass the MTU info to the virtual interface 
in the guest OS and let it handle things?

> In fact, the proposal in this draft should be able to alleviate the
> limitation for TCP connections between VMs behind same gateways, because
> in this case there is no real TCP segmentation performed by "offload
> engine".
> Let me know if you have more concerns, or maybe an example of how an
> option is broken by TSO/GSO, then we can check what's the current
> solution in kernel.

See above - and thanks,