Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt

"Zhou, Han" <> Thu, 22 May 2014 02:44 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id E8CBC1A0079; Wed, 21 May 2014 19:44:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -23.152
X-Spam-Status: No, score=-23.152 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_HI=-5, RP_MATCHES_RCVD=-0.651, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id gp4X4TPOYMwP; Wed, 21 May 2014 19:44:54 -0700 (PDT)
Received: from ( []) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 1A7291A007B; Wed, 21 May 2014 19:44:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;;; q=dns/txt; s=ebaycorp; t=1400726693; x=1432262693; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=/ZMLxm7SQYVJjczojPiU6qNHS+qt7vvdwdVe8gHTvCo=; b=mSXFM5JDxqzCMwQv6U1Fx0t2esnceDJFEJ+ohQncnwHH2mU0FvE1G7uM VObTTqhwNTYXhETu3f5XrJ2Ff/sDvOqaqOWe9PIMZrIXlJeOHKDxBQ7o9 /o6CO2idyUMmYGPq0U+wlSi0z96RO7B1xdSEA4bvqWcNDqs8bkg6X+Z0S c=;
X-EBay-Corp: Yes
X-IronPort-AV: E=Sophos;i="4.98,884,1392192000"; d="scan'208";a="51091600"
Received: from (HELO ([]) by with ESMTP; 21 May 2014 19:44:53 -0700
Received: from ([fe80::e420:c190:6f77:31f7]) by ([fe80::a487:c570:9abc:bb59%14]) with mapi id 14.03.0174.001; Wed, 21 May 2014 20:44:52 -0600
From: "Zhou, Han" <>
To: Joe Touch <>, "" <>, "" <>, "" <>, "" <>
Thread-Topic: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
Date: Thu, 22 May 2014 02:44:52 +0000
Message-ID: <>
References: <> <> <> <> <> <> <>
In-Reply-To: <>
Accept-Language: en-US
Content-Language: en-US
x-originating-ip: []
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-CFilter: Scanned den1
Cc: Erik Nordmark <>, Tom Herbert <>
Subject: Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion list for Tunneling over Foo \(with\)in IP networks \(TOFOO\)." <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 22 May 2014 02:44:56 -0000

Hi Joe,

Thanks for your explain, and in theory I tend to agree with you.
However, look at the real world implementation of offload, e.g. Linux kernel, GSO is implemented in the way I described: a whole TCP packet (including headers and options) are generated by the TCP layer, and segmentation is performed as late as possible, depending on the FEATURES supported by net-device:
1) TSO is NOT supported by net-device then GSO software segmentation is performed (tcp_gso_segment())
2) TSO is supported by NIC, then segmentation is offloaded to NIC hardware. (in this case tcp_dump will not capture small packets but only the large packets before segmentation)
3) It is in Guest OS and TSO is supported by the virtual net-device, then segmentation is offloaded to host OS. 

And my change is specific for 3): without the change segmentation is performed according to the MSS specified by guest OS,  and with the change this segmentation is skipped.

As you can see the compatibility problem you mentioned, if it is a real problem, is not introduced by my change.

Because of this, we might discuss it somewhere else, such as linux kernel community.
I don't think current Linux kernel GSO implementation addresses the problem you mentioned (I could be wrong). Do you know any example of *correct* implementation of TCP offloading? Or it could be a TODO in linux kernel.

Best regards,
> -----Original Message-----
> From: Joe Touch []
> Sent: Thursday, May 22, 2014 1:02 AM
> To: Zhou, Han;;;
> Cc: Erik Nordmark; Tom Herbert
> Subject: Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
> On 5/20/2014 7:17 PM, Zhou, Han wrote:
> > Hi Joe,
> >
> > This is an interesting topic.
> >
> >> TCP offloading is fine when the OS hands off user data, and the offload
> >> engine creates the entire segment.
>  >
> > Existing TSO/GSO mechanisms deliver full (large) TCP segment to
> > "offload engine", which then create smaller segments according to
> > physical MTU, and recalculates checksums. This is the case even
> > without overlay considered. So I suppose the problem you pointed out
> > is not related to my change, but a general limitation for TSO/GSO,
> > right?
> It depends on what part of TCP happens in the guest OS vs. the
> underlying engine. If you expose the TCP API to the guest OS (the API
> spec'd in RFC793), and hand "Send" call data down to the engine, that's
> fine.
> However, what I think is happening is this:
> 	- the guest OS receives the "Send" call and creates a TCP
> 	segment, including TCP header and TCP options
> 	- the guest OS hands the TCP segment to the engine
> 	- the engine parses that TCP segment to create multiple
> 	outgoing segments, typically by copying the passed segment's
> 	header and options, and recalculating the fields it
> 	thinks it needs to
> Simply put, that's as bad as having any middlebox re-calculating TCP
> segments, and is guaranteed to create problems (even if the 'typical'
> case doesn't trip over them).
> The problem is that the engine's TCP interpreter may not understand all
> TCP header options - when (not if) that happens, what does it do?
> RFC793 is clear on this - when a SYN arrives with an option that isn't
> understood, the receiver MUST silently ignore that option.
> So the engine ought to have stripped out all options it doesn't
> understand from the first SYN sent*. But I suspect that's not what it
> thinks it should do - I suspect it thinks it's OK to merely copy - or
> pass through - options it doesn't understand.
> What should happen is that the engine interface should NEVER be a TCP
> segment formed by the guest OS. If what you want is to offload
> segmentation, you ought to pass the user data and TCP header (and its
> options) as separate parameters.
> (* this is why a correctly-written engine ends up reducing TCP
> functionality, because a connection can support only what is supported
> by the endpoints AND the engine [on each end]). Any option the engine
> doesn't support should never be allowed on the connection.
> > For my understanding the TCP implementation should decide whether to
> > use offloading or not according to the feature/options required by a TCP
> > connection. If the option required (such as MD5) is not supported by
> > offloading, the TCP stack should do the segmentation by itself instead
> > of utilizing offloading.
> That works if unknown options are assumed NOT SUPPORTED.
> But I still don't quite understand why you want the segmentation
> happening in the VM - why not pass the MTU info to the virtual interface
> in the guest OS and let it handle things?
> > In fact, the proposal in this draft should be able to alleviate the
> > limitation for TCP connections between VMs behind same gateways, because
> > in this case there is no real TCP segmentation performed by "offload
> > engine".
> >
> > Let me know if you have more concerns, or maybe an example of how an
> > option is broken by TSO/GSO, then we can check what's the current
> > solution in kernel.
> See above - and thanks,
> Joe