Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt

Joe Touch <> Wed, 21 May 2014 01:14 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 3527F1A03FB; Tue, 20 May 2014 18:14:48 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -2.551
X-Spam-Status: No, score=-2.551 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RP_MATCHES_RCVD=-0.651] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id dHNKhtDv3eb0; Tue, 20 May 2014 18:14:45 -0700 (PDT)
Received: from ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id C7B671A03F6; Tue, 20 May 2014 18:14:45 -0700 (PDT)
Received: from [] ( []) (authenticated bits=0) by (8.13.8/8.13.8) with ESMTP id s4L1ESod022812 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Tue, 20 May 2014 18:14:29 -0700 (PDT)
Message-ID: <>
Date: Tue, 20 May 2014 18:14:28 -0700
From: Joe Touch <>
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0
MIME-Version: 1.0
To: "Zhou, Han" <>, "" <>, "" <>, "" <>, "" <>
References: <> <> <> <>
In-Reply-To: <>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-ISI-4-43-8-MailScanner: Found to be clean
Cc: Erik Nordmark <>, Tom Herbert <>
Subject: Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion list for Tunneling over Foo \(with\)in IP networks \(TOFOO\)." <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 21 May 2014 01:14:48 -0000

Hi, Han,

This helps, but doesn't quite address my concern.

TCP offloading is fine when the OS hands off user data, and the offload 
engine creates the entire segment.

The situation you're describing seems to be a hybrid, where the guest OS 
makes a TCP segment, and then something lower (in the VM system) parses 
that segment to create one or more segments on the wire.

If that happens, it will be incompatible with a number of existing TCP 
options, and will also cause side-effects with ACK clocking, timeouts, 
and more than a few other TCP features.

Although I appreciate the goal of efficiency and speed, there is a 
severe compatibility issue that doesn't seem to be addressed.

We can discuss this off-list if useful, FWIW.


On 5/20/2014 6:01 PM, Zhou, Han wrote:
> Hi Joe,
> Thanks for your comment.
> Yes you are right that "length" fields in packet headers can be regarded as hard limit of MTU, but here in the draft we were referring interface MTU. We will make it more precise in next versions.
> For your question, it seems there are misunderstandings. It is always Guest OS (VM) handles the TCP, but segmentation is offloaded to host. Let me explain the code change:
> - TX side:
> -- before the change:
>    TCP segmentation offloaded by TSO of Guest OS virt-driver from guest to host, MSS carried by GSO metadata in skbuff.
>    VXLAN layer add encapsulation, and overlay TCP segmentation is carried right before sending to host IP layer, which is the idea of GSO - segment at the latest point.
>    Host do IP fragmentation only if overlay segment + outer header exceed physical interface MTU. (e.g. when both guest MTU and host MTU are configured to 1500)
> -- after the change:
>    TCP segmentation offloaded by TSO of Guest OS virt-driver from guest to host, MSS carried by GSO metadata in skbuff.
>    VXLAN layer removes GSO metadata and store it in VXLAN-SOE header. So overlay TCP segmentation is skipped.
>    Host do IP fragmentation according to physical interface MTU.
> -RX side:
> -- before the change:
>    Host do IP reassembly if necessary. (e.g. when both guest MTU and host MTU are configured to 1500)
>    Overlay TCP segments are decapsulated and delivered all the way to guest, and guest OS do TCP handling. (high cost here)
> -- after the change:
>    Host do IP reassembly, and large packets decapsulated and delivered to guest, and guest OS do TCP handling. (cost reduced here because of reduced number of packets)
> I hope this clarifies.
> Best regards,
> Han
>> -----Original Message-----
>> From: Joe Touch []
>> Sent: Wednesday, May 21, 2014 1:29 AM
>> To: Zhou, Han;;;
>> Cc: Erik Nordmark; Tom Herbert
>> Subject: Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
>> Hi, all,
>> I had a comment and a question:
>> Comment - (from the doc) overlays do have a hard MTU limit; it is the
>> limit of the encapsulation mechanism. E.g., without additional layers,
>> for UDP in IPv4 this would be a at most 65507 bytes (i.e., IPv4 max -
>> (min IP header + UDP header)). Using additional adaptation layers, this
>> could be larger (e.g., see SEAL).
>> Question - the code appears to have the VXLAN layer do the
>> fragmentation, with the OS layer implementing the rest of TCP. There are
>> a lot of interactions, notably:
>> 	- any mechanism outside of the TCP source and TCP destination
>> 	that interprets the TCP header will result in a decrease in
>> 	functionality
>> 		i.e., the TCP connection will support only the
>> 		intersection of options and features supported
>> 		by the source, dest, *and* VXLAN layers
>> 		(rather than being limited only by the
>> 		source-dest pair)
>> 	- if passed a full TCP segment, this mechanism will be
>> 	incompatible with TCP security (e.g., TCP MD5, TCP-AO, and
>> 	the results of the TCPCRYPT WG.
>> I'm not quite sure from your doc whether you're re-segmenting TCP
>> segments, or merely collecting them for aggregate transit (e.g., as is
>> done in burst-mode Ethernet).
>> Can you please clarify?
>> Joe
>> On 5/19/2014 8:01 PM, Zhou, Han wrote:
>>> Hi,
>>> We have updated the VXLAN-SOE draft according to earlier comments. Now it
>> is fully compatible with VXLAN-GPE. And some examples are added for better
>> understanding.
>>> A prototype is also implemented here (patch based on Open vSwitch):
>> 4ff85fa050f7dd2be
>>> netperf TCP_STREAM test result between a pairs of VMs on hosts with 10G
>> interfaces:
>>> Before the change: 2.62 Gbits/sec
>>> After the change: 6.68 Gbits/sec
>>> Speedup is ~250%.
>>> The patch attracted some interests in OVS community, but since this RFC draft
>> is in very early stage so it is regarded as inappropriate by Jesse to apply the
>> change to OVS tree.
>>> The discuss mail-thread:
>>> So we would like to request a review here by NVO3/TOFOO groups and VXLAN
>> authors: is this VXLAN extension is worth formally put into VXLAN as a standard,
>> so that more people can benefit from it?
>>> Best regards,
>>> Han
>>> -----Original Message-----
>>> From: I-D-Announce [] On Behalf Of
>>> Sent: Friday, May 02, 2014 8:09 PM
>>> To:
>>> Subject: I-D Action: draft-zhou-li-vxlan-soe-01.txt
>>> A New Internet-Draft is available from the on-line Internet-Drafts directories.
>>>           Title           : Segmentation Offloading Extension for VXLAN
>>>           Authors         : Han Zhou
>>>                             Chengyuan Li
>>> 	Filename        : draft-zhou-li-vxlan-soe-01.txt
>>> 	Pages           : 13
>>> 	Date            : 2014-05-02
>>> Abstract:
>>>      Segmentation offloading is nowadays common in network stack
>>>      implementation and well supported by para-virtualized network device
>>>      drivers for virtual machine (VM)s. This draft describes an extension
>>>      to Virtual eXtensible Local Area Network (VXLAN) so that segmentation
>>>      can be decoupled from physical/underlay networks and offloaded
>>>      further to the remote end-point thus improving data-plane performance
>>>      for VMs running on top of overlay networks.
>>> The IETF datatracker status page for this draft is:
>>> There's also a htmlized version available at:
>>> A diff from the previous version is available at:
>>> Please note that it may take a couple of minutes from the time of submission
>>> until the htmlized version and diff are available at
>>> Internet-Drafts are also available by anonymous FTP at:
>>> _______________________________________________
>>> I-D-Announce mailing list
>>> Internet-Draft directories:
>>> or
>>> _______________________________________________
>>> Tofoo mailing list