Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt

"Zhou, Han" <> Wed, 21 May 2014 03:07 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id C6BBD1A044B; Tue, 20 May 2014 20:07:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -23.151
X-Spam-Status: No, score=-23.151 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, RP_MATCHES_RCVD=-0.651, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5] autolearn=ham
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id d3p5Z20Vfl6M; Tue, 20 May 2014 20:07:14 -0700 (PDT)
Received: from ( []) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 8EACC1A0448; Tue, 20 May 2014 20:07:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;;; q=dns/txt; s=ebaycorp; t=1400641633; x=1432177633; h=from:to:cc:subject:date:message-id:references: in-reply-to:mime-version; bh=LdpWCYvGk1ZowCL51xymh5/ddSRUneUVA/YBSbLn1Mg=; b=iLWQZV0tYxVx/s0LbjS3TJSnfKP/CnYLy51e4wmbCsAQLeVjfVWVXnDB loucV1Ley6lk5ZQCld6OjWUPyhAr7pU9YfJLbcR4jVy6pPkJxNQffjSGh kSpGn1sP2w0rFemaxFrTCLBKXghDmxWArySHdK7WiPokePMe6so02pd5C g=;
X-EBay-Corp: Yes
X-IronPort-AV: E=Sophos; i="4.98,878,1392192000"; d="scan'208,217"; a="51233978"
Received: from (HELO ([]) by with ESMTP; 20 May 2014 20:07:13 -0700
Received: from ([fe80::e420:c190:6f77:31f7]) by ([fe80::345e:2420:7d3d:208d%13]) with mapi id 14.03.0174.001; Tue, 20 May 2014 21:07:12 -0600
From: "Zhou, Han" <>
To: Tom Herbert <>
Thread-Topic: FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
Thread-Index: AQHPZf9ognVu1+LRHU+2rgpg3xG8G5tI3oOggAFw/gD///LaIIAAktcA//+euYA=
Date: Wed, 21 May 2014 03:07:11 +0000
Message-ID: <>
References: <> <> <> <> <>
In-Reply-To: <>
Accept-Language: en-US
Content-Language: en-US
x-originating-ip: []
Content-Type: multipart/alternative; boundary="_000_9F56174078B48B459268EFF1DAB66B1A109C7F16DENEXDDAS32corp_"
MIME-Version: 1.0
X-CFilter-Loop: Reflected
Cc: "" <>, "" <>, "" <>, "" <>, Erik Nordmark <>
Subject: Re: [Tofoo] FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion list for Tunneling over Foo \(with\)in IP networks \(TOFOO\)." <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Wed, 21 May 2014 03:07:19 -0000

Hi Tom,

> There has been a lot of work recently to get software variants (GSO and GRO) working in Linux with tunnels.
Are you referring UDP GRO for VXLAN? I am aware of this implementation in kernel 3.14. The performance gain should be similar, but still requires careful configuration of VM MTU to avoid IP fragmentation on physical MTU, rather than decoupling from underlay. I think each of these solutions has its own advantages.

> I think this would be applicable to about all tunnel protocols. Then you would also want to do reassembly at each hop? Sounds expensive. Once you segment, I think you'd only only want to reassemble at the end host.
There is no segmentation/reassembly needed at each hop, but instead “offload” hop by hop, just save the metadata to the next hop headers. And finally segmentation should be avoided if the destination is a VM on a hypervisor.
This scenario is like in Example 2 (section 5.2), just replace the VXLAN-SOE between the 2 gateways by any other Tunnel protocol that can support similar offloading feature, such as STT.

Best regards,

From: Tom Herbert []
Sent: Wednesday, May 21, 2014 10:42 AM
To: Zhou, Han
Cc:;;;; Erik Nordmark
Subject: Re: FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt

On Tue, May 20, 2014 at 5:28 PM, Zhou, Han <<>> wrote:
Hi Tom,

Thanks for your comments.
Yes TSO/LRO with VXLAN support should provide similar or even better performance gains, but the mechanism proposed by this draft decouples overlay and underlay, and it is hardware independent.
Secondly, hardware offloading usually support TCP only (TSO). The mechanism here can help on large UDP packet performance, also verified by the prototype.
BTW, how many types of off-the-shelf NIC support VXLAN offloading? Any performance data?

HW is not a requirement for this offloading. There has been a lot of work recently to get software variants (GSO and GRO) working in Linux with tunnels. These do show 2-3x performance improvements. HW support would be mostly an incremental improvement to that, or becomes interesting for OS bypass like in SR-IOV. GSO/GRO can be presented to the guest as TSO and LRO so that it's possible to plumb use of large packets from the guest all the way to the host driver. It should also be possible to plumb two VMs on the same host to communicate without segmentation, i.e. output from TSO on one VM becomes input for another.

The advantage that I see in your draft is that it allows an intermediate device to perform segmentation/reassembly instead of fragmentation.

Likewise, setting large MTU on overlay interfaces achieves similar result, but still the overlay/underlay decoupling issue. It is usually advised that overlay MTU is slightly smaller than underlay to avoid inefficient fragmentation after adding the outer header, but to achieve really high performance between VMs, large MTU is preferred.

Depends on the performance dimension to be optimized. Larger MTUs could increase latency of high priority small packets for instance (HOL blocking), or UDP based application might try to use MTU to decide how large it should send it's packets to avoid fragmentation.

And considering overlay <-> physical connection, path MTU discovery is not always work.
This kind of configuration complexity and pain-point can be resolved simply by decoupling overlay and underlay MTU, as suggested by this draft. Here is an example of configuration confusion:

Ideally, all NV tunnel protocols should support similar metadata, thus overlay segmentation can be offloaded hop-by-hop.

I think this would be applicable to about all tunnel protocols. Then you would also want to do reassembly at each hop? Sounds expensive. Once you segment, I think you'd only only want to reassemble at the end host.

One important thing to keep in mind, and a hard lesson in real deployment :-), segmentation offload is opportunistic and very dependent on the conditions of the traffic. It's value can be fleeting in real workloads. For instance imagine a host with a lot of active high throughout connections (like a video serving) which hits a hiccup causing all congestions windows to shrink. If the system is not provisioned for this event it can be very hard to recover (a lot more CPU is required to achieve same throughput). This differentiates a larger MTU from relying on segmentation, the former offers more predictable CPU savings.


Best regards,

From: Tom Herbert [<>]
Sent: Wednesday, May 21, 2014 2:44 AM
To: Zhou, Han
Cc:<>;<>;<>;<>; Erik Nordmark
Subject: Re: FW: I-D Action: draft-zhou-li-vxlan-soe-01.txt

Hi Zou, a couple of questions inline.

On Mon, May 19, 2014 at 8:01 PM, Zhou, Han <<>> wrote:

We have updated the VXLAN-SOE draft according to earlier comments. Now it is fully compatible with VXLAN-GPE. And some examples are added for better understanding

A prototype is also implemented here (patch based on Open vSwitch):

netperf TCP_STREAM test result between a pairs of VMs on hosts with 10G interfaces:

Before the change: 2.62 Gbits/sec
After the change: 6.68 Gbits/sec
Speedup is ~250%.
Can you provide some more details on this benefit? It seems like plain TSO/LRO that understands encapsulation should provide similar benefits when going between hosts.

The patch attracted some interests in OVS community, but since this RFC draft is in very early stage so it is regarded as inappropriate by Jesse to apply the change to OVS tree.
The discuss mail-thread:

So we would like to request a review here by NVO3/TOFOO groups and VXLAN authors: is this VXLAN extension is worth formally put into VXLAN as a standard, so that more people can benefit from it?
Could you get the same effect by setting larger MTUs on the overlay network interface and relying in path MTU discovery when going over physical network?

Best regards,

-----Original Message-----
From: I-D-Announce [<>] On Behalf Of<>
Sent: Friday, May 02, 2014 8:09 PM
Subject: I-D Action: draft-zhou-li-vxlan-soe-01.txt

A New Internet-Draft is available from the on-line Internet-Drafts directories.

        Title           : Segmentation Offloading Extension for VXLAN
        Authors         : Han Zhou
                          Chengyuan Li
        Filename        : draft-zhou-li-vxlan-soe-01.txt
        Pages           : 13
        Date            : 2014-05-02

   Segmentation offloading is nowadays common in network stack
   implementation and well supported by para-virtualized network device
   drivers for virtual machine (VM)s. This draft describes an extension
   to Virtual eXtensible Local Area Network (VXLAN) so that segmentation
   can be decoupled from physical/underlay networks and offloaded
   further to the remote end-point thus improving data-plane performance
   for VMs running on top of overlay networks.

The IETF datatracker status page for this draft is:

There's also a htmlized version available at:

A diff from the previous version is available at:

Please note that it may take a couple of minutes from the time of submission
until the htmlized version and diff are available at<>.

Internet-Drafts are also available by anonymous FTP at:

I-D-Announce mailing list<>
Internet-Draft directories: