Re: [Stackevo-discuss] Scope of stackevo and ossification in DC

Tom Herbert <> Tue, 22 December 2015 18:49 UTC

Return-Path: <>
Received: from localhost ( []) by (Postfix) with ESMTP id 9329C1A8A7B for <>; Tue, 22 Dec 2015 10:49:37 -0800 (PST)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: 0.621
X-Spam-Status: No, score=0.621 tagged_above=-999 required=5 tests=[BAYES_40=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FM_FORGED_GMAIL=0.622] autolearn=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id JeOUWjrjIT7V for <>; Tue, 22 Dec 2015 10:49:36 -0800 (PST)
Received: from ( [IPv6:2607:f8b0:4001:c06::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 3FFDE1A8A7A for <>; Tue, 22 Dec 2015 10:49:36 -0800 (PST)
Received: by with SMTP id 186so197922204iow.0 for <>; Tue, 22 Dec 2015 10:49:36 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=/+rStdITMkKy7Zv+fTe+vokrrtEHv+RnVwjbpQS/ZBk=; b=gRKosuH+M/QjEJGSmWTekhU2O1FuHZxM093WLyzgUs4rBjbRjd6Mmy5ZAbB6A6pLXI U43gorXgJgHAUceHjGkBacWgceo23COTAh4IXiQhLFvVE7QAmJ1udKOb5hOONFrbw7mm g9TUVWvuO8+cfee6GNY1SZny7wbxq/0/9QmZY+yaapUmY1cBPhIwu2wQK6oaXLtZuU56 QixgX0orUB8qoLb3DmOzc2aaW1lFDQ3LwaiVI7Gd17SUPqtZmGpvJCMJIy+XYNZmauXa 2FSUbYvDa5sodk9POQHpnuR5qcuH31BIBfhb0dg+L7svEjkx1G+QAAgkOqSqYE4MV1Kh E9zw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=/+rStdITMkKy7Zv+fTe+vokrrtEHv+RnVwjbpQS/ZBk=; b=J1g/oNDGnBAVYgTAr7ZWxLu7iL3g013XNYyXzBwlT+qh0x/LTal0UJ4yPiyf7Gra1x YArEKvF6cmd5elnXkjNRm5Tan7sULxPftdtNmXI26r1s7A9oaG3Jxv3seIh/iy4d6/9s ONfGIPbllIT8954HCWOGyQeJjoR57uNUhG7kudXbkNdDExnPg8Q7OOuHcApBrlzKWEbX BTz/vYzge059frElo4im2YG5lff94djvUc2FoLTmT/xxKeSGxiGlfmEPUZds04ubwaew fxwqvkkTOIO6KfgS2s0RGHejGGI9Tx4Tjgua7wvMqmMGsOx4fhmpZy5DDcdJVJqzxOHK IbTQ==
X-Gm-Message-State: ALoCoQnUE9icaM+wUWDmQVvustYNHnUKepQ+4UCQOtDj/i+RmqjDcMac9egNU7aIzgIPO8jYC6AkjIOCQhZcVOfZfHvbHumQlA==
MIME-Version: 1.0
X-Received: by with SMTP id j23mr3352381iod.50.1450810175576; Tue, 22 Dec 2015 10:49:35 -0800 (PST)
Received: by with HTTP; Tue, 22 Dec 2015 10:49:35 -0800 (PST)
In-Reply-To: <>
References: <> <>
Date: Tue, 22 Dec 2015 10:49:35 -0800
Message-ID: <>
From: Tom Herbert <>
To: Brian Trammell <>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Archived-At: <>
Subject: Re: [Stackevo-discuss] Scope of stackevo and ossification in DC
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: IP Stack Evolution Discussion List <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Tue, 22 Dec 2015 18:49:37 -0000

>> Similar to the use of protocols on the Internet we are hitting the
>> transport protocol ossification problem in the data center.
>> Specifically, performance optimizations in networking devices only
>> support TCP or UDP, and without these optimizations this negatively
>> impacts our use of other protocols.
> Right. But this is a fundamental problem, I think. NIC offloads reach pretty deeply into the transport protocol, and as such won't work with new transports whether they're encrypted or not until those new transports. A question: for NIC offload, how much of the win comes from segmentation offloading, and how much comes from other trickery? If the biggest win really is bundling a bunch of packets into a single context switch, then how would the performance of the current offload architecture compare with a smart library on top of approaches like netmap?
Segmentation offload (RX and TX) is considered win because it reduces
the number of packets that need to be processed through various layers
of the stack. This becomes really evident in deep layering such as we
see with network virtualization. However, most of the benefits can be
achieved with software mechanisms and LRO (RX segmentation offload) is
pretty controversial since the device is compressing TCP headers and
that has had some insidious effects. Checksum offload and RSS are the
critical offloads we need.

>> One example of this is the need
>> for fine grained ECMP which has become driver behind many of the
>> foo-over-UDP proposals (e.g. MPLS/UDP, GRE/UDP, ...).
> So this is a separate issue -- ECMP is a (semi-elegant) hack, predicated on the assumption that things on a five-tuple need to stay together and things on separate five-tuples don't. NAT + TCP (any reordering-intolerant transport, really) makes this assumption more or less hold. Driving it in the opposite direction -- using knowledge that there's ECMP on path to do cheap traffic engineering -- leads to the unintended consequences that foo-over-udp brings with it.
> What you really want architecturally is a way for the network layer (at a gateway) to explicitly say "keep these packets together" and "it's okay to split these packets apart". It'd be even better if we had a way to request/measure/enforce actual path diversity without manually managing tunnels, but this is sadly explicitly a non-feature of our routing protocols. In any case this seems to have a harder incremental deployment story than simple transport state exposure.
IPv6 flow label for ECMP (RFC6438) solves the problem of ECMP/RSS.
With the use of this, devices don't need to parse beyond the IPv6
header to switch packets and we don't need to have the overhead of UDP
encapsulation just for the purpose of getting good ECMP.

>> This problem is likely a proper subset of the general problem, but
>> might be more amenable to some "simpler" solutions. Is this within
>> scope of stackevo?
> It very much seems to be, yes. Let's keep this discussion going on this list...
Protocol ossification is also now in the vernacular of Linux