Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft

"Rodney W. Grimes" <> Thu, 08 August 2019 23:33 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 41FE6120154 for <>; Thu, 8 Aug 2019 16:33:40 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.897
X-Spam-Status: No, score=-1.897 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id mXlCQ_okHePX for <>; Thu, 8 Aug 2019 16:33:38 -0700 (PDT)
Received: from ( []) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by (Postfix) with ESMTPS id E39FA12008B for <>; Thu, 8 Aug 2019 16:33:37 -0700 (PDT)
Received: from (localhost []) by (8.13.3/8.13.3) with ESMTP id x78NXTNp094757; Thu, 8 Aug 2019 16:33:29 -0700 (PDT) (envelope-from
Received: (from freebsd@localhost) by (8.13.3/8.13.3/Submit) id x78NXS0T094756; Thu, 8 Aug 2019 16:33:28 -0700 (PDT) (envelope-from freebsd)
From: "Rodney W. Grimes" <>
Message-Id: <>
In-Reply-To: <>
To: Greg White <>
Date: Thu, 8 Aug 2019 16:33:28 -0700 (PDT)
CC: "Scaglione, Giuseppe" <>, "" <>
X-Mailer: ELM [version 2.4ME+ PL121h (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset=US-ASCII
Archived-At: <>
Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 08 Aug 2019 23:33:41 -0000

> Hi Giuseppe,
> Thanks for sharing your work.   At the TSVWG meeting at IETF105 it was mentioned that there was a hardware implementation of SCE running in a 25Gbps switch (1:47:30 in  I presume this is the implementation that was being referred to.

As the person that made that comment I'll confirm that Yes, this is the implementation I was refering to.

> If I?m reading this correctly, the marking implementation in the switch is actually DCTCP-style (marking CE based on instantaneous queuing delay) in a single queue, rather than SCE-style.   The CE marks are then transformed into SCE marking by the receiver itself before being handed to dctcp-sce or reno-sce.  So, in essence this test appears to be a study of various AQM ramp options and modified ECN congestion responses in a homogeneous DCTCP datacenter environment, rather than being an SCE experiment specifically.

I'll defer to Giuseppe for details on that, but I do not believe this is simple dctcp style ce threshold marking when they are running in sce test mode.  See figure 1 and associated text.

As a person somewhat familiar with SCE I can state that dctcp ce marking as used in a data center and the end node response to that ce mark is infact basically the same as SCE, ie there is no latching of CE state until CWR at the receiver, and the marks are in proportion to the queue depth.  BUT by using a different ECT() marking and a different TCP feedback bit you can do both CE and SCE at the same time, which the classical data center dctcp modifications do not allow for.

> Could you provide some information on the buffer size?  It looks like it may be ~200us.   How does this marking approach compare to existing DCTCP deployments (i.e. do they generally use a hard marking threshold or do they similarly use a ramp)?  Have you thought about adjusting the ramp so that 100% marking occurs before you reach the tail drop limit?  I wonder if that would eliminate the packet drops.
> For comparison, you may as well use stock DCTCP which would avoid the receiver needing to translate the CE marks into SCE marks, since DCTCP can just interpret the CE marks directly.   But, I didn?t think that the goal of SCE was to replace DCTCP in datacenters anyway?

The goal of SCE is to provide a robust congestion mechanism, without having to be restricted to use in a datacenter.

> Are the tables missing some data?  I don?t see the bandwidth/retries/packet drop data for the continuous TCP connections.
> Also, do you have an explanation for Pic 1, where the ping time when the link is idle is almost double the ping time when the link is saturated?

I do have one for this, when the link is idle your receiver is not going to see the packet until a device interrupt occurs, which may be a good deal of time as these are not simple single packet/single interrupt type nics.  Basically when idle the ping packets are having to wait until either the poll threshold of the driver occurs, or something else causes the driver to interrupt.  At high packet rates your delivering a near continuous stream of interrupts to the CPU and thus your pings get processed by the receiver almost instantly.

> Best Regards,
> Greg
> From: tsvwg <> on behalf of "Scaglione, Giuseppe" <>
> Date: Tuesday, August 6, 2019 at 3:36 PM
> To: "" <>
> Subject: [tsvwg] Switch testing at 25G with ECN --SCE Draft
> Greetings,
> The attached document contains data collected testing with an Ethernet switch and the new TCP-dctcp-sce and TCP-reno-sce algorithms, draft-morton-taht-tsvwg-sce-00.
> Best regards,
> Giuseppe Scaglione
> Distinguished Technologist
> R&D Lab Hardware
> Aruba Networks
> Hewlett Packard Enterprise
> 8000 Foothills Blvd.  (MS 5555)
> Roseville, CA   95747
> Tel : 1-916-471-9189
> [cid:image001.jpg@01D11639.32250950][cid:image001.png@01D0EBC2.DCD78900]
Content-Description: image001.jpg

[ image/jpeg is not supported, skipping... ]

Content-Description: image002.png

[ image/png is not supported, skipping... ]

Rod Grimes