Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft

"Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com> Fri, 09 August 2019 00:00 UTC

Return-Path: <prvs=01242d4c40=giuseppe.scaglione@hpe.com>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id A04C1120025 for <tsvwg@ietfa.amsl.com>; Thu, 8 Aug 2019 17:00:18 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.599
X-Spam-Level:
X-Spam-Status: No, score=-2.599 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7pGq6zSNy6Gv for <tsvwg@ietfa.amsl.com>; Thu, 8 Aug 2019 17:00:15 -0700 (PDT)
Received: from mx0b-002e3701.pphosted.com (mx0b-002e3701.pphosted.com [148.163.143.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 5C592120088 for <tsvwg@ietf.org>; Thu, 8 Aug 2019 17:00:15 -0700 (PDT)
Received: from pps.filterd (m0150244.ppops.net [127.0.0.1]) by mx0b-002e3701.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x78Nv14Z001248; Fri, 9 Aug 2019 00:00:08 GMT
Received: from g4t3427.houston.hpe.com (g4t3427.houston.hpe.com [15.241.140.73]) by mx0b-002e3701.pphosted.com with ESMTP id 2u8u1cs06k-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 09 Aug 2019 00:00:07 +0000
Received: from G2W6311.americas.hpqcorp.net (g2w6311.austin.hp.com [16.197.64.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by g4t3427.houston.hpe.com (Postfix) with ESMTPS id 563D357; Fri, 9 Aug 2019 00:00:07 +0000 (UTC)
Received: from G4W9327.americas.hpqcorp.net (16.208.32.97) by G2W6311.americas.hpqcorp.net (16.197.64.53) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Fri, 9 Aug 2019 00:00:06 +0000
Received: from G9W9209.americas.hpqcorp.net (2002:10dc:429c::10dc:429c) by G4W9327.americas.hpqcorp.net (2002:10d0:2061::10d0:2061) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Fri, 9 Aug 2019 00:00:06 +0000
Received: from NAM01-SN1-obe.outbound.protection.outlook.com (15.241.52.13) by G9W9209.americas.hpqcorp.net (16.220.66.156) with Microsoft SMTP Server (TLS) id 15.0.1367.3 via Frontend Transport; Fri, 9 Aug 2019 00:00:06 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=OUBXJoQqz9wqdRDxwD28E03PwH1cKvvXpRp/qee6FD9/QkkAHtaCas6koIedsoD09hRMFDw6DzuMwxIDDWIlxHg/zxbHsCcvycR91ZvnsN+SCP1IPhfUyrRTI/u8GLEFavGsnCVkIH6YXBP2VB/GyAokvIv9Uv5Zah4Dtf/vEBJVK1RKELAstqW3G6JHHsdzOq9m4fgP66K5oFaSHWnI0hncyi6TtCoya7jHGbBdWSk1SrHtp5D/SgQ5FToH0UwxAok2sfunO04oimLeq9tWgkfrCEBVw0gR6/s/irNiRvL7Uyu9ecn2m6vQoQMiGeIkM6BCZo0PgS4MByrDlaXvHQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=HBv9R8WXGCBh3tK6eluyOVFgabN4XzAOv/lv45vhiHY=; b=DCGZrda2LjZlwe0PSMxTkFrZpxJNoAR6HV11y4ioMydvI9tzH797qGYrjNBsxnSBLvAkAIxl/JmVa/SW4wkBQLwVJqqYj9nd6CIgOLI6MEBAaME+j72ekC2rJ1QJVzPjz3N54skhL7+Kn28nV8oXJnx+txyl9x4bvBlxr0aySV2RLqxA9Lav0+WLlC2B09WkcfQgo9MfBOGjEA9KOMKFmd9JJfuHA5zCSznHaxL6ky6STOwSj3ZZSADddXlWCuYVKxPKgX58o5UzUgqFTDt8PZvtvP9WHgCkOUc9cZMFYpgT6CUs364DG1WZLMCUqtbV65qQjyjLH5fiERRrFL6RpQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=hpe.com; dmarc=pass action=none header.from=hpe.com; dkim=pass header.d=hpe.com; arc=none
Received: from AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM (10.169.8.20) by AT5PR8401MB0868.NAMPRD84.PROD.OUTLOOK.COM (10.169.7.16) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2157.16; Fri, 9 Aug 2019 00:00:05 +0000
Received: from AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM ([fe80::d7a:d9a8:652c:bec9]) by AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM ([fe80::d7a:d9a8:652c:bec9%3]) with mapi id 15.20.2157.015; Fri, 9 Aug 2019 00:00:05 +0000
From: "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com>
To: "rgrimes@freebsd.org" <rgrimes@freebsd.org>, Greg White <g.white@CableLabs.com>
CC: "tsvwg@ietf.org" <tsvwg@ietf.org>
Thread-Topic: [tsvwg] Switch testing at 25G with ECN --SCE Draft
Thread-Index: AQHVTjzVagH3r6fcc06dVlu/ZJB8+abx5uIAgAACfxA=
Date: Fri, 09 Aug 2019 00:00:05 +0000
Message-ID: <AT5PR8401MB07070C672C9F519C05D2D3F599D60@AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM>
References: <A8E3F5E9-443D-4F5A-9336-9A0E2E72C278@cablelabs.com> <201908082333.x78NXS0T094756@gndrsh.dnsmgr.net>
In-Reply-To: <201908082333.x78NXS0T094756@gndrsh.dnsmgr.net>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [104.220.93.96]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: a49b31d9-77b9-471d-7142-08d71c5c88dd
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(7168020)(4627221)(201703031133081)(201702281549075)(8990200)(5600148)(711020)(4605104)(1401327)(4618075)(2017052603328)(7167020)(7193020); SRVR:AT5PR8401MB0868;
x-ms-traffictypediagnostic: AT5PR8401MB0868:
x-ms-exchange-purlcount: 1
x-microsoft-antispam-prvs: <AT5PR8401MB086871CD4FB792D9D751B67199D60@AT5PR8401MB0868.NAMPRD84.PROD.OUTLOOK.COM>
x-ms-oob-tlc-oobclassifiers: OLM:10000;
x-forefront-prvs: 01244308DF
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(136003)(396003)(346002)(39860400002)(366004)(376002)(54094003)(189003)(199004)(13464003)(476003)(446003)(11346002)(110136005)(25786009)(2906002)(486006)(6116002)(3846002)(478600001)(66066001)(71200400001)(9686003)(74316002)(53936002)(2501003)(14444005)(71190400001)(305945005)(5660300002)(7736002)(6246003)(8936002)(52536014)(186003)(6306002)(102836004)(55016002)(7696005)(26005)(76176011)(86362001)(316002)(33656002)(229853002)(53546011)(4326008)(99286004)(966005)(14454004)(256004)(5024004)(6506007)(81166006)(81156014)(8676002)(55236004)(6436002)(76116006)(66556008)(66446008)(64756008)(66476007)(66946007); DIR:OUT; SFP:1102; SCL:1; SRVR:AT5PR8401MB0868; H:AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; MX:1; A:1;
received-spf: None (protection.outlook.com: hpe.com does not designate permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam-message-info: SY20X/+OFQFczuy8AgP7/V+H6W7tF01KPYuR2zieo8RhbC6P3JjYOFiBDzpzRyC3mj9uVdiuz7RsOO+PIu2aMqTiTFO9qMnRHO3G6rsZ85ytdr091kghqK1Fx3ydNdJpBnpq2vEP+bkmriMvzgzNT8ZFPjfHCv1mmyhYzIs9A2Yhep/Tev1fY4c3HEIS2VD3E2fahX4qxB6/Hr1C6BVrycE/1cyrhjVPcOtnseHUaYieBYTwNtXuwCnoV+YYX6KIKzguewu2i5Y6LooYXNti8Xt3EfGrO4XB+RjxhCVIki37bP2uIOY7sqZIl9vbqAklKCfm6Cs4bzyd04S7y11MSXRxoU5FZWI8nWBPqxmzqtX2+5a0/JJ/2ip90k6TpE1KxLx60qnzFVX34Zs5v3gZgk/HBw+Uk1oOH3MLGB9T8FM=
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="us-ascii"
X-MS-Exchange-CrossTenant-Network-Message-Id: a49b31d9-77b9-471d-7142-08d71c5c88dd
X-MS-Exchange-CrossTenant-originalarrivaltime: 09 Aug 2019 00:00:05.3577 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 105b2061-b669-4b31-92ac-24d304d195dc
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 0ZRN1HXbuvUDSZOEJVShd9CgeLfftY4yANabDHXGF/yLa74hkwBCijthZui7uiaUrKBa3fOD01OUi149Hwsk/hDajEsC13lzhiBOBMUvUug=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AT5PR8401MB0868
X-OriginatorOrg: hpe.com
Content-Transfer-Encoding: quoted-printable
X-Proofpoint-UnRewURL: 1 URL was un-rewritten
MIME-Version: 1.0
X-HPE-SCL: -1
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-08-08_10:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1906280000 definitions=main-1908080212
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/C003wJXJK-bd6toHye_B_6iX53w>
X-Mailman-Approved-At: Thu, 08 Aug 2019 21:30:19 -0700
Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Aug 2019 00:00:19 -0000

Greetings,

Replying to Greg:

>>If I'm reading this correctly, the marking implementation in the switch is actually DCTCP-style (marking CE
>>based on instantaneous queuing delay) in a single queue, rather than SCE-style.   

The switch only knows RFC 3168 rules. That is, if packet is IP AND header is marked ECN capable (10 | 01) the packet can be remarked. The remarking probability is given by the formula on page 5:

remark_probability = probability_rise *(current_mem_utilization -mem_start_point) ( all % values)

In the paper, anywhere it says SCE testing the settings were:
probability_rise = 1
mem_start_point = 0
Current_mem_utilization --> this is dynamic, goes from 0-100% provided by the HW queue.

Therefore, for SCE testing the switch was remarking purely function of the queue memory utilization:

remark_probability =  current_mem_utilization  

Which produces that 'early warning' of "some congestion" which is mentioned in the sce-draft, and would never occur in traditional switch setting for CE remarking. 
Yes -- all testing was using 1 of the 8 egress queues of the switch ports. It would be easy to do a multi queue one.

>>Could you provide some information on the buffer size?  It looks like it may be ~200us

Sorry cannot share any details on the HW and its memory at this point.  

>>How does this marking approach compare to existing DCTCP deployments (i.e. do they generally use a 
>>hard marking threshold or do they similarly use a ramp)?

In general, some hardware implementation will start remarking at given queue depth (hard start) and then they apply a probability curve more or less aggressive. Other hardware (like the one in the experiment) gets continuous feedback on the queue depth and then applies the probability curve function of such memory usage. 
Looking at the equation above would be interesting to try to play with the probability_rise as well, which was always left at 1.

>>> Also, do you have an explanation for Pic 1, where the ping time when the link is idle is almost double the 
>>ping time when the link is saturated?

>>I do have one for this, when the link is idle your receiver is not going to see the packet until a device 
>>interrupt occurs, which may be a good deal of time as these are not simple single packet/single interrupt 
>>type nics.  Basically when idle the ping packets are having to wait until either the poll threshold of the 
>>driver occurs, or something else causes the driver to interrupt.  At high packet rates your delivering a near 
>> continuous stream of interrupts to the CPU and thus your pings get processed by the receiver almost 
>>instantly.

I agree with Rodney. I am not sure if the Kernel is using NAPI and doing Interrupt driven Packet Processing and then switching to polled -- if just polled the explanation makes perfect sense.

Best Regards!
Giuseppe


-----Original Message-----
From: Rodney W. Grimes [mailto:freebsd@gndrsh.dnsmgr.net] 
Sent: Thursday, August 8, 2019 4:33 PM
To: Greg White <g.white@CableLabs.com>
Cc: Scaglione, Giuseppe <giuseppe.scaglione@hpe.com>; tsvwg@ietf.org
Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft

> Hi Giuseppe,
> 
> Thanks for sharing your work.   At the TSVWG meeting at IETF105 it was mentioned that there was a hardware implementation of SCE running in a 25Gbps switch (1:47:30 in https://www.ietf.org/audio/ietf105/ietf105-placeducanada-20190726-1220.mp3 ).  I presume this is the implementation that was being referred to.

As the person that made that comment I'll confirm that Yes, this is the implementation I was refering to.

> If I?m reading this correctly, the marking implementation in the switch is actually DCTCP-style (marking CE based on instantaneous queuing delay) in a single queue, rather than SCE-style.   The CE marks are then transformed into SCE marking by the receiver itself before being handed to dctcp-sce or reno-sce.  So, in essence this test appears to be a study of various AQM ramp options and modified ECN congestion responses in a homogeneous DCTCP datacenter environment, rather than being an SCE experiment specifically.

I'll defer to Giuseppe for details on that, but I do not believe this is simple dctcp style ce threshold marking when they are running in sce test mode.  See figure 1 and associated text.

As a person somewhat familiar with SCE I can state that dctcp ce marking as used in a data center and the end node response to that ce mark is infact basically the same as SCE, ie there is no latching of CE state until CWR at the receiver, and the marks are in proportion to the queue depth.  BUT by using a different ECT() marking and a different TCP feedback bit you can do both CE and SCE at the same time, which the classical data center dctcp modifications do not allow for.

> 
> Could you provide some information on the buffer size?  It looks like it may be ~200us.   How does this marking approach compare to existing DCTCP deployments (i.e. do they generally use a hard marking threshold or do they similarly use a ramp)?  Have you thought about adjusting the ramp so that 100% marking occurs before you reach the tail drop limit?  I wonder if that would eliminate the packet drops.
> 
> For comparison, you may as well use stock DCTCP which would avoid the receiver needing to translate the CE marks into SCE marks, since DCTCP can just interpret the CE marks directly.   But, I didn?t think that the goal of SCE was to replace DCTCP in datacenters anyway?

The goal of SCE is to provide a robust congestion mechanism, without having to be restricted to use in a datacenter.

> 
> Are the tables missing some data?  I don?t see the bandwidth/retries/packet drop data for the continuous TCP connections.
> 
> Also, do you have an explanation for Pic 1, where the ping time when the link is idle is almost double the ping time when the link is saturated?

I do have one for this, when the link is idle your receiver is not going to see the packet until a device interrupt occurs, which may be a good deal of time as these are not simple single packet/single interrupt type nics.  Basically when idle the ping packets are having to wait until either the poll threshold of the driver occurs, or something else causes the driver to interrupt.  At high packet rates your delivering a near continuous stream of interrupts to the CPU and thus your pings get processed by the receiver almost instantly.

> 
> Best Regards,
> Greg
> 
> 
> 
> From: tsvwg <tsvwg-bounces@ietf.org> on behalf of "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com>
> Date: Tuesday, August 6, 2019 at 3:36 PM
> To: "tsvwg@ietf.org" <tsvwg@ietf.org>
> Subject: [tsvwg] Switch testing at 25G with ECN --SCE Draft
> 
> Greetings,
> 
> The attached document contains data collected testing with an Ethernet switch and the new TCP-dctcp-sce and TCP-reno-sce algorithms, draft-morton-taht-tsvwg-sce-00.
> 
> Best regards,
> 
> 
> Giuseppe Scaglione
> Distinguished Technologist
> R&D Lab Hardware
> Aruba Networks
> Hewlett Packard Enterprise
> 8000 Foothills Blvd.  (MS 5555)
> Roseville, CA   95747
> Tel : 1-916-471-9189
> [cid:image001.jpg@01D11639.32250950][cid:image001.png@01D0EBC2.DCD78900]
> 
> 
Content-Description: image001.jpg

[ image/jpeg is not supported, skipping... ]

Content-Description: image002.png

[ image/png is not supported, skipping... ]

-- 
Rod Grimes                                                 rgrimes@freebsd.org