Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft

Greg White <g.white@CableLabs.com> Fri, 09 August 2019 19:16 UTC

Return-Path: <g.white@CableLabs.com>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 5E73B120077 for <tsvwg@ietfa.amsl.com>; Fri, 9 Aug 2019 12:16:39 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.001
X-Spam-Level:
X-Spam-Status: No, score=-2.001 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=cablelabs.com
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id W-YHpU_OvyjX for <tsvwg@ietfa.amsl.com>; Fri, 9 Aug 2019 12:16:37 -0700 (PDT)
Received: from NAM03-CO1-obe.outbound.protection.outlook.com (mail-eopbgr790132.outbound.protection.outlook.com [40.107.79.132]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id EAF54120019 for <tsvwg@ietf.org>; Fri, 9 Aug 2019 12:16:36 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Kg+IDB9+zSGbGWb/1UM64wF2yYkWVmIB5x479WEioSt/X2fi8/Pr64u6KPw2rfJDrAPoucObnOTsfwsE4SmDdcdMGuWwbfGTw3CkA8NMXcHTu10IEe+XO1ztrw/HA8ZOL3aTNGVKLVlgfV2s6uK7BQwnebaZsgEURVdOGpJnaDF4KtxkH+pUoCSZ81VA04RllEybq/8w7cpXOfdh3q4WJ6ECb36ZCmZLRlni2dxqBnGBJpKlqL/6+Wsx6seBvXQ1VQvyhtxa7XYinPxgFPbjZwvE+hwQ48/t35f1hoyk8U9k+Q+YBmNfMLYlaV9R3qEjAD59FRmDPMCblnhLbJB09g==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=8Uf/itCuQbhoOyXGfvQYI0/tytC3JG+RGhW/IUQPncM=; b=B1AqBVRCcNzSCosQOfQ8e56aY/NXZCcrru3SQhsZ6sRkJHHSobV/y1Bv1yPNpTIzsV464vPutGSu0i1B+1RfOmm3FnQirYzpJM06JNO1eeg7rlcUmdN8NwqPHKLxNjiiBba6xd59GXCoe32qKGj7mBXwWY8rJoULUlR1hnUEWQwEWotdjvdoyswIieG6yh935mi5FxIy1KjnhwHqMhrfvRbK4MoARxZaAr4G65JxXCPk9U2Fy1EiwIR6DIwsQgupRMmR4EkcesTF9R4wxxji2Hso0INVtXW91YAlBYZ+ijo5zLi/qKF1MyiAKSkby8Cs2mR7feN9jQCqkPwXfVzBYQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=cablelabs.com; dmarc=pass action=none header.from=cablelabs.com; dkim=pass header.d=cablelabs.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cablelabs.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=8Uf/itCuQbhoOyXGfvQYI0/tytC3JG+RGhW/IUQPncM=; b=MMODbDLpvrHGWH91W0aYsqUP0eHqKvm4CcdXid6GXl7UqvHCmFMHqp/cEPd7AgJlYpkTEb9kAGa7wlRXUKwHqBLDFwKcjtVTDSQIhqT8K7Zug/twgg6WpP1o8nCobKUW1bldIsVidBdsgWAP0RPUFbAcBlAegqY9SjFwoEQxZZs=
Received: from SN6PR06MB4655.namprd06.prod.outlook.com (52.135.117.85) by SN6PR06MB3935.namprd06.prod.outlook.com (52.132.126.139) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2157.16; Fri, 9 Aug 2019 19:16:34 +0000
Received: from SN6PR06MB4655.namprd06.prod.outlook.com ([fe80::d33:804d:d298:58b3]) by SN6PR06MB4655.namprd06.prod.outlook.com ([fe80::d33:804d:d298:58b3%3]) with mapi id 15.20.2136.022; Fri, 9 Aug 2019 19:16:34 +0000
From: Greg White <g.white@CableLabs.com>
To: "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com>, "rgrimes@freebsd.org" <rgrimes@freebsd.org>
CC: "tsvwg@ietf.org" <tsvwg@ietf.org>
Thread-Topic: [tsvwg] Switch testing at 25G with ECN --SCE Draft
Thread-Index: AQHVTjzVagH3r6fcc06dVlu/ZJB8+abx5uIAgAACfxCAAON5gA==
Date: Fri, 09 Aug 2019 19:16:34 +0000
Message-ID: <81354E22-777C-4BDB-96E0-0B1F6C1DCCD2@cablelabs.com>
References: <A8E3F5E9-443D-4F5A-9336-9A0E2E72C278@cablelabs.com> <201908082333.x78NXS0T094756@gndrsh.dnsmgr.net> <AT5PR8401MB07070C672C9F519C05D2D3F599D60@AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM>
In-Reply-To: <AT5PR8401MB07070C672C9F519C05D2D3F599D60@AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
user-agent: Microsoft-MacOutlook/10.1a.0.190530
authentication-results: spf=none (sender IP is ) smtp.mailfrom=g.white@CableLabs.com;
x-originating-ip: [192.160.73.16]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 737e20ae-fd86-4c00-d01a-08d71cfe17cb
x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600148)(711020)(4605104)(1401327)(2017052603328)(7193020); SRVR:SN6PR06MB3935;
x-ms-traffictypediagnostic: SN6PR06MB3935:
x-ms-exchange-purlcount: 1
x-microsoft-antispam-prvs: <SN6PR06MB393571AF045F4FA4C68BAC60EED60@SN6PR06MB3935.namprd06.prod.outlook.com>
x-ms-oob-tlc-oobclassifiers: OLM:10000;
x-forefront-prvs: 01244308DF
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(376002)(366004)(346002)(136003)(396003)(39850400004)(13464003)(51914003)(54094003)(199004)(189003)(25786009)(2501003)(5660300002)(6436002)(3846002)(66066001)(14454004)(486006)(6486002)(446003)(110136005)(966005)(4326008)(36756003)(2906002)(6116002)(71200400001)(71190400001)(186003)(229853002)(76176011)(99286004)(26005)(66946007)(6306002)(66446008)(76116006)(8936002)(7736002)(64756008)(305945005)(102836004)(66556008)(91956017)(6506007)(478600001)(11346002)(86362001)(53546011)(58126008)(296002)(476003)(66476007)(316002)(6512007)(256004)(14444005)(5024004)(33656002)(81166006)(2616005)(8676002)(81156014)(53936002)(6246003)(85282002); DIR:OUT; SFP:1102; SCL:1; SRVR:SN6PR06MB3935; H:SN6PR06MB4655.namprd06.prod.outlook.com; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1;
received-spf: None (protection.outlook.com: CableLabs.com does not designate permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam-message-info: 7HdJw+OQygZ7fGXIO6UueGxwXs1QxSp/Og1uScxs8KJTZI8c9YceAG3PKBmT44Z4IqnHl4hP3dzNJRs3NU/0XagLyfBw0yjFsb+8Pyzh5jgElACLpb8ARhx42uTAe5E7lvSqXa1Soaxg5evk/sTNENR/zN3/D+YV6BbDeBXRR/9UbA29NqTl6Ranu0Lcq9+QS/MwygX0qeEOKFMvIVp3iC6MhKCyQWnem3k6fcG7Uhnr6JbadqFni58pKbwyMFMTobXuHXf5MO6zBcuF/BM3hY+q/wwwbmIyElDjx2YeoGiGF5XanBB0/y/uhlaMbgpJYxZ6MNyPlVmeeDsm3M4Ce7fYDGmSeTTHcXFzcAdt5lS0/ledKXQj86zb6dqCwY8i0DeWZaNlpqJoHzjgkeW2HLIjAwLRE+4zRqOL7MHjEwI=
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="utf-8"
Content-ID: <4291218A3FFACF45A491BA98839CCF0E@namprd06.prod.outlook.com>
Content-Transfer-Encoding: base64
MIME-Version: 1.0
X-OriginatorOrg: cablelabs.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 737e20ae-fd86-4c00-d01a-08d71cfe17cb
X-MS-Exchange-CrossTenant-originalarrivaltime: 09 Aug 2019 19:16:34.1049 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: ce4fbcd1-1d81-4af0-ad0b-2998c441e160
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: g.white@cablelabs.com
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN6PR06MB3935
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/Cr5V-Ek2G5MCyid0Q-6kD0iH7qw>
Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Aug 2019 19:16:40 -0000

Rodney and Giuseppe, thanks for the clarification.

Just to be super clear, this isn't a hardware implementation of SCE running at 25Gbps.   It is essentially using a 25Gbps DCTCP switch implementation (with an adjustable marking probability slope) to study the behavior of the two experimental SCE congestion controllers in a data center environment.   If this is considered to be a scenario for deployment of SCE congestion controllers, then I would again suggest running the experiment also with the stock DCTCP congestion controller for comparison, since that is what you'd be trying to displace.

-Greg


On 8/8/19, 6:00 PM, "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com> wrote:

    Greetings,
    
    Replying to Greg:
    
    >>If I'm reading this correctly, the marking implementation in the switch is actually DCTCP-style (marking CE
    >>based on instantaneous queuing delay) in a single queue, rather than SCE-style.   
    
    The switch only knows RFC 3168 rules. That is, if packet is IP AND header is marked ECN capable (10 | 01) the packet can be remarked. The remarking probability is given by the formula on page 5:
    
    remark_probability = probability_rise *(current_mem_utilization -mem_start_point) ( all % values)
    
    In the paper, anywhere it says SCE testing the settings were:
    probability_rise = 1
    mem_start_point = 0
    Current_mem_utilization --> this is dynamic, goes from 0-100% provided by the HW queue.
    
    Therefore, for SCE testing the switch was remarking purely function of the queue memory utilization:
    
    remark_probability =  current_mem_utilization  
    
    Which produces that 'early warning' of "some congestion" which is mentioned in the sce-draft, and would never occur in traditional switch setting for CE remarking. 
    Yes -- all testing was using 1 of the 8 egress queues of the switch ports. It would be easy to do a multi queue one.
    
    >>Could you provide some information on the buffer size?  It looks like it may be ~200us
    
    Sorry cannot share any details on the HW and its memory at this point.  
    
    >>How does this marking approach compare to existing DCTCP deployments (i.e. do they generally use a 
    >>hard marking threshold or do they similarly use a ramp)?
    
    In general, some hardware implementation will start remarking at given queue depth (hard start) and then they apply a probability curve more or less aggressive. Other hardware (like the one in the experiment) gets continuous feedback on the queue depth and then applies the probability curve function of such memory usage. 
    Looking at the equation above would be interesting to try to play with the probability_rise as well, which was always left at 1.
    
    >>> Also, do you have an explanation for Pic 1, where the ping time when the link is idle is almost double the 
    >>ping time when the link is saturated?
    
    >>I do have one for this, when the link is idle your receiver is not going to see the packet until a device 
    >>interrupt occurs, which may be a good deal of time as these are not simple single packet/single interrupt 
    >>type nics.  Basically when idle the ping packets are having to wait until either the poll threshold of the 
    >>driver occurs, or something else causes the driver to interrupt.  At high packet rates your delivering a near 
    >> continuous stream of interrupts to the CPU and thus your pings get processed by the receiver almost 
    >>instantly.
    
    I agree with Rodney. I am not sure if the Kernel is using NAPI and doing Interrupt driven Packet Processing and then switching to polled -- if just polled the explanation makes perfect sense.
    
    Best Regards!
    Giuseppe
    
    
    -----Original Message-----
    From: Rodney W. Grimes [mailto:freebsd@gndrsh.dnsmgr.net] 
    Sent: Thursday, August 8, 2019 4:33 PM
    To: Greg White <g.white@CableLabs.com>
    Cc: Scaglione, Giuseppe <giuseppe.scaglione@hpe.com>; tsvwg@ietf.org
    Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft
    
    > Hi Giuseppe,
    > 
    > Thanks for sharing your work.   At the TSVWG meeting at IETF105 it was mentioned that there was a hardware implementation of SCE running in a 25Gbps switch (1:47:30 in https://www.ietf.org/audio/ietf105/ietf105-placeducanada-20190726-1220.mp3 ).  I presume this is the implementation that was being referred to.
    
    As the person that made that comment I'll confirm that Yes, this is the implementation I was refering to.
    
    > If I?m reading this correctly, the marking implementation in the switch is actually DCTCP-style (marking CE based on instantaneous queuing delay) in a single queue, rather than SCE-style.   The CE marks are then transformed into SCE marking by the receiver itself before being handed to dctcp-sce or reno-sce.  So, in essence this test appears to be a study of various AQM ramp options and modified ECN congestion responses in a homogeneous DCTCP datacenter environment, rather than being an SCE experiment specifically.
    
    I'll defer to Giuseppe for details on that, but I do not believe this is simple dctcp style ce threshold marking when they are running in sce test mode.  See figure 1 and associated text.
    
    As a person somewhat familiar with SCE I can state that dctcp ce marking as used in a data center and the end node response to that ce mark is infact basically the same as SCE, ie there is no latching of CE state until CWR at the receiver, and the marks are in proportion to the queue depth.  BUT by using a different ECT() marking and a different TCP feedback bit you can do both CE and SCE at the same time, which the classical data center dctcp modifications do not allow for.
    
    > 
    > Could you provide some information on the buffer size?  It looks like it may be ~200us.   How does this marking approach compare to existing DCTCP deployments (i.e. do they generally use a hard marking threshold or do they similarly use a ramp)?  Have you thought about adjusting the ramp so that 100% marking occurs before you reach the tail drop limit?  I wonder if that would eliminate the packet drops.
    > 
    > For comparison, you may as well use stock DCTCP which would avoid the receiver needing to translate the CE marks into SCE marks, since DCTCP can just interpret the CE marks directly.   But, I didn?t think that the goal of SCE was to replace DCTCP in datacenters anyway?
    
    The goal of SCE is to provide a robust congestion mechanism, without having to be restricted to use in a datacenter.
    
    > 
    > Are the tables missing some data?  I don?t see the bandwidth/retries/packet drop data for the continuous TCP connections.
    > 
    > Also, do you have an explanation for Pic 1, where the ping time when the link is idle is almost double the ping time when the link is saturated?
    
    I do have one for this, when the link is idle your receiver is not going to see the packet until a device interrupt occurs, which may be a good deal of time as these are not simple single packet/single interrupt type nics.  Basically when idle the ping packets are having to wait until either the poll threshold of the driver occurs, or something else causes the driver to interrupt.  At high packet rates your delivering a near continuous stream of interrupts to the CPU and thus your pings get processed by the receiver almost instantly.
    
    > 
    > Best Regards,
    > Greg
    > 
    > 
    > 
    > From: tsvwg <tsvwg-bounces@ietf.org> on behalf of "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com>
    > Date: Tuesday, August 6, 2019 at 3:36 PM
    > To: "tsvwg@ietf.org" <tsvwg@ietf.org>
    > Subject: [tsvwg] Switch testing at 25G with ECN --SCE Draft
    > 
    > Greetings,
    > 
    > The attached document contains data collected testing with an Ethernet switch and the new TCP-dctcp-sce and TCP-reno-sce algorithms, draft-morton-taht-tsvwg-sce-00.
    > 
    > Best regards,
    > 
    > 
    > Giuseppe Scaglione
    > Distinguished Technologist
    > R&D Lab Hardware
    > Aruba Networks
    > Hewlett Packard Enterprise
    > 8000 Foothills Blvd.  (MS 5555)
    > Roseville, CA   95747
    > Tel : 1-916-471-9189
    > [cid:image001.jpg@01D11639.32250950][cid:image001.png@01D0EBC2.DCD78900]
    > 
    > 
    Content-Description: image001.jpg
    
    [ image/jpeg is not supported, skipping... ]
    
    Content-Description: image002.png
    
    [ image/png is not supported, skipping... ]
    
    -- 
    Rod Grimes                                                 rgrimes@freebsd.org