Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft

"Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com> Fri, 09 August 2019 20:44 UTC

Return-Path: <prvs=01242d4c40=giuseppe.scaglione@hpe.com>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 97D14120077 for <tsvwg@ietfa.amsl.com>; Fri, 9 Aug 2019 13:44:59 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.601
X-Spam-Level:
X-Spam-Status: No, score=-2.601 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id FF3rQDDwsjYG for <tsvwg@ietfa.amsl.com>; Fri, 9 Aug 2019 13:44:56 -0700 (PDT)
Received: from mx0a-002e3701.pphosted.com (mx0a-002e3701.pphosted.com [148.163.147.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id B332412006B for <tsvwg@ietf.org>; Fri, 9 Aug 2019 13:44:56 -0700 (PDT)
Received: from pps.filterd (m0150242.ppops.net [127.0.0.1]) by mx0a-002e3701.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x79Kf22w014745; Fri, 9 Aug 2019 20:44:51 GMT
Received: from g9t5008.houston.hpe.com (g9t5008.houston.hpe.com [15.241.48.72]) by mx0a-002e3701.pphosted.com with ESMTP id 2u9ef2gke2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 09 Aug 2019 20:44:51 +0000
Received: from G1W8107.americas.hpqcorp.net (g1w8107.austin.hp.com [16.193.72.59]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by g9t5008.houston.hpe.com (Postfix) with ESMTPS id 11A2A56; Fri, 9 Aug 2019 20:44:50 +0000 (UTC)
Received: from G1W8108.americas.hpqcorp.net (2002:10c1:483c::10c1:483c) by G1W8107.americas.hpqcorp.net (2002:10c1:483b::10c1:483b) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Fri, 9 Aug 2019 20:44:50 +0000
Received: from NAM03-CO1-obe.outbound.protection.outlook.com (15.241.52.13) by G1W8108.americas.hpqcorp.net (16.193.72.60) with Microsoft SMTP Server (TLS) id 15.0.1367.3 via Frontend Transport; Fri, 9 Aug 2019 20:44:50 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gZSaRGJdUUlvghJBKCQs6ATKxWspSo4aMtPSsQ8xOHb2JDjBCIXhFOJMgO27enDGzNUNmCJbN0MrdNcoIyAlmi2lRPTJb57+/Mg/ueNIeOFvlh8J0njRxVWTlNR80GB25OEDpDXR0jNFr0lBY8AwxeGvYClhA0z7liKd6O9iSiqlUzUk4QVZE7Ou+KjBHJfRXyAPiCBwopuyFZo7MQ1bHir8IcgRr8uXkq7LRzhJ6jeSxezJfrlvFt5BfGpwR0VgCls12D0wZOWyDadCi3JOV/4hc0xfVpd7fU/far82xrPo+A69m9bO3el76WVaXFcTg1wXS24GI73sZy5MmF6BKg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=o4GCKUqbN5QjndShZZmtL7TPwhKxAM3rJnb0wnkFg7o=; b=lV7fd3QhQvfiOmmPNb+TptwmSPE8ejK/7QUS/8jgFlZrNJvviH3Ao6zd0NJjq9BH2y2uSw7mym7KbCBz4BLnssruBVsPR+7TPPBkminQOnauvMdkbqddQ6gNGLPsREeho8+66i13ahIc+B2XPRsoZDs5MedaIBdox2ajgwHoQQv/MOFHN8O7WBJlGD1tdunqfSoB5s/6jLxv+As1rF/D3HxWT2aX25qVBI6ib773nC67WRLtqxpCUhYBPgl1ExeowU5JTH1YTdGIUZa6BjnTr3pDWNNZfTliyPi10nXPrdhSNmiBRqI75f20Y8B0OZCNkE1uf/YZnoPysnoYpiCFTQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=hpe.com; dmarc=pass action=none header.from=hpe.com; dkim=pass header.d=hpe.com; arc=none
Received: from AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM (10.169.8.20) by AT5PR8401MB1123.NAMPRD84.PROD.OUTLOOK.COM (10.169.7.136) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2157.14; Fri, 9 Aug 2019 20:44:48 +0000
Received: from AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM ([fe80::d7a:d9a8:652c:bec9]) by AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM ([fe80::d7a:d9a8:652c:bec9%3]) with mapi id 15.20.2157.020; Fri, 9 Aug 2019 20:44:48 +0000
From: "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com>
To: Greg White <g.white@CableLabs.com>, "rgrimes@freebsd.org" <rgrimes@freebsd.org>
CC: "tsvwg@ietf.org" <tsvwg@ietf.org>
Thread-Topic: [tsvwg] Switch testing at 25G with ECN --SCE Draft
Thread-Index: AQHVTjzVagH3r6fcc06dVlu/ZJB8+abx5uIAgAACfxCAAON5gIAAekXQ
Date: Fri, 09 Aug 2019 20:44:48 +0000
Message-ID: <AT5PR8401MB070700703FBF2318808238CF99D60@AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM>
References: <A8E3F5E9-443D-4F5A-9336-9A0E2E72C278@cablelabs.com> <201908082333.x78NXS0T094756@gndrsh.dnsmgr.net> <AT5PR8401MB07070C672C9F519C05D2D3F599D60@AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM> <81354E22-777C-4BDB-96E0-0B1F6C1DCCD2@cablelabs.com>
In-Reply-To: <81354E22-777C-4BDB-96E0-0B1F6C1DCCD2@cablelabs.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [15.211.195.7]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: bdfc872f-a411-4fc4-777c-08d71d0a6b9c
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600148)(711020)(4605104)(1401327)(4618075)(2017052603328)(7193020); SRVR:AT5PR8401MB1123;
x-ms-traffictypediagnostic: AT5PR8401MB1123:
x-ms-exchange-purlcount: 1
x-microsoft-antispam-prvs: <AT5PR8401MB112334DB17D8BA6F4982A38199D60@AT5PR8401MB1123.NAMPRD84.PROD.OUTLOOK.COM>
x-ms-oob-tlc-oobclassifiers: OLM:10000;
x-forefront-prvs: 01244308DF
x-forefront-antispam-report: SFV:NSPM; SFS:(10019020)(366004)(376002)(136003)(346002)(396003)(39860400002)(51914003)(13464003)(199004)(189003)(54094003)(76116006)(64756008)(66556008)(66476007)(66946007)(3846002)(6506007)(102836004)(11346002)(53546011)(486006)(5660300002)(6116002)(66446008)(8676002)(2501003)(81156014)(6306002)(7696005)(966005)(99286004)(55016002)(9686003)(186003)(476003)(478600001)(53936002)(446003)(81166006)(76176011)(6246003)(8936002)(26005)(66066001)(5024004)(14444005)(256004)(86362001)(71190400001)(71200400001)(4326008)(229853002)(14454004)(110136005)(316002)(2906002)(52536014)(6436002)(74316002)(25786009)(7736002)(305945005)(33656002); DIR:OUT; SFP:1102; SCL:1; SRVR:AT5PR8401MB1123; H:AT5PR8401MB0707.NAMPRD84.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1;
received-spf: None (protection.outlook.com: hpe.com does not designate permitted sender hosts)
x-ms-exchange-senderadcheck: 1
x-microsoft-antispam-message-info: kTMBFsDltARgR7NOlkYuGfQzGyAdZavtpHqzYmIKraiq2JMJROXmfrf5JnMOK/0hOckGprG955TFfZDpYb8wMJCryTimsSdP6LBl5Q4vlYYkXXfCQmbGurXhNbTv/xIPcouQNarJC8wezWV2K/gBetkM81FbH4DA/o6a1NBzyCSVYZfhSuqN1WCJtrO9DArkl0Mrxud72BxjIuDwtplVJLP+8dfZ7Mc0R7RPZT5033PgML67hYz2KsWGXOdl+bZ5ylAA+WKNAApbpo6UzzSfKb5i0z/QoMevqWIYc1chbi5QNJ1oi8RzDUQ5scqX+mGZVWfUmuEsUpLz/dxmQQzICJsh2EYILJ3rD8jGdkrE4Xzr4QRVUoUBspKWQ5Cgb+IEHq/QOw+MCKpa9V62Lvw4RPmWP3m9A/KOMiWnSH0sUBw=
x-ms-exchange-transport-forked: True
Content-Type: text/plain; charset="utf-8"
X-MS-Exchange-CrossTenant-Network-Message-Id: bdfc872f-a411-4fc4-777c-08d71d0a6b9c
X-MS-Exchange-CrossTenant-originalarrivaltime: 09 Aug 2019 20:44:48.7449 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 105b2061-b669-4b31-92ac-24d304d195dc
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 0UHT+tTbcoaNLJwAbmPTxE/90z6n6uhnko8v7l/xrCwWOZTvKS/xGw5BZzku+vM4vp3C+hFRtghWGT+25eUuVV8i6g+xiQOMSpx0GpA4eP0=
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AT5PR8401MB1123
X-OriginatorOrg: hpe.com
Content-Transfer-Encoding: base64
X-Proofpoint-UnRewURL: 1 URL was un-rewritten
MIME-Version: 1.0
X-HPE-SCL: -1
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-08-09_06:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1906280000 definitions=main-1908090202
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/gSgU9Ca3Sb8agNiws9GD7g9yY50>
Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Aug 2019 20:45:00 -0000

Greg,

>> Just to be super clear, this isn't a hardware implementation of SCE running at 25Gbps.

I am not sure I follow. The Test setup section of the paper clearly describes the hardware used -- severs, switch, cables. 
What I cannot disclose at this point is the exact model and characteristics of the HPE Aruba Switch used. Yet, it is a "real" Ethernet Switch, providing 25Gbps connectivity, configured to do bridging across the four ports and implementing RFC3168 with the ECN remarking configuration described in my previous email and on the paper.

Cheers,
Giuseppe 

-----Original Message-----
From: Greg White [mailto:g.white@CableLabs.com] 
Sent: Friday, August 9, 2019 12:17 PM
To: Scaglione, Giuseppe <giuseppe.scaglione@hpe.com>; rgrimes@freebsd.org
Cc: tsvwg@ietf.org
Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft

Rodney and Giuseppe, thanks for the clarification.

Just to be super clear, this isn't a hardware implementation of SCE running at 25Gbps.   It is essentially using a 25Gbps DCTCP switch implementation (with an adjustable marking probability slope) to study the behavior of the two experimental SCE congestion controllers in a data center environment.   If this is considered to be a scenario for deployment of SCE congestion controllers, then I would again suggest running the experiment also with the stock DCTCP congestion controller for comparison, since that is what you'd be trying to displace.

-Greg


On 8/8/19, 6:00 PM, "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com> wrote:

    Greetings,
    
    Replying to Greg:
    
    >>If I'm reading this correctly, the marking implementation in the switch is actually DCTCP-style (marking CE
    >>based on instantaneous queuing delay) in a single queue, rather than SCE-style.   
    
    The switch only knows RFC 3168 rules. That is, if packet is IP AND header is marked ECN capable (10 | 01) the packet can be remarked. The remarking probability is given by the formula on page 5:
    
    remark_probability = probability_rise *(current_mem_utilization -mem_start_point) ( all % values)
    
    In the paper, anywhere it says SCE testing the settings were:
    probability_rise = 1
    mem_start_point = 0
    Current_mem_utilization --> this is dynamic, goes from 0-100% provided by the HW queue.
    
    Therefore, for SCE testing the switch was remarking purely function of the queue memory utilization:
    
    remark_probability =  current_mem_utilization  
    
    Which produces that 'early warning' of "some congestion" which is mentioned in the sce-draft, and would never occur in traditional switch setting for CE remarking. 
    Yes -- all testing was using 1 of the 8 egress queues of the switch ports. It would be easy to do a multi queue one.
    
    >>Could you provide some information on the buffer size?  It looks like it may be ~200us
    
    Sorry cannot share any details on the HW and its memory at this point.  
    
    >>How does this marking approach compare to existing DCTCP deployments (i.e. do they generally use a 
    >>hard marking threshold or do they similarly use a ramp)?
    
    In general, some hardware implementation will start remarking at given queue depth (hard start) and then they apply a probability curve more or less aggressive. Other hardware (like the one in the experiment) gets continuous feedback on the queue depth and then applies the probability curve function of such memory usage. 
    Looking at the equation above would be interesting to try to play with the probability_rise as well, which was always left at 1.
    
    >>> Also, do you have an explanation for Pic 1, where the ping time when the link is idle is almost double the 
    >>ping time when the link is saturated?
    
    >>I do have one for this, when the link is idle your receiver is not going to see the packet until a device 
    >>interrupt occurs, which may be a good deal of time as these are not simple single packet/single interrupt 
    >>type nics.  Basically when idle the ping packets are having to wait until either the poll threshold of the 
    >>driver occurs, or something else causes the driver to interrupt.  At high packet rates your delivering a near 
    >> continuous stream of interrupts to the CPU and thus your pings get processed by the receiver almost 
    >>instantly.
    
    I agree with Rodney. I am not sure if the Kernel is using NAPI and doing Interrupt driven Packet Processing and then switching to polled -- if just polled the explanation makes perfect sense.
    
    Best Regards!
    Giuseppe
    
    
    -----Original Message-----
    From: Rodney W. Grimes [mailto:freebsd@gndrsh.dnsmgr.net] 
    Sent: Thursday, August 8, 2019 4:33 PM
    To: Greg White <g.white@CableLabs.com>
    Cc: Scaglione, Giuseppe <giuseppe.scaglione@hpe.com>; tsvwg@ietf.org
    Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft
    
    > Hi Giuseppe,
    > 
    > Thanks for sharing your work.   At the TSVWG meeting at IETF105 it was mentioned that there was a hardware implementation of SCE running in a 25Gbps switch (1:47:30 in https://www.ietf.org/audio/ietf105/ietf105-placeducanada-20190726-1220.mp3  ).  I presume this is the implementation that was being referred to.
    
    As the person that made that comment I'll confirm that Yes, this is the implementation I was refering to.
    
    > If I?m reading this correctly, the marking implementation in the switch is actually DCTCP-style (marking CE based on instantaneous queuing delay) in a single queue, rather than SCE-style.   The CE marks are then transformed into SCE marking by the receiver itself before being handed to dctcp-sce or reno-sce.  So, in essence this test appears to be a study of various AQM ramp options and modified ECN congestion responses in a homogeneous DCTCP datacenter environment, rather than being an SCE experiment specifically.
    
    I'll defer to Giuseppe for details on that, but I do not believe this is simple dctcp style ce threshold marking when they are running in sce test mode.  See figure 1 and associated text.
    
    As a person somewhat familiar with SCE I can state that dctcp ce marking as used in a data center and the end node response to that ce mark is infact basically the same as SCE, ie there is no latching of CE state until CWR at the receiver, and the marks are in proportion to the queue depth.  BUT by using a different ECT() marking and a different TCP feedback bit you can do both CE and SCE at the same time, which the classical data center dctcp modifications do not allow for.
    
    > 
    > Could you provide some information on the buffer size?  It looks like it may be ~200us.   How does this marking approach compare to existing DCTCP deployments (i.e. do they generally use a hard marking threshold or do they similarly use a ramp)?  Have you thought about adjusting the ramp so that 100% marking occurs before you reach the tail drop limit?  I wonder if that would eliminate the packet drops.
    > 
    > For comparison, you may as well use stock DCTCP which would avoid the receiver needing to translate the CE marks into SCE marks, since DCTCP can just interpret the CE marks directly.   But, I didn?t think that the goal of SCE was to replace DCTCP in datacenters anyway?
    
    The goal of SCE is to provide a robust congestion mechanism, without having to be restricted to use in a datacenter.
    
    > 
    > Are the tables missing some data?  I don?t see the bandwidth/retries/packet drop data for the continuous TCP connections.
    > 
    > Also, do you have an explanation for Pic 1, where the ping time when the link is idle is almost double the ping time when the link is saturated?
    
    I do have one for this, when the link is idle your receiver is not going to see the packet until a device interrupt occurs, which may be a good deal of time as these are not simple single packet/single interrupt type nics.  Basically when idle the ping packets are having to wait until either the poll threshold of the driver occurs, or something else causes the driver to interrupt.  At high packet rates your delivering a near continuous stream of interrupts to the CPU and thus your pings get processed by the receiver almost instantly.
    
    > 
    > Best Regards,
    > Greg
    > 
    > 
    > 
    > From: tsvwg <tsvwg-bounces@ietf.org> on behalf of "Scaglione, Giuseppe" <giuseppe.scaglione@hpe.com>
    > Date: Tuesday, August 6, 2019 at 3:36 PM
    > To: "tsvwg@ietf.org" <tsvwg@ietf.org>
    > Subject: [tsvwg] Switch testing at 25G with ECN --SCE Draft
    > 
    > Greetings,
    > 
    > The attached document contains data collected testing with an Ethernet switch and the new TCP-dctcp-sce and TCP-reno-sce algorithms, draft-morton-taht-tsvwg-sce-00.
    > 
    > Best regards,
    > 
    > 
    > Giuseppe Scaglione
    > Distinguished Technologist
    > R&D Lab Hardware
    > Aruba Networks
    > Hewlett Packard Enterprise
    > 8000 Foothills Blvd.  (MS 5555)
    > Roseville, CA   95747
    > Tel : 1-916-471-9189
    > [cid:image001.jpg@01D11639.32250950][cid:image001.png@01D0EBC2.DCD78900]
    > 
    > 
    Content-Description: image001.jpg
    
    [ image/jpeg is not supported, skipping... ]
    
    Content-Description: image002.png
    
    [ image/png is not supported, skipping... ]
    
    -- 
    Rod Grimes                                                 rgrimes@freebsd.org