Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft

Jonathan Morton <> Sun, 11 August 2019 13:31 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id D1FB6120C2C for <>; Sun, 11 Aug 2019 06:31:22 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.749
X-Spam-Status: No, score=-1.749 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=no autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id zxxk0HRo26IW for <>; Sun, 11 Aug 2019 06:30:23 -0700 (PDT)
Received: from ( [IPv6:2a00:1450:4864:20::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 4BEAD120AC0 for <>; Sun, 11 Aug 2019 00:27:40 -0700 (PDT)
Received: by with SMTP id r9so95683894ljg.5 for <>; Sun, 11 Aug 2019 00:27:40 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Gkq1uVr2Hi/YfyrSkiQU0x3kXRxur4pwUPE7U5AV4OQ=; b=VEPvzZh1r4DGjjZlNQWL8XqmlL7n1e8y3fsGl28UP57ulTfLePXcp1jaPPeyjGDKSR tDXQbZFujiUfdPw7sFtVxzHZUgkTh6fdQkxRvkeDXEk8YtJs23zLXKNCnSnSW7cheg+6 hp7BW+75wKAh+5bu4Xe8rFUxKowhtZE/wAvL5BSh5iYMOoOQRHKpiGXj+bhEqXEfj2bK MjZewDqHT7dmfisZhpHvbbmt841r+X5+p1hKaDkkZk9uO8jd+jglPEmgqSuyFR2bMOMh dR7YhzkQHdjykp7iCG+RR/JXfKfGXsH6B8hnIC/lmBJ7FDxKeXsVUhjBTe8oW0noTR+L rwBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=Gkq1uVr2Hi/YfyrSkiQU0x3kXRxur4pwUPE7U5AV4OQ=; b=in+fuNUXq/cX6wocuQwqybQKH3u0OjR6prVi+aS3qLcbbo3cBlI7UyveNshLVlmkUv ZaVuFeyZms6le+akITQQ0zvHV/6+uYdLOt8Wv8aeT1YOwMRmsy7TQ92fGtdCP/SzUpjO azgQeXfbXN2Q4VJ0epn3J6Fwtw9GPXqIO/2oqS/TzJUEEV8Mse+UKpBynEADInCWsvxk UPE9khYLHOJcn8pkAy+Ssj9MLYh4C+j8K3SsSeO8qpFjbhVT37j+vpCAAxfrSSLamU1I u43FVmdLJDKFq3yG1a7qB5bGTkCVyNLwjuLANNncQ/mrazcrLcJiU2k6YJq4HaQLBVHv 8hFQ==
X-Gm-Message-State: APjAAAV42ZugIw6231kssKIQUkN15L3EekgTcNSnibncwi8HwkCB1+DS eDgugMAmSXlWrH1RPPnpy0JJsVln
X-Google-Smtp-Source: APXvYqyEnf5Ef0dsFaF3C4q8xCtIdCQjnGPx9rbo07eY5DFoHbqPG/wuPtCRIuvZ4X0Sv1ORtzBIGQ==
X-Received: by 2002:a2e:8085:: with SMTP id i5mr7928452ljg.23.1565508458421; Sun, 11 Aug 2019 00:27:38 -0700 (PDT)
Received: from jonathartonsmbp.lan ( []) by with ESMTPSA id d3sm1350357ljj.55.2019. (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 11 Aug 2019 00:27:37 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Jonathan Morton <>
In-Reply-To: <>
Date: Sun, 11 Aug 2019 10:27:36 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <>
To: Wesley Eddy <>
X-Mailer: Apple Mail (2.3445.9.1)
Archived-At: <>
Subject: Re: [tsvwg] Switch testing at 25G with ECN --SCE Draft
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Sun, 11 Aug 2019 13:31:23 -0000

> The advice in 3168 is to do marking based on average queue length, but *not* just before it needs to start dropping.  Is it more correct instead to understand what you're saying as the experiment being based on using the instantaneous queue length rather than average?

I think that is right.  In software we like to use the actual sojourn time within the queue of the candidate packet, but in hardware it is often convenient to use the queue length instead.

More generally, I notice it's a surprisingly common misconception on this list that RFC-3168 states that marking should happen when, or just before, packets would otherwise be dropped.  A little critical thought will show that this is absurd; AQM is supposed to leverage congestion-control behaviour to keep the queue depth well away from the hard tail-drop zone on average.  What the spec *actually* says is that an AQM should substitute packet drops for marking, when the packet is not ECT and therefore cannot be marked; this is practically the inverse of the misconception noted.

In particular, I believe marking should definitely occur by the time the queue is half-full, or more precisely so that there is a full BDP (including queuing delay already incurred at the point the mark is applied) of space remaining in the queue.  Why?  To accommodate the RTT of control feedback delay before the signal to exit slow-start takes effect at the queue, during which the cwnd of a typical TCP will double.  Most AQMs do in fact satisfy that principle.

>> RFC3168 ECN marking is already in the switch, the switch was modified to behave in a different manner, using the RFC3168 CE bits.  This different manner was turned OFF to do the dctcp tests and turned ON to do the dctcp-sce tests.
> Since the marking probability function and marking decision isn't part of 3168/ECN, the description here sounds like the switch is always conforming to 3168 behavior, but just using some custom decision logic for marking.
> Just trying to be clear on what we're really talking about ...

I think it's best to consider this an experimental prototype, in which the goal is to investigate whether the obtained behaviour is desirable and useful before continuing with more involved development.  From what I hear, and without going into needless detail, it was easier in the short term to set it up with CE marking and transform that into SCE codepoints at the receiver.  This was always going to be a temporary expedient.

Also from what I hear, experiments had previously been conducted using CE marking and the standard version of DCTCP in Linux, but without anywhere near the level of success seen in the results just posted.  It seems likely that faults in the DCTCP implementation were responsible, but this in turn raises the question of how well maintained that piece of code is, and how such faults were allowed to persist in the mainline codebase.

Linux is normally associated with better development practices than that.  The most reasonable explanation I can come up with is that DCTCP is not actually in widespread use, so when faults appeared they were not noticed by any party interested in having them fixed.  Or, perhaps more disturbingly, DCTCP *is* in widespread use in closed datacentre environments, but when the faults arose in production their effects were not recognised.

So in practice, this controlled experiment has validated certain principles associated with the DCTCP response function (well, a simplified version of it), and also shown that other response functions are similarly effective, when applied to an ultra-low-latency network.  It has also shown that the proposed feedback mechanism for SCE signals on TCP connections is viable, so the relatively complex AccECN mechanism isn't obviously needed. We have shown other results which show similarly good behaviour on Internet-scale latencies with real SCE marking, using a broadly similar marking function.  Overall these are encouraging results for us, and work will continue accordingly.

 - Jonathan Morton