Re: [tsvwg] Slides to support discussion of draft-ietf-tsvwg-rfc6040shim-update

Jonathan Morton <> Thu, 09 April 2020 12:01 UTC

Return-Path: <>
Received: from localhost (localhost []) by (Postfix) with ESMTP id 68B073A0998 for <>; Thu, 9 Apr 2020 05:01:45 -0700 (PDT)
X-Virus-Scanned: amavisd-new at
X-Spam-Flag: NO
X-Spam-Score: -1.849
X-Spam-Status: No, score=-1.849 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: (amavisd-new); dkim=pass (2048-bit key)
Received: from ([]) by localhost ( []) (amavisd-new, port 10024) with ESMTP id 44LHVNd9igRq for <>; Thu, 9 Apr 2020 05:01:43 -0700 (PDT)
Received: from ( [IPv6:2a00:1450:4864:20::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by (Postfix) with ESMTPS id 36BD93A0997 for <>; Thu, 9 Apr 2020 05:01:43 -0700 (PDT)
Received: by with SMTP id h25so1233581lja.10 for <>; Thu, 09 Apr 2020 05:01:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=MbpuZXGAZnNompTW7x1cC5/cqKpQItvHsligwS5rJWk=; b=vIpg6mtzYpxj872WvxTnvTvtL12TMc7Ohmatntv21W9v9zTujRbGdM5G8vNWtLkEn7 l7S5DXFTRFgBCUA/oODevzmMpAMUT8FXOn/q9ZKk24IfS3+4yXVMSNGeyHSPBab6O8Mm c0qnewi0btIURgLkDh5jaSWo4rJZHaoare4nxwRqcHBHDoFHtYUy13PfMpZtdp7A6a2u 9/Q6q8a0O33LLMP0mcCXNIbRpsEeaV3EpR0+wftIbXwVr9yBB8AsJRXRUvy44G+yUD7+ EA0wVS7vyaPMFMhnfsUzRQm2SgOQDslpuBBUaUaCsT3FqA+xnYm4j+EpvFC7pE/LcBg5 dC6w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=MbpuZXGAZnNompTW7x1cC5/cqKpQItvHsligwS5rJWk=; b=BJvfRrCwvEm40yN9LL21xjoCTkjYil4VZYF48C5QJhsJikwY5qTHS6djqXLsPpgD4U 6lqszKVF+IgbZVdJixOUF3SCeY+oBhF8Z2xSnQJQm2d/bvtEIk6adSi0vrg39wkCC1SD 4CshzxFPXH/CtUWoiV53XT/C/5sCO/R4Sn8a7bzrSS98C/2Z3Fh0dEd38bdlnUnGOfg5 5JaWe0emP6LzvkwuBfbU7890vI396JbTct7oD+HCsAMuymyk2P4BeiwOgAsxKjLCRTex SNNQm9EhEoeVM5Nr2gchemTWV9QtSvEWfgc2q0EjV/B/sS3IgxoPZy4L+UTB2zKFxdfo fA3g==
X-Gm-Message-State: AGi0PuYatbEx09rR3FIUYGCbePBS/rHVZDjW83YxYgqS+Jqp7Fbd/k+y 0KBlp268ls+TMONA9PNGryjm0K1W
X-Google-Smtp-Source: APiQypLggdiGLfIS9ZRS4DO75KoWSPwnv7CXL5TgdCznhHwwbOC5MkdiVwBvcaZ8HITvWiXDU+nO0g==
X-Received: by 2002:a05:651c:1aa:: with SMTP id c10mr8143687ljn.53.1586433701042; Thu, 09 Apr 2020 05:01:41 -0700 (PDT)
Received: from jonathartonsmbp.lan ( []) by with ESMTPSA id d15sm7689003lfl.77.2020. (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 09 Apr 2020 05:01:40 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
From: Jonathan Morton <>
In-Reply-To: <>
Date: Thu, 9 Apr 2020 15:01:38 +0300
Cc: tsvwg IETF list <>
Content-Transfer-Encoding: quoted-printable
Message-Id: <>
References: <> <> <>
To: Bob Briscoe <>
X-Mailer: Apple Mail (2.3445.9.1)
Archived-At: <>
Subject: Re: [tsvwg] Slides to support discussion of draft-ietf-tsvwg-rfc6040shim-update
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <>
List-Unsubscribe: <>, <>
List-Archive: <>
List-Post: <>
List-Help: <>
List-Subscribe: <>, <>
X-List-Received-Date: Thu, 09 Apr 2020 12:01:45 -0000

Taking the last point first:

>> I hope the above will help to inform the main discussion about RFC-3168 semantics, when that is taken up.
> That argument is happening here and now. There's no point writing a draft for a new reassembly RFC, if my arguments are flawed.

What you clearly plan to ask for in that draft is a change to the handling of CE upon reassembly, to a method that is explicitly forbidden by the existing (nearly 20 year old) specification which has been widely implemented.  I think you will not obtain WG consensus for that.  You would need to present compelling and incontrovertible evidence that the existing behaviour is inherently harmful, but the argument presented so far is not that strong.

Separately, if the WG decides that ECT(1) should become an output from the network, then a tightening of the language around reassembly of mixed ECN codepoints will be called for.  I'm willing to take on that task if that is more appropriate.

>> Therefore, it is incorrect to convert a timebase marking rate to a uniform marking probability, when packets of significantly different sizes are involved.  A different analysis must therefore be run to establish the effect of a shared timebase AQM on the tunnel path.
> I had looked at the code to check this. When a timer fires, it's the /next/ packet to reach the head that gets marked. Not the one being dequeued when the timer fires.

Since we're reading implementation code, may I direct your attention to the "dropping" flag in the Codel state vector, and the code which sets and clears it?  Marking only occurs when that flag is set, but it is cleared as soon as the queue drains.

Which means that when subjected to traffic with the typical AIMD sawtooth, there are relatively short periods with a non-zero marking rate, interspersed with relatively long periods with no marking.  And this is also true for most probability-based AQMs.  This is crucial to understanding why the timing of marks is the important factor to preserve, not the overall probability.  It is dynamic behaviour that is poorly modelled by a static probability, as I pointed out yesterday.

We can confirm this by observing that your exemplar marking rates are one per 1.7 seconds, or less often than that.  But at default settings, the lowest marking frequency in Codel is 10 per second, so in these conditions Codel spends about a 16:1 ratio of time in the quiescent state; the 1.7 second marking rate comes entirely from the period of the sawtooth, not from any characteristic of the AQM.  The "fat flow" then only requires a single CE mark to drain the queue.  An FQ-AQM is designed to ensure that signal goes to the correct flow immediately; with a single queue and single AQM, this occurs probabilistically.

> Usually two fragments will get onto the wire back to back, with the runt behind. Then, what happens depends on if it's FQ or shared queue, and of course on the specifics of implementations, but in general:
> * If FQ, the likelihood of marking a runt vs a full-size fragment will be roughly the inverse of their sizes (e.g. 40/1540 for the larger packets and 1500/1540 for the runts). Then RFC3168 reassembly will still roughly double the marking probability for fragmented flows vs non-fragmented.
> * If shared queue, the runts are more likely to be marked than any other packet, because they will generally follow their larger sister fragment. The relative likelihood of marking of the rest of the packets will depend on how much of the packet mix are fragments. If there's a decent amount of non-fragments, the marking probability for the rest of the packets will be at least 'fairly equal' to each other.

I'm glad you're thinking at this level of detail, for once.

It's possible to imagine a Codel implementation which corrects for this effect, by looking ahead into the serialisation time of the packet currently under consideration.  Reference Codel doesn't do this, because it doesn't know the transmission rate of the link.  It's something I could theoretically add to my own qdiscs, because they have built-in rate shapers and can therefore calculate the serialisation time.

A probabilistic AQM could multiply its marking probability by the size of the packet to achieve a similar correction.  Does PIE or PI2 do this?  I'm pretty sure it isn't a standard RED feature.

These fixes presuppose that correcting this effect is worthwhile.  It might be.  But consider the following point, again derived from your own example:

To maintain equal throughput, the fragmented 1500-byte flow and the unfragmented 1480-byte flow require the same rate of marking, one per 1.7 seconds.

The unfragmented 750-byte flow requires marks half as often, on a quarter of the fraction of its packets; the period of its sawtooth is twice as long because each CE mark removes twice as many segments from the cwnd, which are replenished at the same rate of one per RTT.  (This appears to show an error in your calculations, as you gave a marking frequency of a *quarter* of the others for this flow, as well as a marking probability of a quarter.)

If we assume (as a worst case for the sake of argument) that CE marks are doubled by the reassembly process, then this would correspond to a sqrt(2) reduction in average cwnd and throughput for the affected flow.

But if we were to apply the same marking frequency to that 750-byte flow as to the others, it would run sqrt(2) slower; if we were to apply the same marking *probability* as to the 1480-byte flow, it would run at half speed.  That is what you would get if these flows shared an idealised Codel or probabilistic AQM respectively.

So the notional doubling of CE marks has a smaller effect than that of halving the segment size of a TCP flow - probably the reverse of what you expected.

Moreover, if you did try to compensate for the CE "doubling" effect at reassembly time, by preserving the fraction of marked packets, it means that a fragmented flow would need two CE marks to perform one sawtooth cycle, rather than only one.  A single-queue AQM therefore has a doubled chance of accidentally signalling to the wrong flow as well as the correct one, when the correct flow happens to be fragmented.  This is not the correct result.

> However, this is all beside the point. This just argues that the effects I have laid out will not be quite as pronounced as the simple approximations I have used. But the effects will still be there.
> * None of what you've said argues /for/ preserving the time of each marking during reassembly.
> * None of what you've said argues /against/ preserving probability rather than timing.

I think I have addressed those points above, fairly thoroughly.

> With shared buffers:
> * Preserving probability will remove the unfairness effects.
> * Preserving timing will tend to have some degree of unfairness effects.

I argue that the reverse is true.  RFC-3168 had good reasons to preserve individual congestion marks at reassembly, and those seem to still be relevant even in this case.

> With per-flow scheduling:
> * There are (obviously) no unfairness effects anyway.
> * The reason I put up the FQ example wasn't fairness, it was to illustrate that there is nothing special about the particular time between one mark and the next - because flows B&C with identical average packet sizes and packet rates converge on different times between marks.

But the reason for the difference in marking frequency is not due to AQM behaviour, but due to the slower growth rate of the flow with the smaller MSS, and thus the slower sawtooth period.  Also, I identified calculation errors in this example.

If you disagree with my analysis here, the best way forward might be to run a practical test to see what the real behaviour is.

> Do you have an argument /for/ preserving the time when individual marks occur, given it causes unfairness in shared queue AQMs (even if not to the degree my simple calculations have predicted)?

The "given" in this question is refuted, so the question is irrelevant.

>> Another implicit assumption is that AQMs apply a static marking rate (in whichever paradigm they choose) over a timescale of many seconds.  In fact, they typically react to changes in queue depth over timescales of milliseconds.  This has the effect of concentrating marking events on the flows with highest throughput at the moment congestion occurs, at the peak of each flow's sawtooth.  This is a further factor which may invalidate the assumption of uniform marking probability, even for probabilistic AQMs.
> During a dynamic episode of higher congestion, the fragmented packets will still end up with a higher incidence of marking, than if they had been identical but non-fragmented packets.

I think, ultimately, we just have to accept this as one of the downsides of using fragmentation.  Perhaps we could even highlight it as an incentive to tunnel vendors to fix their implementations, so that they do not emit fragmented traffic.  And refer to draft-ietf-intarea-frag-fragile.

 - Jonathan Morton