Re: [tsvwg] [Ecn-sane] per-flow scheduling

Sebastian Moeller <moeller0@gmx.de> Thu, 27 June 2019 07:50 UTC

Return-Path: <moeller0@gmx.de>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DA651120221 for <tsvwg@ietfa.amsl.com>; Thu, 27 Jun 2019 00:50:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.648
X-Spam-Level:
X-Spam-Status: No, score=-1.648 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (1024-bit key) header.d=gmx.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id F5JKPatysJ5K for <tsvwg@ietfa.amsl.com>; Thu, 27 Jun 2019 00:50:03 -0700 (PDT)
Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id F31171201CB for <tsvwg@ietf.org>; Thu, 27 Jun 2019 00:50:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.net; s=badeba3b8450; t=1561621795; bh=vw769vQMAUHJhFD82jTpcwORZr9jZAG5KRV6FHLd5Us=; h=X-UI-Sender-Class:Subject:From:In-Reply-To:Date:Cc:References:To; b=QussTp7gg27xOJJz2+ctJLJbOn8qMRgnvqvQ2+vrZX976RK4E57UTBeavzLp0ZGxi dNT3A0rHt7hgj04gBHoRSq76ZzUdsII9gIIB07Vv6iSvJnet1bz1CDyo5V0R8m6XBE qaAT+Np5UsLlgGWdMvdWBsQLJbSt4tkVzt2z38Yk=
X-UI-Sender-Class: 01bb95c1-4bf8-414a-932a-4f6e2808ef9c
Received: from [172.17.3.45] ([134.76.241.253]) by mail.gmx.com (mrgmx002 [212.227.17.190]) with ESMTPSA (Nemesis) id 0LztD9-1icRf61QdX-01559G; Thu, 27 Jun 2019 09:49:55 +0200
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
From: Sebastian Moeller <moeller0@gmx.de>
X-Priority: 3 (Normal)
In-Reply-To: <1561566706.778820831@apps.rackspace.com>
Date: Thu, 27 Jun 2019 09:49:53 +0200
Cc: Jonathan Morton <chromatix99@gmail.com>, "ecn-sane@lists.bufferbloat.net" <ecn-sane@lists.bufferbloat.net>, Brian E Carpenter <brian.e.carpenter@gmail.com>, tsvwg IETF list <tsvwg@ietf.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <9A6E126A-43A3-4BD8-A3AC-507FF9095470@gmx.de>
References: <350f8dd5-65d4-d2f3-4d65-784c0379f58c@bobbriscoe.net> <46D1ABD8-715D-44D2-B7A0-12FE2A9263FE@gmx.de> <CAHx=1M4+sJBEe-wqCyuVyy=oDz7A+SG_ZxBbu_ZZDZiCHrX2uw@mail.gmail.com> <835b1fb3-e8d5-c58c-e2f8-03d2b886af38@gmail.com> <1561233009.95886420@apps.rackspace.com> <71EF351D-AFBF-4C92-B6B9-7FD695B68815@gmail.com> <1561241377.4026977@apps.rackspace.com> <4E863FC5-D30E-4F76-BDF7-6A787958C628@gmx.de> <1561566706.778820831@apps.rackspace.com>
To: "David P. Reed" <dpreed@deepplum.com>
X-Mailer: Apple Mail (2.3445.104.11)
X-Provags-ID: V03:K1:coa5vkGQuB6A7eBbC9IeTqwgcJQ2sQcc52KPAQ3ehBXPeWCDJ01 e4F6l6av2rGicN4A7pDhxTf9YhiqrFv8C1iYAukbcAMQtO7Trlre/NdXYpLb/nQwZFonSha /YyCTfAG0iQg9U0fRNksdCNag843X1jiXxg3nHRBGe4el1PnA+EfO2Pzu0jHk6RCV0eWHru HuzoUAZyl6DHYkmXAKZWA==
X-UI-Out-Filterresults: notjunk:1;V03:K0:bx2Y4n6mwLk=:VZ9Y+ogQ3pQQxyJX+rhqdT KGm/G1RxtgilpGHbc+y1IZCcrsdxyDHQFLJ5J20NfLKVzBWUbtQOr+su1g+jLKR3oq/+uuohN jlUEUMXP99SoOPotuGo4k+Xkk7y75oVUIRShVXO5hsSVT+TzlPGN3pY0oEFZbFkhUpBCBTdts 5bNTBDXj0mM3QZXkUmmVI6MTV4Qm59UCQo35M7CRuJLGfSE9u5e4WUIGQhi2Yjzgy8wC5Sog7 Yt2Z74szieyTbS6v7n7Ef6GoAS0bzi5mUzqoAxoFk+923ZLcXuxQt/fM3fdI427KTCJwicSc2 x16oyh/8edXujA1N4LXu3xdsjWd51l6Bt2jYco2/S0jk9tMMbDEp6KbsoJ4WTEcdC4BFsOhvU 1JiCeGrQEY9KxQXuzJaYTOZha7YILc1nNn4OEUJYJ3lwBKFQquAoZ6Vsb7YOB9fPOidBsBiBD mO6NHhCil0qfAzQqygcO4BTKNSTvv06HCLPFa4a/XhGEs61d2MpMJxnlyfyPKiJU62n3kaB57 gdAcEbslsUAodm/5demmI36xUKto3EI0ZUdZ3LYjCTNuGHrIlcjUU73Zkt/XBActPkj4vM7WD KKE6VZTTyuQMaEbSdyTeoOh3BhUup1i6MmIN6OKSMnk3v3jkfpb1KNjZQbAfiAuYkDuQQToZ5 hOt441nq5rHjaTxgMN5ANnFz7aJUv0ns1b/xkURVk4Bs4buSp0PF+yXNZoVM9ZkVBRtP/P7sj B/Hwm1Mk9IydM3DpnFjZ50w4FxeL4wPkBYs44FRuJnZ0tvUNbJ/I9dnMpJd4Z7BS7KfB2Wk8X GmSV+mMcmcKJGzVRFvj/N3BnNJ4B+HeVMFPbKLImiQr5xD4sIBGGcpDGFjwzHJhtcHPOTEWqX lEMMwYlO7gcszHRvFNck9okhoGgrhrSItux03ks9plA95mp06+T33gtLe5n26OpaoLFgCVske VxC67FyRaEIBeohlI2pVOq3F8MQiA1rLpYA7vbd8XQ0971xY5hZyH
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/IufF9wBNiGGn8vkdzDp_o5VfLTU>
Subject: Re: [tsvwg] [Ecn-sane] per-flow scheduling
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Jun 2019 07:50:06 -0000

Hi David,

thanks for your response.

> On Jun 26, 2019, at 18:31, David P. Reed <dpreed@deepplum.com> wrote:
> 
> It's the limiting case, but also the optimal state given "perfect knowledge".
>  
> Yes, it requires that the source-destination pairs sharing the link in question coordinate their packet admission times so they don't "collide" at the link. Ideally the next packet would arrive during the previous packet's transmission, so it is ready-to-go when that packet's transmission ends.
>  
> Such exquisite coordination is feasible when future behavior by source and destination at the interface is known, which requires an Oracle.
> That's the same kind of condition most information theoretic and queueing theoretic optimality requires.

	Ah, great, I had feared I had missed something.

>  
> But this is worth keeping in mind as the overall joint goal of all users.
>  
> In particular,  "link utilization" isn't a user goal at all. The link is there and is being paid for whether it is used or not (looking from the network structure as a whole). Its capacity exists to move packets out of the way. An ideal link satisfies the requirement that it never creates a queue because of anything other than imperfect coordination of the end-to-end flows mapped onto it. That's why the router should not be measured by "link utilization" anymore than a tunnel in a city during commuting hours should be measured by cars moved per hour. Clearly a tunnel can be VERY congested and moving many cars if they are attached to each other bumper to bumper - the latency through the tunnel would then be huge. If the cars were tipped on their ends and stacked, even more throughput would be achieved through the tunnel, and the delay of rotating them and packing them would add even more delay.

	+1; this is the core of the movement under the "bufferbloat" moniker put latency back into the spot light where it belongs (at least for common inter-active network usage, bulk transfer is a different kettle of fish). Given the relative low rates of common internet access links, running at capacity, while not a primary goal, still becomes common enough to require special treatment to keep latency under load increase under control. Both FQ solutions and L4S offer remedies for that case. (Being a non-expert home-user myself this case also is prominent on my radar, my ISPs backbone and peerings/transits being well managed the access link is the one point where queueing happens, just as you describe).

>  
> The idea that "link utilization" of 100% must be achieved is why we got bufferbloat designed into routers.

	While I do not describe to this view (and actually are trading in "top-speed" to keep latency sane) a considerable fraction of home-users seem obsessed in maxing out their access links and compare achievable rates; whether such behaviour shoud be encouraged is a different question.

> It's a worm's eye perspective. To this day, Arista Networks brags about how its bufferbloated feature design optimizes switch utilization (https://packetpushers.net/aristas-big-buffer-b-s/). And it selects benchmarks to "prove" it. Andy Bechtolsheim apparently is such a big name that he can sell defective gear at a premium price, letting the datacenters who buy it discover that those switches get "clogged up" by TCP traffic when they are the "bottleneck link". Fortunately, they are fast, so they are less frequently the bottleneck in datacenter daily use.
>  
> In trying to understand what is going on with congestion signalling, any buffering at the entry to the link should be due only to imperfect information being fed back to the endpoints generating traffic. Because a misbehaving endpoint generates Denial of Service for all other users.

	This is a good point, and one of the reasons, why I conceptually like flow queueing, as that gives the tools to allow to isolate bad actors, "trust, but verify" comes to mind as a principle. I also add that the _only_ currently known  L4S rolll-out target (low latency docsis) actually mandates a mechanism they call "queue protection" which to me looks pretty much like it is a FQ system that carefully tries to not call itself FQ (it monitors the length of flows and if they exceed something pushes them into the RFC3168 queue, which to this layman means it need to separately track the packets for each flow in the common queue to be able to re-direct them).

>  
> Priority mechanisms focused on protecting high-paying users from low-paying ones don't help much - they only help at overloaded states of the network.

	In principle I agree, in practice things get complicated; mixing latency-indifferent capacity-devouring applications like bit-torrent with say VoIP packets (fixed rates, but latency sensitive) over too narrow a link will make it clear that giving the VoIP packet precedence/priority over the bulk-transfer packet is a sane policy (that becomes an issue due to the difficulty of running a narrow link below capacity). I am sure you are aware of all of this, I just need to spell it out for my thinking process.


> Which isn't to say that priority does nothing - it's just that stable assignment of a sharing level to priority levels isn't easy.  (See Paris Metro Pricing, where there are only two classes, and the problem of deciding how to manage the access to the "first class" section - the idea that 15 classes with different metrics can be handled simply and interoperably between differently managed autonomous systems seems to be an incredibly impractical goal).

	+1; any prioritization scheme should be extremely simple so that an end-user can make predictions about its behavior easily. Also IMHO 3 classes of latency behaviour will go a long way, "normal", "don-t care", "important" should be enough (L4S IMHO only offers "important" and normal, so does not offer to easily down-grade say bulk background transfers like bit-torrent (which is going to be an issue with bit-torrent triggering on ~100 induced latency increase with L4S's RFC3168 queue using a PIE offspring to keep induced latency << 100ms, but I digress).)


> Even in the priority case, buffering is NOT a desirable end user thing.

	+1; IMHO again a reason for fq, misbehaving flows will not spoil the fun for everybody else.

>  
> My personal view is that the manager of a network needs to configure the network so that no link ever gets overloaded, if possible. The response to overload should be to tell the relevant flows to all slow down (not just one, because if there are 100 flows that start up roughly at the same time, causing MD on one does very little.
> This is an example of something where per-flow stuff in the router actually makes the router helpful in the large scheme of things. Maybe all flows should be equally informed, as flows. Which means the router needs to know how to signal multiple flows, while not just hammering all the packets of a single flow.  This case is very real, but not as frequently on the client side as on the "server side" in "load balancers" and such like.
>  
> My point here is simple:
>  
> 1) the endpoints tell the routers what flows are going through a link already. That's just the address information. So that information can be used for fairness pretty well, especially if short term memory (a bloom filter, perhaps) can track a sufficiently large number of flows.
>  
> 2) The per-flow decisions related to congestion control within a flow are necessarily end-to-end in nature - the router can only tell the ends what is going on, but the ends (together - their admissions rates and consumption rates are coupled to the use being made) must be informed and decide. The congestion management must combine information about the source and the destination future behavior (even if it is just taking recent history and projecting it as an estimate of future behavior at source and destination). Which is why it is quite natural to have routers signal the destination, which then signals the source, which changes its behavior.

	In an ideal world the router would also signal the sender as that will at least half the time it takes for the congestion information to reach the most relevant party; but as I understand this is a) not generally possible and b) prone to abuses.

>  
> 3) there are definitely other ways to improve latency for IP and protocols built on top of it  - routing some flows over different paths under congestion is one. call the per-flow routing. Another is scattering a flow over several paths (but that seems problematic for today's TcP which assumes all packets take the same path).

	This is about re-ordering, no? 

>  
> 4) A different, but very coupled view of IP is that any application-relevant buffering shoujld be driven into the endpoints - at the source, buffering is useful to deal with variability in the rate of production of data to be sent. At the destination, buffering is useful to minimize jitter, matching to the consumption behavior of the application.  But these buffers should not be pushed into the network where they cause congestion for other flows sharing resources.
> So buffering in the network should ONLY deal with the uncertainty in resource competition.

	This, at least in my understanding, is one of the underlaying ideas of the L4S approach, so how is your take on how well L4S archives that goal?


>  
> This tripartite breakdown of buffering is protocol independent. It applies to TCP, NTP, RTP, QUIC/UDP, ...  It's what we (that is me) had in mind when we split UDP out of TCP, allowing UDP based protocols to manage source and destination buffering in the application for all the things we thought UDP would be used for - packet speech, computer-computer remote procedure calls (what would be QUIC today), SATNET/interplanetary Internet connections , ...).

	Like many great insights that look obvious in retro-spect, I would guess that might have been controversial at its time?

>  
> Sadly, in the many years since the late 1970's the tendency to think file transfers between infinite speed storage devices over TCP are the only relevant use of the Internet has penetrated the router design community. I can't seem to get anyone to recognize how far we are from that.  No one runs benchmarks for such behavior, no one even measures anything other than the "hot rod" maximum throughput cases.

	I would guess, that this obsession might be market-driven, as long as customers only look at the top-speed numbers, increasing this number will be the priority.


Again thanks for your insights

	Sebastian

>  
> And many egos seem to think that working on the hot rod cases is going to make their career or sell product.  (e.g. the sad case of Arista).
>  
>  
> On Wednesday, June 26, 2019 8:48am, "Sebastian Moeller" <moeller0@gmx.de> said:
> 
> > 
> > 
> > > On Jun 23, 2019, at 00:09, David P. Reed <dpreed@deepplum.com> wrote:
> > >
> > > [...]
> > >
> > > per-flow scheduling is appropriate on a shared link. However, the end-to-end
> > argument would suggest that the network not try to divine which flows get
> > preferred.
> > > And beyond the end-to-end argument, there's a practical problem - since the
> > ideal state of a shared link means that it ought to have no local backlog in the
> > queue, the information needed to schedule "fairly" isn't in the queue backlog
> > itself. If there is only one packet, what's to schedule?
> > >
> > [...]
> > 
> > Excuse my stupidity, but the "only one single packet" case is the theoretical
> > limiting case, no?
> > Because even on a link not running at capacity this effectively requires a
> > mechanism to "synchronize" all senders (whose packets traverse the hop we are
> > looking at), as no other packet is allowed to reach the hop unless the "current"
> > one has been passed to the PHY otherwise we transiently queue 2 packets (I note
> > that this rationale should hold for any small N). The more packets per second a
> > hop handles the less likely it will be to avoid for any newcomer to run into an
> > already existing packet(s), that is to transiently grow the queue.
> > Not having a CS background, I fail to see how this required synchronized state can
> > exist outside of a few steady state configurations where things change slowly
> > enough that the seemingly required synchronization can actually happen (given
> > that the feedback loop e.g. through ACKs, seems somewhat jittery). Since packets
> > never know which path they take and which hop is going to be critical there seems
> > to be no a priori way to synchronize all senders, heck I fail to see whether it
> > would be possible at all to guarantee synchronized behavior on more than one hop
> > (unless all hops are extremely uniform).
> > I happen to believe that L4S suffers from the same conceptual issue (plus overly
> > generic promises, from the RITE website:
> > "We are so used to the unpredictability of queuing delay, we don’t know how
> > good the Internet would feel without it. The RITE project has developed simple
> > technology to make queuing delay a thing of the past—not just for a select
> > few apps, but for all." this seems missing a conditions apply statement)
> > 
> > Best Regards
> > Sebastian