Re: [tsvwg] Multicast suggestion for draft-ietf-tsvwg-le-phb-04

Toerless Eckert <tte@cs.fau.de> Fri, 06 April 2018 18:00 UTC

Return-Path: <eckert@i4.informatik.uni-erlangen.de>
X-Original-To: tsvwg@ietfa.amsl.com
Delivered-To: tsvwg@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 9E9B5124D6C; Fri, 6 Apr 2018 11:00:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -3.961
X-Spam-Level:
X-Spam-Status: No, score=-3.961 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, RCVD_IN_DNSWL_MED=-2.3, T_RP_MATCHES_RCVD=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HivkeBf5Zzu3; Fri, 6 Apr 2018 11:00:18 -0700 (PDT)
Received: from faui40.informatik.uni-erlangen.de (faui40.informatik.uni-erlangen.de [IPv6:2001:638:a000:4134::ffff:40]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 3C626124D68; Fri, 6 Apr 2018 11:00:18 -0700 (PDT)
Received: from faui48f.informatik.uni-erlangen.de (faui48f.informatik.uni-erlangen.de [IPv6:2001:638:a000:4134::ffff:52]) by faui40.informatik.uni-erlangen.de (Postfix) with ESMTP id 817C858C4D8; Fri, 6 Apr 2018 20:00:13 +0200 (CEST)
Received: by faui48f.informatik.uni-erlangen.de (Postfix, from userid 10463) id 7163F440214; Fri, 6 Apr 2018 20:00:13 +0200 (CEST)
Date: Fri, 06 Apr 2018 20:00:13 +0200
From: Toerless Eckert <tte@cs.fau.de>
To: Roland Bless <roland.bless@kit.edu>
Cc: tsvwg@ietf.org, draft-ietf-tsvwg-le-phb@ietf.org
Message-ID: <20180406180013.znzqqbl3ycjzb3i4@faui48f.informatik.uni-erlangen.de>
References: <20180406020854.iqnpv5hok2jszj5b@faui48f.informatik.uni-erlangen.de> <f4eefb98-7672-492f-fc08-341b0ac9908f@kit.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <f4eefb98-7672-492f-fc08-341b0ac9908f@kit.edu>
User-Agent: NeoMutt/20170113 (1.7.2)
Archived-At: <https://mailarchive.ietf.org/arch/msg/tsvwg/fo8VCrjkykxIRG6-lcjmfAczEpQ>
Subject: Re: [tsvwg] Multicast suggestion for draft-ietf-tsvwg-le-phb-04
X-BeenThere: tsvwg@ietf.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: Transport Area Working Group <tsvwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/tsvwg/>
List-Post: <mailto:tsvwg@ietf.org>
List-Help: <mailto:tsvwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/tsvwg>, <mailto:tsvwg-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Apr 2018 18:00:22 -0000

On Fri, Apr 06, 2018 at 05:39:58PM +0200, Roland Bless wrote:
> I think I understand the problem, so I try to rephrase
> in my own words:
> Packet replication happens before the network node can know
> that there is congestion at a particular output
> queue, so the replication to that output queue was
> not necessary. In my point of view this is an interesting
> implementation-related optimization problem, but it is not really
> specific for the LE PHB. The same problem exists for BE traffic,
> as also some output interfaces may be congested, whereas others
> would not.

On a typical network device, the agggregate ingres bandwidth
is the same as the aggregate egres bandwidth (asymmetric, eg: DSL
is a longer discus). So the performance requirements raised against
forwarding is roughly against being able to fill all interface
on input and output. And then somewhat more performance as icing
on the cake.

So when you would start to use badly designed LE, eg: fountain
codec over UDP unicast, the penalty on a network device is not
worse than inelastic BE unicast. Its severely limited in its
"badness" by the need to get all those bad packets into the device
before they can unnecessarily be dropped: You can never make
the box drop more packets than you receive in unicast. 

Likewise, badly designed LE multicast can be a lot worse because
it can be as bad as badly designed inelastic BE multicast. Except
that nobody understands the impact of real bad inelastic multicast,
because all the deployments of multicast go through a
lot of trouble to provide guaranteed bandwidth for inelastic
multicast traffic and protect it against attacks, such as in IPTV
and maket data networks. And in theory, there is elastic multicast
as required by in RFC8085, but only few companies know how to
do it and have working code for this (*sigh*).

So the risk is really high that LE with multicast opens a door
of badly designed multicast solutions ignoring the realities
of network devices because they do not understand how they
operate and assume that a lot of replication and dropping
is not a big problem - when in fact it is.

> At the time when we wrote RFC3754, there was a comment that on
> some router architectures it is possible to directly duplicate
> a multicast packet on the backplane of the router, so that all
> outgoing interfaces read the packet in parallel. In this case
> there may not be such a replication performance impact.

Yes, the are buses and multicast capable fabrics and other
ways how unnecessary replications vary in their impact. But
they can equally make the problem bigger: Imagine 8 port
GigE swich. 4 input ports, each sources 1 Gbps multicast into
one group, 4 ports receivers, all joining all 4 groups. Ports
connected via eg: 8 Gbps internal Bus. So each of the
4 egres port receives 4 Gbps and has to drop them. Aka: The
bus can make such a platform perform worse for these type
of problems than other designs.

In all platforms i have seen over 20 years, the aggregated
replicated "virtual" numbrer of egres packets (before drop)
needs to be at most <= 110% of the aggregate number of
physcial output packets of the box, or you get big problems
(creating victim packets). And when LE systems are designed
to operate under higher loss than BE, you can very easily
run into those situations in well loaded networks.

Of course, it would be a lot of fun to design network devices
for which this is not an issue, and its certainly possible,
but i could never persuade a HW designer to do all the tricky
HW to do this. And besides: Without a lot of clear use case
business value for LE with a lot of loss, nobody would invest
for this. 

> > With BE traffic you would not even think of designing multicast
> > solutions, where the aggregate amount of egres traffic was larger
> > than what the outgoing interfaces support. With LE traffic and
> > network/found codecs you can esily run into the misconception that
> > such a setup would work perfectly well, because you are not aware
> > that replication and large amount of dropping does wase
> > precious resources.
> 
> I'm not sure about this one...see next comment.
> 
> > Aka: I would highly recommend to make a statement that 
> > multicast LE traffic (outside controlled networks) MUST
> > be congestion controlled using the options summarizzed in 
> > RFC8085, 4.1. - and add explanatory text about the reasons
> > (as above).
> 
> I don't see how congestion control would help to
> avoid the internal resource waste by unnecessary
> replication. A congestion signal by packet loss
> or ECN mark would only affect a particular multicast
> branch

Not correct.

> and also receiver-based congestion control
> would not change the situation.

Not correct.

RFC8085 4.1, first bullet:

feedback based is you "congestion signal". It
throttles the sender until no receiver has significant
loss. Effective against waste of replication, but really
just useful for no-congestion (randomn) loss or
limited congestion scenario (well defined reliable
lower bound availabel bandwidth). Otherwise you have
to kick off bad receviers from the tree. The RFC8085
text is a bit to wiggly IMHO. Anyhow: Totally useless
for LE.

RFC8085 4.1, second bullet:

This is the only useful CC option for multicast LE
unless you run in controlled networks where you can
clearly manage the replication-overhead on every
replication device. Receivers will drop themselves
off the tree when they receive loss and join a tree
with a lower bitrate. Aka: Sender needs to offer
content at a variety of bitrates and receiver can
join to a subset of those groups/channels matching
the bitrate it can receive without significant loss.
With streaming data / fountain codecs this is a
great, piece-of-cake design. With video its somewhat
complex SVC designs, but anything LE is not realtime
IMHO, so fountain codecs plus this receiver based
CC are IMHO a great option for LE. With the
only limitation being that you can not go overboard with
the amount of loss acceptable, but really keep it below IMHO
1..3% (based on above discussion of replication overhead).
If you experience higher loss, you need to downspeed
(join different group). 

This is pretty much inline with typical multicast
designs that do this for SVC, so the main sthing to
make sure is that we do not let LE today think it
can live with higher loss than acceptable for BE
receiver rate multicast CC. Funnily enough, that
rule would be a good rule just in case LE gets into BE
queue anyhow, but it also helps against the
multicast replication issue.

As said, let me know if you want me to help out on text
around this...

Oh, and btw: If we would use BIER end-to-end we
could do a lot more with LE. Might be worth another
sentence ;-)

> > If this is something you think is useful, let me know, i could
> > help rewrite to fit the draft.
> > 
> > b) I really had a hard time finding this draft because i am
> > always looking for "scavenger" is there any way to smuggle that
> > word into the draft (some side remark) so that future google searches
> > will find it ?
> 
> > I don't think i am the only one that remembers this traffic
> > class by that name, even if the official IETF term is different (LE).
> 
> Point taken, but let me explain a little bit:
> The QBone Scavenger Service (QBSS) Definition was published
> March 16, 2001, whereas our "Lower than best effort" idea was first
> published in September 1999 as draft-bless-diffserv-lbe-phb-00:
> http://www.ietf.org/mail-archive/web-old/ietf-announce-old/current/msg05168.html
> or
> https://tools.ietf.org/html/draft-bless-diffserv-lbe-phb-00

Hah, there we go, copying that type of history into the draft
will get the needed word "scavenger" back into the draft ;-)

> >From this time on, we developed it further within the DiffServ WG.
> So strictly speaking, the IETF term has been changed from ""Lower than
> best effort" to "Lower Effort"
> (https://tools.ietf.org/html/draft-bless-diffserv-pdb-le-00).
> 
> I think that the QBSS definition was an Internet2 implementation
> of our LBE PHB, but the QBSS definition never mentioned that origin.
> So most people think that the scavenger service definition was the
> original idea, but indeed our lower than best-effort was published
> one and a half years before the QBSS specification.

> So I'll try to provide that linkage to scavenger service for those
> who are only familiar with the Internet2 side of the story :-)

History is always written by the winner, so as long as you
write this draft, and think you can defend the statements
against IESG challenges, i would be very happy to see the
above reformulated into a history section on the RFC. 

Just see my other CC/bleaching text and the last section where i
think it becomes clear why the word scavenger is really a lot
more natural to me. Ideally i i would like to see a title like
this:

"A Lower Effort Per-Hop Behavior (LE PHB) and DSCP for Scavenger Traffic"

Aka: LE describes what the network does (less), scavenger
nicely describes what the traffic needs to do. Quite
complementary. And with all those words (LE, PHB, scavenger,
DSCP) in the title, everybody googling for this stuff will find it.

(many operators know DSCP, but have no idea what PHB means, thats why i would add it to the title).


> 
> Best,
>  Roland

-- 
---
tte@cs.fau.de