[aqm] draft-ietf-aqm-pie-02 review

Polina Goltsman <polina.goltsman@student.kit.edu> Fri, 14 August 2015 13:26 UTC

Return-Path: <polina.goltsman@student.kit.edu>
X-Original-To: aqm@ietfa.amsl.com
Delivered-To: aqm@ietfa.amsl.com
Received: from localhost (ietfa.amsl.com [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id DE6801A1A99 for <aqm@ietfa.amsl.com>; Fri, 14 Aug 2015 06:26:04 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.909
X-Spam-Level:
X-Spam-Status: No, score=-1.909 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=ham
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id edOfdeMud6dB for <aqm@ietfa.amsl.com>; Fri, 14 Aug 2015 06:25:59 -0700 (PDT)
Received: from scc-mailout-kit-01-web.scc.kit.edu (scc-mailout-kit-01-web.scc.kit.edu [IPv6:2a00:1398:9:f712::810d:e75d]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 94E6D1A0062 for <aqm@ietf.org>; Fri, 14 Aug 2015 06:25:58 -0700 (PDT)
Received: from kit-msx-28.kit.edu ([2a00:1398:9:f612::28]) by scc-mailout-kit-01.scc.kit.edu with esmtps (TLS1.2:RSA_AES_256_CBC_SHA1:256) (envelope-from <polina.goltsman@student.kit.edu>) id 1ZQEza-0007nX-Jl for aqm@ietf.org; Fri, 14 Aug 2015 15:25:57 +0200
Received: from scc-wkit-clx-224-41.scc.kit.edu (2a00:1398:9:fb00:1acf:5eff:fe15:5525) by smtp.kit.edu (2a00:1398:9:f612::106) with Microsoft SMTP Server (TLS) id 15.0.1076.9; Fri, 14 Aug 2015 15:25:53 +0200
Message-ID: <55CDEC64.8030807@student.kit.edu>
Date: Fri, 14 Aug 2015 15:25:56 +0200
From: Polina Goltsman <polina.goltsman@student.kit.edu>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0
MIME-Version: 1.0
To: AQM IETF list <aqm@ietf.org>
Content-Type: multipart/alternative; boundary="------------000807030804070400080703"
X-Originating-IP: [2a00:1398:9:fb00:1acf:5eff:fe15:5525]
Archived-At: <http://mailarchive.ietf.org/arch/msg/aqm/BHt5UdTU_vm_Am5Akvsn_FiTk98>
Subject: [aqm] draft-ietf-aqm-pie-02 review
X-BeenThere: aqm@ietf.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: "Discussion list for active queue management and flow isolation." <aqm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/aqm>, <mailto:aqm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/aqm/>
List-Post: <mailto:aqm@ietf.org>
List-Help: <mailto:aqm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/aqm>, <mailto:aqm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Fri, 14 Aug 2015 13:26:05 -0000

Hello all,

Currently, I'm implementing various AQMs and one of them is PIE. 
However, one problem is that the implementation is made difficult due to 
slight inconsistencies between the draft section 4 and the pseudocode as 
well as it's difficulty
to understand without the corresponding (theoretical) background 
knowledge. Therefore, here are my suggestions for improvement from my 
review of the latest draft version [thanks to Roland for helping with 
the suggestions and text]:


      General feedback:

IHMO the current version of the draft, especially Section 4, is hard to 
understand without reading the PIE paper ([HSPR-PIE]) first. Ideally I 
would prefer an introduction section with the overview of the design 
before the "Terminology" section, but it should at least be suggested to 
read the paper first. BTW is the paper publicly available?

In Section 4 there is a subsection per feature added, as opposed to a 
subsection per component. As a result, requirements from the same 
component are written in pieces in different subsections. For me it is 
very hard to combine them together. One solution could be to include the 
pseudocode as last subsection of Section 4. The same applies to Section 5.

The variables in formulas are not introduced before the formulas. I 
would prefer to have a short description of what a formula does before 
the code and a long description after the code.

PIE is described as consisting of three components, however the 
"preambles" in the formulas (the lines that start with stars) do not 
referred directly to the components and the code for the same components 
are titled with different star-lines: e.g., the specification of "random 
dropping at enqueue" component appears under:

* upon a packet arrival MUST
* upon packet arrival
* if rand() < p

In the previous draft version there was only the link capacity 
estimation version and there was no latency component, only the capacity 
estimation component. In the current draft this is replaced with two 
options, which made the whole draft very confusing. For instance, why is 
delay called est-delay if it is either estimated or sampled? It is also 
not clear that estimated delay is calculated in 
drop_probability_calculation. I would *guess* it after reading the 
paper...  IMO it would be better to describe the whole original version 
with link capacity estimation and then suggest the alternatives: 
implementation with timestamps or reading capacity from some other 
(external to PIE) system component as in DOCSIS-PIE. Is it expected that 
most versions will use capacity estimation?

There are SHOULD values for parameters of PIE, but it is never explained 
in what range of conditions these parameters are [expected to be] valid. 
There are slides for PIE in Data Centers 
here:http://www.ietf.org/proceedings/86/slides/slides-86-iccrg-5.pdf. 
They use different parameters for alpha, beta, and T_update. So I assume 
that the parameters in the draft are not valid in these scenario. Also, 
since target/reference delay is what a user probably wants to configure, 
should it be SHOULD as opposed to RECOMMENDED. Finally, what minimum 
buffer size is required for these values of target and burst-allow, so 
that PIE doesn't revert to taildrop in the process?

Moreover, it is IMHO questionable whether implementers are familiar with 
the control theory and can perform calculations based on the formulas in 
the paper. Ideally, I would prefer a spreadsheet, where I can input 
visible network parameters - e.g.,  link capacity, average or max RTT, 
max_available_buffer, desired delay and get values for PIE variables.

As a nit, IMHO delay should always be called delay and not sometimes 
delay and sometimes latency.


      per-Section feedback:

Sec4.2:
 >in the formulas for autotune: if (drop_prob_ < X)
is p a value between 0 and 1 or between 0 and 100% ?

Sec4.2:
 >Here, the current queue length is denoted by qlen.
there is no qlen in the formulas

Sec4.2:
 > Variables, est_del and est_del_old represent the current and previous 
estimation of the queueing delay.
suggestion: which are calculate by the latency component (see Section 4.3)

Sec4.2:
IMO the rationale for PI controller belongs to Design Goals in Section 3 
and not to algorithm specification. By this line, it should already be 
clear that PI is used.

Sec4.4:
 >* if p == 0 and est_del < del_ref and est_del_old < del_ref
 >burst_allowance = max_burst;
to what component does this formula belongs?

Sec5.1

* if rand() < p
This is implementation consideration that belongs to Section 6, here 
should be "upon packet arrival"

Sec5.2
see the comment to the PIE enhanced pseudocode, denoted as [**]. it 
affects this line too.

Sec5.3
Is the size of a buffer specified somewhere? It can be bufferbloated, so 
that 1/3buffer is more than reference delay and burst_allowance together.

Sec6
 >If the implementation doesn’t rely on packet timestamps for 
calculating latency, PIE does not *require extra memory*.
extra operations?
I assume that Linux's Codel does not allocate extra space in skb for 
timestamps, it is already there...

 >The state requirement is only two variables per queue: est_del and 
est_del_old. Hence the memory overhead is small.
well the state is est_del_old, p, and burst_allow, and depending on the 
delay calculation either est_delay or average drain rate and drained 
bytes for measurement.

 >PIE calculates latency using the departure rate, which can be 
implemented using a multiplication.
see the comment to the PIE enhanced pseudocode, denoted as [**]. it 
affects this line too.

 >In summary, PIE is simple to implement
given the current state of the draft I really doubt this ...

 >SFQ can be combined with PIE to provide further improvement of latency 
for various flows with different priorities.
There was a discussion started by me here 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01269.html) in 
which it was concluded that for SFQ scheduler there is supposed to be a 
single PIE state and drop probability of each queue is multiplied by its 
qlen/max qlen 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01363.html). 
Additionally, it was stated that the same approach was taken for a 
scheduler with small number of queues ( probably in this email 
http://www.ietf.org/mail-archive/web/aqm/current/msg01358.html)
Since discussing the interaction of AQM with a scheduler is a 
requirement from draft-aqm-eval-guidelines, this qlen/max_qlen 
recomendation should be summarized


      PIE BASIC pseudocode feedback:

 >current_qdelay_, -qdelay_old_,
aren't they called est_delay and est_delay_old in Section 4. It would be 
good to use consistent names

enqueue():

Burst allowance is reset in this routine. However in the next pseudocode 
it is reset in calculate_drop_prob() which does make more sense to me.

 >if (PIE->burst_allowance_ < 0 && drop_early() == DROP && 
PIE->burst_allowance_ <= 0) {
the burst_allowance condition is verified two times, with < and with <=


 >if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 20%)
 >|| (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) {
 >return ENQUE;
 >}

is this line explained in Section 4?

calculate_drop_prob():

 >//can be implemented using integer multiply,
 >qdelay = PIE->current_qdelay_;
it probably can, but isn't simple assignment better :)

 >PIE->last_timestamp_ = now;
I assume this line came from calling the timer function based on 
timestamps, which is implementation decision and should not be present 
in pseudocode. Linux e.g., calls calculate_drop_prob() by timer.


      PIE enhanced pseudocode feedback:

enqueue:

 >if (queue_.byte_length()+packet.size() > TAIL_DROP) {
 >drop(packet);

this I assume is from Bob Briscoe's review 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01175.html), is it 
explained somewhere in Section 5?

calculate_drop_prob:

 >if ( (now - PIE->last_timestampe_) >= T_UPDATE &&
 >PIE->active_ == ACTIVE) {
1) it is not explained in Section 5 that when PIE is inactive, drop 
probability is not recalculated. Also should it really not?
2) now - timestamp is implementation choice and should not be here
3) there is a typo in timestamp*e*

 >if (PIE->drop_prob_ >= 10% && p > 2%) {
 >p = 0.02;
 >}
 >PIE->drop_prob_ += p;
it this line explained somewhere in Section 5? I believe this line was 
commented in this email 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01216.html) and 
should have appeared as MAY requirement.

dequeue:

 >weight = DQ_THRESHOLD/2^16
1) this is new and not explained in Section 5.2
2) since dq_thresh is in bytes, weight also has units - bytes. which 
makes avg_dq_time  in units bytes * sec or something unknown like (1 - 
bytes) * sec.

in Linux code it is first checked whether a new measurement cycle can be 
started, and then departed bytes are updated. With current version if 
there is exactly dq_threshold bytes in queue and new packets will not 
come the code will not count bytes in the first packet and the 
condition  (queue_.byte_length() >= DQ_THRESHOLD ) will not happen after 
new packets arrive.

[**]
the PIE version in [HSPR-PIE] and Linux version calculate instantaneous 
capacity (also called departure rate) as dq_count/dq_time. Then the 
average capacity was calculated. The new code proposes to calculate 
instantaneous dq_time and average dq_time and then calculate capacity as 
dq_thresh/dq_time in calculate_drop_prob. Since the queue can send 
packets and not bytes, dq_count is not dq_thresh and it may be well 
different and measured capacity not precise.

I think it would be better to include the correct formulas, and then 
suggest a less precise implementation, which requires less overhead in 
Section 6. In some situations there can be enough hardware power so that 
the optimization is not necessary (e.g., Linux on modern Desktops). Plus 
on 10Gb interface Linux sends 65K tso segments on 10Gb, I don't think I 
would want the value to be += one segment precise.


      PIE Linux Code:

This is code from 
https://github.com/torvalds/linux/blob/master/net/sched/sch_pie.c
As far as I've checked, the code in 
https://github.com/hironoriokano/fq-pie/blob/master/pie.h is the same.

enqueue:

line 125 - byte mode
byte mode is not explained anywhere in the draft

dequeue:

line 286:
do I understand correctly: old epsilon is 1/8 but new epsilon is 
threshold / 2^16 which is supposedly  16 * 2^10 / 2^16 = 2^-4.

line 300 updates burst-allow-= dtime:
according to the draft, burst-allow should be updated every Tupdate 
together with drop probability and not on every measure ?
since there is no guarantee that measurement is always active these are 
two different results. Is this code then wrong?

calculate_drop_prob:

line 378:
>if  (qdelay > (PSCHED_NS2TICKS(250 * NSEC_PER_MSEC)))
 >delta += MAX_PROB / (100 / 2);
this is not in the draft, is it ?

line 416 and below:

 > /* We restart the measurement cycle if the following conditions are 
met ...
why do we need to restart measurement cycle? In my understanding the 
measurement cycle is independent of p , and only update to burst-allow 
is necessary.


      Nits:

just before 4.
 >In the following, the design of PIE and its operation are described in 
*deta*.
detail

Sec4.1

 >PIE optionally supports ECN and will be discussed in Section 5.1.
PIE optionally supports ECN. See Section 5.1.

or

PIE optionally supports ECN which will be discussed in Section 5.1.

 >It can *b* reduced
it can be reduced

Sec5.2

 > This threshold would allow *us* a long enough period
us should not be there

Best Regards,
Polina