[tsvwg] Review: draft-wei-tsvwg-tunnel-congestion-feedback-03

Bob Briscoe <bob.briscoe@bt.com> Sun, 22 March 2015 20:45 UTC

Message-ID: <201503222045.t2MKjX8V007156@bagheera.jungle.bt.co.uk>
Date: Sun, 22 Mar 2015 20:45:27 +0000
To: "Weixinpeng (Jackie)" <weixinpeng@huawei.com>
From: Bob Briscoe <bob.briscoe@bt.com>
In-Reply-To: <C5C3BB522B1DDF478AA09545169155B46E31974D@nkgeml507-mbx.chi na.huawei.com>
References: <201411210017.sAL0Hpa7007352@bagheera.jungle.bt.co.uk> <C5C3BB522B1DDF478AA09545169155B46D869048@nkgeml507-mbx.china.huawei.com> <C5C3BB522B1DDF478AA09545169155B46D86DE30@nkgeml507-mbx.china.huawei.com> <201412041939.sB4JdjN3003518@bagheera.jungle.bt.co.uk> <C5C3BB522B1DDF478AA09545169155B46E31974D@nkgeml507-mbx.china.huawei.com>
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="=====================_1329861504==.ALT"
Archived-At: <http://mailarchive.ietf.org/arch/msg/tsvwg/md0ncQAwicmvmHMAvgh-hP1FKwQ>
Cc: "BLACK, David" <Black_David@emc.com>, tsvwg IETF list <tsvwg@ietf.org>
Subject: [tsvwg] Review: draft-wei-tsvwg-tunnel-congestion-feedback-03
Precedence: list

Xinpeng,

Thank you for this draft. It is now focusing on the right thing: 
standardising the feedback protocol.

I suggest that this doc should become two:
i) a requirements and system model doc (Informational) for any 
candidate feedback protocols,
ii) a briefer doc (Proposed Standard) should specify one specific 
case of that feedback protocol.

Only a small amount of the text in this doc would need to be moved 
into the latter.

This review is divided into:
1) Technical Points
2) Document Structure
3) Editorial
---------------------

1) Technical Points

1.1 Resilience to loss of feedback

1.1a) Partial reliability != message frequency adjustment

S.5.1 requires that the frequency of congestion feedback messages 
should reduce as congestion increases, then in 5.2 it claims to 
satisfy that requirement with IPFIX, given IPFIX is built over the 
partial reliability of SCTP.

That reaches the right answer, but it starts from the wrong way of 
stating the requirement.

The frequency at which the Meter /sends/ feedback messages must not 
reduce with congestion. The frequency at which the Collector 
/receives/ messages can reduce, if they get lost on the way due to congestion.

The primary resilience requirement should be stated as:
* metered values must be sent as cumulative counters, so if a message 
is lost the next message will recover the missing information.

If that requirement is met, it allows message reliability 
requirements to be relaxed:
* If a feedback message is lost it need not be retransmitted. Then, 
as congestion increases, timeliness suffers but

Therefore, it is incorrect to feed back the congestion level as a 
fraction. It must be fed back as counters of each type of signal 
(ECT|ECT, CE|ECT etc), then the Collector can reconstruct the 
difference since the last message and divide the appropriate counters 
to get the required ratio.

It is also incorrect to consider feedback messages as less important 
than user data. The feedback messages ensure user data has sufficient 
system capacity, so they should be thought of as control signalling. 
Often control signalling is given highest (Diffserv) forwarding 
priority. Nonetheless, as long as metered values are cumulative, they 
can be sent in the user-plane without any higher priority.

1.1b) Extending GTP, GRE etc must meet the same resilience requirements

S.5. suggests that a native tunnel protocol could be extended to add 
feedback messages, as an alternative to using IPFIX. If so, it has to 
satisfy the resilience requirement, but it does not have to in the 
same way that IPFIX does.

For instance, in mobile networks GTP can be classed as either control 
plane or user plane. If the control plane variant (GTPv2-C) is used, 
it has higher forwarding priority than user-plane traffic.

This would be an alternative way to be resilient during congestion 
without requiring counters to be cumulative. Nonetheless, I would say 
that if a resilient solution in the user-plane is feasible, why 
create a more complex solution in the control plane?

BTW, it is possible to extend NVGRE to add new capabilities, such as 
a feedback control channel. I think this became possible from v4 
onwards (from memory).

1.2 Resilience to loss of forward ECN signals

Even if a network has deployed ECN universally, not only do we have 
to deal with loss of feedback signals, but also loss of forward 
congestion signals (CE and ECT). I think Erik Nordmark pointed this 
out in Jul'14, and it's important.

There are two forms of loss of ECN; the first is easy to handle, the 
second is not:
a) A fraction of CE or ECT signals from upstream might be dropped by 
a queue further downstream that does not support ECN, or that has had 
to switch from ECN to drop because it is overloaded. This should be 
safe, because they should be dropped randomly so the proportions of 
each signal will not change.
b) An ECN-capable queue has to switch from ECN to loss if it is 
persistently too long (as required by RFC3168, S.7). If this happens, 
the congestion feedback system will not be safe unless it is also 
measuring loss.

You rightly say that the system must be able to detect loss as well 
as ECN. This is the missing piece of the current doc. Unfortunately, 
when you add loss measurement, depending how you do it, you might no 
longer able to support the central collector model of Fig 6. I'll explain next.


1.3 Measuring and Reporting Loss

Another reason that the system has to be able to measure loss, not 
just ECN, is because not all transports support ECN (and of course, 
ECN might be defined for a transport, but not enabled).

In S.4.2 you say the tunnel egress 'cannot' be aware of dropped 
Not-ECT packets. That's not strictly true. It's only true if the 
egress must be transport-blind. Otherwise, if the egress is allowed 
to understand TCP, it can look for gaps in the sequence space.

However, being transport-blind ought to be a requirement (because 
congestion management has to work even if the transport is hidden, 
e.g by IPsec).

There have been proposals to add sequence numbers to certain tunnel 
headers (e.g. PWE3), so that loss across the PWE can be metered at the egress.

If adding sequence numbers is too complex, the only way I know to 
measure loss is to compare the amount of data (bytes) entering the 
tunnel ingress to the amount arriving at the egress, over the same 
span of packets. This implies measurements at the egress will be 
time-shifted relative to the ingress by the one-way delay, and it 
also implies that re-ordering could lead to slight over-counting in 
one period and under-counting in the next.

So, loss metering requires measurements at both ingress and egress to 
be collected at one place, and it has to be possible to find synch 
points in the two sets of measurements. I can think of a couple of ways:
a) The ingress regularly sends cumulative byte counts of each type of 
ECN signal to the egress. When each message arrives, the egress adds 
its own byte counts to the message and either returns the whole 
message to the sender, or to a central collector.
b) The ingress sends a uniquely identifiable (e.g. time-stamped) 
synch message to the egress every time it locally records cumulative 
byte counts of each type of ECN signal. When the egress receives this 
synch message, it collects its own byte counts and returns them to 
the ingress, along with the identifier from the synch message.

I prefer scheme (a) because it supports either system model (central 
collector, or ingress collector). Scheme (b) could support the 
central collector model if the ingress sent its measurements to the 
central collector as well.

Both models suffer from inaccuracy during the delay between receiving 
a synch message and producing a meter reading message.

1.4 Faked ECT on Outer to Defer Loss to the Egress

There is another possible way of measuring loss, which I describe in 
draft-briscoe-conex-data-centre-02:
* After standard RFC6040 processing to create the outer header, set 
the outer ECN field to ECT(0).
* At egress, measure any CE in the outer with Not-ECT in the inner.
* Standard RFC6040 decapsulation will drop such packets.

This effectively defers losses from the middle of the tunnel to the egress,
so "CE|Not-ECT" packets can be considered as loss

And "ECT|Not-ECT" packets can be considered as Not-ECT, because the 
standard RFC6040 egress will strip off the ECT from the outer and 
forward the inner as Not-ECT.

This has three potential problems (pointed out by Erik Nordmark as well):

a) If congestion gets serious, an ECN queue within the tunnel will 
revert to loss, so you have to measure loss anyway.

b) Deferring loss from a congested queue to the egress doesn't 
immediately remove load from the congested queue - it has to wait 
1RTT for the e2e congestion control to respond.

c) This assumes one CE equals one loss, which is true according to 
RFC3168, but we are trying to change that (and RFC3168 says we might 
want to change that in future).

I will try to rebut each one:

a) yes, admittedly, you have to measure loss anyway, but if loss is 
only ever a sign of serious congestion, you won't need to measure it 
so accurately. Comparing total bytes in with bytes out using only 
loose synch might be sufficient for loss detection.
If any loss at all is detected, the traffic controller would have to 
take drastic action. At the same time ECN measurements (indicating 
deferred losses) would still give fine-grained information to tune 
individual customers, or routing, or whatever.

b) Deferring losses to the egress does not delay the signals given to 
the e2e control loop. And ECN and AQM in general keep queues from 
overflowing even though control is delayed by 1 RTT. So this one is a 
non-problem.

c) If the relationship between CE and drop is standardised 
differently in future (e.g. no longer 1:1), the tunnel egress could 
translate CE markings into the right number of losses by reversing 
the mapping. The standardised mapping would have to be simple enough 
to make this work without complexity at the egress. For the mapping 
we have in mind (p_ECN = p_drop^2), I believe reversing the mapping 
at the egress would be trivial (tho still admittedly more processing cost).

1.5 ConEx

You might want to mention that ConEx info would be an alternative (if 
deployed).

ConEx was deliberately designed to give congestion info from the 
measurement point onwards downstream until the ultimate destination. 
This is not the same as just over one tunnel segment, but I would 
argue that a domain needs to control traffic with knowledge of the 
congestion it is causing further downstream in other domains. 
Certainly a domain should not re-route traffic without knowledge of 
the downstream effects, otherwise domains will be continually 
re-routing traffic because of re-routes in other domains, and all the 
fighting could go unstable.

In draft-briscoe-conex-data-centre-02 I suggested that, even if ConEx 
is not widely deployed, mechanisms like your 
tunnel-congestion-feedback should use it if it does appear on packets 
in future, then the feedback would be unnecessary for such 
ConEx-capable packets (because the ConEx packets already give the 
info to the ingress).

1.6 Metering Granularity

The IPFIX examples in section 5.2 (p16) seem to meter at the 
granularity of the whole tunnel, not of traffic sources that the 
ingress might be able to control separately (e.g. individual 
customers, or sets of flows that match a hash so they can be 
re-routed separately as an aggregate).

1.7 Control Timescales and Stability

Altho the action as a result of congestion is out of scope, you will 
need to state high level requirements it must meet, e.g.:
* Control decisions must be delayed by more than a worst-case global 
RTT (e.g. 500ms), otherwise tunnel traffic management will not give 
normal e2e congestion control enough time to do its job, and the 
system could go unstable.

I strongly suggest you avoid the term "Congestion Control" for this 
traffic management at the tunnel ingress. Instead, I suggest "Traffic 
Management", or "Congestion Management".

Control implies faster timescales than management.

1.8 Security Considerations

* System has to be resilient to spoof feedback messages causing the 
traffic manager to throttle a user or users, which would otherwise be 
"DoS by proxy". In the ingress controller model, SCTP connection 
hijacking capabilities should be sufficient, but message 
authentication might be advisable. In the central controller model, 
message authentication would be essential.

* Can't think of any other potential attacks at present...


2. Document Structure

2.1 Separate INF & PS docs.

I think this doc will be too long for an implementer who just wants 
to be told what to do, not why. So I suggest you plan to write an 
INFormational requirements doc (this one) and a brief Proposed 
Standard for just the protocol spec.

2.2. Text that's in the wrong place

The doc has the right structure (in terms of headings), but certain 
pieces of the text aren't under the correct heading. E.g.:
* Last para of section 4 about loss ought to be quite early in the 
Intro, and there will need to be a lot more about loss all through 
the doc, including the Intro.
* There ought to be discussion of Direct vs Mediated feedback model 
in Section 4, instead of one model in Fig 4, then suddenly Fig 6 says 
there might be another model.
* Section 5 starts by talking about GTP, then it suddenly says it's 
not going to talk about GTP, but it's going to focus on IPFIX. This 
just needs turning round the other way.
* Section 6 (Benefits) ought to be earlier. There is already a lot of 
motivation text within the Introduction, and the Benefits ought to go 
with that (even if in a new Section 2 after the Intro, on motivation).
* A section enumerating formal requirements for candidate tunnel 
congestion feedback protocols is needed (with the resilience and 
timescale requirements I've mentioned here).

2.3 Introduction needs a rewrite

The Intro reads like it was written before the ideas in the rest of 
the doc had become clarified in your mind. I suggest you wait until 
the whole body of the doc is more stable, then note down the main 
messages you want to get over, scrap the current Intro and rewrite a new one.

3) Editorial

3.1. Motivation in the Intro and the Problem Statement are Misguided and Weak

Even tho I believe there are strong motivations for this work, I 
disagree with nearly everything said to motivate it.

Actually, RFC970 (by John Nagle) is still valid, and motivates this 
work, even tho the title doesn't sound like it would ("On Packet 
Switches With Infinite Storage"). Essentially it explains why e2e 
congestion control is not sufficient, because some users can cause 
harm to others (and it goes on to propose fair queuing). One could 
say that tunnel congestion feedback collects information from all the 
queues across a segment, so it can be used for better algorithms than 
fair queuing, because it gives more information than that from just 
the local queue.

That leads on to the next item missing from the motivation: it needs 
to justify why the ingress cannot use the locally visible traffic 
rates, and why congestion information is better:
* Using rates would require additional knowledge of downstream 
capacity and topology, as well as cross traffic that does not pass 
through this ingress
* Congestion information intrinsically measures the burstiness in the 
traffic as well as whether the average rate is too great for the capacity.

It ought to say that per-flow congestion control is not sufficient, 
because each flow doesn't know the whole effect of other flows from 
that user on all other users. The congestion control of one flow 
doesn't know whether it is one of millions of flows from the same 
user, or just two, and it doesn't know whether the other flows from 
the same user are carrying terabytes of data over time, or just 
kilobytes. And it doesn't know how other users compare.

Then there is the possibility of creating a new path. Only a network 
operator can do that. Even a transport like MPTCP can only choose 
from the paths available, it cannot create new ones, or bring in new capacity.

Motivation statements I disagree with:
* Intro:
   - Congestion is defined as if it is something bad that has to be 
eliminated. More care needs to be taken to define the difference 
between congestion and sustained high load. Congestion is an 
ambiguous term used in the IETF transport area for a level of drop or 
marking that shows that transport protocols are doing the best they 
can to fill the capacity. So lack of congestion is bad in our eyes.
   - The only e2e congestion control protocols it mentions are 
ECN-based, as if drop isn't relevant. Whereas, drop is universally 
relevant and ECN is only an ideal.

* Problem Statement
   - Whether hosts support ECN or not is not a motivation for this 
work. Drop or ECN give hosts sufficient info to do congestion control.
   - "To improve the performance
    of the network, it's better for operator to take network congestion
    situation into network traffic management."
   There is no justification for this statement, which is just plain 
wrong. As I explained earlier, if a number of networks re-route only 
on local congestion info without considering the whole path, it can 
lead to instability as they all fight each other. Whereas each host 
knows about the whole path, and there's considerable theory now about 
why TCP works so well.


3b) Technical nits

I am indebted to people who write in English even tho it's not their 
native tongue; I cannot write or read even one word in Chinese. I 
could understand the draft (with difficulty in places), but the draft 
will need to be corrected by a native English speaker eventually. 
However, clear writing starts with clear ideas. So, in this review I 
have focused on fixing some problems with the logical flow of ideas, 
and I have just given a couple of technical nits below.

S.4.1: "...indicates traffic that does not support ECN, for example UDP..."
ECN support for RTP over UDP is defined in RFC6679.

s/RED/AQM/ throughout.



Bob



________________________________________________________________
Bob Briscoe,                                                  BT

[tsvwg] Review: draft-wei-tsvwg-tunnel-congestion… Bob Briscoe
Re: [tsvwg] Review: draft-wei-tsvwg-tunnel-conges… Weixinpeng (Jackie)