Re: [tsvwg] Review: draft-wei-tsvwg-tunnel-congestion-feedback-03

"Weixinpeng (Jackie)" <weixinpeng@huawei.com> Mon, 23 March 2015 05:10 UTC

From: "Weixinpeng (Jackie)" <weixinpeng@huawei.com>
To: Bob Briscoe <bob.briscoe@bt.com>
Thread-Topic: Review: draft-wei-tsvwg-tunnel-congestion-feedback-03
Thread-Index: AQHQZOEuRlJgRcJedk+NfGZYVc5NTp0paFgj
Date: Mon, 23 Mar 2015 05:10:07 +0000
Message-ID: <C5C3BB522B1DDF478AA09545169155B46E3311AC@nkgeml507-mbx.china.huawei.com>
References: <201411210017.sAL0Hpa7007352@bagheera.jungle.bt.co.uk> <C5C3BB522B1DDF478AA09545169155B46D869048@nkgeml507-mbx.china.huawei.com> <C5C3BB522B1DDF478AA09545169155B46D86DE30@nkgeml507-mbx.china.huawei.com> <201412041939.sB4JdjN3003518@bagheera.jungle.bt.co.uk> <C5C3BB522B1DDF478AA09545169155B46E31974D@nkgeml507-mbx.china.huawei.com>, <201503222045.t2MKjX8V007156@bagheera.jungle.bt.co.uk>
In-Reply-To: <201503222045.t2MKjX8V007156@bagheera.jungle.bt.co.uk>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
Content-Type: multipart/alternative; boundary="_000_C5C3BB522B1DDF478AA09545169155B46E3311ACnkgeml507mbxchi_"
MIME-Version: 1.0
Archived-At: <http://mailarchive.ietf.org/arch/msg/tsvwg/bzTjEr5LgX6gmq-O1gScGUKrvPg>
Cc: "BLACK, David" <Black_David@emc.com>, tsvwg IETF list <tsvwg@ietf.org>
Subject: Re: [tsvwg] Review: draft-wei-tsvwg-tunnel-congestion-feedback-03
Precedence: list

Hi Bob,

Thanks for your valuable comments! I will review the comments in more detail, and provide a new revision draft later.

Regards,

Xinpeng

________________________________
发件人: Bob Briscoe [bob.briscoe@bt.com]
发送时间: 2015年3月23日 4:45
收件人: Weixinpeng (Jackie)
抄送: Zhulei (MBB Research); Lingli Deng; tsvwg IETF list; BLACK, David
主题: Review: draft-wei-tsvwg-tunnel-congestion-feedback-03

Xinpeng,

Thank you for this draft. It is now focusing on the right thing: standardising the feedback protocol.

I suggest that this doc should become two:
i) a requirements and system model doc (Informational) for any candidate feedback protocols,
ii) a briefer doc (Proposed Standard) should specify one specific case of that feedback protocol.

Only a small amount of the text in this doc would need to be moved into the latter.

This review is divided into:
1) Technical Points
2) Document Structure
3) Editorial
---------------------

1) Technical Points

1.1 Resilience to loss of feedback

1.1a) Partial reliability != message frequency adjustment

S.5.1 requires that the frequency of congestion feedback messages should reduce as congestion increases, then in 5.2 it claims to satisfy that requirement with IPFIX, given IPFIX is built over the partial reliability of SCTP.

That reaches the right answer, but it starts from the wrong way of stating the requirement.

The frequency at which the Meter /sends/ feedback messages must not reduce with congestion. The frequency at which the Collector /receives/ messages can reduce, if they get lost on the way due to congestion.

The primary resilience requirement should be stated as:
* metered values must be sent as cumulative counters, so if a message is lost the next message will recover the missing information.

If that requirement is met, it allows message reliability requirements to be relaxed:
* If a feedback message is lost it need not be retransmitted. Then, as congestion increases, timeliness suffers but

Therefore, it is incorrect to feed back the congestion level as a fraction. It must be fed back as counters of each type of signal (ECT|ECT, CE|ECT etc), then the Collector can reconstruct the difference since the last message and divide the appropriate counters to get the required ratio.

It is also incorrect to consider feedback messages as less important than user data. The feedback messages ensure user data has sufficient system capacity, so they should be thought of as control signalling. Often control signalling is given highest (Diffserv) forwarding priority. Nonetheless, as long as metered values are cumulative, they can be sent in the user-plane without any higher priority.
[Wei]

1.1b) Extending GTP, GRE etc must meet the same resilience requirements

S.5. suggests that a native tunnel protocol could be extended to add feedback messages, as an alternative to using IPFIX. If so, it has to satisfy the resilience requirement, but it does not have to in the same way that IPFIX does.

For instance, in mobile networks GTP can be classed as either control plane or user plane. If the control plane variant (GTPv2-C) is used, it has higher forwarding priority than user-plane traffic.

This would be an alternative way to be resilient during congestion without requiring counters to be cumulative. Nonetheless, I would say that if a resilient solution in the user-plane is feasible, why create a more complex solution in the control plane?

BTW, it is possible to extend NVGRE to add new capabilities, such as a feedback control channel. I think this became possible from v4 onwards (from memory).

1.2 Resilience to loss of forward ECN signals

Even if a network has deployed ECN universally, not only do we have to deal with loss of feedback signals, but also loss of forward congestion signals (CE and ECT). I think Erik Nordmark pointed this out in Jul'14, and it's important.

There are two forms of loss of ECN; the first is easy to handle, the second is not:
a) A fraction of CE or ECT signals from upstream might be dropped by a queue further downstream that does not support ECN, or that has had to switch from ECN to drop because it is overloaded. This should be safe, because they should be dropped randomly so the proportions of each signal will not change.
b) An ECN-capable queue has to switch from ECN to loss if it is persistently too long (as required by RFC3168, S.7). If this happens, the congestion feedback system will not be safe unless it is also measuring loss.

You rightly say that the system must be able to detect loss as well as ECN. This is the missing piece of the current doc. Unfortunately, when you add loss measurement, depending how you do it, you might no longer able to support the central collector model of Fig 6. I'll explain next.

1.3 Measuring and Reporting Loss

Another reason that the system has to be able to measure loss, not just ECN, is because not all transports support ECN (and of course, ECN might be defined for a transport, but not enabled).

In S.4.2 you say the tunnel egress 'cannot' be aware of dropped Not-ECT packets. That's not strictly true. It's only true if the egress must be transport-blind. Otherwise, if the egress is allowed to understand TCP, it can look for gaps in the sequence space.

However, being transport-blind ought to be a requirement (because congestion management has to work even if the transport is hidden, e.g by IPsec).

There have been proposals to add sequence numbers to certain tunnel headers (e.g. PWE3), so that loss across the PWE can be metered at the egress.

If adding sequence numbers is too complex, the only way I know to measure loss is to compare the amount of data (bytes) entering the tunnel ingress to the amount arriving at the egress, over the same span of packets. This implies measurements at the egress will be time-shifted relative to the ingress by the one-way delay, and it also implies that re-ordering could lead to slight over-counting in one period and under-counting in the next.

So, loss metering requires measurements at both ingress and egress to be collected at one place, and it has to be possible to find synch points in the two sets of measurements. I can think of a couple of ways:
a) The ingress regularly sends cumulative byte counts of each type of ECN signal to the egress. When each message arrives, the egress adds its own byte counts to the message and either returns the whole message to the sender, or to a central collector.
b) The ingress sends a uniquely identifiable (e.g. time-stamped) synch message to the egress every time it locally records cumulative byte counts of each type of ECN signal. When the egress receives this synch message, it collects its own byte counts and returns them to the ingress, along with the identifier from the synch message.

I prefer scheme (a) because it supports either system model (central collector, or ingress collector). Scheme (b) could support the central collector model if the ingress sent its measurements to the central collector as well.

Both models suffer from inaccuracy during the delay between receiving a synch message and producing a meter reading message.

1.4 Faked ECT on Outer to Defer Loss to the Egress

There is another possible way of measuring loss, which I describe in draft-briscoe-conex-data-centre-02:
* After standard RFC6040 processing to create the outer header, set the outer ECN field to ECT(0).
* At egress, measure any CE in the outer with Not-ECT in the inner.
* Standard RFC6040 decapsulation will drop such packets.

This effectively defers losses from the middle of the tunnel to the egress,
so "CE|Not-ECT" packets can be considered as loss

And "ECT|Not-ECT" packets can be considered as Not-ECT, because the standard RFC6040 egress will strip off the ECT from the outer and forward the inner as Not-ECT.

This has three potential problems (pointed out by Erik Nordmark as well):

a) If congestion gets serious, an ECN queue within the tunnel will revert to loss, so you have to measure loss anyway.

b) Deferring loss from a congested queue to the egress doesn't immediately remove load from the congested queue - it has to wait 1RTT for the e2e congestion control to respond.

c) This assumes one CE equals one loss, which is true according to RFC3168, but we are trying to change that (and RFC3168 says we might want to change that in future).

I will try to rebut each one:

a) yes, admittedly, you have to measure loss anyway, but if loss is only ever a sign of serious congestion, you won't need to measure it so accurately. Comparing total bytes in with bytes out using only loose synch might be sufficient for loss detection.
If any loss at all is detected, the traffic controller would have to take drastic action. At the same time ECN measurements (indicating deferred losses) would still give fine-grained information to tune individual customers, or routing, or whatever.

b) Deferring losses to the egress does not delay the signals given to the e2e control loop. And ECN and AQM in general keep queues from overflowing even though control is delayed by 1 RTT. So this one is a non-problem.

c) If the relationship between CE and drop is standardised differently in future (e.g. no longer 1:1), the tunnel egress could translate CE markings into the right number of losses by reversing the mapping. The standardised mapping would have to be simple enough to make this work without complexity at the egress. For the mapping we have in mind (p_ECN = p_drop^2), I believe reversing the mapping at the egress would be trivial (tho still admittedly more processing cost).

1.5 ConEx

You might want to mention that ConEx info would be an alternative (if deployed).

ConEx was deliberately designed to give congestion info from the measurement point onwards downstream until the ultimate destination. This is not the same as just over one tunnel segment, but I would argue that a domain needs to control traffic with knowledge of the congestion it is causing further downstream in other domains. Certainly a domain should not re-route traffic without knowledge of the downstream effects, otherwise domains will be continually re-routing traffic because of re-routes in other domains, and all the fighting could go unstable.

In draft-briscoe-conex-data-centre-02 I suggested that, even if ConEx is not widely deployed, mechanisms like your tunnel-congestion-feedback should use it if it does appear on packets in future, then the feedback would be unnecessary for such ConEx-capable packets (because the ConEx packets already give the info to the ingress).

1.6 Metering Granularity

The IPFIX examples in section 5.2 (p16) seem to meter at the granularity of the whole tunnel, not of traffic sources that the ingress might be able to control separately (e.g. individual customers, or sets of flows that match a hash so they can be re-routed separately as an aggregate).

1.7 Control Timescales and Stability

Altho the action as a result of congestion is out of scope, you will need to state high level requirements it must meet, e.g.:
* Control decisions must be delayed by more than a worst-case global RTT (e.g. 500ms), otherwise tunnel traffic management will not give normal e2e congestion control enough time to do its job, and the system could go unstable.

I strongly suggest you avoid the term "Congestion Control" for this traffic management at the tunnel ingress. Instead, I suggest "Traffic Management", or "Congestion Management".

Control implies faster timescales than management.

1.8 Security Considerations

* System has to be resilient to spoof feedback messages causing the traffic manager to throttle a user or users, which would otherwise be "DoS by proxy". In the ingress controller model, SCTP connection hijacking capabilities should be sufficient, but message authentication might be advisable. In the central controller model, message authentication would be essential.

* Can't think of any other potential attacks at present...

2. Document Structure

2.1 Separate INF & PS docs.

I think this doc will be too long for an implementer who just wants to be told what to do, not why. So I suggest you plan to write an INFormational requirements doc (this one) and a brief Proposed Standard for just the protocol spec.

2.2. Text that's in the wrong place

The doc has the right structure (in terms of headings), but certain pieces of the text aren't under the correct heading. E.g.:
* Last para of section 4 about loss ought to be quite early in the Intro, and there will need to be a lot more about loss all through the doc, including the Intro.
* There ought to be discussion of Direct vs Mediated feedback model in Section 4, instead of one model in Fig 4, then suddenly Fig 6 says there might be another model.
* Section 5 starts by talking about GTP, then it suddenly says it's not going to talk about GTP, but it's going to focus on IPFIX. This just needs turning round the other way.
* Section 6 (Benefits) ought to be earlier. There is already a lot of motivation text within the Introduction, and the Benefits ought to go with that (even if in a new Section 2 after the Intro, on motivation).
* A section enumerating formal requirements for candidate tunnel congestion feedback protocols is needed (with the resilience and timescale requirements I've mentioned here).

2.3 Introduction needs a rewrite

The Intro reads like it was written before the ideas in the rest of the doc had become clarified in your mind. I suggest you wait until the whole body of the doc is more stable, then note down the main messages you want to get over, scrap the current Intro and rewrite a new one.

3) Editorial

3.1. Motivation in the Intro and the Problem Statement are Misguided and Weak

Even tho I believe there are strong motivations for this work, I disagree with nearly everything said to motivate it.

Actually, RFC970 (by John Nagle) is still valid, and motivates this work, even tho the title doesn't sound like it would ("On Packet Switches With Infinite Storage"). Essentially it explains why e2e congestion control is not sufficient, because some users can cause harm to others (and it goes on to propose fair queuing). One could say that tunnel congestion feedback collects information from all the queues across a segment, so it can be used for better algorithms than fair queuing, because it gives more information than that from just the local queue.

That leads on to the next item missing from the motivation: it needs to justify why the ingress cannot use the locally visible traffic rates, and why congestion information is better:
* Using rates would require additional knowledge of downstream capacity and topology, as well as cross traffic that does not pass through this ingress
* Congestion information intrinsically measures the burstiness in the traffic as well as whether the average rate is too great for the capacity.

It ought to say that per-flow congestion control is not sufficient, because each flow doesn't know the whole effect of other flows from that user on all other users. The congestion control of one flow doesn't know whether it is one of millions of flows from the same user, or just two, and it doesn't know whether the other flows from the same user are carrying terabytes of data over time, or just kilobytes. And it doesn't know how other users compare.

Then there is the possibility of creating a new path. Only a network operator can do that. Even a transport like MPTCP can only choose from the paths available, it cannot create new ones, or bring in new capacity.

Motivation statements I disagree with:
* Intro:
- Congestion is defined as if it is something bad that has to be eliminated. More care needs to be taken to define the difference between congestion and sustained high load. Congestion is an ambiguous term used in the IETF transport area for a level of drop or marking that shows that transport protocols are doing the best they can to fill the capacity. So lack of congestion is bad in our eyes.
- The only e2e congestion control protocols it mentions are ECN-based, as if drop isn't relevant. Whereas, drop is universally relevant and ECN is only an ideal.

* Problem Statement
- Whether hosts support ECN or not is not a motivation for this work. Drop or ECN give hosts sufficient info to do congestion control.
- "To improve the performance
of the network, it's better for operator to take network congestion
situation into network traffic management."
There is no justification for this statement, which is just plain wrong. As I explained earlier, if a number of networks re-route only on local congestion info without considering the whole path, it can lead to instability as they all fight each other. Whereas each host knows about the whole path, and there's considerable theory now about why TCP works so well.

3b) Technical nits

I am indebted to people who write in English even tho it's not their native tongue; I cannot write or read even one word in Chinese. I could understand the draft (with difficulty in places), but the draft will need to be corrected by a native English speaker eventually. However, clear writing starts with clear ideas. So, in this review I have focused on fixing some problems with the logical flow of ideas, and I have just given a couple of technical nits below.

S.4.1: "...indicates traffic that does not support ECN, for example UDP..."
ECN support for RTP over UDP is defined in RFC6679.

s/RED/AQM/ throughout.

Bob

________________________________________________________________
Bob Briscoe, BT

[tsvwg] Review: draft-wei-tsvwg-tunnel-congestion… Bob Briscoe
Re: [tsvwg] Review: draft-wei-tsvwg-tunnel-conges… Weixinpeng (Jackie)