Re: [Dime] draft-ietf-dime-doic-rate-control-08

<bump>

> On Jun 14, 2018, at 4:38 PM, Ben Campbell <ben@nostrum.com> wrote:
> 
> Hi,
> 
> See my responses inline. I removed sections that seem resolved.
> 
> Thanks,
> 
> Ben.
> 
>> On Jun 13, 2018, at 12:30 PM, Steve Donovan <srdonovan@usdonovans.com> wrote:
>> 
>> See my comments inline.
>> 
>> Steve
>> 
>> On 5/25/18 4:17 PM, Gunn, Janet P (CNV) wrote:
>>> Not an author, but I have a strong interest in this ID.  Comments in line.
>>> Janet
>>> 
>>> -----Original Message-----
>>> From: DiME <dime-bounces@ietf.org> On Behalf Of Ben Campbell
>>> Sent: Wednesday, May 16, 2018 1:31 AM
>>> To: draft-ietf-dime-doic-rate-control.all@ietf.org
>>> Cc: dime@ietf.org
>>> Subject: [Dime] draft-ietf-dime-doic-rate-control-08
>>> 
>>> Substantive Comments:
>>> 
>>> General:
>>> The document seems inconsistent about whether rate limits are only reported during overload conditions, or in advance of overload conditions.
>>> 
>>> <JPG> I think that would be "local policy" of the serving (reporting) node, and independent of the protocol used to communicate it.  I think most cases would be reactive, but I can see situations where it could be proactive.<JPG>
>> SRD> I agree with Janet, when the report is sent is very much local policy.  There is no reason to attempt to prevent proactive use of the rate mechanism.
> 
> That’s fine with me, but a sentence or two to that effect would be helpful. (On re-reading, I see where section 1 talks about “approaching overload or overloaded”, but I assume neither of those conditions are necessary?)
> 
>>> 
>>> I’d like to see the need to allocate the rate limit across all potential sources of traffic given some more emphasis. (Maybe a sub-section of its own?)
>>> 
>>> <JPG> I agree, but again I see that as "local policy" of the serving (reporting) node. In particular, there may be reacting nodes that do not support the rate abatement algorithm.<JPG>
>> SRD> Again, I agree with Janet.  This is local policy and there may well be a mix of rate and loss if not all nodes support rate.  I don't think it is appropriate to say that rate should be preferred over loss.  But maybe I'm missing your meaning on "allocate the rate limit”.
> 
> Okay, but you still have to allocate the rate limit across all sources that you apply the rate algorithm to, right? That is, the total offered rate will be something like (average rate per source) * (number of sources). Or am I misunderstanding something?
> 
> 
>>> 
>>> 
>>> §1:
>>> - “ While this can effectively decrease the load handled by the
>>> server, it does not directly address cases where the rate of arrival
>>> of service requests increases quickly."
>>> 
>>> I think it fails to address cases where the load changes rapidly in either direction, right? At least, the following text seems to say that.
>>> 
>>> <JPG> I agree.  When there are rapid fluctuations in the offered load, the "loss" algorithm errs both in  throttling TOO MUCH when there is a dip in offered load, and throttling NOT ENOUGH when there is a spike in offered load.<JPG>
>> SRD> The text in section 1 talks about this already.  Is there a specific change being suggested?
> 
> Section 1 talks about rapidly increasing load. Did I miss mention of rapidly _decreasing_ load?
> 
>>> 
>>> 
>>> 
>>> §3: Does the need for future report types to consider the rate algorithm have IANA implications?
>> SRD> Are you suggesting that the IANA section indicate that all new report types MUST indicate whether or not the rate algorithm can be used with that report type?  I can make that change to the IANA section if it would be appropriate.
>>> 
> 
> I’m not suggesting, I’m asking :-)  But my point was more along whether the IANA registry for report types should include a field about supporting the rate algorithm.
> 
> 
>>> §5.1: The first paragraph indicates state should be kept for every reacting node to which it sends an OLR. But the 5th paragraph can be interpreted to say it sends an OLR to every reacting node with which it has negotiated use of the rate algorithm. (see general comment).
>> SRD> I'm missing something.  The first paragraph says the reporting node maintains state any time it sends a rate overload report.  The fifth paragraph is just saying that the report must include the rate information.
> 
> Actually, my question relates to my question on 5.4. The first paragraph says to keep state for every reporting node to which it sends an OLR. The 5th paragraph implies that it sends on OLR when it selects the rate algorithm for a reacting node. (If you are going to include the rate info, you have to have a report to include it in.)
> 
> So it comes down to clarity about what events happen “when the reporting node selects the algorithm for a reacting node” vs “when the reporting node sends an OLR”.
> 
>>> 
>>> §5.4: The first paragraph seems to suggest the reacting node keeps OCS for every server that has indicated support for the rate algorithm, not just nodes that have sent OLRs. Is that the intent?
>> SRD>  Yes, that is the intent.  This allows the reacting node to make sure that the machinery needed to respond to a rate request is in place prior to receiving an OLR.
> 
> See above.
> 
> 
>>> 
>>> §5.6, first paragraph: The MAY seems week here. I know and agree that we don’t want to force a particular application. But don’t we need to say that if an implementation uses a different algorithm, it MUST have the same behavior as the algorithm in section 7?
>>> 
>>> <JPG> I think it MUST "limit the message rate to the OC-Maximum-Rate AVP value in units of messages per second" (as stated in 7.3.1).  The algorithm described in the rest of 7.3.1 and 7.3.2 is somewhat more sophisticated, allowing for a smoothing factor (TAU) and prioritization.  I do not think we need  to say that the selected algorithm MUST have those features.<JPG>
>> SRD> Again, I agree with Janet.
> 
> Okay.
> 
> 
>>> 
>>> §7.2, third and 4th paragraphs: I don’t understand what this is trying to say. Please elaborate.
>>> 
>>> <JPG>3rd para - Just as a "for instance"- if the reacting node has 50/second low priority messages and 50/second high priority messages that it want to send, and has a rate limit of 75/second, it will send 25/second low priority messages and 50 /second high priority messages.  The limit of 75/second applies to the combined stream of high and low priority messages, even though only the low priority messages are being abated.<JPG >
>>> 
>>> <JPG> 4th para - in the same example, it could be that the high priority messages typically require more processing resources (cpu, etc) than the low priority messages (or vice versa).  So cutting the rate to 75/sec may NOT produce the expected reduction in resource usage.<JPG>
>> SRD> Thanks Janet, I couldn't have explained it better.
> 
> Janet’s explanation is good, but I don’t get that from the text, at least in the case of the third paragraph.
> 
> In the 4th paragraph, I don’t understand how the reporting node would “take into account the workload” in a useful way. That seems to suggest the reporting node can predict the impact of workload based decisions made by the reacting node, which seems unlikely unless there is some out-of-band agreement.
> 
> Is this really saying anything more than “The reacting nodes will decide which messages to send and which to drop, and the result may not be predictable by the reporting node”?
> 
> 
> 
>>> 
>>> 
>>> -6th paragraph: “  may receive requests at a rate below its target maximum Diameter  request rate while others above that target rate.  But the resulting request rate presented to the overloaded reporting node will converge towards the target Diameter request rate.”
>>> 
>>> Why do we expect traffic to converge to the rate limit? It seems like that won't happen if some reporting nodes are not sending at full capacity, unless work can be shifted from the high-rate sources to the slow-rate ones.
>>> 
>>> <JPG> Probably would be better to say that it "will converge  toward a rate at or below the target Diameter request rate.”<JPG>
>> SRD> I'm okay with making Janet's suggested change.
> 
> WFM.
> 
> 
>>> 
>>> §7.3.1: paragraph starting with “ In situations where reacting nodes are configured with some knowledge”
>>> 
>>> that requires knowledge of other traffic sources, not just knowledge of the reporting node.
>>> 
>>> The example code says to transmit a message if (Xp <= TAU). But the text said the limit was “T+TAU).
>>> 
>>> <JPG> I think it is supposed to be "T+TAU"<JPG>
>> SRD> I'd like to get Eric's opinion on this.  This section was copied from the SOC RFC so if it is in error here than it is in error there as well.
> 
> I agree with getting Eric’s opinion :-)
> 
> 
>>> 
>>> §9: I think the security considerations need more thought. What are the security considerations specific to the rate algorithm? If there aren’t any, then please describe the rational behind that. But I suspect there are, for example, can this be used for a DoS? Can it be used to help _mitigate_ a DoS? Could one reacting node cause others to be traffic starved?
>>> 
>>> <JPG>It is possible that a reacting node that does not support overload control could starve the nodes that do support overload control, but this is also true of the loss based version<JPG>
>> SRD> I'm not convinced that there are security scenarios that are new or different for rate versus those documented in the existing DOIC specifications.
> 
> This draft adds a new feature that isn’t in the base mechanism. I gather your point is to say that the new feature shares the same security considerations as the base, and adds no new ones. Right now, section 9 only states the former.
> 
>>> 
>>> Editorial Comments:
>>> 
>>> General: IDNits returns several issues. Some of those may be errors on its part, but I’m pretty sure some of them are real. Please resolve these.
>> SRD> I'll look at those next time I try to submit but I've not gotten IDNits errors in the past.
>> 
>> SRD> I've made the suggested changes below unless indicated otherwise.
>>> 
> 
> … and I’ve deleted sections that seem resolved.
> 
> […]
> 
> 
>>> 
>>> §5.1, third paragraph: The text is not clear whether this means OCS should be maintained per supported application, etc, or that it should maintain state when the rate algorithm on a per supported application, etc, basis.
>> SRD> I don't understand the point being made here.
> 
> Probably because I failed to make it.
> 
> I _think_ that the reporting node keeps state for each reporting node for which it selects rate, and further needs to subdivide that state by application and report type. But the it doesn’t need to keep state for a reporting node for which it doesn’t select rate, even though that reporting node might have an application in common with the first reporting node?
> 
> […]
> 
> 
>>> 
>>> §6.1.1, definition of " OLR_RATE_ALGORITHM”: Two periods at end of sentence.
>>> 
>> SRD> I am hesitant to change any of the test in section 7 given that it is taken from the SOC specification.  I would prefer that Eric comment on these proposed changes before including them in the next version.
> 
> Makes sense.
> 
>>> §7.1, 2nd paragraph: “ signal one another support for rate-based overload
>>> control”: This seems awkward; are there missing words?
>>> 
>>> §7.2, last two paragraphs: The MUSTs do not seem necessary. 2119 keywords should be used when there is some sort of choice or room for error. You don’t need them to define the basic operation of the protocol.
>>> 
>>> §7.3.1: I found the text hard to follow. It would help to declare all the identifiers and initialization up front, and to present things in more of a stepwise fashion.
>>> 
>>> - T is effectively a time interval, right? It would help to say that, especially later when you subtract a different time interval from it.
>>> 
>>> - paragraph 9: Should “admit” be “emit”?
>>> 
>>> - the example code has several mentions of SIP requests.
>>> 
>>> §7.3.2: “ Request candidates for reduction, requests not subject to reduction (except under extenuating circumstances when there aren’t any messages in the first category that can be reduced).”: That seems like an awkward way to say that the second category is the set of requests that is only subject to reduction if there are no messages left in the first category.
>>> 
>>> <JPG> Yes, that is what it means.<JPG>
>>> 
>>> - “ This can be generalized to n priorities using n thresholds for n>2 in the obvious way.”: I suggest you refrain from calling it “obvious".
>>> 
>>> §7.3.3: Paragraph starting with “ Then (only) if the arrival is admitted, increase the bucket by an amount…”: I think you increase the bucket _count_, right?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> This electronic message transmission contains information from CSRA that may be attorney-client privileged, proprietary or confidential. The information in this message is intended only for use by the individual(s) to whom it is addressed. If you believe you have received this message in error, please contact me immediately and be aware that any use, disclosure, copying or distribution of the contents of this message is strictly prohibited. NOTE: Regardless of content, this email shall not operate to bind CSRA to any order or other contract unless pursuant to explicit written agreement or government initiative expressly permitting the use of email for such purpose.
>>> _______________________________________________
>>> DiME mailing list
>>> DiME@ietf.org
>>> https://www.ietf.org/mailman/listinfo/dime
>> 
>> _______________________________________________
>> DiME mailing list
>> DiME@ietf.org
>> https://www.ietf.org/mailman/listinfo/dime
> 
> 
>

Re: [Dime] draft-ietf-dime-doic-rate-control-08

Attachment: signature.asc