Re: [NSIS] AD review: draft-ietf-nsis-ntlp-09 (M1: congestion control/rate limiting issues)

Hancock, Robert wrote:
> Hi all,
> 
> comments on the congestion control/rate limiting issues
> (M1 and in fact L16-part as well):
> 
>> -----Original Message-----
>> From: Magnus Westerlund [mailto:magnus.westerlund@ericsson.com] 
>> Sent: 01 June 2006 15:23
>> To: nsis@ietf.org
>> Subject: [NSIS] AD review: draft-ietf-nsis-ntlp-09
>>
>> M1. Congestion Control
>>
>> This affects text not only in 5.3.3 but also 7.1.3 and possibly other 
>> places. But I do have a general concern that the congestion control 
>> measurements described in the specification is underspecified.
>>
>> First in 5.3.3 I don't see any normative minimal values, or even 
>> recommended values for T1 and T2 that will be safe to deploy on the 
>> internet. I don't find it acceptable that the developer needs to 
>> investigate which values that are safe to use and which are 
>> not. There 
>> should also be some criteria documented for when it is 
>> acceptable to go 
>> beyond these values. So my concern here is that the 
>> retransmission runs 
>> havoc and create way to many packets to be sent.
> 
> This is a fair point. I suppose it is impossible to get away
> permanently without giving concrete numbers in this case.
> 
> In terms of what sort of values to use and how to describe the
> constraints on them, I think a good model to follow is probably
> the SIP INVITE transaction specification (17.1.1.2 of rfc3261)
> which says [fortunately the timer names are the same...]
> 
> "  The default value for T1 is 500 ms.  T1 is an estimate of the RTT
>    between the client and server transactions.  Elements MAY (though it
>    is NOT RECOMMENDED) use smaller values of T1 within closed, private
>    networks that do not permit general Internet connection.  T1 MAY be
>    chosen larger, and this is RECOMMENDED if it is known in advance
>    (such as on high latency access links) that the RTT is larger.
>    Whatever the value of T1, the exponential backoffs on retransmissions
>    described in this section MUST be used."
> 
> and later that T2 should be 64*T1. In our case, 500ms seems a reasonable
> default also; I think T2<=64*T1 since there is a separate bound on the
> T2 value from the signalling application (see the second paragraph of
> 5.3.3). I would be tempted to relax the NOT RECOMMENDED clause, since
> a smaller timeout would be valid and possibly quite useful on a wider
> range of networks, in particular Internet-connected networks but
> where it is known that the Query should be answered within the local
> network. Comments and text suggestions welcome.

I think this is reasonable start. It seems to be to be a good idea to 
allow the adaptation of the timers based on feedback from the peer. It 
seems possible to use the GIST query and the response to measure the 
current RTT. Allowing modification of the timers to track that value + 
some processing delay + fudge factors seems one way forward.

> 
>> Secondly, there are also no values documented for the rate control. I 
>> think it is necessary to document what internet safe values 
>> are here so 
>> that one does not cause problems. In addition is seems a bit 
>> simplistic 
>> to use a token bucket with some parameters selected based on 
>> the local 
>> link as GIST clearly sends messages beyond the local link. Thus one 
>> might have to consider being a bit smarter and more adaptive 
>> to what is 
>> seen for different flows.
> 
> Here I am not so sure. The text here was informed by the equivalent
> discussion for ICMPv6 (reference [26], now RFC4443), which caused
> an extensive thread on the v6 mailing list (start at
> http://www1.ietf.org/mail-archive/web/ipv6/current/msg01343.html +
> another 60 or so messages).
> 
> GIST is not ICMP but many of the same issues arise: messages are
> generated in the IP 'control plane' (in so far as this is a
> meaningful term), partly autonomously but mainly in response to
> events initiated by end systems, the messages go beyond the local
> link, rules have to be written so they apply to a host and a core
> router and everything in between. The end result for ICMP was to 
> write something minimal. (The link bandwidth here is used as an
> indicator for where a router is in a network - core/access/whatever.
> It's clearly imperfect but there's nothing else apart from dynamic
> adaptiveness to make use of, and fixed values seem even worse.)
> 
> We'd really like to avoid adaptiveness in the D-mode state machine.
> The main use of D-mode should be for Queries/Responses for which 
> adaptation is not meaningful for initial messages (there is no
> pre-existing state); if there is a large amount of signalling data
> to send for a given flow, then GIST should transition to C-mode
> anyway, and the rate limits chosen to be cautious to encourage
> that. We aimed for robustness and simplicity rather than performance.
> 
> It might be possible to use some sort of adaptiveness to select
> an appropriate rate to apply to refresh queries used for GIST
> probing (see also your point at the end of L16). The current situation
> is that you can probe as fast as you like until you hit the rate limit,
> and that it's up to the implementer to decide how fast is really
> necessary depending on an assessment of route stability (for which
> I don't know any good objective estimator). On the assumption that 
> most probes will go to the peer you already know about, one could
> refine this to apply a separate token bucket limiter for probe 
> messages towards that peer, which was adapted according to knowledge
> of congestion state with that peer (based on message loss). We need
> input on whether that complexity is really necessary however, since
> it doesn't change the situation for the whole of D-mode but just a
> particular subset of it.

I can understand the reluctance to use adaptation. I am also fine with 
not using adaptation as long as the traffic is not causing any sever 
problem. The question is if the parameters present in the ICMP RFC are 
good enough also for GIST. Has someone made any analysis of this? What 
amount of bit-rate are we talking about. Do we need to provide guidance 
on when the default parameters present in ICMP is appropriate to use for 
GIST?

> 
>> Third, the implication and congestion issues with local 
>> repair seems to 
>> have been brushed over. Section 7.1.3 do indicate that you 
>> need to take 
>> care, but nothing more. Are there some potential for 
>> aggregation of the 
>> queries to minimize the load and have quicker convergence?
> 
> There is no mechanism that I can think of. Certainly there is
> no aggregation possible in general, since every affected flow
> might be affected differently, especially if next-NSIS-routers
> are many hops away. We depend on the rate limiting to prevent
> the generated Queries causing a flood, but that's about it. 
> (There are aggregation techniques for transmitting the notifications,
> but they take place at the NSLP level.)
> 

Okay, it might not be a real issue. As long as there is no risk for the 
local repair to cause instabilities due to its caused load preventing 
other operations from happening due the the repair. But if I understand 
correctly most keep-alive towards NSLPs are done in C-mode.

Cheers

Magnus Westerlund

Multimedia Technologies, Ericsson Research EAB/TVA/A
----------------------------------------------------------------------
Ericsson AB                | Phone +46 8 4048287
Torshamsgatan 23           | Fax   +46 8 7575550
S-164 80 Stockholm, Sweden | mailto: magnus.westerlund@ericsson.com

_______________________________________________
nsis mailing list
nsis@ietf.org
https://www1.ietf.org/mailman/listinfo/nsis