Re: [MEDIACTRL] [sip-overload] WGLC: draft-ietf-soc-overload-design

Janet P Gunn <jgunn6@csc.com> Tue, 31 August 2010 15:24 UTC

Return-Path: <jgunn6@csc.com>
X-Original-To: mediactrl@core3.amsl.com
Delivered-To: mediactrl@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 5ED4C3A6A2D; Tue, 31 Aug 2010 08:24:35 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.712
X-Spam-Level:
X-Spam-Status: No, score=-2.712 tagged_above=-999 required=5 tests=[AWL=-3.528, BAYES_40=-0.185, GB_SUMOF=5, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-4]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6rFnjUZbh2o3; Tue, 31 Aug 2010 08:24:30 -0700 (PDT)
Received: from mail64.messagelabs.com (mail64.messagelabs.com [216.82.249.227]) by core3.amsl.com (Postfix) with ESMTP id A751F3A6A41; Tue, 31 Aug 2010 08:24:29 -0700 (PDT)
X-VirusChecked: Checked
X-Env-Sender: jgunn6@csc.com
X-Msg-Ref: server-7.tower-64.messagelabs.com!1283268298!104766551!1
X-StarScan-Version: 6.2.4; banners=-,-,-
X-Originating-IP: [20.137.2.87]
Received: (qmail 18910 invoked from network); 31 Aug 2010 15:24:59 -0000
Received: from amer-mta101.csc.com (HELO amer-mta101.csc.com) (20.137.2.87) by server-7.tower-64.messagelabs.com with DHE-RSA-AES256-SHA encrypted SMTP; 31 Aug 2010 15:24:59 -0000
Received: from amer-gw09.amer.csc.com (amer-gw09.amer.csc.com [20.6.39.245]) by amer-mta101.csc.com (Switch-3.4.3/Switch-3.3.3mp) with ESMTP id o7VFOvjd022581; Tue, 31 Aug 2010 11:24:58 -0400
In-Reply-To: <034e01cb48c1$9b406dd0$d1c14970$@packetizer.com>
References: <4C71B1C3.6070805@ericsson.com> <A11921905DA1564D9BCF64A6430A62390293A4AF@XMB-BGL-411.cisco.com><4C7AA34D.4020000@alcatel-lucent.com> <A11921905DA1564D9BCF64A6430A62390293A4B0@XMB-BGL-411.cisco.com> <4C7AC02D.1000200@alcatel-lucent.com> <OF5FC5A3A1.0A30DB2F-ON8525778E.006FC85F-8525778E.0070FB2C@csc.com> <A11921905DA1564D9BCF64A6430A623903054F93@XMB-BGL-411.cisco.com> <4C7BC713.3010208@alcatel-lucent.com> <A11921905DA1564D9BCF64A6430A62390293A4B6@XMB-BGL-411.cisco.com> <034e01cb48c1$9b406dd0$d1c14970$@packetizer.com>
To: mediactrl@ietf.org, sip-overload@ietf.org
MIME-Version: 1.0
X-KeepSent: 8F9DDFDC:C309487D-85257790:00545617; type=4; name=$KeepSent
X-Mailer: Lotus Notes Release 8.0.2FP1 CCH2 April 23, 2009
From: Janet P Gunn <jgunn6@csc.com>
Message-ID: <OF8F9DDFDC.C309487D-ON85257790.00545617-85257790.0054AE10@csc.com>
Date: Tue, 31 Aug 2010 11:24:54 -0400
X-MIMETrack: Serialize by Router on AMER-GW09/SRV/CSC(Release 8.5.1FP1 HF440|June 18, 2010) at 08/31/2010 11:25:14 AM, Serialize complete at 08/31/2010 11:25:14 AM
Content-Type: multipart/alternative; boundary="=_alternative 0054AD7D85257790_="
Subject: Re: [MEDIACTRL] [sip-overload] WGLC: draft-ietf-soc-overload-design
X-BeenThere: mediactrl@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Media Control WG Discussion List <mediactrl.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/mediactrl>, <mailto:mediactrl-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www.ietf.org/mail-archive/web/mediactrl>
List-Post: <mailto:mediactrl@ietf.org>
List-Help: <mailto:mediactrl-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/mediactrl>, <mailto:mediactrl-request@ietf.org?subject=subscribe>
X-List-Received-Date: Tue, 31 Aug 2010 15:24:35 -0000

Comments on draft-ietf-soc-overload-design-01

Intro, third paragraph says:
“For example, a PSTN gateway that runs
   out of trunk lines but still has plenty of capacity to process SIP
   messages should reject incoming INVITEs using a 488 (Not Acceptable
   Here) response [RFC4412].”

While it is true that 4412 DOES say to use 488 in this case, we have found 
that, in the real world, this can lead to incorrect mapping back to ISUP. 
In at least some contexts, “503 with a Reason header field Q.850 cause 
value of 34 (no circuit available)” may be used instead of 488.  (I 
believe this is covered in a PTSC document.)  So I suggest 

“For example, a PSTN gateway that runs
   out of trunk lines but still has plenty of capacity to process SIP
   messages should reject incoming INVITEs using a response such as 488 
(Not Acceptable
   Here), as described in RFC4412.”

After this paragraph, I would add a new paragraph saying something like:

“There are other failure cases in which a SIP server also serves non-SIP 
traffic (e.g., RTP packets, database queries and updates, event handling) 
which can lead to server overload.  These other loads may, or may not, be 
correlated with the SIP message volume. The server is unable to process 
all SIP requests due to resource constraints, but simply reducing the flow 
of SIP messages may not sufficiently reduce the load to avoid congestion 
collapse.  In this context, it is to be expected that the server has some 
other method of overload control addressing these other sources of load. 
However, the specifics of the overload control for other traffic types, 
and the coordination of the different overload controls, are out of scope 
for this document.” 

This should address Partha’s, and others’ concerns.

Fourth paragraph.

In addition to the other problems with 503 and Retry-After, 503 is used 
for other situations (with or without Retry-After), not just SIP Server 
overload.  A SIP Overload Control process based on 503 would have to 
specify exactly which cause values trigger the Overload Control.

Section 2

Even when SIP messages are not dropped, significant delay can cause 
time-outs which lead to retransmission.  I would change the second 
sentence to 
“When SIP is running over the UDP protocol, it will retransmit messages 
that were dropped or excessively delayed by a SIP server due to overload 
and thereby increase the offered load for the already overloaded server.”

At the end of section 2 you say
  “Another challenge for SIP overload control is that the rate of the
   true traffic source usually cannot be controlled.  Overload is often
   caused by a large number of UAs each of which creates only a single
   message.  These UAs cannot be rate controlled as they only send one
   message.  However, the sum of their traffic can overload a SIP
   server.”

In fact, the various wireless technologies DO have method for controlling 
the load “caused by a large number of UAs each of which creates only a 
single message.”  Some of these are of the form “pick a random number and 
see if it exceeds the threshold you have been given”.

Examples include Access Class Barring, and Access Persistence Mechanism. 
It would be possible to do something similar at the SIP level, though it 
would probably be redundant.

My suggested rewording would be:

“Another challenge for SIP overload control is controlling the rate of the 
true traffic source.  Overload is often caused by a large number of UAs 
each of which creates only a single message.  However, the sum of their 
traffic can overload a SIP server. The overload mechanisms suitable for 
controlling a SIP server (e.g., rate control) may not be effective for 
individual UAs.  In some cases, there are other non-SIP mechanisms for 
limiting the load from the UAs.  These may operate independently from, or 
in conjunction with, the SIP overload mechanisms described here.  In 
either case, they are out of scope for this document.”

Section 4

Your model is built on the premise of a “sending entity” and a “receiving 
entity”.  In the real world, not only is Server A sending SIP messages to 
Server B, but Server B is also sending SIP messages to Server A.

I don’t think you should clutter up your model by trying to address both 
directions at once, but you should state somewhere in the text that you 
have made that simplification/abstraction for ease of comprehension, and 
that any mechanism must work in the context of “SIP messages going both 
ways”.

My suggestion would be to add another sentence after 
“The model in Figure 1 shows a scenario with one sending and one
   receiving entity.  In a more realistic scenario a receiving entity
   will receive traffic from multiple sending entities and vice versa
   (see Section 6).”

My suggestion would be:
“In addition, in a more realistic scenario, SIP messages will be going 
both directions, from B to A as well as A to B.  However, the overload 
control mechanisms in each direction can be considered independently.”

Then, in section 5.1, change 
“Each control loop between two servers is
   completely independent of the control loop between other servers
   further up- or downstream.” 
To
“Each control loop between two servers is
   completely independent of the control loop between other servers
   further up- or downstream, and of the control loop between the two 
servers in the other direction.” 

Section 8, 
second paragraph
After “An
   overload control mechanism should ensure that the delay encountered
   by a SIP message is not increased significantly during periods of
   overload.”
Add
“Significantly increased delay can lead to time-outs, and retransmission 
of SIP messages, making the overload worse.”

“Reactiveness” doesn’t seem the right word to me.  “Responsiveness” sounds 
better to me.

End of section 8
Another important metric is the (cpu) load used by the overload “monitor” 
and “actuator”.

End of section 9
Suggest changing 
“Explicit overload control
   mechanisms can be differentiated based on the type of information
   conveyed in the overload control feedback and whether the control
   function is in the receiving or sending entity (receiver- vs. sender-
   based overload control).” 
To
“Explicit overload control
   mechanisms can be differentiated based on the type of information
   conveyed in the overload control feedback and whether the control
   function is in the receiving or sending entity (receiver- vs. sender-
   based overload control), or both.” 

In 9.2, I think 
“A loss percentage enables a SIP server to ask an upstream neighbor to
   reduce the number of requests it would normally forward to this
   server by a percentage X. For example, a SIP server can ask an
   upstream neighbor to reduce the number of requests this neighbor
   would normally send by 10%.  The upstream neighbor then redirects or
   rejects X percent of the traffic that is destined for this server.”
Should be
“A loss percentage enables a SIP server to ask an upstream neighbor to
   reduce the number of requests it would normally forward to this
   server by a X%. For example, a SIP server can ask an
   upstream neighbor to reduce the number of requests this neighbor
   would normally send by 10%.  The upstream neighbor then redirects or
   rejects 10% of the traffic that is destined for this server.”

End of 9.2
WRT:
“Thus, percentage throttling requires an adjustment of the throttling
   percentage in response to the traffic received and may not always be
   able to prevent a server from encountering brief periods of overload
   in extreme cases.”
This is not unique to percentage throttling.  It is possible in rate based 
and window based methods as well.  In all cases, it is heavily dependent 
on the frequency of updates by the control mechanism.  But that needs to 
be balanced against the load generated by the control mechanism.  I am not 
sure whether it makes sense to say something in each method, or put it up 
front as a general comment.

Sec 9.4
Here again, remember that there are many other things that can generate 
503, with or without Retry-After.

Sec 11
Last paragraph add:
“Conversely, the semantics of any proposed approach should permit a 
variety of different algorithms.”




Nits/wordsmithing
Note at end of section 6, change “different than” to “different from”.

Section 12 first para
Change 
“Overload control can require a SIP server to prioritize requests and
   select requests that need to be rejected or redirected.”
To
“Overload control can require a SIP server to prioritize requests and
   select requests to be rejected or redirected.”

Sec 12
Third para
Change 
“Responses should not be targeted when a SIP server is trying to
   reduce load for a number of reasons.”
To
“For a number of reasons, SIP responses should not be dropped in order to
   reduce SIP processing load”


Janet