RE: [PCN] traffic matrix scenario

Lars, All 

> But this only means that there is a *potential* for a problem.
> 
> Do you have an estimate on how likely it is that a sufficient 
> number of new flows start transmitting over these empty 
> aggregates at roughly the same time, such that they together 
> can push the bottleneck straight into overload?

We cannot really give any measure of likelihood.  In this deployment
scenario, at least, we design our network so PCN almost never rejects
calls unless the traffic matrix is highly anomalous.  Such anomalies are
inherently unpredictable.  For example, a scenario we sometimes consider
is an event which simultaneously destroys an exchange building and
causes a surge of traffic to and from the affected exchange region
consisting of emergency traffic mixed with concerned relatives calling
up to see what has happened.  We cannot tell you how likely this
scenario is.  But it is essential to us that PCN operates correctly in
this scenario, regardless of the scenario's precise likelihood.

So, what do we need to do to ensure that PCN does operate correctly in
the presence of extreme anomalous events?  
We have a few parameters to play with.  One is the margin between
pre-congestion and actual congestion.  Another is provided by a cap on
the rate of admitting new flows.

What the statistics we gave before imply is that this rate must be quite
tightly capped, at least in the case of ingress-egress aggregates with
no flows, because the size of the pulse of flows we admit before we get
good feedback could be being multiplied across many other ingresses, and
together this chunk of admissions must be reliably limited to below our
safety margin.  If the rate of admission is being capped on a per
ingress basis, then the multiplication factor cannot reliably be assumed
to be much smaller than the total number of ingresses, 100 in our case.
If the rate of admission is being capped on a per aggregate basis then
our traffic matrix information suggests that we could have a situation
where many aggregates (a few thousand) are all ramping up into the same
bottleneck.  So the multiplication factor needs to be of that order.

We can do the sums: given the safety margin, the time lag before
feedback becomes reliable, and the multiplication factor we can directly
compute the maximum rate of admission per ingress and per aggregate.
Alternatively, given a desirable minimum size for this maximum rate of
admission we can compute the needed safety margin.

What should we take to be the time lag before the feedback becomes
reliable?  One component is the amount of time between admission control
decision and data actually reaching the ingress.  Can we assume this to
be reliably bounded?  The other components are then propagation and
queueing delay, alongside the time needed for the bottleneck queues to
grow and the congestion estimates to grow.  Do we have good estimates
for these times? 

Ben Strulo

_______________________________________________
PCN mailing list
PCN@ietf.org
https://www1.ietf.org/mailman/listinfo/pcn