[bmwg] FW: draft-ietf-bmwg-igp-dataplane-conv

"Peter De Vriendt" <pdevrien@cisco.com> Wed, 22 October 2008 12:51 UTC

Return-Path: <bmwg-bounces@ietf.org>
X-Original-To: bmwg-archive-1@ietf.org
Delivered-To: ietfarch-bmwg-archive@core3.amsl.com
Received: from [127.0.0.1] (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id E5A203A6899; Wed, 22 Oct 2008 05:51:42 -0700 (PDT)
X-Original-To: bmwg@core3.amsl.com
Delivered-To: bmwg@core3.amsl.com
Received: from localhost (localhost [127.0.0.1]) by core3.amsl.com (Postfix) with ESMTP id 6B6803A6804 for <bmwg@core3.amsl.com>; Wed, 22 Oct 2008 05:51:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.999
X-Spam-Level:
X-Spam-Status: No, score=-1.999 tagged_above=-999 required=5 tests=[BAYES_00=-2.599, J_CHICKENPOX_35=0.6]
Received: from mail.ietf.org ([64.170.98.32]) by localhost (core3.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4HSxajn0QxD5 for <bmwg@core3.amsl.com>; Wed, 22 Oct 2008 05:51:38 -0700 (PDT)
Received: from av-tac-bru.cisco.com (odd-brew.cisco.com [144.254.15.119]) by core3.amsl.com (Postfix) with ESMTP id CB1153A6899 for <bmwg@ietf.org>; Wed, 22 Oct 2008 05:51:37 -0700 (PDT)
X-TACSUNS: Virus Scanned
Received: from strange-brew.cisco.com (localhost [127.0.0.1]) by av-tac-bru.cisco.com (8.11.7p3+Sun/8.11.7) with ESMTP id m9MCqpf04554 for <bmwg@ietf.org>; Wed, 22 Oct 2008 14:52:51 +0200 (CEST)
Received: from pdevrienwxp (dhcp-peg2-vl21-144-254-14-172.cisco.com [144.254.14.172]) by strange-brew.cisco.com (8.11.7p3+Sun/8.11.7) with ESMTP id m9MCqpG04543; Wed, 22 Oct 2008 14:52:51 +0200 (CEST)
From: Peter De Vriendt <pdevrien@cisco.com>
To: bmwg@ietf.org
Date: Wed, 22 Oct 2008 14:52:51 +0200
Message-ID: <00ce01c93445$17e3e990$ac0efe90@emea.cisco.com>
MIME-Version: 1.0
X-Mailer: Microsoft Office Outlook 11
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3350
Thread-Index: Ack0PlKV+fPB5Y6cRxCWgLbfiTA9qgABL33A
Subject: [bmwg] FW: draft-ietf-bmwg-igp-dataplane-conv
X-BeenThere: bmwg@ietf.org
X-Mailman-Version: 2.1.9
Precedence: list
List-Id: Benchmarking Methodology Working Group <bmwg.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/listinfo/bmwg>, <mailto:bmwg-request@ietf.org?subject=unsubscribe>
List-Archive: <https://www.ietf.org/mailman/private/bmwg>
List-Post: <mailto:bmwg@ietf.org>
List-Help: <mailto:bmwg-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/bmwg>, <mailto:bmwg-request@ietf.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: bmwg-bounces@ietf.org
Errors-To: bmwg-bounces@ietf.org

 
Hi,

	 
I've reviewed 
http://www.ietf.org/internet-drafts/draft-ietf-bmwg-igp-dataplane-conv-meth-16.txt
and
http://www.ietf.org/internet-drafts/draft-ietf-bmwg-igp-dataplane-conv-term-16.txt
and would like to propose changes that are important to achieve a 
complete, clear, consistent and correct benchmark which is free from
wrong interpretations or wrongly executed test-cases.

I hope you will take below changes into account.
	 
	 
Thanks,
	 
Peter
	 
	 
Terminology
==========
 
General comments:
 
There is no mention on how to measure convergence events that do not cause any traffic loss.
Some events like metric change may  not cause any traffic loss. 
IMO the terminology draft is not complete without this being taken into account. I will make
corrections to this in the more specific sections below and take this behavior into 
account when commenting.
 
More specific:
 
3.1. Convergence event:  should have a time unit (yy/mm/dd hh:mm:ss:ms),
which indicates the moment the event occurs.
I would remove "better next-hop learned via a routing protocol" in Discussion because it is
vague and confusing (a better next-hop is learned as a result of an event).
Convergence event should be seen as the trigger that causes the change from 
preferred egress to next-best egress and nothing more.
So this is not necessarily the same time as convergence event instant, which
should be defined as follows:
3.8 Convergence event instant
The time that a convergence event becomes observable in the data plane, either
by the time that the DUT begins to exhibit packet loss or stops forwarding traffic
on the preferred egress interface.
 
Indeed, suppose we do a metric change at time t0.
The router will apply LSA gen and SPF timers and will eventually
start rerouting traffic at time t1, which can be many seconds later.
So at t1 we will observe traffic loss on the preferred egress interface.
So t1 is the convergence event instant and time t0 is the convergence event.
 
I think it might even be better to use different and less confusing terms,
for example we could call 3.1 convergence event trigger.
 
Note: the drawing in Figure 1 should be adapted accordingly, such that this difference
is clearly visible. 
 
 
 
3.3 Full convergence:
It is not clear if this is a state or a transition:
- state after convergence recovery instant
- transition from convergence event instant until convergence recovery instant.
 
The two are distinct and should be clearly defined.
 
Also, the state should not be confused with 'convergence time'.
I would therefore remove the paragraph "Full convergence may be measured ...." from Discussion item.
 
So what can be seen as convergence times:
	 
- (a) What can be called "convergence transition time"
time from convergence event trigger until convergence recovery instant
during this time loss could be seen (link failure) or no loss could be seen
(metric change or recovery of a link).

(b) "convergence loss transition time"
time from convergence event instant until convergence recovery instant in case packet loss is observed.
Convergence loss transition time can be measured using rate derived convergence time 
 
(c) " Route specific convergence time" -see 3.15
 
 
 
3.5 Route specific convergence: why not merging this with route specific convergence time.
Route specific convergence is neither a state nor an action.
I don't see the purpose of it and it is confusing.
 
 
 3.7. Convergence packet loss:
If you mention "packets that were delayed due to buffering" you should 
define when a packet is delayed due to buffering. 
Example: What if my switching path to the next-best egress interface is a longer path
(i.e. due to a different and slower device on that path). Will you then 
discard all packets that arrive because they have a larger delay?
I think it would be better to remove "packets that were delayed due to buffering"
and let the tester decide/determine when such packets should be discarded.

3.8 see above
 
3.10 First route convergence instant: should have a time unit (yy/mm/dd hh:mm:ss:ms).
 
3.14 Loss derived convergence time
It should mention in the definition that this gives the average packet loss in the presence of
a convergence recovery transition or convergence event transition (means convergence event/recovery transition > 0).
and hence is not recommended to use.
But it should also mention that, if convergence event/recovery transition is 0,
that it could give a very accurate value of full convergence.
 
 
3.15 route-specific convergence time.
It is a mistake to assume ""convergence loss transition time" (or what you call full convergence)
equals max(route specific convergence time).
 
For example:
convergence event at t0.
prefix 1 loss starts at t1
prefix 2 loss starts at t2
prefix 1 recovery at t3
prefix 2 recovery at t4
t4>t3>t2>t1>>t0
	 
Assume t4-t2 > t3-t1

Here max(route-specific) = t4-t2 < t4-1 
 
3.17 First route convergence time
The Discussion section should remove Equation 4b.
This is not necessarily true. Example:
convergence event at t0.
prefix 1 loss starts at t1 = convergence event instant
prefix 2 loss starts at t2
prefix 2 recovery at t3
prefix 1 recovery at t4
t4>t3>t2>t1>>t0
	 
route-specific convergence for prefix 1 = t4-t1 > route specific convergence for prefix 2 = t3-t2 
	 
First route convergence according equation 4a = t3-t1 > equation 4b which is t3-t2
	 
3.18 Reversion convergence time: I see no added value of adding this.
Going back to the normal state is an event like any other.
	 
	 
Methodology
=========
 
I think more writing needs to be done on what measurement method to use and when
(i.e.. rate derived, loss derived (with warning on average) or route-specific).
 
For example if the sampling interval is too large (1 sec), but the number of IGP routes is 
relatively low (i.e. 5k), one could prefer to measure route-specific convergence for each of the 5k routes.

In this context, 
3.6.2 is not correct and incomplete (not correct on loss derived).
 
Further comments:
3.2.8 Reversion Convergence time is not a metric (and should be removed - see terminology)
I think this section needs some rewriting after consensus of the changes made in terminology doc.
 
 
3.3. interface failure indication delay is not part of IGP timer values.
 
 
4. Test cases
For each .7 : why waiting so long (30 sec) and not making it 5 sec?
This will safe automation guys a lot of time ;-)


For test cases 4.4, 4.5, 4.6 and 4.7, the question is, what do you want to measure?
Do we measure "convergence transition time" (see term. 3.3 a) or 
the "convergence loss transition time" (see term 3.3b)?
This should be clearly specified such that equipment vendors will have the same behavior.
	 
4.6 Convergence due to route withdrawal.
The doc is not clear on how to execute this test. I think we should limit to
only one convergence trigger (i.e. one LS update packet withdrawing many routes),
vs. the multiple events (i.e. sending multiple LSA update packets withdrawing many routes) 
as could be understood from the text. Having multiple triggers will bias the results
and it is said only one event should trigger convergence (no nested events).
As an example, what if the tester sends the multiple LSA's at a pace so low,
such that not all LSAs are received before SPF starts, example, last withdraw-LSA 
arrives at time 200 ms while the SPF starts at time 100 ms due to its short 
SPF delay timers - a second SPF with slower timers might be applied to
do the full convergence. Hence we should limit to only one trigger, meaning
only one LS update packet should withdraw all the routes.
 
In reality, this test case corresponds to an ABR or ASBR (using OSPF terminology),
withdrawing inter or externally learned routes.  The test case should
be more specific on this part.
 
4.7 Convergence due to cost change.
This is the same feedback as for 4.6.
It should be clear that there is only one trigger, hence only one LS update
packet should be sent indicating the cost change. As such, it will be the 
neighbor of the DUT that will generate an LSA increasing the cost for all its neighbors.
 
 
Regards,
 
Peter
			

_______________________________________________
bmwg mailing list
bmwg@ietf.org
https://www.ietf.org/mailman/listinfo/bmwg