Re: [ippm] [spring] Monitoring metric to detect and locate congestion

"MORTON, ALFRED C (AL)" <acm@research.att.com> Sun, 01 March 2020 21:25 UTC

Return-Path: <acm@research.att.com>
X-Original-To: ippm@ietfa.amsl.com
Delivered-To: ippm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id B532F3A08EA; Sun, 1 Mar 2020 13:25:28 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -1.798
X-Spam-Level:
X-Spam-Status: No, score=-1.798 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, HTTPS_HTTP_MISMATCH=0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=no autolearn_force=no
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jGgWq3t5caQw; Sun, 1 Mar 2020 13:25:26 -0800 (PST)
Received: from mx0a-00191d01.pphosted.com (mx0b-00191d01.pphosted.com [67.231.157.136]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 75A5B3A08E8; Sun, 1 Mar 2020 13:25:26 -0800 (PST)
Received: from pps.filterd (m0049459.ppops.net [127.0.0.1]) by m0049459.ppops.net-00191d01. (8.16.0.42/8.16.0.42) with SMTP id 021LNGR7046565; Sun, 1 Mar 2020 16:25:25 -0500
Received: from tlpd255.enaf.dadc.sbc.com (sbcsmtp3.sbc.com [144.160.112.28]) by m0049459.ppops.net-00191d01. with ESMTP id 2ygmwgg0v1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Sun, 01 Mar 2020 16:25:24 -0500
Received: from enaf.dadc.sbc.com (localhost [127.0.0.1]) by tlpd255.enaf.dadc.sbc.com (8.14.5/8.14.5) with ESMTP id 021LPJug020667; Sun, 1 Mar 2020 15:25:19 -0600
Received: from zlp30497.vci.att.com (zlp30497.vci.att.com [135.46.181.156]) by tlpd255.enaf.dadc.sbc.com (8.14.5/8.14.5) with ESMTP id 021LPDic020539 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sun, 1 Mar 2020 15:25:13 -0600
Received: from zlp30497.vci.att.com (zlp30497.vci.att.com [127.0.0.1]) by zlp30497.vci.att.com (Service) with ESMTP id 372594016997; Sun, 1 Mar 2020 21:25:13 +0000 (GMT)
Received: from clpi183.sldc.sbc.com (unknown [135.41.1.46]) by zlp30497.vci.att.com (Service) with ESMTP id EC33840006FA; Sun, 1 Mar 2020 21:25:12 +0000 (GMT)
Received: from sldc.sbc.com (localhost [127.0.0.1]) by clpi183.sldc.sbc.com (8.14.5/8.14.5) with ESMTP id 021LPC3B012092; Sun, 1 Mar 2020 15:25:12 -0600
Received: from mail-azure.research.att.com (mail-azure.research.att.com [135.207.255.18]) by clpi183.sldc.sbc.com (8.14.5/8.14.5) with ESMTP id 021LP4FC011290; Sun, 1 Mar 2020 15:25:04 -0600
Received: from exchange.research.att.com (njbdcas1.research.att.com [135.197.255.61]) by mail-azure.research.att.com (Postfix) with ESMTP id F1C6EE573B; Sun, 1 Mar 2020 16:24:52 -0500 (EST)
Received: from njmtexg5.research.att.com ([fe80::b09c:ff13:4487:78b6]) by njbdcas1.research.att.com ([fe80::8c6b:4b77:618f:9a01%11]) with mapi id 14.03.0468.000; Sun, 1 Mar 2020 16:25:02 -0500
From: "MORTON, ALFRED C (AL)" <acm@research.att.com>
To: "Ruediger.Geib@telekom.de" <Ruediger.Geib@telekom.de>, "robert@raszuk.net" <robert@raszuk.net>
CC: "spring@ietf.org" <spring@ietf.org>, "ippm-chairs@ietf.org" <ippm-chairs@ietf.org>, "ippm@ietf.org" <ippm@ietf.org>
Thread-Topic: [spring] Monitoring metric to detect and locate congestion
Thread-Index: AQHV7Vqes6DvElEb3EuzJ92ptCCruag0Qfaw
Date: Sun, 01 Mar 2020 21:25:01 +0000
Message-ID: <4D7F4AD313D3FC43A053B309F97543CFED653C41@njmtexg5.research.att.com>
References: <FRXPR01MB03926E7F0A28B837D69C3C819CEA0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE><CAOj+MMHZWoVaP8-h17n86cMB_ZY0CjyO9GNShDjKM_NDTxOXxA@mail.gmail.com> <FRXPR01MB039210375C9806E4D47DD14C9CEA0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE><CAOj+MMFTOOEW-Q6meFmvMgJstS59-wk7nfWPBnW+m2fFmmStbw@mail.gmail.com> <FRXPR01MB03927B00B7202BC549DF7ECA9CEB0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE><CAOj+MMFKJjQdvk5jsVtbjg85oQ7yFYFQvV9WBYknDuCM+FM48g@mail.gmail.com> <FRXPR01MB03926B6680111B4E52F282F39CEB0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE>
In-Reply-To: <FRXPR01MB03926B6680111B4E52F282F39CEB0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [65.119.211.164]
Content-Type: multipart/alternative; boundary="_000_4D7F4AD313D3FC43A053B309F97543CFED653C41njmtexg5researc_"
MIME-Version: 1.0
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138, 18.0.572 definitions=2020-03-01_07:2020-02-28, 2020-03-01 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_policy_notspam policy=outbound_policy score=0 bulkscore=0 mlxlogscore=999 suspectscore=0 adultscore=0 impostorscore=0 priorityscore=1501 lowpriorityscore=0 spamscore=0 mlxscore=0 clxscore=1011 phishscore=0 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2003010169
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/8RgdGxJmMaWnLhuc_UVxpKW2qWY>
Subject: Re: [ippm] [spring] Monitoring metric to detect and locate congestion
X-BeenThere: ippm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF IP Performance Metrics Working Group <ippm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ippm>, <mailto:ippm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ippm/>
List-Post: <mailto:ippm@ietf.org>
List-Help: <mailto:ippm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ippm>, <mailto:ippm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Sun, 01 Mar 2020 21:25:29 -0000

Hi Rudiger,

I’m looking at your proposal again after several months.

Thanks for preparing the slides, which help.
I have a question about one of the inferred results
on slide 2, where it says:

Simultaneous packet loss or a change in delay of measurement
paths 1, 2 and 3 indicate loss of connectivity between LSR a and LER i

What are the details leading-up to the conclusion above?
I might be more inclined to agree if the conditions were
“Simultaneous packet loss AND a change in delay of measurement”
possible because the data plane has been re-routed.

Just a thought, more to consider here,
Al


From: ippm [mailto:ippm-bounces@ietf.org] On Behalf Of Ruediger.Geib@telekom.de
Sent: Thursday, February 27, 2020 5:42 AM
To: robert@raszuk.net
Cc: spring@ietf.org; ippm-chairs@ietf.org; ippm@ietf.org
Subject: Re: [ippm] [spring] Monitoring metric to detect and locate congestion

Hi Robert,

thanks for your support and comments. I’d be glad if other interested operator representatives indicate their interest on the list...

Regards, Ruediger

Von: Robert Raszuk <robert@raszuk.net<mailto:robert@raszuk.net>>
Gesendet: Donnerstag, 27. Februar 2020 10:38
An: Geib, Rüdiger <Ruediger.Geib@telekom.de<mailto:Ruediger.Geib@telekom.de>>
Cc: ippm-chairs@ietf.org<mailto:ippm-chairs@ietf.org>; SPRING WG <spring@ietf.org<mailto:spring@ietf.org>>; ippm@ietf.org<mailto:ippm@ietf.org>
Betreff: Re: [spring] Monitoring metric to detect and locate congestion

Hi,

As I mentioned in my first mail I am a big supporter of end to end path measurement - mainly have been focusing on hop by hop timestamping approach.

So my comments were not really to discourage in any way end to end path probing - it is extremely useful. They were just to see your point of view on alternative options just for the specific goal here - congestion detection.

Will be watching how your proposal evolves with interest !

Cheers,
R.

On Thu, Feb 27, 2020 at 9:17 AM <Ruediger.Geib@telekom.de<mailto:Ruediger.Geib@telekom.de>> wrote:
Hi Robert,

regarding scalability, I hope the difference between our positions is just whether it’s a router or a dedicated CPE. I don’t promote deploying 20k PCs (I hope to promote a metric to replace them). I prefer the dedicated CPE, but routers do as well.

The telemetry threshold might be an option too, if congestion is to be detected. In September last year, we had to replace a line card at a production router, whose ingress port hardware was corrupted. There were no drops, just random delay variations, which showed in our performance measurement system. I wonder whether problems like that can be detected by evaluating telemetry.
Further, telemetry needs to work reliable if the router hardware is busy dealing with heavy loads (i.e., telemetry must be sufficiently privileged). External measurements on forwarding layer don’t rely on router internal processing resources.

Regards,

Ruediger


Von: Robert Raszuk <robert@raszuk.net<mailto:robert@raszuk.net>>
Gesendet: Mittwoch, 26. Februar 2020 14:19
An: Geib, Rüdiger <Ruediger.Geib@telekom.de<mailto:Ruediger.Geib@telekom.de>>
Cc: ippm-chairs@ietf.org<mailto:ippm-chairs@ietf.org>; SPRING WG <spring@ietf.org<mailto:spring@ietf.org>>; ippm@ietf.org<mailto:ippm@ietf.org>
Betreff: Re: [spring] Monitoring metric to detect and locate congestion

Hi,

Two clarifications:

[RG1] the measurements pass the routers on forwarding plane.

Well if I have 20K CEs and would like to measure this end to end that means it better run on a router ... I can not envision installing 20K new PCs just for this. At min such end point it should run on a well designed CPE as an LXC.

[RG1] Up to now, the conditions in Deutsche Telekom’s backbone network require a careful design of router probing interval and counters to be read.

Sure. I actually had in mind telemetry push model where you only get notified to yr monitoring station when applied queue threshold is crossed. Very little mgmt traffic in the network limited to only information which is important.

I know some vendors resist to locally (at LC or local RE) apply filtering to telemetry streaming but this is IMHO just sick approach.

Thx,
R.





On Wed, Feb 26, 2020 at 12:01 PM <Ruediger.Geib@telekom.de<mailto:Ruediger.Geib@telekom.de>> wrote:
Hi Robert,

Thanks, my replies in line marked [RG1]

I have read your draft and presentation with interest as I am a big supporter and in some lab trials  of end to end network path probing.

Few comments, observations, questions:

You are essentially measuring and comparing delay across N paths traversing known network topology (I love "network tomography" name !)

[RG1] it’s telemetry with a constrained set up, but the term doesn’t appear in the draft yet…that can be changed.

------
* First question - this will likely run on RE/RP and in some platforms path between LC and RE/RP is completely deterministic and can take 10s or 100s of ms locally in the router. So right here the proposal to compare anything may not really work - unless the spec mandates that actually timestamping is done in hardware on the receiving LC. Then CPU can process it when it has cycles.

[RG1] the measurements pass the routers on forwarding plane. High-end routers add variable processing latencies. It’s on a level of double or lower triple digit [us] on Deutsche Telekom backbone routers. If a dedicated sender receiver system is used, timestamping may be optimized for the purpose.
------

* Second question is that congestion usually has a very transient character ... You would need to be super lucky to find any congestion in normal network using test probes of any sort. If you have interfaces always congested then just the queue depth time delta may not be visible in end to end measurements.

[RG1] The probing frequency depends on the characteristics of congestion the operator wants to be aware of. Unplanned events may cause changes in measurement delays lasting for minutes or longer (congestion or hardware issues). The duration of a reliably detectable “event” correspond to the measurement packet distance (I don’t intend to replace hello-exchanges or BFD by the metric).
------

* Third - why not simply look at the queue counters at each node ? Queue depth, queue history, min, avg, max on a per interface basis offer tons of information readily available. Why would anyone need to inject loops of probe packets in known network to detect this ? And in black box unknown networks this is not going to work as you would not know the network topology in the first place. Likewise link down/up is already reflected in your syslog via BFD and IGP alarms. I really do not think you need end to end protocol to tell you that.

[RG1] Up to now, the conditions in Deutsche Telekom’s backbone network require a careful design of router probing interval and counters to be read. The proposed metric allows to capture persistent issues impacting forwarding. It points out, where these likely occur. An operator may then have a closer look at an interface/router to analyse what’s going on, using the full arsenal of accessible information and tools. As unusual events happen rarely, it may still be a fair question for which purpose linecard- and central processing cycles of routers are consumed.
-------
+ Thanks for catching the nit below..

Regards, Ruediger

s/nodes L100 and L200 one one/nodes L100 and L200 on one/

:)

Many thx,
R.

On Wed, Feb 26, 2020 at 8:55 AM <Ruediger.Geib@telekom.de<mailto:Ruediger.Geib@telekom.de>> wrote:
Dear IPPM (and SPRING) participants,

I’m solliciting interest in a new network monitoring metric which allows to detect and locate congested interfaces. Important properties are

  *   Same scalability as ICMP ping in the sense one measurement relation required per monitored connection
  *   Adds detection and location of congested interfaces as compared to ICMP ping (otherwise measured metrics are compatible with ICMP ping)
  *   Requires Segment Routing (which means, measurement on forwarding layer, no other interaction with passed routers – in opposite to ICMP ping)
  *   Active measurement (may be deployed using a single sender&receiver or separate sender and receiver, Segment Routing allows for both options)

I’d be happy to present the draft in Vancouver.. If there’s community interest. Please read and comment.

You’ll find slides at

https://datatracker.ietf.org/meeting/105/materials/slides-105-ippm-14-draft-geib-ippm-connectivity-monitoring-00<https://urldefense.proofpoint.com/v2/url?u=https-3A__datatracker.ietf.org_meeting_105_materials_slides-2D105-2Dippm-2D14-2Ddraft-2Dgeib-2Dippm-2Dconnectivity-2Dmonitoring-2D00&d=DwMGaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=OfsSu8kTIltVyD1oL72cBw&m=c38PpFTaJ0AWTj9q3yN4NtYUD_ieugCWMnRqjd3ZmPQ&s=3zzlAlab399Bw3a6xX8yWYKNiIgul-FxWiQX3l0n758&e=>

Draft url:

https://datatracker.ietf.org/doc/draft-geib-ippm-connectivity-monitoring/<https://urldefense.proofpoint.com/v2/url?u=https-3A__datatracker.ietf.org_doc_draft-2Dgeib-2Dippm-2Dconnectivity-2Dmonitoring_&d=DwMGaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=OfsSu8kTIltVyD1oL72cBw&m=c38PpFTaJ0AWTj9q3yN4NtYUD_ieugCWMnRqjd3ZmPQ&s=Zu1SLrrHSC0XeySk7ZSZxZ5CnEjdnlo1TdDq8R07ZJE&e=>

Regards,

Ruediger
_______________________________________________
spring mailing list
spring@ietf.org<mailto:spring@ietf.org>
https://www.ietf.org/mailman/listinfo/spring<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ietf.org_mailman_listinfo_spring&d=DwMGaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=OfsSu8kTIltVyD1oL72cBw&m=c38PpFTaJ0AWTj9q3yN4NtYUD_ieugCWMnRqjd3ZmPQ&s=J2AhOl6DRqfg3lk90anUJEvOQygGdIMAMyfJwdZtrVU&e=>