Re: [ippm] [spring] Monitoring metric to detect and locate congestion

Robert Raszuk <robert@raszuk.net> Thu, 27 February 2020 09:38 UTC

Return-Path: <robert@raszuk.net>
X-Original-To: ippm@ietfa.amsl.com
Delivered-To: ippm@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 52EA93A1683 for <ippm@ietfa.amsl.com>; Thu, 27 Feb 2020 01:38:30 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -2.098
X-Spam-Level:
X-Spam-Status: No, score=-2.098 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=ham autolearn_force=no
Authentication-Results: ietfa.amsl.com (amavisd-new); dkim=pass (2048-bit key) header.d=raszuk.net
Received: from mail.ietf.org ([4.31.198.44]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Ja-Cn92VwZ2G for <ippm@ietfa.amsl.com>; Thu, 27 Feb 2020 01:38:28 -0800 (PST)
Received: from mail-oi1-x230.google.com (mail-oi1-x230.google.com [IPv6:2607:f8b0:4864:20::230]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id 156873A1681 for <ippm@ietf.org>; Thu, 27 Feb 2020 01:38:28 -0800 (PST)
Received: by mail-oi1-x230.google.com with SMTP id l12so1313470oil.9 for <ippm@ietf.org>; Thu, 27 Feb 2020 01:38:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=raszuk.net; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=78HUOwU0Pau0C7JZvr4rPqenbj4D8ejn3P8iLF8bsCw=; b=VT7lmMJ/leicUspBXNqaNdFMf1bXrOJegQF25XVosdkwVIQkY9Y7VBXUMpixM2Ubsq XZoM4CHtI+YpzyGcDiTW/G8oIC0zvZBb6aPRpVoUzc2zPlJCIO+0QUpwsQaSf8CA4Usw jGt0W0h/ey51uQppmW183lKxBo5lYnbli9FPrIBNgqQ92VjM0gdH65XrBHZXP76A4zKQ y63uU0xBLZb+lJbAkgOmgOiUP3SXrrGCovzf2yq7s+6/R0pRoVgUxK0S9GBJNmMSPSgv K9KvHuHOghswXTBlj/ZEgqcInJs096GlSp/0m94MNivddNcyDvnGehbB5pBWxaI33wF0 /9Tg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=78HUOwU0Pau0C7JZvr4rPqenbj4D8ejn3P8iLF8bsCw=; b=SODpBPeoCFXsuhWAeR973eCkYuCKJ5UrUcU1USkyR9qkPyivfAuRQCBYRm2Oqd7407 IisDIW0fNtuPFMVgZzSaYUwR7ax+pSfEMu7v5/8NvUiSkg/7WTEEaL8U+crrAd4UT1l9 Dl5EKfUQkhJMn9LSlcGvaCMDwt6Z/+9d7Y+15KoPy1GWo02UWOCk1sigvAp5OtPG7Vr7 EnXZ0jtpkWTpcBiYgl/BRRLjpQXI4rMfOUZXddGeOwgaLgwARg6XM3/CHAHIj/DxsZt9 2f/oC+qT03yA6iJKC9xu2qt4B+AXfis4Moa5id0dfm+iZ36AhN2qyPPuNMTWNGUGMhpO F+Aw==
X-Gm-Message-State: APjAAAVeZ3Y1k6VoIev1ANYjiwoelhF73U0YgNP4HuZjS65Pl0ZNzdUO D+9WmhbTW9VN6C9GEfCXAK/qppu4wCpP47EG/HiqQg==
X-Google-Smtp-Source: APXvYqyJOUlsQhXJmd2mn5R/YGcwQKfLCunHKck+R/kOfD5AjxYnT9XLMjsIuKcMb7wsODw1kzC6TiV3NlQ3MO+7YDE=
X-Received: by 2002:a54:4510:: with SMTP id l16mr2588571oil.70.1582796307203; Thu, 27 Feb 2020 01:38:27 -0800 (PST)
MIME-Version: 1.0
References: <FRXPR01MB03926E7F0A28B837D69C3C819CEA0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE> <CAOj+MMHZWoVaP8-h17n86cMB_ZY0CjyO9GNShDjKM_NDTxOXxA@mail.gmail.com> <FRXPR01MB039210375C9806E4D47DD14C9CEA0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE> <CAOj+MMFTOOEW-Q6meFmvMgJstS59-wk7nfWPBnW+m2fFmmStbw@mail.gmail.com> <FRXPR01MB03927B00B7202BC549DF7ECA9CEB0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE>
In-Reply-To: <FRXPR01MB03927B00B7202BC549DF7ECA9CEB0@FRXPR01MB0392.DEUPRD01.PROD.OUTLOOK.DE>
From: Robert Raszuk <robert@raszuk.net>
Date: Thu, 27 Feb 2020 10:38:17 +0100
Message-ID: <CAOj+MMFKJjQdvk5jsVtbjg85oQ7yFYFQvV9WBYknDuCM+FM48g@mail.gmail.com>
To: Ruediger.Geib@telekom.de
Cc: ippm-chairs@ietf.org, SPRING WG <spring@ietf.org>, ippm@ietf.org
Content-Type: multipart/alternative; boundary="000000000000a491fe059f8b7b47"
Archived-At: <https://mailarchive.ietf.org/arch/msg/ippm/HHz7JpCzXdZOLiWeIZazkynHYkk>
Subject: Re: [ippm] [spring] Monitoring metric to detect and locate congestion
X-BeenThere: ippm@ietf.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: IETF IP Performance Metrics Working Group <ippm.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/ippm>, <mailto:ippm-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ippm/>
List-Post: <mailto:ippm@ietf.org>
List-Help: <mailto:ippm-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/ippm>, <mailto:ippm-request@ietf.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Feb 2020 09:38:30 -0000

Hi,

As I mentioned in my first mail I am a big supporter of end to end path
measurement - mainly have been focusing on hop by hop timestamping
approach.

So my comments were not really to discourage in any way end to end path
probing - it is extremely useful. They were just to see your point of view
on alternative options just for the specific goal here - congestion
detection.

Will be watching how your proposal evolves with interest !

Cheers,
R.

On Thu, Feb 27, 2020 at 9:17 AM <Ruediger.Geib@telekom.de> wrote:

> Hi Robert,
>
>
>
> regarding scalability, I hope the difference between our positions is just
> whether it’s a router or a dedicated CPE. I don’t promote deploying 20k PCs
> (I hope to promote a metric to replace them). I prefer the dedicated CPE,
> but routers do as well.
>
>
>
> The telemetry threshold might be an option too, if congestion is to be
> detected. In September last year, we had to replace a line card at a
> production router, whose ingress port hardware was corrupted. There were no
> drops, just random delay variations, which showed in our performance
> measurement system. I wonder whether problems like that can be detected by
> evaluating telemetry.
>
> Further, telemetry needs to work reliable if the router hardware is busy
> dealing with heavy loads (i.e., telemetry must be sufficiently privileged).
> External measurements on forwarding layer don’t rely on router internal
> processing resources.
>
>
>
> Regards,
>
>
>
> Ruediger
>
>
>
>
>
> *Von:* Robert Raszuk <robert@raszuk.net>
> *Gesendet:* Mittwoch, 26. Februar 2020 14:19
> *An:* Geib, Rüdiger <Ruediger.Geib@telekom.de>
> *Cc:* ippm-chairs@ietf.org; SPRING WG <spring@ietf.org>; ippm@ietf.org
> *Betreff:* Re: [spring] Monitoring metric to detect and locate congestion
>
>
>
> Hi,
>
>
>
> Two clarifications:
>
>
>
> [RG1] the measurements pass the routers on forwarding plane.
>
>
>
> Well if I have 20K CEs and would like to measure this end to end that
> means it better run on a router ... I can not envision installing 20K new
> PCs just for this. At min such end point it should run on a well designed
> CPE as an LXC.
>
>
>
> [RG1] Up to now, the conditions in Deutsche Telekom’s backbone network
> require a careful design of router probing interval and counters to be
> read.
>
>
>
> Sure. I actually had in mind telemetry push model where you only get
> notified to yr monitoring station when applied queue threshold is crossed.
> Very little mgmt traffic in the network limited to only information which
> is important.
>
>
>
> I know some vendors resist to locally (at LC or local RE) apply filtering
> to telemetry streaming but this is IMHO just sick approach.
>
>
>
> Thx,
>
> R.
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 26, 2020 at 12:01 PM <Ruediger.Geib@telekom.de> wrote:
>
> Hi Robert,
>
>
>
> Thanks, my replies in line marked [RG1]
>
>
>
> I have read your draft and presentation with interest as I am a
> big supporter and in some lab trials  of end to end network path probing.
>
>
>
> Few comments, observations, questions:
>
>
>
> You are essentially measuring and comparing delay across N paths
> traversing known network topology (I love "network tomography" name !)
>
>
>
> [RG1] it’s telemetry with a constrained set up, but the term doesn’t
> appear in the draft yet…that can be changed.
>
>
>
> ------
>
> * First question - this will likely run on RE/RP and in some platforms
> path between LC and RE/RP is completely deterministic and can take 10s or
> 100s of ms locally in the router. So right here the proposal to compare
> anything may not really work - unless the spec mandates that actually
> timestamping is done in hardware on the receiving LC. Then CPU can process
> it when it has cycles.
>
>
>
> [RG1] the measurements pass the routers on forwarding plane. High-end
> routers add variable processing latencies. It’s on a level of double or
> lower triple digit [us] on Deutsche Telekom backbone routers. If a
> dedicated sender receiver system is used, timestamping may be optimized for
> the purpose.
>
> ------
>
>
>
> * Second question is that congestion usually has a very transient
> character ... You would need to be super lucky to find any congestion in
> normal network using test probes of any sort. If you have interfaces always
> congested then just the queue depth time delta may not be visible in end to
> end measurements.
>
>
>
> [RG1] The probing frequency depends on the characteristics of congestion
> the operator wants to be aware of. Unplanned events may cause changes in
> measurement delays lasting for minutes or longer (congestion or hardware
> issues). The duration of a reliably detectable “event” correspond to the
> measurement packet distance (I don’t intend to replace hello-exchanges or
> BFD by the metric).
>
> ------
>
>
>
> * Third - why not simply look at the queue counters at each node ? Queue
> depth, queue history, min, avg, max on a per interface basis offer tons of
> information readily available. Why would anyone need to inject loops of
> probe packets in known network to detect this ? And in black box unknown
> networks this is not going to work as you would not know the network
> topology in the first place. Likewise link down/up is already reflected in
> your syslog via BFD and IGP alarms. I really do not think you need end to
> end protocol to tell you that.
>
>
>
> [RG1] Up to now, the conditions in Deutsche Telekom’s backbone network
> require a careful design of router probing interval and counters to be
> read. The proposed metric allows to capture persistent issues impacting
> forwarding. It points out, where these likely occur. An operator may then
> have a closer look at an interface/router to analyse what’s going on, using
> the full arsenal of accessible information and tools. As unusual events
> happen rarely, it may still be a fair question for which purpose linecard-
> and central processing cycles of routers are consumed.
>
> -------
>
> + Thanks for catching the nit below..
>
>
>
> Regards, Ruediger
>
>
>
> s/nodes L100 and L200 one one/nodes L100 and L200 on one/
>
>
>
> :)
>
>
>
> Many thx,
>
> R.
>
>
>
> On Wed, Feb 26, 2020 at 8:55 AM <Ruediger.Geib@telekom.de> wrote:
>
> Dear IPPM (and SPRING) participants,
>
>
>
> I’m solliciting interest in a new network monitoring metric which allows
> to detect and locate congested interfaces. Important properties are
>
>    - Same scalability as ICMP ping in the sense one measurement relation
>    required per monitored connection
>    - Adds detection and location of congested interfaces as compared to
>    ICMP ping (otherwise measured metrics are compatible with ICMP ping)
>    - Requires Segment Routing (which means, measurement on forwarding
>    layer, no other interaction with passed routers – in opposite to ICMP ping)
>    - Active measurement (may be deployed using a single sender&receiver
>    or separate sender and receiver, Segment Routing allows for both options)
>
>
>
> I’d be happy to present the draft in Vancouver.. If there’s community
> interest. Please read and comment.
>
>
>
> You’ll find slides at
>
>
>
>
> https://datatracker.ietf.org/meeting/105/materials/slides-105-ippm-14-draft-geib-ippm-connectivity-monitoring-00
>
>
>
> Draft url:
>
>
>
> https://datatracker.ietf.org/doc/draft-geib-ippm-connectivity-monitoring/
>
>
>
> Regards,
>
>
>
> Ruediger
>
> _______________________________________________
> spring mailing list
> spring@ietf.org
> https://www.ietf.org/mailman/listinfo/spring
>
>