Re: [ippm] [spring] Monitoring metric to detect and locate congestion

Hi,

As I mentioned in my first mail I am a big supporter of end to end path
measurement - mainly have been focusing on hop by hop timestamping
approach.

So my comments were not really to discourage in any way end to end path
probing - it is extremely useful. They were just to see your point of view
on alternative options just for the specific goal here - congestion
detection.

Will be watching how your proposal evolves with interest !

Cheers,
R.

On Thu, Feb 27, 2020 at 9:17 AM <Ruediger.Geib@telekom.de> wrote:

> Hi Robert,
>
>
>
> regarding scalability, I hope the difference between our positions is just
> whether it’s a router or a dedicated CPE. I don’t promote deploying 20k PCs
> (I hope to promote a metric to replace them). I prefer the dedicated CPE,
> but routers do as well.
>
>
>
> The telemetry threshold might be an option too, if congestion is to be
> detected. In September last year, we had to replace a line card at a
> production router, whose ingress port hardware was corrupted. There were no
> drops, just random delay variations, which showed in our performance
> measurement system. I wonder whether problems like that can be detected by
> evaluating telemetry.
>
> Further, telemetry needs to work reliable if the router hardware is busy
> dealing with heavy loads (i.e., telemetry must be sufficiently privileged).
> External measurements on forwarding layer don’t rely on router internal
> processing resources.
>
>
>
> Regards,
>
>
>
> Ruediger
>
>
>
>
>
> *Von:* Robert Raszuk <robert@raszuk.net>
> *Gesendet:* Mittwoch, 26. Februar 2020 14:19
> *An:* Geib, Rüdiger <Ruediger.Geib@telekom.de>
> *Cc:* ippm-chairs@ietf.org; SPRING WG <spring@ietf.org>; ippm@ietf.org
> *Betreff:* Re: [spring] Monitoring metric to detect and locate congestion
>
>
>
> Hi,
>
>
>
> Two clarifications:
>
>
>
> [RG1] the measurements pass the routers on forwarding plane.
>
>
>
> Well if I have 20K CEs and would like to measure this end to end that
> means it better run on a router ... I can not envision installing 20K new
> PCs just for this. At min such end point it should run on a well designed
> CPE as an LXC.
>
>
>
> [RG1] Up to now, the conditions in Deutsche Telekom’s backbone network
> require a careful design of router probing interval and counters to be
> read.
>
>
>
> Sure. I actually had in mind telemetry push model where you only get
> notified to yr monitoring station when applied queue threshold is crossed.
> Very little mgmt traffic in the network limited to only information which
> is important.
>
>
>
> I know some vendors resist to locally (at LC or local RE) apply filtering
> to telemetry streaming but this is IMHO just sick approach.
>
>
>
> Thx,
>
> R.
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 26, 2020 at 12:01 PM <Ruediger.Geib@telekom.de> wrote:
>
> Hi Robert,
>
>
>
> Thanks, my replies in line marked [RG1]
>
>
>
> I have read your draft and presentation with interest as I am a
> big supporter and in some lab trials  of end to end network path probing.
>
>
>
> Few comments, observations, questions:
>
>
>
> You are essentially measuring and comparing delay across N paths
> traversing known network topology (I love "network tomography" name !)
>
>
>
> [RG1] it’s telemetry with a constrained set up, but the term doesn’t
> appear in the draft yet…that can be changed.
>
>
>
> ------
>
> * First question - this will likely run on RE/RP and in some platforms
> path between LC and RE/RP is completely deterministic and can take 10s or
> 100s of ms locally in the router. So right here the proposal to compare
> anything may not really work - unless the spec mandates that actually
> timestamping is done in hardware on the receiving LC. Then CPU can process
> it when it has cycles.
>
>
>
> [RG1] the measurements pass the routers on forwarding plane. High-end
> routers add variable processing latencies. It’s on a level of double or
> lower triple digit [us] on Deutsche Telekom backbone routers. If a
> dedicated sender receiver system is used, timestamping may be optimized for
> the purpose.
>
> ------
>
>
>
> * Second question is that congestion usually has a very transient
> character ... You would need to be super lucky to find any congestion in
> normal network using test probes of any sort. If you have interfaces always
> congested then just the queue depth time delta may not be visible in end to
> end measurements.
>
>
>
> [RG1] The probing frequency depends on the characteristics of congestion
> the operator wants to be aware of. Unplanned events may cause changes in
> measurement delays lasting for minutes or longer (congestion or hardware
> issues). The duration of a reliably detectable “event” correspond to the
> measurement packet distance (I don’t intend to replace hello-exchanges or
> BFD by the metric).
>
> ------
>
>
>
> * Third - why not simply look at the queue counters at each node ? Queue
> depth, queue history, min, avg, max on a per interface basis offer tons of
> information readily available. Why would anyone need to inject loops of
> probe packets in known network to detect this ? And in black box unknown
> networks this is not going to work as you would not know the network
> topology in the first place. Likewise link down/up is already reflected in
> your syslog via BFD and IGP alarms. I really do not think you need end to
> end protocol to tell you that.
>
>
>
> [RG1] Up to now, the conditions in Deutsche Telekom’s backbone network
> require a careful design of router probing interval and counters to be
> read. The proposed metric allows to capture persistent issues impacting
> forwarding. It points out, where these likely occur. An operator may then
> have a closer look at an interface/router to analyse what’s going on, using
> the full arsenal of accessible information and tools. As unusual events
> happen rarely, it may still be a fair question for which purpose linecard-
> and central processing cycles of routers are consumed.
>
> -------
>
> + Thanks for catching the nit below..
>
>
>
> Regards, Ruediger
>
>
>
> s/nodes L100 and L200 one one/nodes L100 and L200 on one/
>
>
>
> :)
>
>
>
> Many thx,
>
> R.
>
>
>
> On Wed, Feb 26, 2020 at 8:55 AM <Ruediger.Geib@telekom.de> wrote:
>
> Dear IPPM (and SPRING) participants,
>
>
>
> I’m solliciting interest in a new network monitoring metric which allows
> to detect and locate congested interfaces. Important properties are
>
>    - Same scalability as ICMP ping in the sense one measurement relation
>    required per monitored connection
>    - Adds detection and location of congested interfaces as compared to
>    ICMP ping (otherwise measured metrics are compatible with ICMP ping)
>    - Requires Segment Routing (which means, measurement on forwarding
>    layer, no other interaction with passed routers – in opposite to ICMP ping)
>    - Active measurement (may be deployed using a single sender&receiver
>    or separate sender and receiver, Segment Routing allows for both options)
>
>
>
> I’d be happy to present the draft in Vancouver.. If there’s community
> interest. Please read and comment.
>
>
>
> You’ll find slides at
>
>
>
>
> https://datatracker.ietf.org/meeting/105/materials/slides-105-ippm-14-draft-geib-ippm-connectivity-monitoring-00
>
>
>
> Draft url:
>
>
>
> https://datatracker.ietf.org/doc/draft-geib-ippm-connectivity-monitoring/
>
>
>
> Regards,
>
>
>
> Ruediger
>
> _______________________________________________
> spring mailing list
> spring@ietf.org
> https://www.ietf.org/mailman/listinfo/spring
>
>