Re: [Idr] Comments of draft-ietf-idr-5g-edge-service-metadata

Cheng Li <c.l@huawei.com> Wed, 03 January 2024 17:38 UTC

Return-Path: <c.l@huawei.com>
X-Original-To: idr@ietfa.amsl.com
Delivered-To: idr@ietfa.amsl.com
Received: from localhost (localhost [127.0.0.1]) by ietfa.amsl.com (Postfix) with ESMTP id 14235C14F708; Wed, 3 Jan 2024 09:38:10 -0800 (PST)
X-Virus-Scanned: amavisd-new at amsl.com
X-Spam-Flag: NO
X-Spam-Score: -4.206
X-Spam-Level:
X-Spam-Status: No, score=-4.206 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, RCVD_IN_ZEN_BLOCKED_OPENDNS=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01] autolearn=ham autolearn_force=no
Received: from mail.ietf.org ([50.223.129.194]) by localhost (ietfa.amsl.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PJr-iIW3bPCV; Wed, 3 Jan 2024 09:38:07 -0800 (PST)
Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ietfa.amsl.com (Postfix) with ESMTPS id C1F08C14F702; Wed, 3 Jan 2024 09:38:06 -0800 (PST)
Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4T4xjJ2plTz6K6Hs; Thu, 4 Jan 2024 01:36:36 +0800 (CST)
Received: from lhrpeml100001.china.huawei.com (unknown [7.191.160.183]) by mail.maildlp.com (Postfix) with ESMTPS id 76AE71404F6; Thu, 4 Jan 2024 01:38:03 +0800 (CST)
Received: from dggpemm100005.china.huawei.com (7.185.36.231) by lhrpeml100001.china.huawei.com (7.191.160.183) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Wed, 3 Jan 2024 17:38:02 +0000
Received: from dggpemm500003.china.huawei.com (7.185.36.56) by dggpemm100005.china.huawei.com (7.185.36.231) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 4 Jan 2024 01:38:00 +0800
Received: from dggpemm500003.china.huawei.com ([7.185.36.56]) by dggpemm500003.china.huawei.com ([7.185.36.56]) with mapi id 15.01.2507.035; Thu, 4 Jan 2024 01:38:00 +0800
From: Cheng Li <c.l@huawei.com>
To: Linda Dunbar <linda.dunbar@futurewei.com>, "draft-ietf-idr-5g-edge-service-metadata@ietf.org" <draft-ietf-idr-5g-edge-service-metadata@ietf.org>
CC: "idr@ietf. org" <idr@ietf.org>
Thread-Topic: Comments of draft-ietf-idr-5g-edge-service-metadata
Thread-Index: Adosa7Cf3hi9cq/zSIWRbC/vIGZG6AAoQoFgAGfhXZAAAM8f8AD2XG+wAActsMAAkMCPMA==
Date: Wed, 03 Jan 2024 17:38:00 +0000
Message-ID: <406087ac017f4016ad5b8e850716fb23@huawei.com>
References: <ab04e74bfbb041488e1748290a114d21@huawei.com> <CO1PR13MB4920FEE1BAB0BBD53F4CD8F5858EA@CO1PR13MB4920.namprd13.prod.outlook.com> <eeefb099f66f4fb7ac1fdad57f6df81d@huawei.com> <CO1PR13MB49204E7C730A96DFB0DFE5BA8590A@CO1PR13MB4920.namprd13.prod.outlook.com> <ba1d4fbbd1404c739637631c29a58136@huawei.com> <CO1PR13MB4920A9F3DFF8CC146BB262928597A@CO1PR13MB4920.namprd13.prod.outlook.com>
In-Reply-To: <CO1PR13MB4920A9F3DFF8CC146BB262928597A@CO1PR13MB4920.namprd13.prod.outlook.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.221.205.154]
Content-Type: multipart/alternative; boundary="_000_406087ac017f4016ad5b8e850716fb23huaweicom_"
MIME-Version: 1.0
Archived-At: <https://mailarchive.ietf.org/arch/msg/idr/t3nY1Q9Ox4uTW1vuL-D6N3QSHaA>
Subject: Re: [Idr] Comments of draft-ietf-idr-5g-edge-service-metadata
X-BeenThere: idr@ietf.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: Inter-Domain Routing <idr.ietf.org>
List-Unsubscribe: <https://www.ietf.org/mailman/options/idr>, <mailto:idr-request@ietf.org?subject=unsubscribe>
List-Archive: <https://mailarchive.ietf.org/arch/browse/idr/>
List-Post: <mailto:idr@ietf.org>
List-Help: <mailto:idr-request@ietf.org?subject=help>
List-Subscribe: <https://www.ietf.org/mailman/listinfo/idr>, <mailto:idr-request@ietf.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Jan 2024 17:38:10 -0000

Hi Linda,

Sorry for my delay due to the Christmas break.
Please see my reply inline.

Thanks,
Cheng


From: Linda Dunbar <linda.dunbar@futurewei.com>
Sent: Tuesday, December 19, 2023 7:08 PM
To: Cheng Li <c.l@huawei.com>; draft-ietf-idr-5g-edge-service-metadata@ietf.org
Cc: idr@ietf. org <idr@ietf.org>
Subject: RE: Comments of draft-ietf-idr-5g-edge-service-metadata

Cheng,

Do you think adding the following subsection to indicate the Resource Utilization for a Service (prefix) can address your comments?
[Cheng] General speaking, it looks good to me, many thanks for your reply, and I have some comments below.

4.1.2. Service Delay caused by Resource Utilization Anomalies.
One of the challenges for getting ultra-low latency services hosted in edge data centers is the extra service delays caused by higher-than-normal resource utilization of the hosting environment. Though out of the scope of this document, many cloud services employ queue-based systems for handling incoming requests and monitoring the length of queues to get insights into the workload that can highlight delays in processing requests when queues become unusually long. With advanced monitoring and sophisticated algorithms, some edge data centers can gauge or estimate the potential service delay caused by physical resource utilization. For those edge data centers, the following subTLV can be included in the Metadata Path Attribute to indicate the estimated service delay for the prefix and the overall resource utilization for the environment that hosts the prefix.
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Resource Utilization Sub-Type |   Length      |T| Reserved    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                  Resource utilization                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  Delay caused by resource utilization anomalies               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

[Cheng] It is interesting to put Resource utilization and delay data into one single TLV. This design is common in PCEP(IMHO), but I am not SO familiar with BGP TLV.
In BGP, people prefer to separate different type of info into different TLVs or load them into a container/combined TLV? To me Delay and resource utilization are two types of info. However, your argument is also making sense to me from some extent, like, the delay is caused by the utilization.

If we follow this design principle, then other info should be included into this TLV indicating by new flags in the reserved field. That works as well.  If we need the jitter, how can we include in it Metadata Path Attribute? That is my question: How to organize the info that we need?


  1.  Type (2 byte):
     *   Identifier for the type of subTLV.
  2.  Length (1 byte):
     *   Number of bytes in the Value field.
  3.  Value: including the following information:
     *   Resource Utilization Percentage (4 bytes):
        *   A percentage value indicating the level of resource utilization that triggered the delay.
[Cheng]Why a percentage value needs 4 bytes? 1 Byte is enough?

     *   Delay caused by resource utilization anomalies (4 bytes):
        *   The duration of the delay caused by resource utilization anomalies, measured in milliseconds.
[Cheng] Millisecond makes sense. However, I am thinking that if one day, we may go into a crazy era that every computer works so fast that they care about nanosecond difference? It may happen, but not now. What if we use the format from NTP? 64-bit may be too long, 32-bit  NTP short format is better? [RFC5905]


4.      Flag T: if set to 1, indicating both Resource Utilization Percentage and Delay caused by resource utilization anomalies are included. If set to 0, only the Resource Utilization Percentage is included.

Though out of the scope of this document, here are some exemplary practices that can be utilized to gauge or estimate service delay caused by resource utilization anomalies. Those practices are mainly to demonstrate that it is feasible to estimate service delay metrics.
Baseline Comparison:
Establishing a baseline for normal resource utilization is essential. Deviations from this baseline can be a strong indicator of abnormal behavior. Regularly comparing current resource utilization with historical data helps identify trends and anomalies.
Real-Time Monitoring:
Implementing real-time monitoring solutions enables prompt detection of spikes or sustained increases in resource utilization. Automated alerts can notify administrators when predefined thresholds are breached, facilitating swift intervention.
Machine Learning Algorithms:
Leveraging machine learning algorithms allows for the creation of predictive models that can anticipate resource utilization patterns. By training these models on historical data, the system can proactively identify potential service delays before they impact user experience.

[Cheng] above text is really helpful. Thank you linda!



Linda


From: Cheng Li <c.l@huawei.com<mailto:c.l@huawei.com>>
Sent: Tuesday, December 19, 2023 8:42 AM
To: Linda Dunbar <linda.dunbar@futurewei.com<mailto:linda.dunbar@futurewei.com>>; draft-ietf-idr-5g-edge-service-metadata@ietf.org<mailto:draft-ietf-idr-5g-edge-service-metadata@ietf.org>
Cc: idr@ietf. org <idr@ietf.org<mailto:idr@ietf.org>>
Subject: RE: Comments of draft-ietf-idr-5g-edge-service-metadata

Thank you for your quick reply! We are almost on the same page.  Please see my reply inline.

Respect,
Cheng


From: Linda Dunbar <linda.dunbar@futurewei.com<mailto:linda.dunbar@futurewei.com>>
Sent: Tuesday, December 19, 2023 12:33 AM
To: Cheng Li <c.l@huawei.com<mailto:c.l@huawei.com>>; draft-ietf-idr-5g-edge-service-metadata@ietf.org<mailto:draft-ietf-idr-5g-edge-service-metadata@ietf.org>
Cc: idr@ietf. org <idr@ietf.org<mailto:idr@ietf.org>>
Subject: RE: Comments of draft-ietf-idr-5g-edge-service-metadata

Cheng,

Please see below for the resolutions. I snipped the text that you have agreed the changes to make it easier for the discussion.

Linda

From: Cheng Li <c.l@huawei.com<mailto:c.l@huawei.com>>
Sent: Thursday, December 14, 2023 10:50 AM
To: Linda Dunbar <linda.dunbar@futurewei.com<mailto:linda.dunbar@futurewei.com>>; draft-ietf-idr-5g-edge-service-metadata@ietf.org<mailto:draft-ietf-idr-5g-edge-service-metadata@ietf.org>
Cc: idr@ietf. org <idr@ietf.org<mailto:idr@ietf.org>>
Subject: RE: Comments of draft-ietf-idr-5g-edge-service-metadata

Hi Linda,

Thank you for your reply, please see my reply inline.

Respect,
Cheng


From: Linda Dunbar <linda.dunbar@futurewei.com<mailto:linda.dunbar@futurewei.com>>
Sent: Tuesday, December 12, 2023 11:38 PM
To: Cheng Li <c.l@huawei.com<mailto:c.l@huawei.com>>; draft-ietf-idr-5g-edge-service-metadata@ietf.org<mailto:draft-ietf-idr-5g-edge-service-metadata@ietf.org>
Cc: idr@ietf. org <idr@ietf.org<mailto:idr@ietf.org>>
Subject: RE: Comments of draft-ietf-idr-5g-edge-service-metadata

Cheng,

Thank you very much for the comments.
Please see below of the detailed resolutions to your comments.

Linda

From: Cheng Li <c.l=40huawei.com@dmarc.ietf.org<mailto:c.l=40huawei.com@dmarc.ietf.org>>
Sent: Monday, December 11, 2023 2:28 PM
To: draft-ietf-idr-5g-edge-service-metadata@ietf.org<mailto:draft-ietf-idr-5g-edge-service-metadata@ietf.org>
Cc: idr@ietf. org <idr@ietf.org<mailto:idr@ietf.org>>
Subject: Comments of draft-ietf-idr-5g-edge-service-metadata

Hi authors,

<sniped >

1.      Why the value range is 0-100? It is a 32 bit value, why no let it be 0-2**32-1? A deployment can choose a range as they want. The Ingress node only care about the value of one site is bigger or smaller than others, isn’t it?

I may suggest to make it open, and let deployment to choose the range they like. Otherwise, we still need to check the value before processing it.
[Linda] Sure, it can change to 0-2**32-1. Then, deployment probably can use a range to compare.
How about changing to the following:
Preference Index value: 1-2^31; the higher the value, the more preference the site.
[Cheng] why 2^31? The maximum is  2^32 -1? So we may not need to describe it? Just say that bigger value means higher priority? Then people can choose any value in 32 bits. Unless you want to make 0 as reserved for some reasons.

[Linda] Sure, it can be changed to  2^32 -1
[Cheng]thanks.

<snipped>

BTW, I will recommend to add a new sub-TLV to describe the available capacity of resource of a service? Because from the aspect of a service, it care how much resource it can access.
[Linda] is the new subTLV for a specific prefix? Or a group of prefixes?
[Cheng] for the prefix that it is associated to? It can be used for many prefixes if we want and BGP allow to do so. What do you think?
[Linda] I think per prefix is more reasonable as BGP UPDATE is to notify the information about a prefix. One prefix can represent a group of services.
[Cheng]Agree.


The whole resource of the datacenter or the site is important, but they are not all for this service.
We can use a unified value to present the capacity of the resource, while the unit is defined by the service itself, so that it can be standardized for many services instead of one or several ones.

# Service Delay Predication Value
I see this info can even be used for predict the load, or I misunderstand something? I will recommend to add a new sub-TLV of load or utilization independently.

[Linda] Is the loadUtilization per prefix?  Is it same as the earlier “resource of the service? Or is it different?
[Cheng] it is for per service associated with per prefix. The above may discuss the total resource of the service, and this is the load of it. Using this load(percentage) x capacity can get the real-time available resource?
[Linda] BGP UPDATE is about a route (a.k.a. a service) running on a server. The load utilization is about the server which can host many routes. So, the metrics is more about the underlay physical resource utilization to support the service. How does the egress know the information?
[Cheng]some agency will collect this information and send it to the egress router. This specific method is out of scope of this draft I think

The reason is a site may have their own means to measure the load/utilization of a service, and this percentage times capacity can tell us how much resource are available now.
This is important for the ingress to know if it can send more service packets to a site. We may not need to care about how the load is measured, but only need to distribute it to the ingress node.



The original delay is required for the ingress node to calculate the lowest delay from it to the instance inside a site.
The network delay can be obtained from existing means. The delay from the egress router attaching to the site and the instance inside the site can be measured by existing means as well, like TWAMP.
The Internal processing delay can be measured by existing tools. We may only care about the data value instead of how it is generated.
[Linda] correct.
[Cheng]Thanks, jitter or other type may be included as well. Let see how to define the flag. If we have different type of delay info, will they be included in the same time?
[Linda] Yes. Maybe using another bit in the Reserved ?
[Cheng]Yes, but if different flags are set in the same time, the order of the optional fields should be defined clearly.

Similar comments may be proposed in the past, sorry I did not follow to the discussion in time.

Hope my comments make sense.

Thanks and respect,
Cheng