[Cats] Re: draft-ietf-cats-usecases-requirements-10 ietf last call Artart review

kehan yao <khyao78@gmail.com> Fri, 12 December 2025 09:37 UTC

MIME-Version: 1.0
References: <176539583482.1334877.15163936020458948002@dt-datatracker-5bd94c585b-wk4l4>
In-Reply-To: <176539583482.1334877.15163936020458948002@dt-datatracker-5bd94c585b-wk4l4>
From: kehan yao <khyao78@gmail.com>
Date: Fri, 12 Dec 2025 17:37:26 +0800
Message-ID: <CABYiY4uNOG4FFj5ROisb4usR_8q757XgRKi8iYk4sUfm20Ms_g@mail.gmail.com>
To: Tim Bray <tbray@textuality.com>
Content-Type: multipart/alternative; boundary="0000000000003d82890645be03b6"
Message-ID-Hash: RS34XHLTFT3E3GDA3W6DLJSWV7SHFFBZ
CC: art@ietf.org, cats@ietf.org, draft-ietf-cats-usecases-requirements.all@ietf.org, last-call@ietf.org
Precedence: list
Subject: [Cats] Re: draft-ietf-cats-usecases-requirements-10 ietf last call Artart review
Archived-At: <https://mailarchive.ietf.org/arch/msg/cats/52r4Ujvr0h0yzdIHoIaLbLIOOWo>

Hi Tim,

Thank you so much for your detailed review. Please see some of my replies
inline below.
The modifications will be reflected in the next revision.

Best regards,
Kehan

Tim Bray via Datatracker <noreply@ietf.org> 于2025年12月11日周四 03:44写道：

> Document: draft-ietf-cats-usecases-requirements
> Title: Computing-Aware Traffic Steering (CATS) Problem Statement, Use
> Cases,
> and Requirements Reviewer: Tim Bray Review result: Not Ready
>
> This is the ARTART review of draft-ietf-cats-usecases-requirements-10. It
> has
> no special standing and is offered as input to further discussion of the
> subject.
>
> While I have never looked at ALTO, I spent 5+ years as an employee of AWS
> where
> a central everyday concern was the design and operation of distributed
> systems,
> so I feel I have some exposure to the issues being addressed.
>
> I feel that this document is not suitable for publication as an RFC.
> Quoting
> from the Shepherd Report:
>
>   The WG milestones only explicitly say to adopt this document (not to
> publish
>   as an RFC). However, the charter does not preclude this. The working
> group
>   discussed this point and had strong consensus that publication as an
>   Informational RFC would be helpful for future protocol work.
>
> This document contains a lot of RFC 2119 language, which I don't think
> belongs
> in an informational RFC.  After my review, I am left dubious of the claim
> that
> this "would be helpful for future protocol work".  Perhaps this would be
> suitable for leaving as a draft for guiding the work of the WG?
>
> *[KY] The working group has conducted detailed discussions on this matter
previously. The authors and contributors are eager to publish this document
as an informational RFC, and this initiative has received substantial
support from the working group. On one hand, it provides guidance for the
CATS framework, metric definitions, solution design, and other related
work. On the other hand, a more critical reason is that publication of this
work by the IETF will facilitate its reference by external parties and
other organizations (e.g., 3GPP, ETSI, etc.). During the document update
process, we have also received significant interest and support from
outside the working group. Publishing it as an RFC may therefore better
enhance the influence of both this work and the IETF.*


> I found this draft difficult (and very time-consuming) to read and am not
> convinced that it offers practical value.  Perhaps it is aimed at a class
> of
> system or protocol designer who is working on problems different from
> those I
> faced, so my experience is not relevant and the comments below are not
> helpful.
>  If so, sorry.
>
> The draft is extremely verbose, 11K words in length. I found it difficult
> to
> read and understand because of this and because the language is often
> general
> and nontechnical.  (Also the quality of the language needs work, there are
> many
> grammatical errors.)  It would benefit from the attention of an editor
> with the
> goal of reducing its size and increasing its clarity.  For example, I
> think the
> entirety of Section 1 could be replaced by the following without loss of
> value:
> "It is often desirable to distribute compute workloads across multiple
> compute
> resources.  These resources can include servers and load balancers in data
> centers and compute capacity deployed in CDN POPs.  Routing requests for
> service to such nodes with the goals of providing good response to variable
> loads presents multiple complex problems."
>
> * [KY] Thanks. I do acknowledge that the document is relatively lengthy,
particularly the sections on problem description and scenario elaboration.
As one of the editors, my native language is not English, which has indeed
led to some grammatical errors. I apologize for this. Moving forward, I
will appropriately condense redundant content in subsequent updates to
improve readability for audiences with diverse professional backgrounds.*

2, 3.1 Edge computing could mean two different things: Resources at CDN
> POPs,
> or resources at infrastructure locations which are specialized at mediating
> access to internal servers and the Internet. These offer functions
> including
> load balancing and firewalling. The draft uses the term "edge" in a very
> generalized way.
>
>


*[KY] Regarding the question about "edge computing", please allow me to
provide a brief clarification. Similar questions were raised by experts in
the working group, including the chair, during the document update. To
better distinguish between the two terms, we have defined "edge computing"
and "network edge" in section 2 as follows:“Network Edge: The network edge
is an architectural demarcation point used to identify physical locations
where the corporate network connects to third-party networks.Edge
Computing: Edge computing is a computing pattern that moves computing
infrastructures, i.e, servers, away from centralized data centers and
instead places it close to the end users for low latency
communication.Relations with network edge: edge computing infrastructures
connect to corporate network through a network edge entry/exit point.”*

*I believe that the former understanding provided by you is ‘edge
computing’ while the latter is ‘network edge’. I wonder if we can reach a
consensus on this point?*


> I am unconvinced that some of the scenarios offered are realistic:
>
> 4.1 "Cloud VR/AR introduces the concept of cloud computing to the
> rendering of
> audiovisual assets in such applications. Here, the edge cloud helps
> encode/decode and render content.” I'm surprised. Rendering AR/VR requires
> considerable compute cycles and typically would be accomplished either on
> client hardware (mobile phone, AR/VR headset) or in a data center server,
> the
> results being cached by the edge. But rendering on edge devices? I don't
> think
> so? I haven't worked on AR in a few years so maybe I'm out of date, but
> this is
> still surprising.
>
>
*[KY] Regarding AR/VR scenarios, large-scale deployment has not yet been
achieved at present. However, with the emergence of more intelligent
terminals (e.g., AI headsets, AI smart glasses), computing resources
deployed in edge data centers will be required to provide rendering
services. While the evolution of this scenario is relatively slow, we
believe it still has significant deployment potential and value for the
future. For example, during the 118th IETF meeting, the BBC shared similar
work in the CATS working group—the AI4ME project. For
reference:https://datatracker.ietf.org/doc/slides-118-cats-2a-ai4me-and-bbc-cats-use-cases/
<https://datatracker.ietf.org/doc/slides-118-cats-2a-ai4me-and-bbc-cats-use-cases/>*


> 4.2 Repeated discussions of the same problem which could be summarized
> “try to
> use the nearest edge PoP to reduce latency, unless it’s overloaded, in
> which
> case fall back to somewhere else, while reporting the problem”
>
> *[KY] Thank you. I will refine the expressions in this section in the next
update.*


> 4.5.2 “Distributed AI training” - Is this really a thing?  It’s not my
> understanding of how model building/training is done in practice.  This
> and the
> other use cases would benefit from citations to real-world research.
>
>





*[KY] First, inference is a typical use case for CATS, as it is highly
sensitive to metrics such as latency and memory consumption. For
distributed training, federated learning, and related technologies, there
is also strong relevance to CATS. Relevant references are listed
below:FedFog: Resource-Aware Federated Learning in Edge and Fog
Networkshttps://arxiv.org/abs/2507.03952
<https://arxiv.org/abs/2507.03952>ARES: Adaptive Resource-Aware Split
Learning for Internet of
Thingshttps://dl.acm.org/doi/10.1016/j.comnet.2022.109380
<https://dl.acm.org/doi/10.1016/j.comnet.2022.109380>The authors will add
relevant citations in the next revision.*


> 5.2, R5 “The Resource Model MUST be implementable in an interoperable
> manner.“
> The use of RFC2119 language on such a vague, general statement feels like
> mis-use to me.  This comment applies to a high proportion of the
> requirement
> assertions.
>
>



*[KY] There are some explanations after this requirement, “R5: The Resource
Model MUST be implementable in an interoperable manner. That is,
independent implementations of the Resource Model must be interoperable.”If
you think this is not sufficient, how about the following
modifications:_NEW_“R5: The Resource Model MUST be implementable in an
interoperable manner. That is, metrics generated by this resource model
MUST be understood and interoperable across independent implementations.”*


> R6: "The Resource Model MUST be executable in a scalable manner. That is,
> an
> agent implementing the Resource Model MUST be able to execute it at the
> required time scale and at an affordable cost (e.g., memory footprint,
> energy,
> etc.)” The absence of discussion of scaling metrics such as for example
> “p99
> latencies” is striking. Note that 5.3 is about metrics, but provides no
> examples nor does it enumerate any specific metrics.
>
> *[KY] How about adding some examples at the first paragraph in section
5.2:*

*_NEW_“Computing metrics can have many different semantics, particularly
for being service-specific. For example, delay, measured as milliseconds
(ms), can gauge packet transmission time as well as service processing
time, GPU memory, measured as Gigabytes (GB) or MegaBytes (MB) can
represent the computing load that influence how many service requests that
can be handled in a pre-defined time duration. These representations may
entail information on the semantics of the metric or sometimes metrics may
also be purely one or more semantic-free numerals.” *


> R7: "The Resource Model MUST be useful." Once again, the 2119 language
> feels
> inapplicable.
>
> * [KY] Thanks for the criticisms.If it is not appropriate and compelling,
authors could consider to delete this requirement.*

R18: "CATS systems MUST maintain instance affinity for stateful sessions and
> transactions." This may be true in some service scenarios but in
> large-scale
> distributed systems it can cause all sorts of problems.  I personally was
> severely bitten by a misguided attempt to provide instance affinity in a
> large-scale cloud application, see
> https://www.tbray.org/ongoing/When/201x/2019/09/25/On-Sharding (also have
> a
> look at some of the other issues discussed there, which feel like they
> ought to
> be relevant to this subject matter)
>
> There is no discussion of shuffle sharding, which is overwhelmingly seen
> as a
> best practice to make systems resilient in the face of inevitable server
> failures.  In fact, there is little discussion of resilience in the face of
> server failures. That feels like one of the big and hard problems in
> operating
> real-world distributed systems.


* [KY] Thank you for providing the reference materials. Indeed, we lack
practical experience in distributed systems. We will strive to incorporate
considerations related to fault resilience in the next update. Regarding
instance affinity, after reading your blog, we agree that the current
wording is overly absolute. Situations such as traffic surges could
restrict system optimization options due to this requirement. Therefore, we
propose weakening the wording as follows:*

*_NEW_R18: "CATS systems are RECOMMENDED to maintain instance affinity for
stateful sessions and transactions."*
>
>
>
The Security Considerations section seems short.  One of the functions
> required
> of every system is authentication of its users, and not all classes of
> servers
> can perform this task; how does authentication figure in the CATS
> ecosystem?
>
> * [KY] Similar comments have also been raised by DNSdir review, I’ve
proposed some revisions in previous thread, please take a look to see if
approriate.*
*https://mailarchive.ietf.org/arch/msg/cats/mpjybHZE9X91EaiY7oJrSKbTFkU/
<https://mailarchive.ietf.org/arch/msg/cats/mpjybHZE9X91EaiY7oJrSKbTFkU/>*
>
>
> --
> Cats mailing list -- cats@ietf.org
> To unsubscribe send an email to cats-leave@ietf.org
>

[Cats] draft-ietf-cats-usecases-requirements-10 i… Tim Bray via Datatracker
[Cats] Re: draft-ietf-cats-usecases-requirements-… kehan yao
[Cats] Re: draft-ietf-cats-usecases-requirements-… Julien Maisonneuve (Nokia)
[Cats] Re: draft-ietf-cats-usecases-requirements-… Yao Kehan
[Cats] Re: draft-ietf-cats-usecases-requirements-… Julien Maisonneuve (Nokia)
[Cats] Re: draft-ietf-cats-usecases-requirements-… Joel Halpern
[Cats] Re: draft-ietf-cats-usecases-requirements-… Yao Kehan
[Cats] Re: draft-ietf-cats-usecases-requirements-… Joel Halpern
[Cats] Re: draft-ietf-cats-usecases-requirements-… Yao Kehan
[Cats] Re: draft-ietf-cats-usecases-requirements-… Cheng Li