[ai-control] Re: WG Last Call: draft-ietf-aipref-vocab-03 (Ends 2025-09-18)

Mark Nottingham <mnot@mnot.net> Wed, 10 September 2025 02:55 UTC

Feedback-ID: ie6694242:Fastmail
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.700.81\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <CABcZeBNW4cy1EBXzx1A=qgWS+F4a8zJFyxEP33XEBNSimt9UMw@mail.gmail.com>
Date: Wed, 10 Sep 2025 12:54:51 +1000
Content-Transfer-Encoding: quoted-printable
Message-Id: <7AFC1626-0E0B-4017-ABA5-3006A8720687@mnot.net>
References: <175703816389.1311.5141574230046433427@dt-datatracker-f7c8fdcb7-pjx77> <CABcZeBNW4cy1EBXzx1A=qgWS+F4a8zJFyxEP33XEBNSimt9UMw@mail.gmail.com>
To: Eric Rescorla <ekr@rtfm.com>
Message-ID-Hash: UABEKSOGSODOIJO4HBMMYPLM2ILUJBCX
CC: ai-control@ietf.org
Precedence: list
Subject: [ai-control] Re: WG Last Call: draft-ietf-aipref-vocab-03 (Ends 2025-09-18)
Archived-At: <https://mailarchive.ietf.org/arch/msg/ai-control/v-FTNhqoYrlYOaqvEz0KLKRjbP0>

Hi EKR,

I've attempted to capture this in the following issues:

https://github.com/ietf-wg-aipref/drafts/issues/151  - Definition of AI
https://github.com/ietf-wg-aipref/drafts/issues/152 - Use of 'Machine Learning'
https://github.com/ietf-wg-aipref/drafts/issues/153 - Difference between "unknown" and "allowed"
https://github.com/ietf-wg-aipref/drafts/issues/154 - Defaults
https://github.com/ietf-wg-aipref/drafts/issues/155 - Automated Processing is Too Broad
https://github.com/ietf-wg-aipref/drafts/issues/156 - Search's Parent
https://github.com/ietf-wg-aipref/drafts/issues/157 - AI Training and Generative AI are Too Broad
https://github.com/ietf-wg-aipref/drafts/issues/158 - Bots Collect Data for Multiple Purposes

If you see anything I missed, please point it out (either in e-mail, a comment on those, or a new issue). I didn't include the OVERALL prelude in each of the issues that section spawned, but can copy it into a comment on them if you like -- the information is still in the links back to your message. Or would it be better as a separate issue?

Cheers,


> On 10 Sep 2025, at 9:09 am, Eric Rescorla <ekr@rtfm.com> wrote:
> 
> OVERALL
> I don't think this taxonomy is really going in the right direction. I
> have some critiques of the specific categories but, more broadly, I
> think this is tied to specific technology choices in a way that's
> unlikely to age well. I think instead it would be much more useful to
> focus on the uses to which the data will be put (i.e., the output of the model),
> as that's really the source of unhappiness about these systems.
> 
> Just so we're on the same page modern LLMs work (very approximately)
> by training on an enormous corpus of data which sets the model weights
> (I'm deliberately conflating pre-training and fine-tuning).  They are
> then prompted with input and asked to create output based on that
> input, as well as the output it's already generated (collectively the
> context). As part of that process, some models can also collect new
> data and use that as part of the context (RAG). In the context of web
> crawling, then we have two ways for data to get into the system:
> 
> - In the training phase and stored in the model weights.
> - In the generation phase as part of the context via RAG.
> 
> This is an important technical distinction, but it's not clear why it
> matters from the perspective of the site. Consider an AI system which
> collects the same corpus as is currently used for pre-training but
> only trains on a small portion of it and then when it's asked to
> generate content, uses some kind of "internal RAG" to suck in the
> relevant documents and generate the output (this is a lot more like
> human brains seem to work in my experience, because we just can't
> remember much stuff). From the perspective of the site this is the
> same: the system is using the site's content to generate new content,
> but in the taxonomy of this draft, this isn't AI training or even
> generative AI training because it doesn't impact the model
> weights. Obviously I have no idea if this is a productive technical
> direction, but neither do you, and conformance shouldn't hinge
> on whether it turns out to be.
> 
> In my opinion a much more productive direction would be to focus on
> the application to which this data is being put.  I don't have
> anything like a complete story, but it seems to me that we have some
> understanding of the categories at either end of the spectrum:
> 
> - Indexing (search): where you attempt to determine which
>   piece of content the user wants and steer them towards it.
> 
> - Substitution: where the service generates something that
>   is effectively a substitute good for the content.
> 
> I think these are useful conceptually because the first is the
> traditional one that I think people have generally accepted as "good"
> and the second is the one that seems to be of most concern. Obviously,
> it's not as clear as that because even before generative AI search
> systems were producing substitute goods (infoboxes, etc.), but I think
> that's a feature of this analysis, because sites often didn't like
> that and I think this kind of taxonomy captures that intuition without
> worrying about whether the substitute good was generated via some
> LLM, deterministic hand-written code, or something in between.
> 
> In between these two, we have some other applications. For instance:
> 
> - Summarization (think AI overviews) this is a partial substitute, but
>   often it comes with a source link which can steer traffic to the
>   site.
> 
> - Generating original content that isn't a direct substitute.  An
>   example here might be code generation: the other day I asked an AI
>   to help write me a scraper for the IETF agenda [0] and while I'm
>   sure it was inspired by a lot of existing scraping scripts, it's not
>   like it plagiarized one of them in total or somehow reduced the
>   demand for them (this is a distinct question from whether it
>   injected some verbatim code).
> 
> Again, I don't claim that these are the right dividing lines, but I
> think that this kind of analysis does a better job of capturing what
> is actually concerning to sites about AI. The challenge then becomes
> how to turn these into relatively precise definitions. This is
> probably harder than the current definitions, but I don't think
> precision is a virtue if the definitions aren't actually a good fit to
> the problem they are trying to solve (as they say, "for every complex
> problem there is an answer that is clear, simple, and wrong").
> 
> 
> To that end...
> 
> - As I and others have noted, "automated processing" essentially
>   sweeps in any web crawler. In particular, as noted by Greg Lindahl,
>   parsing the document to find links is plainly "automated
>   processing", so you just can't really have a crawler without
>   "automated processing", which makes this whole category redundant
>   with forbidding crawling in the existing robots.txt framework.
> 
> - I don't understand why "search" is somehow a subset of "automated
>   processing" rather than "AI training". In many if not most cases,
>   search will involve training an AI model, especially given the very
>   wide parameters of "AI" (see below).
>   
> - AI training is a coherent category but seems like a trap for the
>   unwary, because a lot of people are going to turn it on and thus
>   exclude all kinds of useful applications, when what they really want
>   is much more narrow (something like gen AI). I also have a problem
>   with the term "training" for the reasons indicated above.
> 
> - Generative AI training is a remarkably wide category (see
>   aforementioned comments about substitution, summarization, and
>   generating original content).
> 
> As an aside, it's obviously possible for a bot to collect data for
> multiple purposes. Maybe I've missed it, but does this draft say what
> the rules are there? I guess you're supposed to somehow only proceed
> if the intersection of all the uses is allowed? S 5.1 seems to be
> about the related but different problem of harmonizing multiple
> statements for a given usage.
> 
> 
> 
> DETAILED
> S 2.
> 
>    Artificial Intelligence (AI):
>       An engineered system of sufficient complexity that, for a given
>       set of human-defined objectives, learns from data to generate
>       outputs such as content, predictions, recommendations, or
>       decisions.
> 
> As I mentioned previously, I think this definition of AI is extremely
> problematic and would effectively sweep in any statistical
> technique. I understand that "sufficient complexity" is intended to
> somehow reduce the scope, but it's a totally subjective standard.
> 
>    AI Training:
>       The application of machine learning to data to produce or improve
>       a model for an artificial intelligence system.
> 
> I'm not sure what "machine learning" is doing here. Why not just:
> 
>       The use of data to produce or improve an artificial
>       intelligence system.
> 
> I would then strike the term "machine learning" entirely.
> 
> 
> S 3.
>    After processing a statement of preferences the recipient associates
>    each category of use one of three preference values: "allowed",
>    "disallowed", or "unknown".  In the absence of a statement of
>    preference, all usage categories are assigned a preference value of
>    "unknown".
> 
> What's the semnatic difference between "unknown" and "allowed"?
> Eventually, I need to either use or not use a given piece of
> data. Is this just an internal detail of the algorithm?
> 
> 
> S 3.1.
> 
>    An entity that receives usage preferences MAY choose to respect those
>    preferences it has discovered, according to an understanding of how
>    the asset is used, how that usage corresponds to the usage categories
>    where preferences have been stated, and the applicable legal context.
> 
>    Usage preferences can be ignored due to express agreements between
>    relevant parties, explicit provisions of law, or the exercise of
>    discretion in situations where widely recognized priorities justify
>    doing so.  Priorities that could justify ignoring preferences
>    include—but are not limited to—free expression, safety, education,
>    scholarship, research, preservation, interoperability, and
>    accessibility.
> 
> As stated before, I think we should strike this text and the
> following. If the specification doesn't require respecting
> the preferences then it doesn't need to take positions on
> why not respecting them is OK, and despite the text, this
> will inevitably be taken as indicating that other reasons
> aren't as valid.
> 
> 
> S 5.
>    One approach for dealing with an "unknown" outcome is to assign a
>    default value.  This document takes no position on what default might
>    be assigned.
> 
> I don't think this makes any sense. The purpose of this document
> is to allow the processing agent to understand the declaring
> party's preferences, and we don't even require the agent
> to respect them, so it's weird to talk about defaults in
> that context.
> 
> 
> S 6.
> I think it's premature to worry about this.
> 
> -Ekr
> 
> [0] https://github.com/ekr/ietf-agenda
> 
> 
> On Thu, Sep 4, 2025 at 7:09 PM Mark Nottingham via Datatracker <noreply@ietf.org> wrote:
> 
> Subject: WG Last Call: draft-ietf-aipref-vocab-03 (Ends 2025-09-18)
> 
> This message starts a 2-week WG Last Call for this document.
> 
> Abstract:
>    This document defines a vocabulary for expressing preferences
>    regarding how digital assets are used by automated processing
>    systems.  This vocabulary allows for the declaration of restrictions
>    or permissions for use of digital assets by such systems.
> 
> File can be retrieved from:
> https://datatracker.ietf.org/doc/draft-ietf-aipref-vocab/
> 
> Please review and indicate your support or objection to proceed with the
> publication of this document by replying to this email keeping
> ai-control@ietf.org in copy. Objections should be motivated and suggestions
> to resolve them are highly appreciated.
> 
> Authors, and WG participants in general, are reminded again of the
> Intellectual Property Rights (IPR) disclosure obligations described in BCP 79
> [1]. Appropriate IPR disclosures required for full conformance with the
> provisions of BCP 78 [1] and BCP 79 [2] must be filed, if you are aware of
> any. Sanctions available for application to violators of IETF IPR Policy can
> be found at [3].
> 
> Thank you.
> 
> [1] https://datatracker.ietf.org/doc/bcp78/
> [2] https://datatracker.ietf.org/doc/bcp79/
> [3] https://datatracker.ietf.org/doc/rfc6701/
> 
> 
> 
> -- 
> ai-control mailing list -- ai-control@ietf.org
> To unsubscribe send an email to ai-control-leave@ietf.org
> -- 
> ai-control mailing list -- ai-control@ietf.org
> To unsubscribe send an email to ai-control-leave@ietf.org

--
Mark Nottingham   https://www.mnot.net/

[ai-control] Re: WG Last Call: draft-ietf-aipref-… Eric Rescorla
[ai-control] WG Last Call: draft-ietf-aipref-voca… Mark Nottingham via Datatracker
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Mike Dierken
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Mark Nottingham
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Eric Rescorla
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Mark Nottingham
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Chris Needham
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Deen, Glenn (Comcast Cable)
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Sebastian Posth
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Mark Nottingham
[ai-control] Re: WG Last Call: draft-ietf-aipref-… Bradley Silver