[ai-control] Re: WG Last Call: draft-ietf-aipref-vocab-03 (Ends 2025-09-18)

Mark Nottingham <mnot@mnot.net> Wed, 10 September 2025 02:55 UTC

Return-Path: <mnot@mnot.net>
X-Original-To: ai-control@mail2.ietf.org
Delivered-To: ai-control@mail2.ietf.org
Received: from localhost (localhost [127.0.0.1]) by mail2.ietf.org (Postfix) with ESMTP id 2ADD6602CF12 for <ai-control@mail2.ietf.org>; Tue, 9 Sep 2025 19:55:02 -0700 (PDT)
X-Virus-Scanned: amavisd-new at ietf.org
X-Spam-Flag: NO
X-Spam-Score: -2.799
X-Spam-Level:
X-Spam-Status: No, score=-2.799 tagged_above=-999 required=5 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, RCVD_IN_VALIDITY_SAFE_BLOCKED=0.001, SPF_PASS=-0.001] autolearn=ham autolearn_force=no
Authentication-Results: mail2.ietf.org (amavisd-new); dkim=pass (2048-bit key) header.d=mnot.net header.b="covaDkuI"; dkim=pass (2048-bit key) header.d=messagingengine.com header.b="gbDjKq81"
Received: from mail2.ietf.org ([166.84.6.31]) by localhost (mail2.ietf.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id sazUaVWrCrcn for <ai-control@mail2.ietf.org>; Tue, 9 Sep 2025 19:55:01 -0700 (PDT)
Received: from fhigh-a8-smtp.messagingengine.com (fhigh-a8-smtp.messagingengine.com [103.168.172.159]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail2.ietf.org (Postfix) with ESMTPS id 44174602CF01 for <ai-control@ietf.org>; Tue, 9 Sep 2025 19:55:01 -0700 (PDT)
Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfhigh.phl.internal (Postfix) with ESMTP id DE4811400149; Tue, 9 Sep 2025 22:54:55 -0400 (EDT)
Received: from phl-mailfrontend-02 ([10.202.2.163]) by phl-compute-02.internal (MEProxy); Tue, 09 Sep 2025 22:54:55 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mnot.net; h=cc :cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm2; t=1757472895; x=1757559295; bh=v3HCM8tQ+3hfm8mBWHOIwwimSYmuAHTWeLIZxdMG+g0=; b= covaDkuIuKT7dI5FPwXy3PayUbmznPlxWyusl0IDexML2wF6vTsJydefNXlGbExU FKFybDs07onhU64CJesIApHddOuhWVAVBa50EYQBvMn/Ztf2nbYRKxH2v7XdBs7t aE+auzijkWyVg4H13EKtuUoA5QD5IboP6Jd9t7S0jDLBTTEqIrwPqTpdrkG3HQQp cF65K+Bd71MXXMkRt3LsHwdgB4CtxdprQJ22rP/1lzzC/B4WfsIE569yib2EkP0e pG1cV5SQFpRS8UkeC5FqTAqWremxyfN8yUHeXiFNh1zTSfIVuRL1TxBNCbitf48i oxm+mVEBNsxMwww7+3eVtA==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1757472895; x= 1757559295; bh=v3HCM8tQ+3hfm8mBWHOIwwimSYmuAHTWeLIZxdMG+g0=; b=g bDjKq81kgfumAjdQUNYyw/xxk9/ia9veQleotJRFOrIEd1+nVO4rVNIeD8nxFZy8 gQ5O7lJXVq6uHBxAbuBZAp7GFG7iSsPw4CS8dV2n9AXeYdZcGx2AD2H0+pDWYSIa WO/8KMkeVqaGj6tbgBsPjRiMFEbAfDddybeKXjzDkcIsA+pARNXhUM/khHOHS5Cx 2/fJ1+nC893Gcnpgke4jiJC+ntd72Ni76X0ewdiHrsQzL4TYlA0T1owjCDvEuwQG gc79UCcPKVpxdep/y8GyOWDqmmWv1Kw5xE7owenDOZqy0O1p7FeP/JFEQWGPGgbT K83/wiSxveppgn2yt+eCw==
X-ME-Sender: <xms:f-jAaJ80PT-7xXYswePku3dQYdvksZztncVnqexbbZVNJ1WuE-7S3g> <xme:f-jAaC9L6V3o7tJ_aicGTR6ySC4jw4kzrQxRkVMB2IYbIzCkomhn8mYJTizWp5Ib7 cNmNaooZZUvyGMe1A>
X-ME-Received: <xmr:f-jAaIvcyH8OFuhpDv9R83McjfF5u3N6qeEZ6zTwq6Zlenig2jal_FuNGRaWjK7DtXA3shUUXHAIVQMVgtPxiYfrh21BQf2Qd7zio3DQB68BRaBYEFKFCA>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggddvvddugecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecunecujfgurheptggguffhjgffvefgkfhfvffosehtqhhmtd hhtdejnecuhfhrohhmpeforghrkhcupfhothhtihhnghhhrghmuceomhhnohhtsehmnhho thdrnhgvtheqnecuggftrfgrthhtvghrnhepueetffdvgeelhedufeehffevieejfeffvd egieeuvdeukefhheejkefggedvuefgnecuffhomhgrihhnpehgihhthhhusgdrtghomhdp ihgvthhfrdhorhhgpdhmnhhothdrnhgvthenucevlhhushhtvghrufhiiigvpedtnecurf grrhgrmhepmhgrihhlfhhrohhmpehmnhhothesmhhnohhtrdhnvghtpdhnsggprhgtphht thhopedvpdhmohguvgepshhmthhpohhuthdprhgtphhtthhopegvkhhrsehrthhfmhdrtg homhdprhgtphhtthhopegrihdqtghonhhtrhholhesihgvthhfrdhorhhg
X-ME-Proxy: <xmx:f-jAaFq_F-03KWvnCmnVj_IxvhMNIWs6ZRSD-MfsMWwzQGxZKHUurw> <xmx:f-jAaFme2eOWoAjcLpSWfMSdAKaPGBH23eQ_0JFw2th27XqMDL0HWg> <xmx:f-jAaJzfvpkT_5aTCYLpKQOmuj8lIrtBueyMYqkDnckCBMkds2Jckg> <xmx:f-jAaNlIk_w1Ddj8DSjsV6jX9uk7uxdPDEnlqawgBtfB9r6DtAipPA> <xmx:f-jAaG6QEr__5FP7JWZTJmenZXHJHSS4KDHwwfNOlkpYfGFzGq9wH96F>
Feedback-ID: ie6694242:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 9 Sep 2025 22:54:54 -0400 (EDT)
Content-Type: text/plain; charset="utf-8"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3826.700.81\))
From: Mark Nottingham <mnot@mnot.net>
In-Reply-To: <CABcZeBNW4cy1EBXzx1A=qgWS+F4a8zJFyxEP33XEBNSimt9UMw@mail.gmail.com>
Date: Wed, 10 Sep 2025 12:54:51 +1000
Content-Transfer-Encoding: quoted-printable
Message-Id: <7AFC1626-0E0B-4017-ABA5-3006A8720687@mnot.net>
References: <175703816389.1311.5141574230046433427@dt-datatracker-f7c8fdcb7-pjx77> <CABcZeBNW4cy1EBXzx1A=qgWS+F4a8zJFyxEP33XEBNSimt9UMw@mail.gmail.com>
To: Eric Rescorla <ekr@rtfm.com>
X-Mailer: Apple Mail (2.3826.700.81)
Message-ID-Hash: UABEKSOGSODOIJO4HBMMYPLM2ILUJBCX
X-Message-ID-Hash: UABEKSOGSODOIJO4HBMMYPLM2ILUJBCX
X-MailFrom: mnot@mnot.net
X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header
CC: ai-control@ietf.org
X-Mailman-Version: 3.3.9rc6
Precedence: list
Subject: [ai-control] Re: WG Last Call: draft-ietf-aipref-vocab-03 (Ends 2025-09-18)
List-Id: AI Control <ai-control.ietf.org>
Archived-At: <https://mailarchive.ietf.org/arch/msg/ai-control/v-FTNhqoYrlYOaqvEz0KLKRjbP0>
List-Archive: <https://mailarchive.ietf.org/arch/browse/ai-control>
List-Help: <mailto:ai-control-request@ietf.org?subject=help>
List-Owner: <mailto:ai-control-owner@ietf.org>
List-Post: <mailto:ai-control@ietf.org>
List-Subscribe: <mailto:ai-control-join@ietf.org>
List-Unsubscribe: <mailto:ai-control-leave@ietf.org>

Hi EKR,

I've attempted to capture this in the following issues:

https://github.com/ietf-wg-aipref/drafts/issues/151  - Definition of AI
https://github.com/ietf-wg-aipref/drafts/issues/152 - Use of 'Machine Learning'
https://github.com/ietf-wg-aipref/drafts/issues/153 - Difference between "unknown" and "allowed"
https://github.com/ietf-wg-aipref/drafts/issues/154 - Defaults
https://github.com/ietf-wg-aipref/drafts/issues/155 - Automated Processing is Too Broad
https://github.com/ietf-wg-aipref/drafts/issues/156 - Search's Parent
https://github.com/ietf-wg-aipref/drafts/issues/157 - AI Training and Generative AI are Too Broad
https://github.com/ietf-wg-aipref/drafts/issues/158 - Bots Collect Data for Multiple Purposes

If you see anything I missed, please point it out (either in e-mail, a comment on those, or a new issue). I didn't include the OVERALL prelude in each of the issues that section spawned, but can copy it into a comment on them if you like -- the information is still in the links back to your message. Or would it be better as a separate issue?

Cheers,


> On 10 Sep 2025, at 9:09 am, Eric Rescorla <ekr@rtfm.com> wrote:
> 
> OVERALL
> I don't think this taxonomy is really going in the right direction. I
> have some critiques of the specific categories but, more broadly, I
> think this is tied to specific technology choices in a way that's
> unlikely to age well. I think instead it would be much more useful to
> focus on the uses to which the data will be put (i.e., the output of the model),
> as that's really the source of unhappiness about these systems.
> 
> Just so we're on the same page modern LLMs work (very approximately)
> by training on an enormous corpus of data which sets the model weights
> (I'm deliberately conflating pre-training and fine-tuning).  They are
> then prompted with input and asked to create output based on that
> input, as well as the output it's already generated (collectively the
> context). As part of that process, some models can also collect new
> data and use that as part of the context (RAG). In the context of web
> crawling, then we have two ways for data to get into the system:
> 
> - In the training phase and stored in the model weights.
> - In the generation phase as part of the context via RAG.
> 
> This is an important technical distinction, but it's not clear why it
> matters from the perspective of the site. Consider an AI system which
> collects the same corpus as is currently used for pre-training but
> only trains on a small portion of it and then when it's asked to
> generate content, uses some kind of "internal RAG" to suck in the
> relevant documents and generate the output (this is a lot more like
> human brains seem to work in my experience, because we just can't
> remember much stuff). From the perspective of the site this is the
> same: the system is using the site's content to generate new content,
> but in the taxonomy of this draft, this isn't AI training or even
> generative AI training because it doesn't impact the model
> weights. Obviously I have no idea if this is a productive technical
> direction, but neither do you, and conformance shouldn't hinge
> on whether it turns out to be.
> 
> In my opinion a much more productive direction would be to focus on
> the application to which this data is being put.  I don't have
> anything like a complete story, but it seems to me that we have some
> understanding of the categories at either end of the spectrum:
> 
> - Indexing (search): where you attempt to determine which
>   piece of content the user wants and steer them towards it.
> 
> - Substitution: where the service generates something that
>   is effectively a substitute good for the content.
> 
> I think these are useful conceptually because the first is the
> traditional one that I think people have generally accepted as "good"
> and the second is the one that seems to be of most concern. Obviously,
> it's not as clear as that because even before generative AI search
> systems were producing substitute goods (infoboxes, etc.), but I think
> that's a feature of this analysis, because sites often didn't like
> that and I think this kind of taxonomy captures that intuition without
> worrying about whether the substitute good was generated via some
> LLM, deterministic hand-written code, or something in between.
> 
> In between these two, we have some other applications. For instance:
> 
> - Summarization (think AI overviews) this is a partial substitute, but
>   often it comes with a source link which can steer traffic to the
>   site.
> 
> - Generating original content that isn't a direct substitute.  An
>   example here might be code generation: the other day I asked an AI
>   to help write me a scraper for the IETF agenda [0] and while I'm
>   sure it was inspired by a lot of existing scraping scripts, it's not
>   like it plagiarized one of them in total or somehow reduced the
>   demand for them (this is a distinct question from whether it
>   injected some verbatim code).
> 
> Again, I don't claim that these are the right dividing lines, but I
> think that this kind of analysis does a better job of capturing what
> is actually concerning to sites about AI. The challenge then becomes
> how to turn these into relatively precise definitions. This is
> probably harder than the current definitions, but I don't think
> precision is a virtue if the definitions aren't actually a good fit to
> the problem they are trying to solve (as they say, "for every complex
> problem there is an answer that is clear, simple, and wrong").
> 
> 
> To that end...
> 
> - As I and others have noted, "automated processing" essentially
>   sweeps in any web crawler. In particular, as noted by Greg Lindahl,
>   parsing the document to find links is plainly "automated
>   processing", so you just can't really have a crawler without
>   "automated processing", which makes this whole category redundant
>   with forbidding crawling in the existing robots.txt framework.
> 
> - I don't understand why "search" is somehow a subset of "automated
>   processing" rather than "AI training". In many if not most cases,
>   search will involve training an AI model, especially given the very
>   wide parameters of "AI" (see below).
>   
> - AI training is a coherent category but seems like a trap for the
>   unwary, because a lot of people are going to turn it on and thus
>   exclude all kinds of useful applications, when what they really want
>   is much more narrow (something like gen AI). I also have a problem
>   with the term "training" for the reasons indicated above.
> 
> - Generative AI training is a remarkably wide category (see
>   aforementioned comments about substitution, summarization, and
>   generating original content).
> 
> As an aside, it's obviously possible for a bot to collect data for
> multiple purposes. Maybe I've missed it, but does this draft say what
> the rules are there? I guess you're supposed to somehow only proceed
> if the intersection of all the uses is allowed? S 5.1 seems to be
> about the related but different problem of harmonizing multiple
> statements for a given usage.
> 
> 
> 
> DETAILED
> S 2.
> 
>    Artificial Intelligence (AI):
>       An engineered system of sufficient complexity that, for a given
>       set of human-defined objectives, learns from data to generate
>       outputs such as content, predictions, recommendations, or
>       decisions.
> 
> As I mentioned previously, I think this definition of AI is extremely
> problematic and would effectively sweep in any statistical
> technique. I understand that "sufficient complexity" is intended to
> somehow reduce the scope, but it's a totally subjective standard.
> 
>    AI Training:
>       The application of machine learning to data to produce or improve
>       a model for an artificial intelligence system.
> 
> I'm not sure what "machine learning" is doing here. Why not just:
> 
>       The use of data to produce or improve an artificial
>       intelligence system.
> 
> I would then strike the term "machine learning" entirely.
> 
> 
> S 3.
>    After processing a statement of preferences the recipient associates
>    each category of use one of three preference values: "allowed",
>    "disallowed", or "unknown".  In the absence of a statement of
>    preference, all usage categories are assigned a preference value of
>    "unknown".
> 
> What's the semnatic difference between "unknown" and "allowed"?
> Eventually, I need to either use or not use a given piece of
> data. Is this just an internal detail of the algorithm?
> 
> 
> S 3.1.
> 
>    An entity that receives usage preferences MAY choose to respect those
>    preferences it has discovered, according to an understanding of how
>    the asset is used, how that usage corresponds to the usage categories
>    where preferences have been stated, and the applicable legal context.
> 
>    Usage preferences can be ignored due to express agreements between
>    relevant parties, explicit provisions of law, or the exercise of
>    discretion in situations where widely recognized priorities justify
>    doing so.  Priorities that could justify ignoring preferences
>    include—but are not limited to—free expression, safety, education,
>    scholarship, research, preservation, interoperability, and
>    accessibility.
> 
> As stated before, I think we should strike this text and the
> following. If the specification doesn't require respecting
> the preferences then it doesn't need to take positions on
> why not respecting them is OK, and despite the text, this
> will inevitably be taken as indicating that other reasons
> aren't as valid.
> 
> 
> S 5.
>    One approach for dealing with an "unknown" outcome is to assign a
>    default value.  This document takes no position on what default might
>    be assigned.
> 
> I don't think this makes any sense. The purpose of this document
> is to allow the processing agent to understand the declaring
> party's preferences, and we don't even require the agent
> to respect them, so it's weird to talk about defaults in
> that context.
> 
> 
> S 6.
> I think it's premature to worry about this.
> 
> -Ekr
> 
> [0] https://github.com/ekr/ietf-agenda
> 
> 
> On Thu, Sep 4, 2025 at 7:09 PM Mark Nottingham via Datatracker <noreply@ietf.org> wrote:
> 
> Subject: WG Last Call: draft-ietf-aipref-vocab-03 (Ends 2025-09-18)
> 
> This message starts a 2-week WG Last Call for this document.
> 
> Abstract:
>    This document defines a vocabulary for expressing preferences
>    regarding how digital assets are used by automated processing
>    systems.  This vocabulary allows for the declaration of restrictions
>    or permissions for use of digital assets by such systems.
> 
> File can be retrieved from:
> https://datatracker.ietf.org/doc/draft-ietf-aipref-vocab/
> 
> Please review and indicate your support or objection to proceed with the
> publication of this document by replying to this email keeping
> ai-control@ietf.org in copy. Objections should be motivated and suggestions
> to resolve them are highly appreciated.
> 
> Authors, and WG participants in general, are reminded again of the
> Intellectual Property Rights (IPR) disclosure obligations described in BCP 79
> [1]. Appropriate IPR disclosures required for full conformance with the
> provisions of BCP 78 [1] and BCP 79 [2] must be filed, if you are aware of
> any. Sanctions available for application to violators of IETF IPR Policy can
> be found at [3].
> 
> Thank you.
> 
> [1] https://datatracker.ietf.org/doc/bcp78/
> [2] https://datatracker.ietf.org/doc/bcp79/
> [3] https://datatracker.ietf.org/doc/rfc6701/
> 
> 
> 
> -- 
> ai-control mailing list -- ai-control@ietf.org
> To unsubscribe send an email to ai-control-leave@ietf.org
> -- 
> ai-control mailing list -- ai-control@ietf.org
> To unsubscribe send an email to ai-control-leave@ietf.org

--
Mark Nottingham   https://www.mnot.net/