**Session Date/Time:** 08 Apr 2025 11:00

# [AIPREF](../wg/aipref.html)

## Summary

The AIPREF Working Group convened to discuss proposals for a foundational vocabulary for expressing AI preferences. The primary focus was on Paul's "Open Future" proposal (draft-paul-aipref-opt-out-vocabulary), which had emerged from discussions in an EU regulatory context. A sense of those present indicated broad support for adopting this draft as a starting point, with a clear understanding that significant discussion and revision would be necessary to address a range of open issues, particularly concerning the scope and definition of "search," "inference," "RAG" (Retrieval Augmented Generation) use cases, and the balance between specific and broad vocabulary terms. The session also touched upon the practical implications of defining an abstract vocabulary model versus a concrete serialization.

## Key Discussion Points

*   **Vocabulary Proposals Review**: The Chair noted two main proposals for the vocabulary: Paul's "Open Future" and Tom's proposal. The Chair expressed a bias towards Paul's proposal as a smaller, more focused starting point. It was also noted that Tom's draft was largely derived from Paul's work.
*   **Paul's "Open Future" Presentation**:
    *   **Origin**: The vocabulary was developed in an EU regulatory context, influenced by the EU Copyright Directive's TDM exceptions and the AI Act's connection between TDM and AI training.
    *   **Purpose**: To provide a common vocabulary for machine-readable opt-outs/ins by parties wishing to restrict or allow the use of their assets for AI training and other TDM forms, enabling interoperability across different preference signaling mechanisms.
    *   **Design Considerations**:
        *   Developed with an opt-out bias due to EU context, but intended to be applicable for both opt-in and opt-out in any legal context.
        *   Uses "TDM" as an umbrella category, based on EU definitions, to allow for the exercise of related rights.
        *   Focuses on the *use of assets* (e.g., training, inference) rather than the act of crawling itself, as crawling often serves multiple, undifferentiated purposes.
    *   **Proposed Vocabulary Categories**:
        *   **Text and Data Mining (TDM)**: The broadest category, defined per the EU Copyright Directive as "the act of using one or more assets... in the context of any automated analytical technique aimed at analyzing text and data in digital form in order to generate information with improvement...".
        *   **AI Training**: Defined simply as "the act of training AI models."
        *   **Generative AI Training**: Defined as "the act of training general purpose AI models that have the capacity to generate text, images or other forms of synthetic content or the act of training other types of AI models... that have a the purpose of generating text and images and other forms of synthetic content." This combines capacity-based and purpose-based approaches.
        *   **Hierarchy**: TDM is the overarching category; AI Training is a subset of TDM; Generative AI Training is a subset of AI Training.
    *   **Known Shortcomings**:
        *   No explicit category for "inference" or "RAG" (Retrieval Augmented Generation) uses; these concerns arose later, and the legal framework and conceptual clarity around them are less developed.
        *   "Search and discovery" is currently sidestepped, with the draft pointing to the Robots Exclusion Protocol for governing such uses.
*   **Discussion on Vocabulary Scope and Definitions**:
    *   **Inference, RAG, and Search**: A significant portion of the discussion centered on the omission or underspecification of these categories.
        *   Many participants felt that "inference," "RAG," and "search" use cases are critical and should be included in the initial scope of work, not deferred.
        *   Concerns were raised that current search experiences extensively use AI, making a clean "search carveout" difficult.
        *   Different interpretations of "search" were highlighted: traditional "blue links," AI-augmented search, generative AI search providing direct answers, and agentic uses.
        *   The impact on publishers (e.g., traffic, revenue, brand reputation, accuracy of generated content) was emphasized as a key driver for these concerns.
        *   The lack of clear legal frameworks for inference-time uses, compared to training, was acknowledged.
    *   **TDM Broadness and EU-Centricity**:
        *   Concerns were raised about the broadness of the "TDM" definition and its direct derivation from EU law.
        *   Paul clarified that while EU-centric, it needs to fit the EU context to be workable, as the EU currently has the most specific rights reservations in this area. It also acts as a "big hammer" for rights holders to object broadly.
        *   The WG discussed adapting the language to be more jurisdiction-agnostic and evaluating other TDM definitions (e.g., UK).
    *   **Derivative Works and Synthetic Data**: The question of expressing preferences for AI-generated synthetic data derived from original content was raised. It was noted that the current vocabulary provides building blocks for policies, but this specific scenario (e.g., allowing training but restricting subsequent training on synthetic outputs) is complex and not explicitly addressed.
    *   **AI Training Broadness**: Some participants felt "AI Training" was still too broad, potentially encompassing internal AI systems (e.g., spam filters, search ranking algorithms) that users may not wish to opt out of.
    *   **Hierarchy and Abstraction**:
        *   The proposed nested hierarchy was debated, with some suggesting a less rigid or overlapping model might be more appropriate for search/inference.
        *   The idea of an abstract vocabulary model, allowing for multiple serialization formats (e.g., JSON, Structured Fields), with a *recommended* default serialization, gained support. This aims to balance interoperability with flexibility for different attachment mechanisms.
    *   **User Types and Granularity**: Discussion on how many internet users (e.g., CMS users, individual creators, professional publishers) would use these preferences and the need for granularity to avoid blanket opt-outs.
    *   **Agentic Uses**: Briefly mentioned as a potential area to consider, especially if not explicitly out of scope.
*   **Chair's Summary of Issues Identified for Future Discussion**:
    1.  Search and discovery carve-out (and its definition).
    2.  The nexus of RAG, inference time, and agentic cases.
    3.  The role and definition of TDM (e.g., its broadness, EU-centricity).
    4.  General EU-centricity of language and definitions.
    5.  Definitional questions around "AI" itself.
    6.  Mapping vocabulary to actual preferences (e.g., how "yes/no" is expressed).
    7.  The interface to attachment mechanisms (abstract model vs. concrete syntax/protocol bits).
    8.  The time dimension / staleness of preferences.
    9.  Enabling "good uses" (e.g., spam filtering, offensive language detection) vs. disabling "bad uses" (e.g., weapons, biometrics).
    10. The proposed nesting/hierarchy of terms.

## Decisions and Action Items

*   **Decision**: The Working Group decided to adopt Paul's "Open Future" proposal (draft-paul-aipref-opt-out-vocabulary) as the starting point for the vocabulary draft, recognizing that substantial work is needed to address identified issues.
*   **Action Item**: The Chairs will issue a Call for Adoption (CfA) on the AIPREF mailing list for draft-paul-aipref-opt-out-vocabulary within the next day or so.
*   **Action Item**: The Working Group will begin to flesh out and discuss the issues list identified during this session.

## Next Steps

*   **Draft Adoption**: Proceed with the CfA for draft-paul-aipref-opt-out-vocabulary on the mailing list for approximately one week.
*   **Issue Discussion**: Continue detailed discussions on the identified issues, potentially using GitHub issues to track progress.
*   **Draft Iteration**: Following adoption, the Working Group will iterate on the draft to incorporate resolutions from these discussions.
*   **Future Session Planning**: The Chairs will determine if the next session should continue discussions on vocabulary issues or pivot to attachment mechanisms, balancing the need for foundational vocabulary work with the aggressive timeline.
*   **Remote Participation Improvements**: The Chairs will explore ways to improve A/V quality and speaker identification for remote participants for future sessions (e.g., encouraging all participants to join the Meetecho meeting, even if physically present).

---

**Session Date/Time:** 08 Apr 2025 07:15

# [AIPREF](../wg/aipref.html)

## Summary

The first interim meeting of the AIPREF Working Group began with an extensive overview of IETF processes, norms, and the lifecycle of working group documents, primarily for the benefit of new participants. This was followed by a round of self-introductions from those present both in person and remotely. The main technical discussion for the day focused on the scope and requirements for the "vocabulary" deliverable, including fundamental concepts like opt-in/opt-out models, extensibility, and the interpretation of signals within various legal and technical contexts. While many points were discussed, a key theme emerged around balancing the need for a minimal, shippable core with anticipating future needs and use cases. The chairs proposed to shift to more specific proposals after lunch.

## Key Discussion Points

### IETF Process Overview
The chairs (Mark Nottingham and Krishnan) provided a detailed introduction to IETF processes and participation, including:
*   **The Notewell:** Emphasizing legal provisions, intellectual property rights (IPR) disclosures, anti-harassment policies, and the code of conduct.
*   **Participation Norms:** Highlighting that IETF participation is individual, not representative of employers or external groups, and is open to anyone. Professional behavior and constructive disagreement are expected. The concept of "rat holing" (hyper-focusing on a single issue) was noted as something chairs might manage to ensure efficient use of time.
*   **Contributions and IPR:** All contributions (in meetings, on mailing lists, or GitHub) are under IETF IPR terms, and personal awareness of relevant patents must be disclosed.
*   **Transparency:** All working group activities are in the open, including recordings and minutes (though this specific session was not recorded).
*   **Consensus:** Decisions are made by "rough consensus," not unanimity. Silence is difficult to interpret and participants are encouraged to express support or explain technical/policy arguments for disagreement. All consensus is formally established on the mailing list; synchronous meetings only gather a "sense of the room." RFC 7282 was recommended for understanding IETF consensus.
*   **Appeals Process:** If fundamental process errors are perceived, appeals can be made to chairs, then the Area Director (Mike Bishop, AD for Web and Internet Transport, introduced himself), and further if necessary.
*   **Meetings:** Interim meetings are primarily for active work and discussion, not just presentations. Remote participation is enabled, requiring clear speaking and active monitoring of chat for remote participants. Participants were encouraged to state their name when speaking.
*   **Document Lifecycle:**
    *   Individuals can publish "individual drafts" (e.g., `draft-yourname-aipref-xyz`) as proposals.
    *   Working Group "adoption calls" are issued on the mailing list to select starting points for WG work (e.g., `draft-ietf-aipref-xyz`). Adoption does not imply consensus on content, only on the document as a starting point.
    *   Editors are appointed by chairs and serve the working group's consensus, not their own opinions. Change control transfers to the working group upon adoption.
    *   **DataTracker vs. GitHub:** DataTracker is the official document store, while GitHub is used by editors/chairs for managing minutia and issue tracking. Substantial discussions and consensus *must* happen on the mailing list, not GitHub issues or pull requests. Issues are for tracking discussion topics, not direct proposals.
    *   **Working Group Last Call (WGLC):** The WG reviews a mature draft for readiness.
    *   **IETF Last Call (IETF LC):** The entire IETF reviews, including Directorate Reviews (e.g., Security Directorate).
    *   **ISG Review:** The Internet Engineering Steering Group (ISG) evaluates feedback and votes on the document.
    *   **RFC Editor Queue:** Final editorial changes for publication as an RFC.
*   **Future Work:** After deliverables are shipped, the group may consider new work, potentially through re-chartering to expand scope.

### Initial Discussion on Vocabulary Deliverable
The discussion then moved to the first deliverable: Vocabulary.
*   **Proposed Discussion Structure (Mark Nottingham):**
    1.  The three states: opt-in, opt-out, silence.
    2.  Application of rules: acquisition, training, or ultimate use of trained models.
    3.  Set of things to express preferences about.
    4.  Scope of search indexing (should it be included, given its interaction with existing `robots.txt` practices and potential for differing publisher preferences for search vs. generative AI).
*   **Legal Frameworks and Scope Limitations:**
    *   Sash and Glenn raised concerns about the technical mechanisms being misinterpreted or over-applied by policymakers to contexts not intended or worked through by the IETF.
    *   Mark Nottingham suggested a guide for policymakers to clarify what the mechanism does and doesn't do.
    *   Krishnan noted that the lack of a signal could be interpreted differently in various legal frameworks.
*   **The "Fourth State" and Extensibility:**
    *   Leonard Rosenthal and D proposed a "fourth state": "but with criteria," allowing for expressions like "contact me" or detailed terms (e.g., payment, specific conditions). Leonard cited ongoing EU work on machine-readable detailed rights.
    *   Mark Nottingham emphasized the aggressive charter timeline (August deliverables) and suggested focusing on a minimal core first, then re-chartering for more elaborate extensions, to ensure progress. This view was generally supported, with a caveat against preventing future extensibility.
    *   Bradley suggested that "no" could be the start of a negotiation, leading to contact and custom agreements, fitting within the tri-state model while enabling the "but with criteria" need.
*   **AI Search vs. Traditional Search:**
    *   Bradley, Matt Roger, and Mia highlighted the critical distinction: publishers want to appear in traditional search but may not want their content summarized by generative AI search results, as this reduces click-through rates. This suggests a need for granular control.
*   **"Data is out there" Problem:** Joe Bradley and Tim Brock noted the difficulty of retracting data once scraped and the need for preferences to apply not just at crawl time but also pre-training and during use. Mia clarified the focus is on *how* crawled data is used, not just crawling itself.
*   **Opt-in vs. Opt-out (again):** The group discussed the implications of default opt-in versus opt-out models, especially concerning unforeseen future uses and rapid business model evolution (e.g., actors licensing their voice/likeness for AI training).
    *   Mark Nottingham reiterated that the IETF defines *preferences*, but whether a legal regime interprets "no signal" as consent, non-consent, or something else is outside the WG's control. The goal is to accommodate both opt-in and opt-out *legal regimes* by providing appropriate signals.
    *   Bradley emphasized that the vocabulary should be agnostic to specific legal regimes, but the WG *should* state that "no signal does not imply consent," leaving legal interpretation to relevant jurisdictions.
*   **"Good Actors" vs. "Bad Actors":**
    *   Gady, Paul Keller, and Alyssa discussed that the protocol is for "good faith actors" willing to obey signals. It won't stop malicious scraping but aims to provide clear signals for those who comply.
    *   Ted argued that `robots.txt`'s user-agent targeting is a "misfeature" because it's a crude form of bot authentication and can entrench dominant players. He would oppose using identity as a primary mechanism. Gady confirmed this with an example of IP range repurposing leading to impersonation.
    *   Alyssa clarified that "good actors" includes site owners who lack tools for granular control, not just crawlers.
    *   K further argued against identity-specific rules and protecting against misuse, stating that external mechanisms exist for piracy/misconduct.
    *   Fabrice suggested focusing on a standard, universal vocabulary that applies to "all actors" by default.
*   **Versioning and Audit:**
    *   Ted proposed including an ability to reference versions of preference signals (like an `Etag` or `nonce`) to track compliance and facilitate protocol evolution. Krishnan noted this is more involved than simple versioning.
    *   Matt Roger highlighted the need for tracking provenance for "fair trade" AI models, which was acknowledged as an important future consideration but out of scope for the current charter's initial phase.
*   **Attachment Mechanisms and Contexts:**
    *   Ted and Brian noted the limitations of `robots.txt` (crawl-time artifact) and discussed the need for preferences to travel *with* content (e.g., HTTP headers, embedded in media, or email protocols like IMAP/SMTP), and the challenges of re-checking usage rights (e.g., cost of re-crawling).
    *   Leonard spoke about asset-based preferences (e.g., embedded in video) making distribution point irrelevant. Ted countered that distributors (e.g., YouTube) also have terms and preferences that might need to be combined or reconciled with content-creator preferences.
*   **Territoriality/Jurisdiction and Conflicting Signals (Glenn's Concern):**
    *   Glenn raised a complex scenario: original content distributed by two good-faith actors in different jurisdictions (e.g., EU vs. US), each attaching preferences relevant to their local context. A global crawler would then find conflicting signals for the same content. How does the good-faith crawler resolve this without a "territory" signal?
    *   This discussion was identified as a "rat hole" by chairs, due to its complexity and relation to legal/contractual issues beyond the WG's technical scope. The group acknowledged the problem of conflicting signals for known mechanisms and the need for *some* guidance, but that a complete solution for territoriality is unlikely within the current charter.

## Decisions and Action Items

*   **Decision:** The working group will continue the open, wide-ranging discussion until lunch, then shift to focusing on specific proposals for the vocabulary deliverable after lunch.
*   **Decision:** The chairs will aim to facilitate side conversations and speculative ideas without them blocking or distracting from the main deliverables, potentially exploring separate mailing lists or clear guidance on the main list.
*   **Action Item:** Glenn to clearly articulate the issue of conflicting policies and jurisdictional signals in a message to the mailing list for further discussion.
*   **Action Item:** The note-taker (Ted) was requested to share the link to the notes in the chat and possibly the mailing list for review, to ensure accuracy and completeness.

## Next Steps

*   **Today (after lunch):** Resume the meeting to focus on specific proposals for the "vocabulary" deliverable.
*   **Remainder of the Week:** Continue discussions on attachment mechanisms (e.g., location-based, embedded).
*   **Thursday:** Conduct a status check of progress, evaluate timelines, and discuss the potential need for additional interim meetings.
*   **Ongoing:** Continue the discussion on complex, cross-cutting issues (e.g., conflicting signals, audit mechanisms, bot authentication, territoriality) on the mailing list, with the understanding that some aspects may be outside the current charter and may lead to future re-chartering efforts.
*   **Remote Participants:** Note that the afternoon session will use a different MeetEcho link, available via the DataTracker meeting materials page.