Session Date/Time: 23 Jun 2025 13:00 ✎ Suggest a correction

AIPREF

Summary

The AIPREF Working Group held a virtual interim meeting to discuss the current status of the vocabulary and attachment drafts, gather feedback from participants, and prepare for the upcoming design team meeting in July. Key discussions revolved around refining the definitions and hierarchical structure of the vocabulary categories, particularly Text and Data Mining (TDM), AI Inference, and Search, as well as the practical implications and extensibility of the attachment mechanisms in HTTP headers and robots.txt. A strong emphasis was placed on documenting the rationale behind design choices and engaging the broader community.

Key Discussion Points

Meeting Logistics:
- A scribe (Felix) volunteered to take notes, with collaborative editing encouraged.
- The meeting was recorded.
- Chairs (Mark Nottingham, Sesh Krishnan) welcomed participants and introduced the meeting's informal nature to gather feedback on drafts.
- Attendees were reminded to use the queuing function for orderly discussion.
Draft Overview (presented by Martin Thomson):
- Vocabulary Draft:
  - Defines five broad categories: AI Training, Generative AI Training, AI Inference, Search, and Text and Data Mining (TDM).
  - TDM is currently defined as a superset of AI Training in the proposed hierarchy, with other categories potentially inheriting preferences.
  - Key open issues include: refining definitions for AI Inference and Search, ensuring clear hierarchical arrangement, and discussing methods for combining and extending the vocabulary.
  - The diagram in the draft was noted as needing correction.
- Attachment Draft:
  - Proposes an exemplary syntax for expressing preferences (e.g., a label with a "yes" or "no" value).
  - Discusses potential extensions using parameters (e.g., day=Tuesday), new orthogonal terms (e.g., non-commercial), or arbitrary terms_of_use URLs.
  - Identifies two primary attachment methods: HTTP headers and robots.txt.
  - robots.txt presents complexities due to applying preferences to different parts of a site and managing preferences for content acquired at different times.
  - A fundamental question remains whether these two attachment methods are sufficient or if others should be defined.
Discussion on Vocabulary (TDM, Hierarchy, Issues #3, #5):
- TDM Scope and Naming: Some participants expressed reluctance regarding having all five categories or the specific name "Text and Data Mining." A more neutral term like "computational analysis" was suggested for the broad top-level category.
- Hierarchy and Overrides: The intended design is for a hierarchical structure where a more specific preference (e.g., generative-ai-training=yes) overrides a broader one (e.g., tdm=no). This allows for default "no" (catch-all) with specific opt-ins.
- Declarant vs. Rights Holder: Fars raised concerns about implying that the declarant is always the rights holder. Paul clarified that the draft uses "declaring party" to accommodate various legal contexts and that the system is independent of underlying legal frameworks.
- End-User Impact: Concerns were raised by Fars and Felix about how these preferences might impact end-user rights or access, particularly regarding search snippets or AI system interfaces (e.g., RAG scenarios).
- Diagrams and Rationale: Martin advised against fixating on the current illustrative diagram, which needs updates to reflect the complex relationships between categories. He emphasized the need to document the rationale behind definitions to provide context and prevent repetitive discussions.
- TDM as Catch-all: A sense of the room indicated support for a broad, catch-all category (like TDM or "all") to cover unforeseen future uses and to enable a "default no, then opt-in" preference model.
- EU Copyright Directive: Paul highlighted the importance of a TDM-equivalent category for European rights holders to exercise rights under the EU Copyright Directive.
- Determinism: Sesh noted that a TDM category provides determinism across different legal regimes (opt-in vs. opt-out).
Discussion on Vocabulary (AI Inference, Search):
- Inference Definition: Felix raised concerns that the term "inference" is ambiguous, potentially covering both AI system behaviors (e.g., RAG – independently fetching resources) and individual user behaviors (e.g., prompting with self-supplied assets). He feared that preferences applied to system behavior could inadvertently restrict user rights if intermediaries treat voluntary signals as legal directives.
- Narrowing Scope: Paul agreed that the "inference" category in the draft was intentionally broad but feedback indicates a need to narrow it down, likely to specific AI system actions like RAG (Retrieval Augmented Generation), to avoid unintended harm to user autonomy.
- Style Reference: Leonard reiterated that the "inference" category originated from the use case of "style reference," where an asset influences an AI's output without being stored or used for training. He emphasized the need for authors to express preferences for such uses, regardless of the specific term.
- Preference vs. Access Control: Timid Robot and Leonard stressed that the system defines preferences, not an access control mechanism. Companies' incentives to restrict users based on these preferences, while a practical outcome, are largely outside the WG's scope. Martin and Paul agreed on the need to document potential less obvious effects.
Discussion on Attachment (Syntax, HTTP Header, robots.txt, Issue #51):
- Broader Attachment Methods: There was agreement that HTTP headers and robots.txt form a baseline. Liaison with groups like IPC, C2PA, JPEG Trust, and ISO/IEC/ITU is planned to ensure alignment of this vocabulary with existing and future embedded metadata/attachment approaches. Leonard committed to collecting a list of relevant contacts.
- JSON Encoding: Brian suggested exploring a simple JSON key:value encoding for preferences. Chairs encouraged this as a separate, exploratory effort.
- robots.txt Syntax (Issue #51): Martin highlighted trade-offs in different proposed syntaxes for robots.txt preferences, especially regarding multi-level extensibility (attaching parameters to individual preferences vs. the entire preference set) and user-centric authoring. Backwards compatibility with existing robots.txt parsers and their handling of multiple groups was a key consideration.
- Hierarchy in Attachment: Sonia questioned how preferences from different attachment mechanisms (e.g., robots.txt, HTTP headers, embedded metadata) would combine. Martin clarified that robots.txt primarily governs acquisition (crawling), while preferences apply to use. The combining of potentially conflicting preferences from multiple sources is an open discussion.
- Crawling vs. Use: The distinction between crawling (acquisition) and subsequent use of content is critical and needs clear documentation. Paul foresaw potential misunderstandings from declaring parties who might equate crawler presence with usage violations.
- Timeline of Preferences (Issue #12): This issue covers the validity of preferences over time (time of collection vs. time of use) and for content from sources that no longer exist. It was suggested to split Issue #12 into more granular topics.

Decisions and Action Items

Agreements Reached:
- The Working Group will focus on the current two drafts (Vocabulary and Attachment) and the two primary attachment methods (HTTP headers and robots.txt) for initial delivery, with future expansion to other embedding mechanisms.
- A broad, top-level category similar to Text and Data Mining (potentially renamed to "computational analysis") will be retained due to its utility as a catch-all for unknown future uses and its relevance for aligning with legal frameworks like the EU Copyright Directive.
- Definitions for "AI Inference" and "Search" need to be narrowed to avoid ambiguity and potential unintended restrictions on end-user rights or behaviors.
- The system defines preferences, not an access control mechanism, and participants acknowledged that market dynamics will influence how these preferences are implemented by AI providers.
Action Items:
- Martin Thomson: Open an issue to discuss explicitly documenting the use case of deliberately overriding preferences and acknowledging its implications within the document.
- Leonard Rosenthol: Collect and share a list of individuals and groups actively involved in embedded metadata and other attachment mechanisms for future liaison and coordination.
- Editors/Chairs: Actively document the rationale and shared understandings behind design choices and definitions in the drafts to provide context and prevent rehashing past discussions.
- Chairs: Initiate a discussion on the mailing list regarding participation in a hackathon in Madrid to develop code or tests related to AIPREF.
- All Participants: Continue to engage actively with the mailing list and upcoming draft revisions, providing feedback early to ensure all issues are identified and addressed before the London meeting.
- Editors: Ensure clear documentation distinguishes between the acquisition of content (crawling) and its subsequent use, as preferences apply to the latter.

Next Steps

Editors will continue to refine the drafts based on the discussions, with a focus on narrowing definitions, clarifying hierarchy, and documenting rationale.
Participants are encouraged to review new draft versions and submit comments to the mailing list well in advance of the London meeting to enable early discussion.
The Working Group aims to ensure all major issues are identified and addressed to progress the drafts towards standardization.
Discussions will commence regarding participation in the Madrid hackathon to build reference implementations or tests.
The next in-person discussion will be at the London design team meeting in July.