AIPREF

Summary

The session focused on defining attachment mechanisms for AI preferences, specifically discussing HTTP headers and the reuse or extension of robots.txt. Strong support emerged for both HTTP headers and robots.txt (with modifications for purpose-based selection) as initial attachment mechanisms. The working group agreed to appoint editors to consolidate existing proposals into a concise draft for these two mechanisms. Key discussions also covered the challenge of combining multiple preference signals and the timeliness of preference application, with a general consensus that the working group should define technical interpretation rules but not delve into legal enforcement or retroactive application.

Key Discussion Points

Note-Taker Volunteer: A volunteer (Farms) agreed to take notes for the afternoon session, with a request for others to assist in capturing major points and emerging conclusions.
Order of Discussion: The session commenced by discussing "unit-based" attachment mechanisms (HTTP headers) as a warm-up before moving to "location-based" mechanisms (robots.txt).
Attachment by HTTP Header Fields
- Concept: HTTP headers serve as metadata for content (e.g., PDF, image, HTML). They are key-value pairs indicating content type, caching info, and can convey generic information about the content.
- Advantages: HTTP headers offer granularity, as their scope is implicit to a single response. Servers can be configured to add specific headers, enabling fine-grained policies.
- Interaction with robots.txt: It was noted that crawlers generally retain HTTP headers when fetching content, which is crucial for their function. The possibility of robots.txt explicitly pointing to HTTP headers was raised but not fully resolved.
- Link Headers (RFC 8288): The use of link headers (RFC 8288) was discussed as a mechanism to reference external data. While recognized as in use (e.g., for CTPA data), the sense of the room was that link relations might constitute a separate form of attachment to be considered distinct from direct HTTP header fields if pursued.
- Relevance to Content: The discussion highlighted the importance of defining "content-related" HTTP headers that travel with the content, ensuring metadata is preserved as content moves through systems.
- Scope for Composite Media: For HTTP headers, the policy applies to the specific representation provided by HTTP (e.g., an HTML page or an image file). For complex composite media (e.g., streaming video with ads) where elements might be assembled server-side and delivered as one, the HTTP header would apply to the entire container. If individual elements are fetched via separate HTTP requests, they would each have their own headers. It was acknowledged that defining policies for deeply embedded or tunnelled content (e.g., MASK protocol) or very granular sub-elements within a single stream would likely require specific attachment mechanisms outside the current scope.
- Bad Actors / Rights Verification: The issue of content reposted by bad actors without correct headers was raised. The working group reiterated that its charter does not include technical mechanisms for verifying rights holders or enforcing compliance; rather, it focuses on defining how preferences can be expressed.
Attachment by Location (robots.txt)
- Need for Purpose-Based Selection: A primary concern was the lack of a global opt-out for AI training in current robots.txt and the need to enable selection by "purpose" (e.g., AI training, search) rather than just "user-agent" or "location."
- Support for robots.txt Reuse: There was strong support for leveraging robots.txt due to its wide implementation, respect by crawlers, and performance benefits (e.g., avoiding fetching disallowed resources).
- Extensibility: The key-value syntax of robots.txt was considered extensible, with parsers designed to skip unrecognized fields, which can make it robust to new preference expressions.
- Concerns:
  - robots.txt is an older, "dumb" protocol, which can be prone to misconfiguration.
  - Some creative communities (e.g., photographers) view robots.txt as irrelevant to their needs.
  - While supporting purpose-based selection, the continued need for user-agent specific rules was also acknowledged.
- Alternative (ai.txt pointer): An alternative of using robots.txt to point to a separate file (e.g., ai.txt) for AI preferences was discussed. The general sense was to prioritize integrating preferences directly into robots.txt first, and only consider a separate file if insurmountable syntax conflicts arise.
- Applicability beyond HTTP: It was clarified that robots.txt applies to any protocol (e.g., FTP) that uses URIs, not solely HTTP.
- "Crawling" vs. "Scraping": The distinction between "crawling" (mapping the internet) and "scraping" (making copies for use, often for training) was briefly discussed, highlighting that robots.txt preferences need to be relevant to both.
Combination Rules (Multiple Preferences)
- The Problem: A single piece of content might have multiple, potentially conflicting, preference signals from different sources (e.g., robots.txt, HTTP header, embedded metadata).
- Proposed Algorithm: The most discussed algorithm was "most specific, most restrictive wins." This implies:
  1. Resolve Specificity: Determine which preference signal is more specific (e.g., embedded metadata often seen as more specific than an HTTP header, which is more specific than a robots.txt rule).
  2. Apply Restrictiveness: Among the most specific applicable rules, the most restrictive preference takes precedence.
- Hierarchy of Specificity: The discussion touched on the hierarchy of location-based vs. unit-based (asset-based) preferences, with a leaning towards asset-based (embedded/HTTP header) being more specific.
- Legal Context vs. Technical: The group acknowledged that external factors like laws, agreements, or platform policies might override technical combination rules, but these are out of the scope of the WG's technical definition.
- Multiple Copies: For identical content existing on different servers with different preferences, the current consensus is to treat these as distinct instances; a technical solution for "shopping for least restrictive" content across the web is not within the current charter.
Timeliness and Staleness of Preferences
- The Problem: How should preferences apply to content obtained at different times? For example, if content was scraped years ago, should the current robots.txt policy apply, or the one from when it was fetched?
- Current Practice: Crawler operators generally refresh robots.txt every 12-24 hours and apply the policy contemporary to the content acquisition.
- Retroactive Application: There were strong arguments against defining retroactive application of new preferences to old data, citing legal issues (certainty, fairness) and technical complexity (content changing over time).
- Decision: The WG will aim to define how preferences are interpreted at the time of usage/acquisition but will explicitly state that retroactive application of preferences is out of scope. This ensures the technical framework doesn't overstep its authority. HTTP caching semantics are not suitable for defining content freshness in this context.
Registry of Attachment Mechanisms
- Proposal: An IANA registry was proposed to authoritatively list various attachment mechanisms (e.g., HTTP headers, robots.txt, embedded in specific file formats).
- Arguments For: Provides a clear, extensible list for implementers, aids discovery for actors needing to respect all preferences, and offers a common reference point for policymakers.
- Arguments Against: Some felt it was premature or unnecessary "handholding," potentially raising political issues about inclusion criteria or implying enforcement beyond the WG's scope.
- Status: No firm decision was made. It was noted as an issue to consider for future work or guidance, rather than an immediate deliverable.

Decisions and Action Items

Attachment Mechanism for HTTP Headers: The working group supports defining an attachment mechanism using HTTP header fields.
Attachment Mechanism for robots.txt: The working group supports reusing and updating robots.txt for location-based attachment, with the specific goal of accommodating purpose-based selection. The idea of a separate pointer file (ai.txt) is considered a fallback if direct integration into robots.txt proves technically unfeasible due to syntax constraints.
Document Structure:
- The overall standard is envisioned to consist of at least two main documents: one for the core vocabulary, abstract data model, and default serialization; and a second for specific attachment mechanisms.
- The initial attachment document will focus on robots.txt and HTTP headers.
- The document defining combination rules will likely be part of the vocabulary document.
Editors for Attachment Draft: Chairs will appoint editors to survey existing drafts and create a compact, succinct proposal for the robots.txt and HTTP header attachment mechanisms, to be brought to the working group for a call for adoption.
Issues List: An issues list will be established to track all identified problems related to attachment, including:
- Timeliness of preference application.
- Definition and use of link relations.
- Discovery of applicable attachment mechanisms.
- Handling of encapsulated/multipart media (containers).
- Specific robots.txt integration details (e.g., single file vs. indirection).
- Proposals for embedding preferences in specific file formats.
Out of Scope for Current Charter: The working group confirmed that its current charter explicitly excludes:
- Enforcement mechanisms for preferences.
- Determining the validity or authority of parties making preference assertions.
- Retroactive application of preferences to previously acquired content.
- Technical mechanisms for verifying authenticity or rights.
- Defining attachment mechanisms for all other protocols (e.g., SMTP, MASK) or deeply embedded composite media beyond HTTP's direct content representation, though external drafts on these are welcome.

Next Steps

Chairs will proceed with appointing editors for the attachment mechanisms draft.
The working group will establish an issues list to track attachment-related problems.
Discussions on combination rules and potential proposals will continue, likely on the mailing list and in future sessions.
The next session is scheduled for tomorrow (truncated, 3h 15m), where the group plans to discuss next steps for the working group and potential future meetings. Attendees are encouraged to prepare issues they wish to discuss productively.

Session Date/Time: 09 Apr 2025 07:15

AIPREF

Summary

The session continued the in-depth discussion on the core vocabulary for AI preferences, prioritizing it over the attachment mechanism due to its foundational nature. A significant portion of the discussion focused on whether and how to include "inference" and "RAG" (Retrieval Augmented Generation) use cases within the vocabulary, alongside a renewed focus on "search." While there was a strong inclination to address these areas, participants emphasized keeping the definitions coarse-grained for an initial deliverable by August, with a clear path for future extensibility. The discussion also touched on the hierarchy and nesting of terms, the definition of "AI," and the importance of ensuring the vocabulary is understandable by content holders and actionable by AI system operators, while acknowledging the broader legal and policy landscape without directly legislating.

Key Discussion Points

Meeting Logistics: Audio checks were performed for remote participants. Joe was volunteered as the note-taker for the first two hours. New and existing participants introduced themselves.
Agenda Prioritization: The Chairs noted the significant progress made on vocabulary discussions yesterday. It was decided to continue focusing on vocabulary for the morning session, reserving more time for this foundational topic, and defer the attachment discussion until after lunch.
Review of Previous Discussions: The Chairs recap yesterday's discussion, highlighting the abstract model for preferences with a suggested serialization, and the inclination to include search and discovery use cases.
"RAG" (Retrieval Augmented Generation) and Inference Use Cases:
- Paul explained that his initial draft did not accommodate these use cases due to their late emergence in initial discussions and the undefined nature of what "inference" truly encompasses. He expressed concern that including them could significantly slow down the working group.
- Leonard (Adobe) expressed strong support for including inference, citing real-world user requests from creatives who want to prevent "style referencing" (e.g., "make me something like this"). Leonard highlighted that C2PA AI preferences already include an "inference" flag based on these requests and that Adobe prioritizes the owner's preference over the end-user's in such cases.
- Bradley (News/Publishing community) emphasized that RAG is the number one concern for the publishing community, particularly in signalling preferences for content not to be used for advertising purposes. He argued that omitting RAG would be a "blow" to the news/magazine community and suggested the group should at least try to tackle it.
- Gary (Microsoft) stated that Microsoft AI solutions today do not typically look at robots.txt beyond meta tags like noarchive. He suggested a generic "no gen AI" rule could be sufficient for publishers, covering no LLM training and no retrieval.
- Hiroshi (Keio University) highlighted that Japanese copyright law is highly permissive for AI training, making "inference" or "use of generated content" a critical concern in Japan, especially if it violates copyright. He argued against ignoring inference.
- Many participants expressed a desire to be inclusive of these use cases, even if it meant an initial "imperfect" definition, rather than deferring them. Some suggested focusing on "usage" or "actions" rather than specific AI techniques (like RAG), which are rapidly evolving.
- Joe proposed distinguishing between "agentic" (non-human eyeballs) and "user-based prompting," suggesting that agentic uses of content (e.g., an AI summarizing a paywalled article to circumvent paywalls) are a core concern.
- Sonia (European Commission) raised concerns about the broad impact of including RAG/inference on accessibility tools, safety features, and spam filters, noting that the European Accessibility Act will come into force soon. She suggested considering alternatives to input prevention, such as output filters.
- The discussion highlighted the tension between rights holders' desire for fine-grained control and the need for a simple, actionable vocabulary for both content holders and AI operators.
Granularity of Vocabulary: There was a strong sense that the initial vocabulary should be coarse-grained, focusing on high-level uses/actions rather than overly specific AI techniques or legal interpretations. This approach aims for an achievable "Minimum Viable Product" (MVP) by August, with mechanisms for future extensibility.
Nesting and Hierarchy of Terms: The existing draft's use of nesting (e.g., Generative AI Training being a subset of AI Training) was discussed.
- Martin illustrated conceptual overlaps and subsets, noting that while strict subsets exist for some terms, others (like search vs. inference) may overlap or be independent.
- The value of nesting for expressing preferences at a higher level (e.g., "no AI training" applying to "no generative AI training") and for long-term extensibility was acknowledged by some.
- Others questioned whether nesting simplifies or complicates the framework for the average user, suggesting clear, independent definitions might be preferable if there are only a small number of core terms.
TDM (Text and Data Mining): There was an inclination to retain the TDM term in the draft due to its prevalence in legal frameworks globally, but with a need for clear definition within the vocabulary's context and potentially a clarification of its relationship to other terms (e.g., whether inference is a strict subset).
Other Vocabulary Issues (Briefly Touched/Deferred):
- "What is AI?": Acknowledged as complex and deferred.
- "Mapping to preferences" (e.g., "allow" or "disallow"): To be addressed as part of the attachment mechanism.
- "Time dimension" (e.g., a timestamp for when a policy was set): Discussed whether it belongs in the core vocabulary or the attachment mechanism; deferred for further proposals defining use cases and requirements.
- "Spam filtering, offensive language, weapons/biometrics" (purpose-based controls): Agreed to defer these for later, ensuring the framework allows for future extension.
- "Eurocentricity": Agreed to address editorially, aiming for exemplary language rather than canonical legal text, acknowledging that some terms have legal origins but striving for global applicability.

Decisions and Action Items

Decision: The working group will prioritize continuing the discussion and development of the core vocabulary for AI preferences.
Decision: A sense of those present indicates strong interest in trying to address "inference" and "RAG" use cases within the vocabulary, aiming for coarse-grained definitions with extensibility.
Decision: A sense of those present indicates continued interest in addressing "search" use cases within the vocabulary.
Action Item: Participants interested in proposing vocabulary terms and their definitions for "inference," "RAG," and "search" use cases are invited to submit their proposals to the mailing list in the coming weeks. Proposals can be sent as emails or Internet-Drafts and should explicitly describe the meaning of the terms for both content holders and AI system operators.
Decision: Detailed discussions on "what is AI," "mapping to preferences," "time dimension" (for the vocabulary itself), and "purpose-based controls" (e.g., spam, offensive content, biometrics) are deferred for later stages, with the understanding that the framework should allow for future extension.
Action Item: Editors will continue to refine the draft to address "eurocentricity" editorially, aiming for exemplary language and broader applicability.
Action Item: Glenn (and others interested) are invited to submit a proposal outlining use cases and requirements for including a "time dimension" (e.g., policy timestamp) in the context of preferences for future discussion.

Next Steps

Continue the vocabulary discussion on the mailing list, focusing on the new proposals for "inference," "RAG," and "search" terms and their precise definitions.
The afternoon session will shift focus to the attachment mechanism.
The Chairs will consider potential future meetings to continue refining the vocabulary based on proposals and ongoing discussions.