GPAI Code Draft 3: Copyright Changes

Background

The European AI Act (Regulation (EU) 2024/1689, or the ‘AI Act’) places specific obligations on providers of General-Purpose AI (‘GPAI’) models. These models, including those from the GPT family, Llama, and Gemini, must adhere to requirements such as comprehensive documentation and the establishment of a policy ensuring compliance with EU copyright law.

To facilitate adherence to these stipulations, the AI Act anticipates the development of Codes of Practice tailored for GPAI models. Following an invitation from the AI Office, various experts and stakeholders formed four working groups dedicated to drafting an initial Code of Practice. Approval of this Code by the EU Commission would grant it ‘general validity’ across the EU. Adoption of the approved GPAI Code of Practice offers companies a means to demonstrate proactive compliance, potentially mitigating regulatory scrutiny and associated penalties.

The AI Office recently released the third draft of the Code of Practice (‘3rd Draft’) produced by these working groups. This draft encompasses several key areas:

  • Commitments
  • Transparency
  • Copyright
  • Safety and Security

The final version of this Code of Practice is slated for release on May 2, 2025.

This document will delve into the significant details within the copyright section of the 3rd Draft. A notable shift from the second draft (‘2nd Draft’) is the 3rd Draft’s streamlined and concise approach. A key change is that the 3rd Draft generally mandates that compliance efforts should be commensurate with the provider’s size and capabilities, unlike the 2nd Draft.

Who is this relevant for?

The Code of Practice primarily targets providers of GPAI models. These models are characterized by their significant generality and their ability to proficiently execute a broad spectrum of distinct tasks. This encompasses providers of well-known large language models like GPT (OpenAI), Llama (Meta), Gemini (Google), and Mistral (Mistral AI). However, smaller model providers may also fall under its purview, provided their models can be utilized for a diverse range of tasks. Furthermore, businesses that fine-tune models for their specific applications might also be classified as GPAI model providers.

‘Downstream providers,’ or businesses that integrate GPAI models into their AI systems, should also familiarize themselves with the Code of Practice. This Code is poised to become a quasi-standard for GPAI models, defining the expectations for AI system developers regarding GPAI model capabilities. This understanding can be crucial during contract negotiations with GPAI model providers.

Providers of GPAI models are obligated to establish a policy that ensures compliance with EU copyright law (Art. 53 (1) (c) AI Act). Given the novelty of this requirement, practical guidance on the structure and content of such a policy has been lacking. The Code of Practice aims to address this gap.

The Code of Practice mandates that providers implement the following measures:

Providers who sign the Code of Practice (‘Signatories’) are required to formulate, maintain, and implement a copyright policy that aligns with EU copyright law. This requirement is directly derived from the AI Act. Signatories must also ensure that their organizations adhere to this copyright policy.

A significant departure from the 2nd Draft is that the 3rd Draft no longer mandates the publication of the copyright policy. Signatories are merely encouraged to do so. This reduced requirement is logical, as the AI Act itself does not compel model providers to publish their copyright policies.

Web Crawling of Copyrighted Content

Signatories are generally permitted to employ web crawlers for text and data mining (‘TDM’) purposes to gather training data for their GPAI models. However, they must ensure that these crawlers respect technologies designed to restrict access to copyrighted materials, such as paywalls.

Moreover, Signatories are obligated to exclude ‘piracy domains,’ which are online sources that primarily engage in the distribution of copyright-infringing materials.

Web Crawling and Identifying and Complying with TDM Opt-outs

Signatories must ensure that web crawlers identify and respect TDM opt-outs declared by rightsholders. While EU copyright law generally permits TDM, rightsholders retain the right to opt-out. For web content, this opt-out must be machine-readable. The 3rd Draft elaborates on the requirements for web crawlers, specifying that they must identify and comply with the widely adopted robots.txt protocol. Additionally, web crawlers must adhere to other relevant machine-readable TDM opt-outs, such as metadata established as an industry standard or solutions commonly used by rightsholders.

Signatories are required to take reasonable steps to inform rightsholders about the web crawlers in use and how these crawlers handle robots.txt directives. This information can be disseminated through various channels, such as a web feed. Notably, the 3rd Draft no longer includes an obligation to publish this information.

Identifying and Complying with a TDM Opt-out for Non-Web-Crawled Content

GPAI model providers may also acquire datasets from third parties rather than conducting web crawling themselves. While the 2nd Draft mandated a copyright due diligence of third-party datasets, the 3rd Draft requires reasonable efforts to obtain information regarding whether web crawlers used to gather the information complied with robots.txt protocols.

A significant risk associated with AI usage is the potential for the AI to generate output that infringes on copyrights. This could involve duplicating code or images found online that are protected by copyright.

Signatories are required to make reasonable efforts to mitigate this risk. This represents a more lenient approach compared to the 2nd Draft, which prescribed measures to avoid ‘overfitting.’ The 3rd Draft adopts a more technology-neutral stance, emphasizing reasonable efforts.

Furthermore, Signatories must incorporate a clause in their terms and conditions (or similar documents) for providers of downstream AI systems, prohibiting the use of their GPAI model in a manner that infringes on copyright.

Designating a Point of Contact

Signatories are required to provide a point of contact for rightsholders. They must also establish a mechanism that allows rightsholders to submit complaints regarding copyright infringements.

Under the 3rd Draft, Signatories have the option to refuse to process complaints that are deemed unfounded or excessive.

The 3rd Draft, while seemingly streamlined, introduces nuances and shifts in emphasis that warrant a closer look. Let’s dissect each section further:

The initial mandate to publish the copyright policy, present in the 2nd Draft, raised concerns about potential competitive disadvantages and the exposure of sensitive information. The 3rd Draft’s move to encourage publication, rather than requiring it, acknowledges these concerns. This change allows providers to maintain a degree of confidentiality regarding their internal compliance strategies, while still promoting transparency. However, the ‘encouragement’ aspect still places a subtle pressure on providers to be open about their policies, potentially leading to a de facto standard of publication over time. The rationale behind this shift is to balance the need for transparency with the legitimate business interests of GPAI providers. Publishing detailed internal policies could reveal proprietary information about their training methods and data sources, potentially giving competitors an unfair advantage. By encouraging, but not requiring, publication, the 3rd Draft attempts to find a middle ground.

The explicit permission for web crawling, coupled with the requirement to respect access restrictions like paywalls, reflects a delicate balancing act. The AI Act recognizes the importance of data for training AI models, but it also underscores the need to respect the rights of content creators. The exclusion of ‘piracy domains’ is a crucial addition, explicitly targeting sources that actively engage in copyright infringement. This provision reinforces the principle that AI development should not be built on the foundation of illegal activities. This section clarifies that while data acquisition through web crawling is permissible, it must be conducted responsibly and ethically. The requirement to respect paywalls and other access restrictions acknowledges that content creators have a right to control access to their work, even for TDM purposes. The explicit exclusion of ‘piracy domains’ sends a strong signal that the AI Act will not tolerate the use of illegally obtained data for training AI models.

TDM Opt-outs: The Technical Specificity of Compliance

The 3rd Draft’s emphasis on the robots.txt protocol and other machine-readable opt-out mechanisms highlights the technical aspects of compliance. This specificity provides clarity for both GPAI providers and rightsholders. For providers, it outlines concrete steps they must take to ensure their crawlers respect opt-out requests. For rightsholders, it clarifies how they can effectively signal their preferences regarding TDM. The inclusion of ‘industry standard’ metadata and ‘widely adopted’ solutions acknowledges that the landscape of opt-out mechanisms is evolving and that flexibility is necessary. This section moves beyond general principles and provides concrete technical guidance. The focus on robots.txt, a widely used standard for controlling web crawler access, provides a clear and readily implementable mechanism for compliance. The recognition of other machine-readable opt-out mechanisms ensures that the Code of Practice remains adaptable to future technological developments. This technical specificity is crucial for ensuring that the Code of Practice is not just a set of abstract principles, but a practical guide for real-world implementation.

Non-Web-Crawled Content: Shifting Responsibility and Due Diligence

The change from ‘copyright due diligence’ to ‘reasonable efforts to obtain information’ regarding third-party datasets represents a subtle but significant shift in responsibility. While the 2nd Draft placed a heavier burden on GPAI providers to actively investigate the copyright status of datasets, the 3rd Draft focuses on verifying whether the data collection process (by the third party) respected robots.txt. This implicitly acknowledges that GPAI providers may not always have direct control over the data acquisition practices of third parties, but they still have a responsibility to inquire about compliance. This shift reflects the practical realities of the data supply chain. GPAI providers often rely on third-party datasets, and it may be impractical or impossible for them to conduct full copyright due diligence on every piece of data. However, they are still expected to take reasonable steps to ensure that the data they use has been obtained legally and ethically. This includes verifying that the data provider respected TDM opt-outs during the data collection process.

Mitigating Infringing Output: From ‘Overfitting’ to ‘Reasonable Efforts’

The move away from the term ‘overfitting’ is a welcome change. ‘Overfitting,’ a technical term in machine learning, refers to a model that performs well on training data but poorly on new data. While overfitting can contribute to copyright infringement (e.g., by memorizing and reproducing copyrighted material), it’s not the only cause. The 3rd Draft’s broader focus on ‘reasonable efforts to mitigate risk’ encompasses a wider range of potential infringement scenarios and allows for more flexibility in implementation. This change also acknowledges that perfect prevention of copyright infringement may be unattainable, and a risk-based approach is more practical. The shift from ‘overfitting’ to ‘reasonable efforts’ reflects a more nuanced understanding of the challenges of preventing copyright infringement in AI-generated output. ‘Overfitting’ is a specific technical issue, while ‘reasonable efforts’ encompasses a broader range of mitigation strategies, including filtering, content moderation, and other techniques. This broader approach is more realistic and adaptable to the evolving nature of AI technology.

Point of Contact and Complaint Mechanism: Streamlining the Process

The requirement for a designated point of contact and a complaint mechanism provides rightsholders with a clear avenue for addressing potential copyright infringements. The ability for Signatories to refuse ‘unfounded or excessive’ complaints is a practical addition, preventing the system from being overwhelmed by frivolous claims. This provision helps to ensure that the complaint mechanism remains a viable and efficient tool for addressing legitimate copyright concerns. This section establishes a practical mechanism for resolving copyright disputes. The designated point of contact provides a clear channel for communication between rightsholders and GPAI providers. The complaint mechanism allows rightsholders to formally raise concerns about potential infringements. The ability to refuse unfounded or excessive complaints protects GPAI providers from being burdened by frivolous claims, while still ensuring that legitimate concerns are addressed.

The Broader Implications and Future Considerations

The 3rd Draft of the GPAI Code of Practice represents a significant step towards operationalizing the copyright provisions of the AI Act. It provides much-needed clarity and guidance for GPAI providers, while also seeking to protect the rights of content creators. However, several broader implications and future considerations remain:

  • The ‘Reasonable Efforts’ Standard: The repeated use of the phrase ‘reasonable efforts’ introduces a degree of subjectivity. What constitutes ‘reasonable’ will likely be subject to interpretation and may evolve over time through legal challenges and industry best practices. This ambiguity could lead to uncertainty for providers, but it also allows for flexibility and adaptation to different contexts. The interpretation of ‘reasonable efforts’ will be a key factor in determining the effectiveness of the Code of Practice. This standard will likely be shaped by court decisions, regulatory guidance, and evolving industry norms.

  • The Role of Downstream Providers: While the Code primarily targets GPAI providers, downstream providers have a vested interest in understanding its provisions. The Code sets expectations for the quality and compliance of GPAI models, which can inform contract negotiations and risk assessments. Downstream providers may also face indirect pressure to ensure that their use of GPAI models aligns with the Code’s principles. Downstream providers will need to be aware of the Code of Practice and ensure that their use of GPAI models is consistent with its requirements. This may involve conducting due diligence on GPAI providers and incorporating contractual clauses to address copyright compliance.

  • The Evolution of Technology: The rapid pace of AI development means that the Code of Practice will need to be a living document. New techniques for data acquisition, model training, and output generation may emerge, requiring updates to the Code’s provisions. The reference to ‘industry standard’ metadata and ‘widely adopted’ solutions acknowledges this need for ongoing adaptation. The Code of Practice will need to be regularly reviewed and updated to keep pace with technological advancements. This will require ongoing collaboration between policymakers, industry stakeholders, and technical experts.

  • International Harmonization: The EU AI Act is a pioneering piece of legislation, but it’s not operating in a vacuum. Other jurisdictions are also grappling with the challenges of regulating AI. International harmonization of AI regulations, including copyright provisions, will be crucial to avoid fragmentation and ensure a level playing field for AI developers. The EU AI Act could serve as a model for other jurisdictions, but international cooperation will be essential to ensure that AI regulations are consistent and effective across borders.

  • The Impact on Innovation: The Code of Practice aims to strike a balance between promoting AI innovation and protecting copyright. However, the impact of these regulations on the pace and direction of AI development remains to be seen. Some argue that overly strict regulations could stifle innovation, while others contend that clear rules are necessary to foster responsible AI development. The long-term impact of the Code of Practice on AI innovation will depend on how it is implemented and enforced. A balanced approach that protects copyright while allowing for continued innovation will be crucial.

  • Enforcement and Monitoring: How will the adherence be checked? The effectiveness of the codes will depend largely on the mechanisms put in place for enforcement and monitoring. Clear mechanisms for monitoring compliance and enforcing the Code of Practice will be essential to its success. This may involve audits, inspections, and penalties for non-compliance.

The 3rd Draft of the GPAI Code of Practice is a complex and evolving document with far-reaching implications. It represents a significant effort to address the challenges of copyright compliance in the age of AI, but it’s also a work in progress. Ongoing dialogue between stakeholders, including GPAI providers, rightsholders, policymakers, and the broader AI community, will be essential to ensure that the Code achieves its intended goals and remains relevant in the face of rapid technological change. The ultimate success of the Code will depend on its ability to adapt to the ever-changing landscape of AI technology and to strike a balance between protecting copyright and fostering innovation.