Meta Lawsuit: AI Training & Copyright Info Removal

The lawsuit, Kadrey et al. vs Meta Platforms, centers on the allegation that Meta not only used copyrighted material without authorization to train its AI models but also actively removed copyright management information (CMI) from that material. This removal, plaintiffs argue, constitutes a violation of the Digital Millennium Copyright Act (DMCA). CMI includes details like author names, copyright notices, terms of use, and other identifying information attached to copyrighted works.

The plaintiffs, authors Richard Kadrey, Sarah Silverman, and Christopher Golden, initially filed a class-action lawsuit against Meta, claiming the company’s use of their copyrighted books to train its Llama large language models was unlawful. However, the case took a significant turn in January 2025 when the plaintiffs amended their complaint to specifically address the issue of CMI removal. They asserted that Meta was aware that its AI models were being trained on copyrighted material and that the outputs of these models could potentially contain CMI. More crucially, they alleged that Meta deliberately stripped this CMI from the training data to conceal the origin of the AI-generated content and avoid copyright infringement claims.

The plaintiffs’ legal team argues that this removal of CMI is a separate and distinct violation of the DMCA, independent of the underlying copyright infringement. The DMCA prohibits the intentional removal or alteration of CMI, as well as the distribution of works with removed or altered CMI, knowing that this will induce, enable, facilitate, or conceal infringement.

Judge’s Ruling: DMCA Claim to Proceed

In a significant ruling, Judge Vince Chhabria of the San Francisco federal court allowed the plaintiffs’ DMCA claim to proceed. While acknowledging that the inference was “not particularly strong,” Judge Chhabria stated that the plaintiffs’ allegations raised a “reasonable” inference that Meta removed CMI to prevent its Llama AI models from outputting CMI and, consequently, revealing the use of copyrighted material in their training. This ruling is crucial because it allows a key aspect of the lawsuit to move forward, increasing the likelihood of the case either reaching a settlement or proceeding to a full trial.

The judge’s decision highlights the importance of CMI in protecting copyrighted works. By allowing the DMCA claim to proceed, the court is signaling that the removal of CMI is a serious issue that warrants further investigation, even in the context of AI training. This ruling could have significant implications for other AI companies that may be engaging in similar practices.

It’s important to note that Judge Chhabria did not rule on the merits of the DMCA claim itself. He simply determined that the plaintiffs had presented enough evidence to justify further proceedings. The burden of proof will still be on the plaintiffs to demonstrate that Meta intentionally removed CMI and that this removal was done to facilitate or conceal copyright infringement.

Meta’s Admission and the Books3 Dataset

Meta’s own admissions have played a role in the progression of the case. The company has acknowledged using a dataset known as Books3 in the training of its Llama 1 large language model. Books3 is a controversial dataset, widely known to contain a vast collection of copyrighted books, many of which were likely included without the permission of the copyright holders. The existence of Books3, and Meta’s admitted use of it, lends credence to the plaintiffs’ claims that Meta knowingly used copyrighted material in its AI training.

The Books3 dataset was compiled by researcher Shawn Presser and is part of a larger collection called The Pile. It has been a subject of debate within the AI community, with some arguing that its use is essential for training powerful language models, while others condemn its inclusion of copyrighted works without proper licensing. The fact that Meta relied on this dataset raises questions about the company’s due diligence in ensuring that its training data was obtained legally and ethically.

Partial Dismissal of Claims

While the DMCA claim regarding CMI removal is moving forward, Judge Chhabria did dismiss one of the plaintiffs’ other claims. This dismissed claim alleged that Meta’s use of unlicensed books obtained through peer-to-peer torrent networks for Llama training violated California’s Comprehensive Computer Data Access & Fraud Act (CDAFA). The judge found that the plaintiffs had not sufficiently demonstrated that Meta’s actions met the specific requirements of the CDAFA.

The dismissal of this claim, however, does not significantly weaken the overall case against Meta. The core issue of copyright infringement and the DMCA violation related to CMI removal remain, and these are arguably the stronger and more impactful claims.

Expert Opinion: DMCA Claim and Fair Use

Legal experts have weighed in on the significance of the judge’s ruling and its potential implications. Edward Lee, a professor of law at Santa Clara University, cautioned against drawing broad conclusions about fair use based solely on the DMCA claim proceeding. He pointed out that Judge Chhabria expressed skepticism about the plaintiffs’ ability to ultimately prove the DMCA claim and suggested the possibility of revisiting the issue on summary judgment.

Professor Lee emphasized that the plaintiffs’ attorneys had successfully identified a more specific factual basis for their DMCA claim, which had previously been dismissed. This suggests that the legal arguments surrounding CMI removal are evolving and becoming more refined. The focus on CMI provides a more concrete legal hook than simply arguing about the general use of copyrighted material in AI training.

The question of fair use remains a central, unresolved issue in many of these AI copyright cases. Fair use is a legal doctrine that allows limited use of copyrighted material without permission under certain circumstances, such as for criticism, commentary, news reporting, teaching, scholarship, or research. Whether the use of copyrighted material to train AI models qualifies as fair use is a complex and hotly debated topic. The courts have yet to provide clear guidance on this issue, and the outcome of cases like Kadrey et al. vs Meta Platforms could play a significant role in shaping the legal landscape.

The progression of the CMI claim against Meta, along with a previous ruling in favor of Thomson Reuters against Ross Intelligence (a case involving the alleged use of copyrighted legal materials to train an AI), suggests a potential shift in how courts are viewing the use of copyrighted material in AI training. These decisions could embolden plaintiffs in other ongoing AI-related lawsuits and potentially lead to more scrutiny of AI companies’ data acquisition and training practices.

For example, the case Tremblay et al. vs OpenAI et al. was recently amended to revive a previously dismissed DMCA claim. The amended complaint, citing new evidence uncovered during discovery, argues that OpenAI also removed CMI during the training of its large language models. This demonstrates how the legal strategies in these cases are evolving and how plaintiffs are learning from each other’s successes and failures.

The increasing focus on CMI removal could become a significant trend in AI copyright litigation. It provides a more tangible and specific legal argument than simply claiming general copyright infringement. By focusing on the deliberate removal of identifying information, plaintiffs can potentially demonstrate a more intentional and culpable act on the part of AI companies.

The legal battles surrounding AI and copyright highlight the complex challenges of balancing innovation with intellectual property rights. The rapid advancement of AI technology has outpaced the development of clear legal frameworks governing the use of copyrighted material in AI training. The indiscriminate ingestion of vast amounts of copyrighted data to train AI models has raised serious concerns about potential infringement, particularly when AI models generate outputs that closely resemble or directly reproduce copyrighted works.

The outcomes of these cases could have significant implications for the future of AI development and the use of copyrighted material in training datasets. The decisions may influence how AI companies approach data acquisition and model training, potentially leading to greater emphasis on licensing, attribution, and the protection of copyright management information.

There is a growing recognition that AI companies need to be more transparent and accountable about their data sourcing practices. The use of datasets like Books3, which contain a significant amount of copyrighted material without clear licensing, is increasingly being questioned. AI companies may need to adopt more rigorous procedures for vetting their training data and ensuring that they have the necessary rights to use the material.

The debate also extends to the broader question of how to fairly compensate creators for the use of their works in AI training. Some argue that AI companies should be required to obtain licenses for all copyrighted material used in training, while others propose alternative models, such as collective licensing or a levy system. Finding a solution that balances the interests of creators, AI developers, and the public is a complex challenge that will require ongoing dialogue and collaboration.

The legal arguments presented in these cases delve into the intricacies of copyright law, the DMCA, and the application of fair use principles in the context of AI. The plaintiffs contend that Meta’s actions constitute a deliberate attempt to circumvent copyright protections and deprive creators of their rightful recognition and compensation. Meta, on the other hand, may argue that its use of copyrighted material falls under fair use or that the removal of CMI was necessary for technical reasons, or even unintentional. The courts will ultimately need to weigh these arguments and determine whether Meta’s actions crossed the line into copyright infringement.

The cases also raise questions about the responsibility of AI developers to ensure that their models are trained on legally obtained data. As AI becomes increasingly pervasive, the need for transparency and accountability in data sourcing and model training becomes paramount. The legal outcomes of these disputes could shape industry practices and encourage the development of ethical guidelines for AI development.

The debate over copyright and AI is not limited to the legal arena. It also extends to broader societal discussions about the role of AI in creative endeavors and the potential impact on human artists and authors. Some argue that AI-generated content poses a threat to human creativity, while others view AI as a tool that can enhance and augment human capabilities. These discussions highlight the need for a nuanced understanding of the relationship between AI and human creativity and the importance of fostering a collaborative environment that benefits both creators and technology developers.

The legal battles currently underway represent a crucial step in navigating the complex intersection of copyright law and artificial intelligence. The decisions rendered in these cases will likely have far-reaching consequences, shaping the future of AI development, the protection of intellectual property, and the relationship between technology and creativity. The ongoing dialogue between legal experts, technology developers, and creators is essential for ensuring that AI innovation proceeds in a manner that respects both legal frameworks and the rights of creators. The courts will play a critical role in defining the boundaries of permissible use and establishing legal precedents that address the unique challenges posed by AI-generated content.