Generative AI Copyright Fight: News Publishers Sue Cohere

The landscape of AI development is once again embroiled in a legal showdown, as a group of prominent news and media organizations have launched a copyright and trademark infringement lawsuit against generative AI startup Cohere. Filed in the U.S. District Court for the Southern District of New York in February 2025, the suit names over a dozen plaintiffs, including well-respected publications like Forbes, The Guardian, and the Los Angeles Times. At the heart of the matter lies Cohere’s utilization of Retrieval-Augmented Generation (RAG) technology, which the plaintiffs allege involves the unauthorized use of their copyrighted material to construct databases and generate outputs.

RAG Technology Under Scrutiny

Retrieval-Augmented Generation (RAG) emerged as a potential solution to some inherent challenges associated with large language models (LLMs). Proposed by Patrick Lewis and his colleagues in 2020, RAG aims to mitigate issues such as hallucination (the generation of factually incorrect or nonsensical information), outdated knowledge, and a lack of transparency in the model’s reasoning. Interestingly, Patrick Lewis himself is currently a researcher at Cohere, continuing his work on RAG technology. The adoption of RAG has been widespread, with major players like Microsoft, Google, Amazon, and NVIDIA integrating it into their AI systems.

The lawsuit brought by the news publishers centers on several key allegations of copyright infringement against Cohere. These claims highlight the complex legal questions surrounding the use of copyrighted material in the training and operation of generative AI models. The lawsuit not only questions the current use of RAG but also sets a precedent that could reshape the legal framework surrounding AI development and copyright law. The outcome of the case is being closely watched by AI developers, media organizations, and legal experts alike, as it has the potential to significantly impact the future of AI and its relationship with copyrighted content.

The plaintiffs’ lawsuit underscores the concerns held by content creators regarding the use of their work in the AI systems. It is not simply about monetary compensation but also about maintaining control over their intellectual property and ensuring the integrity of their brands. The lawsuit also addresses the ethical concerns surrounding AI, such as the transparency of the models and the ability to attribute sources accurately. These issues are becoming increasingly important as AI technology continues to develop and integrate further into society.

The plaintiffs’ allegations against Cohere can be broken down into four main categories:

1. AI Model Training

The core of the plaintiffs’ argument revolves around how Cohere trained its large language model, known as the “Command Family.” They claim Cohere engaged in extensive “scraping” of text from the internet, including copyrighted content from the plaintiffs’ publications. This scraped data was then used to create the datasets necessary for training the Command Family model. Furthermore, the plaintiffs allege that Cohere utilized third-party datasets like Common Crawl’s C4, which contain significant amounts of their copyrighted material, without obtaining the necessary permissions.

The use of copyrighted material in AI model training has become a contentious issue. AI developers often argue that such use falls under the doctrine of “fair use,” which allows for the limited use of copyrighted material for purposes such as criticism, commentary, news reporting, teaching, scholarship, or research. However, copyright holders argue that the large-scale scraping and use of their content for commercial purposes, such as training AI models, goes beyond the scope of fair use. This legal battle will likely hinge on whether the court agrees with the plaintiffs’ assessment.

The debate over fair use is one of the most critical aspects of the case. The court will need to determine whether the use of copyrighted material for AI training constitutes a transformative use, which is a key factor in determining fair use. If the court finds that the use is not transformative and that it harms the market for the original works, it is more likely to rule against Cohere. The plaintiffs argue that the use of their copyrighted material has a direct economic impact on their businesses, as it reduces the need for users to subscribe to their publications or visit their websites.

2. Real-time Use / RAG

Another key aspect of the lawsuit focuses on how Cohere’s services, particularly its Chat interface, utilize RAG technology in real-time. The plaintiffs allege that Cohere’s models scrape content from external sources, including their websites, to generate responses to user queries. This real-time scraping, according to the plaintiffs, constitutes copyright infringement, especially when Cohere’s models bypass paywalls or ignore “robots.txt” directives, which are commands that instruct web crawlers (including those used by AI models) not to scrape specific content from a website.

The bypassing of paywalls and robots.txt directives raises serious ethical and legal questions. Paywalls are designed to protect copyrighted content and ensurethat publishers are compensated for their work. Robots.txt directives are a standard mechanism for website owners to control how their content is accessed and used by web crawlers. By ignoring these safeguards, Cohere is accused of demonstrating a disregard for copyright laws and the rights of content creators.

The real-time use of copyrighted material through RAG technology presents additional challenges for copyright law. Unlike traditional copyright cases, where the infringing material is often fixed in a tangible medium, RAG technology involves the dynamic and ephemeral use of copyrighted content. This makes it more difficult to track and monitor the use of copyrighted material and to determine the extent of the infringement. The plaintiffs argue that Cohere has a responsibility to implement measures to prevent its models from accessing and using copyrighted material without permission, even if the use is only temporary.

3. Infringing Outputs

The plaintiffs contend that Cohere’s services provide infringing outputs in the form of copies, substantial excerpts, or substitutional summaries of their copyrighted works in response to user queries. They cite examples of Cohere Chat outputs where the “Under the Hood” panel displays full or partial articles copied directly from the plaintiffs’ websites.

The plaintiffs argue that these outputs, whether they are verbatim copies or summaries, directly substitute for the need for users to visit the original articles. This, in turn, harms the digital subscription and advertising revenue that the plaintiffs rely on to sustain their businesses. The core of this argument is that Cohere’s AI models are essentially acting as unauthorized distributors of copyrighted content, depriving the original publishers of their rightful compensation.

The issue of infringing outputs is closely tied to the concept of substantial similarity under copyright law. The court will need to determine whether the outputs generated by Cohere’s models are sufficiently similar to the plaintiffs’ copyrighted works to constitute infringement. This analysis will involve a comparison of the original works with the AI-generated outputs, taking into account factors such as the originality of the works, the amount of material copied, and the purpose for which the material was used. The plaintiffs argue that even summaries of their works can be infringing if they are so detailed and comprehensive that they effectively replace the original works in the market.

4. Unauthorized Adaptation

In addition to displaying portions of the plaintiffs’ works in the “Under the Hood” panel, Cohere’s services also provide summaries or abstracts of these works. The plaintiffs argue that the level of detail in these summaries is so extensive that they essentially replace the original works, exceeding the boundaries of fair use.

Copyright law protects not only the verbatim reproduction of copyrighted works but also the creation of derivative works, which are adaptations or transformations of the original. The plaintiffs argue that Cohere’s summaries are so comprehensive that they constitute unauthorized derivative works, infringing on their exclusive right to create and distribute adaptations of their copyrighted material.

The unauthorized adaptation claim raises questions about the scope of copyright protection for derivative works in the context of AI. The court will need to determine whether the summaries generated by Cohere’s models are sufficiently transformative to qualify as new works, or whether they are simply derivative works that infringe on the plaintiffs’ copyrights. The plaintiffs argue that the summaries are not transformative because they are based directly on the content of the original works and do not add any significant new expression or meaning. The plaintiffs also argue that the summaries harm their market for derivative works, as they compete with their own efforts to license or create summaries of their copyrighted material.

Secondary Liability for User Actions

Beyond the claim of direct copyright infringement, the plaintiffs also argue that Cohere is secondarily liable for the infringing acts of its users. They argue that Cohere’s services facilitate the reproduction, display, and distribution of the plaintiffs’ works by users, and that Cohere cannot evade responsibility by solely attributing infringement to user actions. The basis for this claim is that Cohere’s product generates answers only after a user inputs a prompt, making the company a participant in the infringing activity.

This argument of secondary liability is significant because it seeks to hold AI developers accountable for the actions of their users, even when those users are the ones directly engaging in copyright infringement. If successful, this argument could have far-reaching implications for the development and deployment of AI technologies, as it would require developers to implement safeguards to prevent their users from infringing on copyright.

The issue of secondary liability is particularly complex in the context of AI because AI models are designed to learn and adapt based on user input. This means that it can be difficult to predict how users will use the models and whether their use will infringe on copyright. The plaintiffs argue that Cohere has a duty to monitor how its models are being used and to take steps to prevent users from engaging in infringing activity. This could include implementing filters to block access to copyrighted material, providing warnings to users about the risks of copyright infringement, or terminating the accounts of users who repeatedly infringe on copyright.

Trademark Infringement Claims

The lawsuit extends beyond copyright infringement to include claims of trademark infringement. The plaintiffs allege that Cohere’s practice of attributing sources constitutes trademark infringement because it uses the plaintiffs’ well-known trademarks without permission or associates them with AI-generated erroneous content. This, they argue, leads to damage to the plaintiffs’ brand reputation and a dilution of their distinctiveness.

Trademarks are symbols, designs, or phrases legally registered to represent a company or product. The unauthorized use of a trademark can cause confusion among consumers and damage the brand’s reputation. The plaintiffs argue that Cohere’s use of their trademarks in conjunction with AI-generated content could mislead users into believing that the plaintiffs endorse or are affiliated with Cohere’s services, which is not the case.

The trademark infringement claim centers on the potential for consumer confusion and the dilution of the plaintiffs’ brands. The plaintiffs argue that Cohere’s use of their trademarks could lead users to believe that the plaintiffs have endorsed Cohere’s services or that they are affiliated with Cohere in some way. This could damage the plaintiffs’ reputation if users are dissatisfied with Cohere’s services or if the AI-generated content is inaccurate or misleading. The plaintiffs also argue that Cohere’s use of their trademarks dilutes the distinctiveness of their brands by associating them with AI-generated content. The plaintiffs could demonstrate that Cohere’s use of their trademarks is likely to cause confusion among consumers or that it has diluted the distinctiveness of their brands.

This lawsuit against Cohere is not an isolated incident. It follows a previous copyright lawsuit in the U.S. in October 2024 that also focused on the RAG application in AI services. This growing number of cases highlights the increasing tension betweenAI developers and copyright holders as RAG architecture becomes more prevalent in AI services.

The legal battles surrounding RAG technology are likely to become a significant issue in the future of AI copyright law. RAG presents unique challenges because it involves the real-time retrieval and use of copyrighted material to generate outputs. This raises complex questions about the scope of fair use, the responsibility of AI developers for user actions, and the protection of intellectual property in the age of artificial intelligence.

The outcome of these lawsuits could have a profound impact on the development and deployment of AI technologies. If courts rule in favor of copyright holders, AI developers may be forced to implement stricter safeguards to prevent copyright infringement, which could increase the cost and complexity of developing AI models. On the other hand, if courts rule in favor of AI developers, copyright holders may need to find new ways to protect their intellectual property in the face of increasingly sophisticated AI technologies.

The evolving legal framework surrounding AI and copyright is critical for fostering innovation while protecting intellectual property rights. Clear guidelines and regulations are needed to address the unique challenges posed by AI technologies, such as RAG, and to provide certainty for both AI developers and content creators. This requires a collaborative approach involving courts, lawmakers, and the AI community to establish a balance between promoting innovation and ensuring that intellectual property is respected. It’s also crucial to consider international perspectives, as copyright laws vary across different jurisdictions. These differences can lead to conflicts and complexities when AI models are trained and deployed globally.

The clash between news publishers and Cohere serves as a critical juncture in the ongoing debate surrounding AI, copyright, and the future of content creation. The outcome of this case, along with others like it, will undoubtedly shape the legal landscape for generative AI and its interaction with copyrighted material for years to come. As AI continues to evolve and become more integrated into various aspects of our lives, it is essential to strike a balance between promoting innovation and protecting the rights of content creators. The courts, lawmakers, and the AI community must work together to establish clear guidelines and regulations that foster creativity while ensuring that intellectual propertyis respected.

The news industry, in particular, faces a unique set of challenges in the age of AI. As AI models become increasingly capable of generating news content, it is crucial that publishers are compensated for the use of their copyrighted material and that the integrity of their brands is protected. The lawsuit against Cohere represents an effort by news publishers to assert their rights and ensure that their work is not exploited by AI companies without proper authorization. The media companies seek legal safeguards for original news reporting and journalism in the rapidly changing environment. Fair competition, transparent AI sourcing and training, and ethical AI development are essential to protect the integrity of the news landscape. This action is expected to influence not only the media industry but other sectors, too, as AI transforms creation, distribution, and rights management across diverse platforms.