OpenAI's GPT-4o Scrutinized for Using Paywalled Data

The relentless march of artificial intelligence development, spearheaded by giants like OpenAI, frequently collides with long-established principles of intellectual property and data ownership. This collision has once again sparked controversy, with fresh allegations surfacing that OpenAI’s newest flagship model, GPT-4o, may have been trained using copyrighted materials sequestered behind paywalls, potentially without securing the necessary permissions. These claims originate from a newly established watchdog group, the AI Disclosures Project, adding another layer of complexity to the already intricate debate surrounding the ethical sourcing of data for training sophisticated AI systems.

The Watchdog’s Bark: Allegations from the AI Disclosures Project

Launched in 2024, the AI Disclosures Project positions itself as a non-profit entity dedicated to scrutinizing the often-opaque practices within the AI industry. Its founders include notable figures such as media entrepreneur Tim O’Reilly, the founder of O’Reilly Media, a prominent publisher of technical books, and economist Ilan Strauss. This connection to O’Reilly Media is particularly relevant, as the project’s initial bombshell report focuses specifically on the alleged presence of O’Reilly’s paywalled book content within GPT-4o’s training dataset.

The central assertion of their study is provocative: despite the absence of any known licensing agreement between OpenAI and O’Reilly Media, the GPT-4o model exhibits a markedly high level of familiarity with content derived directly from O’Reilly’s copyrighted books. This familiarity, the report contends, strongly suggests that these paywalled materials were incorporated into the vast corpus of data used to build the model’s capabilities. The study highlights a significant difference compared to older OpenAI models, particularly GPT-3.5 Turbo, implying a potential shift or expansion in data acquisition practices leading up to the development of GPT-4o.

The implications are substantial. If proprietary, paid-for content is being ingested by AI models without authorisation or compensation, it raises fundamental questions about copyright law in the age of generative AI. Publishers and authors rely on subscription or purchase models, predicated on the exclusivity of their content. The alleged use of this material for training could be seen as undermining these business models, potentially devaluing the very content that requires significant investment to create. This specific accusation moves beyond the scraping of publicly available websites, venturing into the territory of accessing content explicitly intended for paying customers.

Peering Inside the Black Box: The Membership Inference Attack

To substantiate their claims, the researchers at the AI Disclosures Project employed a sophisticated technique known as a “membership inference attack,” specifically using a method they call DE-COP. The core idea behind this approach is to test whether an AI model has “memorised” or at least developed a strong familiarity with specific pieces of text. In essence, the attack probes the model to see if it can reliably distinguish between original text passages (in this case, from O’Reilly books) and carefully constructed paraphrased versions of those same passages, generated by another AI.

The underlying logic is that if a model consistently shows a higher-than-random ability to identify the original human-authored text compared to a close paraphrase, it implies the model has encountered that original text before – likely during its training phase. It’s akin to testing if someone recognizes a specific, lesser-known photograph they claim never to have seen; consistent recognition suggests prior exposure.

The scale of the AI Disclosures Project’s test was considerable. They utilized 13,962 distinct paragraph excerpts drawn from 34 different O’Reilly Media books. These excerpts represented the kind of specialized, high-value content typically found behind the publisher’s paywall. The study then measured the performance of both GPT-4o and its predecessor, GPT-3.5 Turbo, on this differentiation task.

The results, as presented in the report, were striking. GPT-4o demonstrated a significantly heightened ability to recognize the paywalled O’Reilly content. Its performance was quantified using an AUROC (Area Under the Receiver Operating Characteristic curve) score, a common metric for evaluating the performance of binary classifiers. GPT-4o achieved an AUROC score of 82%. In contrast, GPT-3.5 Turbo scored just above 50%, which is essentially equivalent to random guessing – indicating little to no specific recognition of the tested material. This stark difference, the report argues, provides compelling, albeit indirect, evidence that the paywalled content was indeed part of GPT-4o’s training diet. An 82% score suggests a strong signal, well beyond what would be expected by chance or generalised knowledge.

Necessary Caveats and Unanswered Questions

While the findings present a compelling narrative, the co-authors of the study, including AI researcher Sruly Rosenblat, commendably acknowledge potential limitations inherent in their methodology and the complex nature of AI training. One significant caveat they raise is the possibility of indirect data ingestion. It’s conceivable,they note, that users of ChatGPT (OpenAI’s popular interface) might have copied and pasted excerpts from paywalled O’Reilly books directly into the chat interface for various purposes, such as asking questions about the text or requesting summaries. If this occurred frequently enough, the model could have learned the content indirectly through user interactions, rather than through direct inclusion in the initial training dataset. Disentangling direct training exposure from indirect learning via user prompts remains a significant challenge in AI forensics.

Furthermore, the study’s scope did not extend to OpenAI’s absolute latest or specialised model iterations that might have been developed or released concurrently or subsequently to GPT-4o’s main training cycle. Models potentially including GPT-4.5 (if it exists under that specific nomenclature or capability level) and reasoning-focused models like o3-mini and o1 were not subjected to the same membership inference attacks. This leaves open the question of whether data sourcing practices might have evolved further, or if these newer models exhibit similar patterns of familiarity with paywalled content. The rapid iteration cycles in AI development mean that any snapshot analysis risks being slightly outdated almost immediately.

These limitations do not necessarily invalidate the study’s core findings, but they add crucial layers of nuance. Proving definitively what resides within the terabytes of data used to train a foundation model is notoriously difficult. Membership inference attacks offer probabilistic evidence, suggesting likelihood rather than offering absolute certainty. OpenAI, like other AI labs, guards its training data composition closely, citing proprietary concerns and competitive sensitivities.

The allegations levelled by the AI Disclosures Project do not exist in a vacuum. They represent the latest skirmish in a much broader, ongoing conflict between AI developers and creators over the use of copyrighted material for training purposes. OpenAI, along with other prominent players like Google, Meta, and Microsoft, finds itself embroiled in multiple high-profile lawsuits. These legal challenges, brought by authors, artists, news organisations, and other rights holders, generally allege widespread copyright infringement stemming from the unauthorised scraping and ingestion of vast amounts of text and images from the internet to train generative AI models.

The core defense often mounted by AI companies hinges on the doctrine of fair use (in the United States) or similar exceptions in other jurisdictions. They argue that using copyrighted works for training constitutes a “transformative” use – the AI models are not merely reproducing the original works but are using the data to learn patterns, styles, and information to generate entirely new outputs. Under this interpretation, the training process itself, aimed at creating a powerful new tool, should be permissible without requiring licenses for every piece of data ingested.

However, rights holders vehemently contest this view. They argue that the sheer scale of the copying involved, the commercial nature of the AI products being built, and the potential for AI outputs to directly compete with and supplant the original works weigh heavily against a finding of fair use. The contention is that AI companies are building multi-billion dollar enterprises on the back of creative work without compensating the creators.

Against this litigious backdrop, OpenAI has proactively sought to mitigate some risks by striking licensing deals with various content providers. Agreements have been announced with major news publishers (like the Associated Press and Axel Springer), social media platforms (such as Reddit), and stock media libraries (like Shutterstock). These deals provide OpenAI with legitimate access to specific datasets in exchange for payment, potentially reducing its reliance on potentially infringing web-scraped data. The company has also reportedly hired journalists, tasking them with helping to refine and improve the quality and reliability of its models’ outputs, suggesting an awareness of the need for high-quality, potentially curated, input.

The Ripple Effect: Content Ecosystem Concerns

The AI Disclosures Project’s report extends its concerns beyond the immediate legal implications for OpenAI. It frames the issue as a systemic threat that could negatively impact the health and diversity of the entire digital content ecosystem. The study posits a potentially damaging feedback loop: if AI companies can freely use high-quality, professionally created content (including paywalled material) without compensating the creators, it erodes the financial viability of producing such content in the first place.

Professional content creation – whether it’s investigative journalism, in-depth technical manuals, fiction writing, or academic research – often requires significant time, expertise, and financial investment. Paywalls and subscription models are frequently essential mechanisms for funding this work. If the revenue streams supporting these efforts are diminished because the content is effectively being used to train competing AI systems without remuneration, the incentive to create high-quality, diverse content could decline. This could lead to a less informed public, a reduction in specialized knowledge resources, and potentially an internet dominated by lower-quality or AI-generated content lackinghuman expertise and verification.

Consequently, the AI Disclosures Project advocates strongly for greater transparency and accountability from AI companies regarding their training data practices. They call for the implementation of robust policies and potentially regulatory frameworks that ensure content creators are fairly compensated when their work contributes to the development of commercial AI models. This echoes broader calls from creator groups worldwide who seek mechanisms – whether through licensing agreements, royalty systems, or collective bargaining – to ensure they receive a share of the value generated by AI systems trained on their intellectual property. The debate centers on finding a sustainable equilibrium where AI innovation can flourish alongside a thriving ecosystem for human creativity and knowledge generation. The resolution of ongoing legal battles and the potential for new legislation or industry standards will be critical in shaping this future balance. The question of how to track data provenance and attribute value in massive, complex AI models remains a significant technical and ethical hurdle.