The Erosion of Openness: Is 'Open Source' AI Truly Open? | en

The term ‘open source’ carries a powerful resonance in the world of technology. It evokes images of collaborative innovation, shared knowledge, and a fundamental belief in transparency. This spirit was vividly embodied half a century ago with the formation of the Homebrew Computer Club in Menlo Park, California. This collective of enthusiasts and tinkerers didn’t just build machines; they built a culture grounded in freely exchanging ideas and software, laying foundational stones for the open-source movement that would revolutionize computing. Yet, today, this hard-won legacy and the very definition of openness are facing a subtle but significant challenge, particularly within the rapidly expanding domain of artificial intelligence. A growing number of companies developing sophisticated AI models are eagerly branding their creations as ‘open source,’ but a closer look reveals that this label is often applied superficially, masking a reality that falls short of the movement’s core tenets. This dilution of meaning is not merely a semantic quibble; it poses a genuine threat to the principles of transparency and replicability that are paramount, especially within the scientific community.

Understanding the Genuine Spirit of Open Collaboration

To grasp the current predicament, one must first appreciate what ‘open source’ truly signifies. It’s more than just free-of-charge software; it’s a philosophy rooted in collective progress and verifiable trust. The bedrock of this philosophy rests on four essential freedoms:

The freedom to run the program for any purpose.
The freedom to study how the program works and change it so it does your computing as you wish. Access to the source code is a precondition for this.
The freedom to redistribute copies so you can help others.
The freedom to distribute copies of your modified versions to others. By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

These freedoms, typically enshrined in licenses like the GNU General Public License (GPL), MIT License, or Apache License, have historically centered on source code. Source code – the human-readable instructions written by programmers – is the blueprint of traditional software. Making this code openly available allows anyone to inspect it, understand its logic, identify potential flaws, adapt it to new needs, and share those improvements.

This model has been an extraordinary catalyst for innovation and scientific advancement. Consider the impact of tools readily available to researchers worldwide:

Statistical analysis: Software like R Studio provides a powerful, transparent, and extensible environment for statistical computing and graphics, becoming a cornerstone of data analysis in countless scientific fields. Its openness allows for peer review of methods and development of specialized packages.
Computational fluid dynamics: OpenFOAM offers a sophisticated library for simulating fluid flows, crucial in fields ranging from aerospace engineering to environmental science. Its open nature enables customization and verification of complex simulations.
Operating systems: Linux and other open-source operating systems form the backbone of much of the world’s computing infrastructure, including scientific high-performance computing clusters, valued for their stability, flexibility, and transparency.

The benefits extend far beyond mere cost savings. Open source fosters reproducibility, a cornerstone of the scientific method. When the tools and code used in research are open, other scientists can replicate the experiments, verify the findings, and buildupon the work with confidence. It promotes global collaboration, breaking down barriers and allowing researchers from diverse backgrounds and institutions to contribute to shared challenges. It ensures longevity and avoids vendor lock-in, protecting research investments from the whims of proprietary software companies. It accelerates discovery by allowing rapid dissemination and iteration of new ideas and techniques. The open-source ethos is fundamentally aligned with the scientific pursuit of knowledge through transparency, scrutiny, and shared progress.

Artificial Intelligence: A Different Beast Entirely

The established open-source paradigm, built securely around the accessibility of source code, encounters significant turbulence when applied to the realm of artificial intelligence, particularly large-scale models like foundational large language models (LLMs). While these AI systems certainly involve code, their functionality and behaviour are shaped by far more complex and often opaque elements. Simply releasing the architectural code for a neural network doesn’t equate to genuine openness in the way it does for traditional software.

An AI model, especially a deep learning model, is typically composed of several key ingredients:

Model Architecture: This is the structural design of the neural network – the arrangement of layers, neurons, and connections. Companies often do release this information, presenting it as evidence of openness. It’s akin to sharing the blueprint of an engine.
Model Weights (Parameters): These are the numerical values, often billions of them, within the network that have been adjusted during the training process. They represent the learned patterns and knowledge extracted from the training data. Releasing the weights allows others to use the pre-trained model. This is like providing the fully assembled engine, ready to run.
Training Data: This is perhaps the most critical and most frequently obscured component. Foundational models are trained on colossal datasets, often scraped from the internet or sourced from proprietary or private collections (like medical records, which raise significant privacy concerns). The composition, curation, filtering, and potential biases within this data profoundly influence the model’s capabilities, limitations, and ethical behaviour. Without detailed information about the training data, understanding why a model behaves the way it does, or assessing its suitability and safety for specific applications, becomes incredibly difficult. This is the secret fuel mixture and the precise conditions under which the engine was run-in.
Training Code and Process: This includes the specific algorithms used for training, the optimization techniques, the chosen hyperparameters (settings that control the learning process), the computational infrastructure employed, and the significant energy consumed. Minor variations in the training process can lead to different model behaviours, making reproducibility challenging even if the architecture and data were known. This represents the detailed engineering specifications, tooling, and factory conditions used to build and tune the engine.

Many systems currently marketed as ‘open source’ AI primarily offer access to the model architecture and the pre-trained weights. While this allows users to run the model and perhaps fine-tune it on smaller datasets, it critically fails to provide the necessary transparency regarding the training data and process. This severely curtails the ability to truly study the model’s fundamental properties or to modify it in deeply meaningful ways that require retraining or understanding its origins. The freedoms to study and modify, central to the open-source definition, are significantly hampered when the crucial elements of data and training methodology remain hidden. Replicating the model’s creation from scratch – a key test of scientific understanding and verification – becomes virtually impossible.

The Troubling Trend of ‘Openwashing’ in AI

This gap between the label and the reality has given rise to a practice known as ‘openwashing.’ This term describes the act of companies leveraging the positive reputation and perceived benefits of ‘open source’ for marketing and strategic advantage, while simultaneously withholding access to critical components like detailed training data information or the code used for the training itself. They cloak their systems in the language of openness without fully embracing its demanding principles of transparency and community access.

Several prominent AI models, despite being widely used and sometimes carrying an ‘open’ designation, fall short when measured against the comprehensive definition of open source championed by organizations like the Open Source Initiative (OSI). An analysis by the OSI, which has been working diligently since 2022 to clarify the meaning of open source in the AI context, highlighted concerns with several popular models:

Llama 2 & Llama 3.x (Meta): While the model weights and architecture are available, restrictions on use and incomplete transparency regarding the full training dataset and process limit their alignment with traditional open-source values.
Grok (X): Similarly, while made available, the lack of comprehensive information about its training data and methodology raises questions about its true openness.
Phi-2 (Microsoft): Often described as an ‘open model,’ full transparency regarding its creation process and data remains limited.
Mixtral (Mistral AI): Though parts are released, it doesn’t meet the full criteria for open source due to limitations in access to all necessary components for study and modification.

These examples stand in contrast to efforts that strive for greater adherence to open-source principles:

OLMo (Allen Institute for AI): Developed by a non-profit research institute, OLMo was explicitly designedwith openness in mind, releasing not just weights but also training code and details about the data used.
LLM360’s CrystalCoder: A community-driven project aiming for full transparency across the model’s lifecycle, including data, training procedures, and evaluation metrics.

Why engage in openwashing? The motivations are multifaceted:

Marketing and Perception: The ‘open source’ label carries significant goodwill. It suggests collaboration, ethical practices, and a commitment to the broader community, which can attract users, developers, and positive press.
Ecosystem Building: Releasing model weights, even without full transparency, encourages developers to build applications on top of the AI system, potentially creating a dependent ecosystem that benefits the originating company.
Regulatory Arbitrage: This is a particularly concerning driver. Upcoming regulations, such as the European Union’s AI Act (2024), are expected to impose stricter requirements on certain high-risk AI systems. However, exemptions or lighter scrutiny are often proposed for ‘free and open-source software.’ By applying the ‘open source’ label – even if inaccurately according to established definitions – companies might hope to navigate these regulations more easily, avoiding potentially costly compliance burdens associated with proprietary, high-risk systems. This strategic labelling exploits a potential loophole, undermining the regulation’s intent to ensure safety and transparency.

This practice ultimately devalues the term ‘open source’ and creates confusion, making it harder for users, developers, and researchers to discern which AI systems genuinely offer the transparency and freedoms the label implies.

Why True Openness Matters Urgently for Science

For the scientific community, the stakes in this debate are exceptionally high. Science thrives on transparency, reproducibility, and the ability for independent verification. The increasing integration of AI into research – from analyzing genomic data and modelling climate change to discovering new materials and understanding complex biological systems – makes the nature of these AI tools critically important. Relying on ‘black box’ AI systems, or those masquerading as open without providing genuine transparency, introduces profound risks:

Impaired Reproducibility: If researchers cannot access or understand the training data and methodology behind an AI model used in a study, replicating the results becomes impossible. This fundamentally undermines a core pillar of the scientific method. How can findings be trusted or built upon if they cannot be independently verified?
Hidden Biases and Limitations: All AI models inherit biases from their training data and design choices. Without transparency, researchers cannot adequately assess these biases or understand the model’s limitations. Using a biased model unknowingly could lead to skewed results, flawed conclusions, and potentially harmful real-world consequences, especially in sensitive areas like medical research or social science.
Lack of Scrutiny: Opaque models evade rigorous peer review. The scientific community cannot fully interrogate the model’s inner workings, identify potential errors in its logic, or understand the uncertainties associated with its predictions. This hinders the self-correcting nature of scientific inquiry.
Dependence on Corporate Systems: Reliance on closed or semi-closed AI systems controlled by corporations creates dependencies. Research agendas could be subtly influenced by the capabilities and limitations of available corporate tools, and access could be restricted or become costly, potentially stifling independent research directions and widening the gap between well-funded institutions and others.
Stifled Innovation: True open source allows researchers not just to use tools but to dissect, modify, improve, and repurpose them. If key components of AI models remain inaccessible, this crucial avenue for innovation is blocked. Scientists are prevented from experimenting with novel training techniques, exploring different data combinations, or adapting models for specific, nuanced research questions that the original developers did not anticipate.

The scientific community cannot afford to passively accept the dilution of the term ‘open source.’ It must actively advocate for clarity and demand genuine transparency from AI developers, especially when these tools are employed in research contexts. This involves:

Promoting Clear Standards: Supporting efforts, like those by the OSI, to establish clear, rigorous definitions for what constitutes ‘open-source AI,’ definitions that encompass transparency regarding architecture, weights, training data, and training processes.
Prioritizing Verifiable Tools: Favoring the use of AI models and platforms that meet these high standards of transparency, even if they are initially less performant or require more effort than readily available opaque alternatives.
Demanding Transparency: Insisting that publications involving AI include detailed disclosures about the models used, including comprehensive information about training data provenance, processing, and potential biases, as well as training methodologies.
Supporting Truly Open Projects: Contributing to and utilizing community-driven projects and initiatives from institutions committed to genuine openness in AI development.

The spirit of the Homebrew Computer Club – one of shared knowledge and collaborative building – is essential for navigating the complexities of the AI era responsibly. Reclaiming and defending the true meaning of ‘open source’ for artificial intelligence is not just about terminological purity; it’s about safeguarding the integrity, reproducibility, and continued progress of science itself in an increasingly AI-driven world. The path forward requires vigilance and a collective commitment to ensuring that the powerful tools of AI are developed and deployed in a manner consistent with the principles of open inquiry that have served science so well for centuries.

updated at 2025-03-28

# AI # LLM # AIGC