Ethical AI: A Sci-Fi Dream Realized | en

The Herculean Task of Ethical Data Sourcing

The journey to this ethical AI oasis was far from a walk in the park. As the researchers readily admit, the true bottleneck wasn’t computational power, but sheer human effort. The process of assembling the Common Pile v0.1, an expansive dataset exceeding eight terabytes, demanded painstaking manual cleaning and reformatting to render it suitable for AI training. Imagine sifting through virtually endless piles of digital information, searching for any kind of error that might corrupt the data set.

But the real challenge lay in the meticulous double-checking of copyright status. In the chaotic realm of the internet, rampant mislicensing is the norm, turning copyright verification into a Sisyphean task.

“This isn’t a thing where you can just scale up the resources that you have available” study coauthor Stella Biderman told WaPo. “We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that’s just really hard.”

The process of sifting through terabytes of data looking for copyright issues is not easy. The researchers could not simply add more computer chips to the process and hope for a solution. Instead, they needed to manually verify and annotate all of the data. The scale of the task made it exceptionally difficult, given the number of files considered. Determining the appropriate license for each item, or verifying that the copyright had expired took significant manpower.

The team had to develop creative solutions and workflows to effectively sift through and flag the problematic files. They also had to work with various legal experts and copyright organizations to properly interpret the copyright status for files using ambiguous licenses.

Further complicating the task was that the data was collected from a variety of sources, each with its own method of storing and organizing files. So the files also needed to be reorganized and standardized before they could be used to train the AI model.

Triumph Over Adversity: The Birth of an Ethical AI

Despite the daunting obstacles, Biderman and her dedicated team persevered. Once the arduous task of creating the Common Pile was complete, they unleashed its potential to train a seven billion-parameter Large Language Model (LLM). The resulting AI not only held its own against industry benchmarks like Meta’s Llama 1 and Llama 2 7B, but also did so with a clean ethical conscience.

But the AI research landscape evolves as quickly as a speeding bullet. It is important to remember that Meta released Llama 1 and Llama 2 a couple of years ago, a relative eternity in the world of AI. So the ability to compete with these industry benchmarks after only a short time is a credit to the team. Other industry LLMs such as Google’s PaLM or OpenAI’s GPT have far more parameters and thus far higher training costs. To be able to compete with the Llama 1 or Llama 2 7B indicates that this AI may be able to compete at the highest levels as technology improves.

The fact that a lean, determined team could achieve comparable results with limited resources is a testament to their ingenuity. One particularly inspired find was a treasure trove of over 130,000 English language books in the Library of Congress that had been previously overlooked. The Library of Congress also has a wide variety of other data sources that have been overlooked, but are in the public domain and thus would be useful in training an ethically sourced data AI.

The lack of resources available to the team also made it critical to optimize the data that they sourced. By being selective on the data used the team could obtain similar performance that is available in models trained on vast datasets without the ethical issues.

The Murky Waters of AI and Copyright

Copyright remains a thorny ethical and legal issue in the age of AI. Industry giants like OpenAI and Google have amassed vast datasets by devouring everything in sight, from news articles to personal social media posts. This practice has drawn criticism from all sides. Authors have even filed lawsuits, alleging the illegal use of copyrighted books to train AI models. The lawsuits filed by authors often demand that specific AI files be removed from the training data and that the companies developing the AI pay for the use of the data.

The tech industry contends that such practices constitute fair use, arguing that the development of AI would be “impossible” without unfettered access to data. Some experts have countered this argument by stating that the argument is akin to saying one needs to steal to be able to innovate. This latest research delivers a stinging rebuke to that Silicon Valley narrative.

While this achievement marks a significant step forward, it doesn’t eliminate all ethical considerations. Large language models, with their potential to displace human workers, still raise fundamental questions about the future of labor. The automation that LLMs provide may put some people out of jobs. The ethical questions involve what society can do to help people who are in these situations. One potential solution is to retrain workers and give them other jobs, but determining what jobs people are capable of doing can be a complex calculation. Furthermore, the use of works in the public domain may not sit well with everyone, particularly those whose creative contributions are now being regurgitated by AI.

Even in a hypothetical future where AI firms are forced to seek permission or provide compensation for data usage, copyright holders may still face undue pressure to allow AI training. The immense resources that can be brought to bear when training AI models means that most copyright holders would not be able to resist the pressure from large AI firms to allow them to use the data. There is the ethical question of whether copyright holders have a right to say no to usage of their data.

Towards Transparency and Accountability in AI

Biderman, however, remains pragmatic. She harbors no illusions that companies like OpenAI will suddenly embrace ethical data sourcing. Instead, she hopes that her work will encourage greater transparency in data usage. What data sets were used to train which AI products? Knowing the answer to that question could have significant implications for the future of AI. It may be possible to trace AI generated files back to the original source of information by seeing which files are most influenced by which datasets.

“Even partial transparency has a huge amount of social value and a moderate amount of scientific value,” she told WaPo.

Currently the exact data sets used to train a given AI are closely guarded secrets. The only way to replicate an AI model is to either be told exactly how the current AI model was created, or to reverse engineer the AI model which could take a ton of time and effort. Not having full transparency in the data means that some organizations will not be able to compete with the organizations that developed the AI model.

Also the models tend to work as black boxes. This means it is hard to explain exactly the reasoning why an AI model outputted a particular decision.

A Paradigm Shift in AI Development

The implications of this research extend far beyond the realm of AI ethics. It signifies a fundamental shift in how AI can be developed, demonstrating that ethical considerations and technological advancement need not be mutually exclusive. By prioritizing transparency, responsible data sourcing, and human oversight, we can forge a future where AI serves humanity, rather than the other way around. The AI can be developed to promote human values, not to generate short term profits. Also by focusing on ethical considerations more people from more backgrounds will be willing to adopt the new technology. This reduces the risk of an AI winter where people lose interest in developing or using AI models.

Addressing Ethical Concerns and Societal Impacts

The tech industry’s argument that ethical data usage is an insurmountable obstacle has now been decisively challenged. The success of this project underscores the feasibility of building AI models on a solid ethical foundation. The old assumptions of impossibility that many held are no longer valid. However, the ethical dimensions of AI development extend beyond copyright issues. The socio-economic impacts of AI, including job displacement and algorithmic bias, demand careful consideration.

The ethical considerations that affect AI models go beyond just sourcing. We must also verify that the data is not causing AI models to be biased towards or against any segment of the population. For example, an AI model trained predominantly using data from one country will likely not have accurate data or understanding of people from another background. AI bias can have wide ramifications, as AI is used to make decisions across a variety of business and personal contexts. AI bias can also be difficult to identify, because the logic underlying an AI model is difficult to ascertain.

Promoting Transparency and Accountability

To foster trust and ensure responsible innovation, the AI industry must embrace transparency and accountability. Companies should be open about the data sources used to train their models and the methodologies employed to mitigate bias. This will increase trust and use of AI models. Independent audits and external oversight can further enhance accountability and prevent ethical lapses.

AI transparency can be implemented to verify that the datasets contain a wide enough distribution to avoid bias in the AI model. The datasets need to be diverse to avoid ethical problems. AI accountability can be implemented by external audits to check for potential ethical lapses. This also would involve making the AI models more explainable and making clear the logic underlying their decisions. Also having a clear path so people can make appeals on the outputs from a given AI model.

Collaboration and Open Source Solutions

The development of ethically-sourced AI requires collaboration and open-source solutions. By sharing datasets, methodologies, and best practices, researchers and developers can accelerate progress and collectively address the challenges of ethical AI development. Datasets can be shared using tools that are compliant with current data protection and privacy rules. Open source code also provides people with the opportunity to verify the code used to develop AI models so they can verify it is ethical. Open-source initiatives can also empower smaller organizations and individuals to participate in the AI revolution, ensuring that the benefits of this technology are shared more equitably.

Greater collaboration will mean more innovation, but this innovation must be steered in an ethical direction.

The Promise of a Brighter Future

The creation of an AI model trained entirely on ethically-sourced data represents a milestone in the quest for responsible and beneficial AI. This groundbreaking achievement not only proves that ethical AI development is possible but also provides a roadmap for others to follow. By embracing transparency, collaboration, and a commitment to ethical principles, we can unlock the full potential of AI while safeguarding human values and promoting a more just and equitable future. While there are ethical considerations, AI can also bring significant improvements to humanity, such as creating new medicines and accelerating innovation.

The commitment to ethical principles must be at the forefront of our effort to develop AI tools.

updated at 2025-06-09

# AIGC # Llama # Meta