Notes

AI needs cultural policies, not just regulation

Contact Counsellor

Last Updated02/08/2024

The full potential of AI can only be realized, and its benefits equitably distributed, if fair and broad access to data is provided.

Data Race and Ethical Concerns: As AI, especially Large Language Models (LLMs), requires enormous amounts of high-quality data for training, the competition for data has intensified.
- However, this demand raises ethical concerns, with fears that it could lead to the use of pirated or substandard datasets, such as the controversial 'Books3' collection of pirated texts.
Large Language Models (LLMs): These are sophisticated AI systems capable of understanding and generating human-like text by learning from extensive data, supporting various language-related applications.
Feedback Loops and Biases: The dependence on existing datasets can create feedback loops that reinforce biases present in the data.
If AI models are trained on flawed datasets, they may perpetuate and even amplify these biases, leading to skewed outputs that often reflect a narrow, Anglophone-centric perspective.

Lack of Primary Sources: Current LLMs are mainly trained on secondary sources, which often lack the depth and richness of primary cultural artifacts.
- Essential primary sources, like archival documents and oral traditions, are frequently overlooked, limiting the diversity of data available for AI training.
Underutilization of Cultural Heritage: Many cultural heritage repositories, such as state archives, remain underutilized in AI training.
- These archives hold vast amounts of linguistic and cultural data that could significantly enhance AI's understanding of humanity's diverse history and knowledge.
Digital Divide: The digitization of cultural heritage is often given low priority, resulting in limited access to valuable data that could benefit AI development.
- This data gap disproportionately affects smaller companies and startups, hindering their ability to compete with larger tech firms.