AI needs cultural policies, not just regulation
- The full potential of AI can only be realized, and its benefits equitably distributed, if fair and broad access to data is provided.
Key Highlights:
- Data Race and Ethical Concerns: As AI, especially Large Language Models (LLMs), requires enormous amounts of high-quality data for training, the competition for data has intensified.
- However, this demand raises ethical concerns, with fears that it could lead to the use of pirated or substandard datasets, such as the controversial 'Books3' collection of pirated texts.
- Large Language Models (LLMs): These are sophisticated AI systems capable of understanding and generating human-like text by learning from extensive data, supporting various language-related applications.
- Feedback Loops and Biases: The dependence on existing datasets can create feedback loops that reinforce biases present in the data.
- If AI models are trained on flawed datasets, they may perpetuate and even amplify these biases, leading to skewed outputs that often reflect a narrow, Anglophone-centric perspective.
Challenges:
- Lack of Primary Sources: Current LLMs are mainly trained on secondary sources, which often lack the depth and richness of primary cultural artifacts.
- Essential primary sources, like archival documents and oral traditions, are frequently overlooked, limiting the diversity of data available for AI training.
- Underutilization of Cultural Heritage: Many cultural heritage repositories, such as state archives, remain underutilized in AI training.
- These archives hold vast amounts of linguistic and cultural data that could significantly enhance AI's understanding of humanity's diverse history and knowledge.
- Digital Divide: The digitization of cultural heritage is often given low priority, resulting in limited access to valuable data that could benefit AI development.
- This data gap disproportionately affects smaller companies and startups, hindering their ability to compete with larger tech firms.

