AI decreases human-generated content, limiting data for training AI

The use of ChatGPT has led to a decrease in human-generated content with people asking and answering fewer questions online, according to new research from Corvinus University of Budapest.

Content and discussions online are used by people to learn new things and solve problems, and essential for training AI, particularly Large Language Models like ChatGPT.

Johannes Wachs, Associate Professor at Corvinus University, and colleagues from UCL and LMU Munich investigated the impact of ChatGPT on the generation of open data on Stack Overflow, an online Q&A platform for computer programmers and an essential source of training data for LLMs.

The researchers found that, after the introduction of ChatGPT, there was a sharp decrease in human content creation: ChatGPT users are less likely to post questions and answers on the platform or visit the platform regularly.

As people use ChatGPT more instead of online knowledge databases or platforms which allow discussion, displacing the human behaviour which generates the data it is trained on, the quality and quantity of data available for training future AI decreases.

“The decreased production of open data will limit the training of future models. LLM-generated content itself is likely an ineffective substitute for training data generated by humans to train new models. Training an LLM on LLM-generated content is like making a photocopy of a photocopy, providing successively less satisfying results,” says Professor Wachs.

The researchers explain that we should prioritise encouraging people to exchange information and knowledge online with each other, and not only rely on AI and LLMs.

These findings were published in the journal PNAS Nexus.