According to a recent Proof News investigation that was co-published with Wired, large tech companies, including Apple, NVIDIA, Salesforce and Anthropic, have allegedly been using YouTube videos as training material for their AI systems. This raises moral and legal questions when they used the subtitles from over 170,000 YouTube videos without permission.

The dataset and its use

Transcripts from more than 48,000 channels, including well-known content creators like MrBeast and Marques Brownlee as well as news organizations like ABC News and The New York Times, are part of the YouTube Subtitles dataset and was taken without permission from the original authors. Marques Brownlee, known as MKBHD, commented on the issue, stating, “This is going to be an evolving problem for a long time.”

The dataset is a component of a bigger collection called The Pile from EleutherAI, a nonprofit organization. In addition, The Pile has datasets from Wikipedia, books and other sources. A few major tech companies, like Apple, NVIDIA and Salesforce, have publicly stated that they use The Pile to train their AI models.

The issue of YouTube content being used for AI training has generated a lot of discussion. Neal Mohan, the CEO of YouTube, has said in the past that it is against the terms of service for AI to be trained on video content, including transcripts. Sundar Pichai, the CEO of Google, echoed this idea and emphasized how crucial it is to follow YouTube’s terms and conditions.

Creators are particularly concerned about the unauthorized use of their content. David Pakman, a political commentator YouTuber with over two million subscribers, expressed his frustration, saying, “This is my livelihood, and I put time, resources, money, and staff time into creating this content.” He believes that creators should be compensated if their work is used for AI training.

In an effort to bring some openness and enable creators to determine whether their work has been utilized without permission, Proof News has developed an interactive tool that lets them verify whether their content was included in the YouTube Subtitles dataset.