A new investigation by “The Atlantic” shows that many YouTube videos from both large media channels and individual creators have been included in datasets used for training generative AI. The report includes an online tool that allows anyone to check which creators’ videos appear in those data sets.
Data found using the tool
Atlantic reporter Alex Reisner found that many of the videos included in these training sets come from educational and news channels such as TED and the BBC. Nearly 50,000 TED videos and more than 33,000 BBC videos were identified. The investigation also located videos from smaller creators, some of which appeared in datasets without clear attribution. The Atlantic’s tool extracts identifiers from public datasets and matches them with YouTube uploads. Creators such as Jon Peters were among those whose content was listed.
YouTube introduced a setting in December 2024 that lets creators decide whether their videos can be used for third-party AI training. Videos are excluded by default unless creators opt in. The option, which appears in YouTube Studio, allows channels to select specific companies or open access more broadly.
Transparency issues with the usage of content for AI training
Some creators said they were surprised to see their content in the datasets when they believed the default opt-out would prevent this. Others view selective participation as a means to influence how AI systems incorporate their work. Legal experts note that the findings raise questions about copyright enforcement, terms of service and what rights creators have over videos already ingested into training sets.
The release of the search tool may increase pressure on YouTube and AI developers to provide greater transparency. Some creators and advocacy groups are calling for stronger policies, stricter prevention of unauthorized scraping and possible compensation models. It remains unclear how quickly platforms and regulators will respond to these demands.
