Apple, Anthropic and Other AI Firms Have Reportedly Trained AI Models on Thousands of YouTube Videos

By: gadgets.ndtv.com

Jul 17 2024
0
0 Views

Apple, Anthropic and Other AI Firms Have Reportedly Trained AI Models on Thousands of YouTube Videos

Apple, Anthropic, and other major artificial intelligence (AI) firms have reportedly trained AI models on data from hundreds of thousands of YouTube videos. A new report claims that multiple AI companies used a publicly available dataset called Pile which contained the plain text of videos' subtitles without any video imagery. The data was collected from popular YouTube creators such as MrBeast, Marques Brownlee, and PewDiePie as well as Indian YouTube creators such as CarryMinati, BB ki Vines, and Ashish Chanchlani.

Multiple AI Models Reportedly Trained on YouTube Videos

Proof News conducted an investigation to find that subtitles data from as many as 1,73,536 YouTube videos were taken from more than 48,000 channels. As per the report, EleutherAI, a non-profit AI research lab, curated this dataset. Later, it was used by companies such as Apple, Anthropic, Nvidia, Salesforce, and more. Notably, the AI lab published a research paper highlighting the details of the dataset.

EleutherAI created a data repository of 800GB dubbed Pile and made it publicly available for those who wanted to train AI models but could not afford large datasets. The majority of the dataset was taken from publicly available sources such as English Wikipedia, e-books, and more. However, it also contained the subtitles from all the videos compiled in a dataset called YouTube Subtitles.

You Can Now Use the Claude AI App on Android

The report claimed that the Pile was used to train Apple's OpenELM AI model, on the basis of the research paper's description. Salesforce, Nvidia, and Anthropic's AI models' research papers also reportedly mention the usage of the dataset.

Anthropic spokesperson Jennifer Martinez told the publication in a statement, “The Pile includes a very small subset of YouTube subtitles. YouTube's terms cover direct use of its platform, which is distinct from use of the Pile dataset. On the point about potential violations of YouTube's terms of service, we'd have to refer you to the Pile authors.”

Notably, YouTube's terms of service prohibit anyone from accessing the videos on the platform using automated means such as robots, botnets or scrapers. YouTube Subtitles will fall under the scraping category. A Google spokesperson told Proof News in an email response that the tech giant has taken “action over the years to prevent abusive, unauthorised scraping.” However, no comments were made about AI firms' usage of the data.

Whistleblowers Reportedly Accuse OpenAI of Enforcing 'Illegal' NDAs

In a post on X (formerly known as Twitter), Marques Brownlee called out Apple for sourcing data from companies that included his videos' transcripts, but he also highlighted that it was not the iPhone maker's fault since they did not collect the data.

Dell XPS 13, Inspiron 14 Copilot+ AI PCs Launched in India: Details

Apple has sourced data for their AI from several companies

One of them scraped tons of data/transcripts from YouTube videos, including mine

Apple technically avoids "fault" here because they're not the ones scraping

But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024

While this dataset was collected and distributed publicly, there could be other instances of data scraping on platforms such as YouTube. With AI firms scrambling to find more data to train their large language models (LLMs), data procurement might continue to enter similar legally grey areas.

anchor links ads by Easy Branches

Get Reliable Matka Guessing Forum with our Satta Matka Expert and Get all Matka Chart For Free.

Apple, Anthropic and Other AI Firms Have Reportedly Trained AI Models on Thousands of YouTube Videos

Multiple AI Models Reportedly Trained on YouTube Videos

Related

Apple, Anthropic and Other AI Firms Have Reportedly Trained AI Models on Thousands of YouTube Videos

Xiaomi Mix Fold 4 Key Specifications Revealed by Geekbench Listing Ahead of Upcoming Debut

Apple Granted Patent That Describes Public Wi-Fi Network Ranking System

Google Pixel 9 Series Camera Details Leaked Ahead of Expected August Launch

Bitcoin Price Rise Continues, Most Altcoins Recover After Brief Slump

WhatsApp Lets Users Set Contacts as Favourites in Chats and Calls for Quick Access With Latest Update

Anthropic Launches Android App for Claude AI Assistant, Powers It With Claude 3.5 Sonnet

Xiaomi Mix Flip Design, Key Specifications Including Snapdragon 8 Gen 3 Chip Revealed Ahead of July 19 Launch

FICCI Lists Pro-Blockchain Suggestions Ahead of Upcoming Union Budget: Details

YouTube creators surprised to find Apple and others trained AI on their videos

Sony Xperia 5 VI Alleged Cases Listed on German Retailer Site; Suggests Similar Design to Its Predecessor

Redmi Pad SE 4G India Launch Date Set for July 29; Design, Colour Options Teased

Xiaomi Mix Fold 4 Key Specifications Revealed by Geekbench Listing Ahead of Upcoming Debut