OpenAI Trained GPT-4 Using a Million Hours of YouTube Transcriptions

Earlier this week, The Wall Street Journal highlighted the challenges faced by AI firms in acquiring top-notch training data. Today, The New York Times delved into the strategies adopted by these companies to address this issue, often navigating the ambiguous terrain of AI copyright laws.

The narrative begins with OpenAI, facing a shortage of training data, purportedly creating its Whisper audio transcription model to surmount this obstacle. The company transcribed more than a million hours of YouTube videos to train GPT-4, its cutting-edge large language model, as reported by The New York Times. Despite recognizing the legal ambiguity of this approach, OpenAI deemed it to be within the bounds of fair use. The Times notes that OpenAI president Greg Brockman played a direct role in curating the videos utilized for this purpose.

In an email to The Verge, OpenAI spokesperson Lindsay Held stated that the company customizes "distinctive" datasets tailored to enhance the understanding of the world by each of its models, thereby sustaining its global research competitiveness. Held further elaborated that OpenAI draws from a variety of sources, encompassing publicly accessible data and collaborations for non-public data, while also exploring the possibility of creating its own synthetic data.

According to The Times article, OpenAI depleted its reservoir of valuable data by 2021, prompting discussions about transcribing YouTube videos, podcasts, and audiobooks as alternative sources after exhausting other resources. By that point, the company had already trained its models using diverse datasets, including computer code sourced from Github, databases of chess moves, and educational content from Quizlet.

Google spokesperson Matt Bryant informed The Verge via email that the company has become aware of "unconfirmed reports" regarding OpenAI's activities. Bryant reiterated that both their robots.txt files and Terms of Service explicitly prohibit unauthorized scraping or downloading of YouTube content, aligning with the company's terms of use. Similarly, YouTube CEO Neal Mohan addressed concerns about OpenAI potentially utilizing YouTube to train its Sora video-generating model this week. Bryant emphasized that Google implements "technical and legal measures" to deter unauthorized usage when there is a clear legal or technical justification to do so.

The Times' sources also indicated that Google collected transcripts from YouTube. Bryant confirmed that the company has indeed trained its models "on some YouTube content," asserting that this was done in accordance with their agreements with YouTube creators.

According to The Times, Google's legal department requested its privacy team to adjust the language in its policy to broaden the scope of permissible actions with consumer data, including data from office tools like Google Docs. The updated policy was allegedly strategically released on July 1st to leverage the distraction of the Independence Day holiday weekend.

Similarly, Meta encountered challenges regarding the availability of high-quality training data. Recordings obtained by The Times revealed discussions within its AI team about the unauthorized utilization of copyrighted materials as they endeavored to catch up with OpenAI. Exhausting nearly every accessible English-language literary work, essay, poem, and news article on the internet, the company purportedly contemplated options such as acquiring book licenses or potentially acquiring a major publishing company outright. Furthermore, Meta appeared to be constrained in its utilization of consumer data due to privacy-centric adjustments made in response to the Cambridge Analytica scandal.

Google, OpenAI, and the wider AI training community are grappling with the challenge of rapidly diminishing training data for their models, which improve as they assimilate more data. According to The Journal's recent report, companies could potentially surpass the rate of new content creation by the year 2028.

The Journal, in its Monday publication, discussed potential solutions to this dilemma, which include training models on "synthetic" data generated by their own models or employing "curriculum learning." The latter method involves feeding models high-quality data in a structured manner with the aim of enabling them to establish "smarter connections between concepts" using a reduced amount of information. However, neither approach has been definitively proven yet. Alternatively, companies may resort to using whatever data they can acquire, whether with permission or not. However, this approach is fraught with challenges, as evidenced by multiple lawsuits filed in the past year or so.