Nvidia accused of scraping ‘A Human Lifetime’ of videos per day to train AI
The global average human lifespan of 16 waking hours multiplied by 73 equals 426,320 hours of scraped video per day.
Nvidia is being accused of scraping millions of videos online to train its own AI products. Sources say the videos weren’t just intended for research but were supposed to be used for the company’s products, including Omniverse 3D world generator, self-driving car systems, and its Digital Humans avatar generator. These reports allegedly came from an anonymous former Nvidia employee who shared the data with 404 Media.
According to the outlet, several employees were instructed to download videos to train Nvidia’s AI. Many have raised concerns about the legality and ethics of the move, but project managers have consistently assured them. Ming-Yu Liu, vice president of Research at Nvidia, allegedly responded to one question with, “This is an executive decision. We have an umbrella approval for all of the data.”
It isn’t the first time an AI tech company has been accused of scraping online content without permission. Several lawsuits exist against AI companies like OpenAI, Stability AI, Midjourney, DeviantArt, and Runway. Nvidia isn’t affected at the moment, as it’s primarily known for supplying AI chip data centers, which helped make it one of the most valuable companies in the world.
However, it seems that Nvidia also wants to get into the data processing game by creating foundational AI models that other companies can build upon. To help the company achieve an edge in the highly competitive AI market right now, Nvidia is allegedly targeting training its systems using a massive library of online video data.
“We are finalizing the v1 data pipeline and securing the necessary computing resources to build a video data factory that can yield a human lifetime visual experience worth of training data per day,” said Liu in an email.
Some sources report that Nvidia used publicly available videos, data licensed exclusively for non-commercial research, YouTube videos, and even movies and shows from Netflix. It’s even alluded that the company will have someone watching the movies while using screen capture technology to record from Netflix, although we cannot ascertain if this was a joke. “We should get a lot of high-quality face videos from this,” adds Liu.
The Nvidia team working on its AI training should also consider capturing gameplay video and tapping the GeForce Now team to help them get it. However, Jim Fan, a senior research scientist at Nvidia said, “We don’t have yet have statistics or video files yet, because the infras [sic] is not yet set up to capture lots of live game videos & actions. They’re both engineering & regulatory hurdles to hop through. But we will add cleaned & processed GFN (GeForce Now) data to team-vfm as soon (as) they arrive.”
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
404 Media says the AI project, dubbed Cosmos, started in February 2024. By March, the team had downloaded 100,000 videos, and in May, an email said that they had compiled 38.5 million URLs, with almost 40% of them coming from cinematic videos.
It’s unclear how deep and wide the Cosmos project is in Nvidia, but 404 Media has quoted Nvidia CEO Jensen Huang responding to an email about it with, “Great update. Many companies have to build video FM [foundational models]. We can offer a fully accelerated pipeline.”
Nvidia is likely rushing to build its model while copyright and other AI training issues haven’t yet settled, resulting in a massive legal gray area. At the moment, there is no specific law that deals with AI training, but legislators have already taken notice. Several bills in Congress specifically tackle this, like the AI Foundation Model Transparency Act and the Generative AI Copyright Disclosure Act.
Google argues that AI scraping is ‘Fair Use,’ but we don’t know where these laws will take us. So, while nothing is yet in black and white, many companies want to get the most out of online data to gain a leg up on the competition.
Jowi Morales is a tech enthusiast with years of experience working in the industry. He’s been writing with several tech publications since 2021, where he’s been interested in tech hardware and consumer electronics.
-
ThomasKinsley More reasons to hate AI. CoPilot finally appeared on my W10 machine. Thankfully I was able to uninstall the abomination.Reply -
vanadiel007 This article shows how short our lives are. Each hour ticking away until the number reaches 0...Reply -
hotaru251 Many have raised concerns about the legality and ethics of the move
This is why we need a court system to go over it (ideally one that has an understanding of the issue of its importance) sooner rather than later and if it is indeed breaking rules and stealing content either pay people or (ideally) scrap it all & force em to start over.