AI Companies Seeking AI-Produced Data for Recursive Training

Two symbols of people talking in a loop, sharing the same ideas repeatedly.
(Image credit: Shutterstock)

It seems that AI companies including Microsoft, OpenAI, and Cohere are doing everything they can to find synthetic data with which to train their AI products. Citing the limited availability of "organic" human-generated data in the world wide web, these companies aim to use AI-generated (synthetic) data in a sort of infinite loop, where training is achieved on data that's already been generatively created.

“If you could get all the data that you needed off the web, that would be fantastic,” said Aidan Gomez, chief executive of $2 billion LLM start-up Cohere to the Financial Times. “In reality, the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need.”

And there's also the matter of cost, as human-generated data, according to Gomez, is "extremely expensive". This has already led to the founding of some "synthetic data" companies, such as Gretel.ai, which specializes in producing synthetic datasets that are then sold for training purposes.

The problem of data availability and provenance is one of the biggest limiting factors in our current era of AI. Today, there are real risks in training AI networks with synthetic data that's already been "chewed" and generated by AIs themselves. For one, there's the issue of compounding deficiencies in the base training data: if the original, non-synthetic training dataset already suffered from biases, those same biases will be included, digested, and amplified in subsequent training iterations, increasing its relevancy.

But another, perhaps much more impactful issue stems from a recently-discovered limit: output quality severely degrades after five training rounds on AI-generated, synthetic data. Whether this "MAD" condition presents a soft or hard limit towards AI training seems like a question at the heart of Microsoft and OpenAI's intention to recursively train their AI networks. This is a space that'll likely see a flurry of studies, however; Microsoft Research, for instance, has published papers on recursively-generated short stories (meaning that a model was trained on stories generated by another model) and a coding AI network that was trained on AI-generated documentation around Python programming. Verifying the risks of data degeneration in these and other, larger-sized models (such as the 70B-parameter Llama 2, recently released to open-source by Meta) will be key to how far (and how fast) AI evolves in the foreseeable future.

With AI-geared companies clamoring for more and more data, it makes sense that they'd try to recursively generate high-quality datasets. This can be done in multiple ways, but perhaps the one with the greater likelihood of success comes from simply letting two AI networks interact with one another, with one taking the role of a tutor, and the other taking the role of a student. Human intervention would (and always will) be necessary, however, in order to cull lower-quality data points and keep in check "hallucinations" (AI affirmations that aren't truthful).

There are some obstacles on the road to the technocratic dream of a self-evolving, self-teaching AI; models that can have internal discussions, internal discoveries, and that produce new knowledge that isn't mere mixing and matching (although that's one of the hallmarks of creative output, after all).

Of course, we do have to keep in mind that not all dreams are pleasant. We already have trouble dealing with human-induced nightmares; there's no telling how impactful a machine's "nightmares" might be.

Francisco Pires
Freelance News Writer

Francisco Pires is a freelance news writer for Tom's Hardware with a soft side for quantum computing.