OpenAI is scaling up synthetic data generation

"This interaction between models foreshadows a recursive self-improvement loop."

Dec 02, 2025

In a new interview, Mark Chen mentioned that OpenAI is aggressively scaling up several bets, including one about synthetic data that “OpenAI talked a lot about” when GPT-5 was launched.

This is a reference to Sebastien Bubeck’s brief cameo during the GPT-5 launch, during which he said that OpenAI has developed “new training techniques” whereby o3 generated synthetic data to train GPT-5 in a way “raw web data just never could”. The point was to generate not a large volume of data cheaply, but rather useful training data.

“This interaction between models foreshadows a recursive self-improvement loop”, Bubeck said, adding: “Here at OpenAI we cracked pre-training, then reasoning, and now we are seeing their interactions significantly deepened. In the future, AI systems will move far beyond our current pre-training and post-training pipelines we’ve been used to and we are seeing the first steps towards this right now and right here.”

And OpenAI is now aggressively scaling it up.

prinz

Discussion about this post

Ready for more?