OpenAI is scaling up synthetic data generation
"This interaction between models foreshadows a recursive self-improvement loop."
In a new interview, Mark Chen mentioned that OpenAI is aggressively scaling up several bets, including one about synthetic data that “OpenAI talked a lot about” when GPT-5 was launched.
This is a reference to Sebastian Bubeck’s brief cameo during the GPT-5 launch, during which he said that OpenAI has developed “new training techniques” whereby o3 generated synthetic data to train GPT-5 in a way “raw web data just never could”. The point was to generate not a large volume of data cheaply, but rather useful training data.
“This interaction between models foreshadows a recursive self-improvement loop”, Bubeck said, adding: “Here at OpenAI we cracked pre-training, then reasoning, and now we are seeing their interactions significantly deepened. In the future, AI systems will move far beyond our current pre-training and post-training pipelines we’ve been used to and we are seeing the first steps towards this right now and right here.”
And OpenAI is now aggressively scaling it up.
