We're running low on data to train AI. The good news is there's a fix for that

The well of untapped data that fuelled the last wave of AI breakthroughs is running dry, leaving AI models in limbo. Image: Getty Images/iStockphoto
- The well of untapped data that fuelled the last wave of AI breakthroughs is running dry, leaving AI models in limbo.
- The ability to create new data and datasets to train models on is available to us, but obstacles remain.
- Companies and economies that drop the barriers for the creation of novel data will earn a competitive advantage.
A generation ago, the rise of the World Wide Web ignited the Big Data era and helped drive the AI revolution we are seeing today. At first, more data translated to deeper insights, leading to advances in risk modeling, targeted advertising, business intelligence and other data-driven innovations. However, despite the world’s data doubling every three to four years, experts now say AI models are running out of data, which will significantly hamper their growth and effectiveness.
The reality is AI is able to ingest and synthesize data faster than we can generate “new” data it hasn’t seen before. For example, once AI has absorbed all the knowledge in a scientific textbook, no new insights can be gained until a new edition is published. Even then, the subject matter is largely the same, so AI knowledge expansion is incremental. Although the amount of data increases, the lack of variety and novelty is what’s holding AI back.
Regardless of technological advances, Large Language Models (LLMs) all trained on the same existing data will eventually generate the same commoditized outputs. Without new data, AI cannot help us solve more advanced business, scientific and societal challenges.
This is particularly evident in quantitative industries such as pharmaceuticals, financial services, or chemicals, where endless possible permutations – drug combinations, market conditions, atomic structures – far outpace our capacity to model them and extract new data and insights. The only way to achieve this in the past was conducting physical experiments for each permutation – an impossibly slow, prohibitively expensive, or simply unfeasible process.
The well of untapped data that fuelled the last wave of AI breakthroughs is running dry, leaving these increasingly powerful AI models in limbo. To unlock AI’s true potential, we need new sources of quantitative data that will enable us to model and explore the future, not just analyse the past.
The good news is that we have ways of creating new "synthetic" data.
How we can generate novel data
Rapidly generating novel datasets for complex AI systems can be approached in two ways: automation or computation.
Automation leverages robotics and advanced sensors to conduct physical experiments around the clock. While this increases output, deploying thousands of lab-bots is not cost effective and robs scientists of critical hands-on learning and knowledge-gathering through experimentation.
Computation combines diverse datasets, physical laws and deep computational models to digitally simulate complex systems – from biochemical reactions to mechanical stress equations – more accurately, rapidly, safely and cost-effectively.
Computation forms the backbone of Large Quantitative Models (LQMs). Unlike LLMs that extrapolate from historical text riddled with biases, errors and misinformation, LQMs are trained on first principles equations governing physics, chemistry, biology and other quantitative domains. These models not only simulate outcomes, but generate auditable, causal explanations and new data that does not exist in current literature.
The impact across industries is considerable. For example, instead of new drugs taking decades of costly lab experimentation to develop, LQMs can rapidly model drug interactions, likely mechanisms of action, or toxicity for thousands of potential candidates simultaneously before lab experiments begin. These virtual experiments make physical research vastly more targeted and efficient, and minimize lengthy, costly trial-and-error drug discovery techniques. They also help researchers explore chemical spaces that have yet to be cataloged or tackle complex, “undruggable” conditions where progress has stalled due to a lack of new data.
Synthetic data for a new innovation ecosystem
While this capability to generate synthetic data is transformative, its impact hinges on a broader innovation ecosystem. Today, essential datasets needed for medical or industrial breakthroughs can be difficult to access. This creates a barrier to AI innovation. However, a shift is underway.
Data for foundational models can be provisioned via platforms or specific programmes for accessing them, making groundbreaking AI capabilities more broadly accessible without compromising intellectual property or data privacy, governance and security standards. The cost and expertise required to build these models remain nontrivial, but platform-based access democratizes participation, allowing collective intelligence to flourish.
For example, sharing data resources such as virtual patient populations or protein-ligand combinations could allow researchers to explore novel therapies in virtual trials across broader genetic backgrounds. This approach could advance therapies for challenging conditions like cancer, vaccines for global pandemics or personalized medicine. Similar models are emerging in finance, materials science and energy, each leveraging democratized access to accelerate problem-solving at scale.
Generating data for a competitive advantage
The capacity to generate data is not merely a technical advantage, it’s a strategic one. Companies or governments that can pivot from slow, expensive “design-build-test” cycles to rapid, iterative “simulate-refine-validate” workflows will take the lead in transforming their industries or countries. Here’s what that might look like in some of the world’s biggest industries:
Life sciences: Synthetic patient data and virtual cell models can substantially reduce drug development timelines and costs and help predict clinical trial outcomes before patient enrolment.
Finance: Simulating complex market scenarios enables robust portfolio stress-testing, delving beyond historical crises to prepare for truly novel risks.
Manufacturing: Digitally modelling atomic structures and physical properties can identify novel compounds, chemicals, catalysts or alloys faster and produce superior products that meet precise performance criteria.
Energy: Turning captured carbon, waste materials, or low-value by-products into high-value products improves sustainability and creates new revenue streams. Accelerating R&D for advanced batteries supports global electrification efforts.
The coming wave of AI-powered economic and scientific leadership will be determined by those who master data generation methodologies, not those who stockpile historical datasets. Generating causal, trustworthy data will be the key to accelerating innovation across every sector.
Architecting the data-first future
We now stand at an inflection point: choosing dependence on finite, historical data or embracing the opportunities afforded by generated data. Business and government leaders must prioritize investment in new data generation methods to create AI and industrial ecosystems that are more innovative, competitive, and resilient. Implementing a data-first architecture is essential to unlock the power of collective AI intelligence and drive future breakthroughs.
Leaders must champion this transition, cultivating the talent, partnerships and frameworks to propel their organizations into a new era where innovation is fuelled not by the scarcity of what’s been observed in the past, but by the abundance of what’s possible in the future.
Don't miss any update on this topic
Create a free account and access your personalized content collection with our latest publications and analyses.
License and Republishing
World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.
The views expressed in this article are those of the author alone and not the World Economic Forum.
Stay up to date:
Data Science
Forum Stories newsletter
Bringing you weekly curated insights and analysis on the global issues that matter.
More on Artificial IntelligenceSee all
Andrea Willige
December 5, 2025




