As AI blurs the lines between real and synthetic data, strong governance is essential

In most organizations, synthetic data governance is created by many different groups working together. Image: Unsplash/LeifChristophGottwald
Arun Sundararajan
Harold Price Professor, Entrepreneurship and Technology, NYU Stern School of Business- Synthetic data – artificially generated information that mimics real-world data – can fill gaps, protect privacy and enable the testing of new scenarios.
- The lines between synthetic and real data are blurring, creating new and significant opportunities, as well as serious risks.
- Business leaders must prioritize oversight and compliance, building robust traceability and provenance systems to govern synthetic data use.
Once a niche tool used to address data gaps or safeguard privacy, synthetic data (artificially generated information that mimics real-world data) is transforming the use of artificial intelligence (AI) in many industries. It can fill data gaps, protect privacy and enable the testing of new scenarios, providing a scalable and cost-effective alternative when real-world data is limited or sensitive.
But as the proliferation of synthetic data expands, the line between real and artificial blurs, threatening trust, distorting knowledge and embedding systemic risks.
The opportunities for using synthetic data are vast, but success will rely on strong governance, inclusive and high-quality data practices and transparent collaboration among developers, scientists, policy-makers and organizational leaders.
The evolution of synthetic data
Synthetic data emerged as a solution when high-quality, representative real-world data was unavailable due to incompleteness, bias or privacy restrictions. It became particularly valuable in cases where the required data simply did not exist. Today, synthetic data continues to supplement real-world datasets covering underrepresented languages, health conditions and demographic groups. It enhances equity in settings ranging from clinical trials and criminal justice to financial inclusion.
But the landscape of synthetic data has expanded dramatically since those early days. Synthetic data is no longer just a tool of necessity, it’s a driver of innovation. A recent strategic brief by the World Economic Forum’s Global Future Council on Data Frontiers addressed the range of new methods for generating synthetic data that have spawned myriad novel application areas.
Entire urban environments can now be replicated for autonomous vehicle testing, for example, as self-driving car company Waymo has done. Media companies like ByteDance can generate massive new synthetic training data sets for their recommendation systems. In healthcare, synthetic patient data is being used to test treatment plans at scale without exposing medical records.
Simulated data has become particularly powerful, offering controlled environments for stress-testing financial markets, modelling climate impacts or running “digital twin” scenarios for infrastructure planning.
Synthetic data's novel challenges
This promise comes with new risks. Because synthetic data is pervasive, realistic and foundational in shaping AI systems, it can become indistinguishable from authentic data sources. This creates several risks:
Bias or error amplification
If the underlying data used to generate synthetic data is biased or incorrect, the results may reinforce inequities rather than reduce them – especially if the process of generating the synthetic data is biased. This challenge is reinforced by the increasing frequency with which datasets are created for the purpose of training AI systems.
AI autophagy
As AI systems are trained on AI-generated outputs, accuracy and reliability degrade, undermining performance across domains like computer vision or natural language processing. Beyond the widely documented context of “model collapse” for generative AI, this risk can present in more subtle ways. For example, computer vision systems may be compromised if trained on rapidly proliferating AI-generated image and video data in which lighting, motion or overlapping objects are not rendered realistically.
Erosion of trust
From deepfakes to identity theft through voice cloning, unauthorized synthetic media threatens public trust in data authenticity itself. If people no longer believe what they see, hear or read, the consequences ripple far beyond technical systems.
These risks, while familiar, are magnified by the difficulty of distinguishing between AI-generated and real-world data. The benefits of synthetic data become a liability when governance is weak.
Collaborating on synthetic data governance
Synthetic data can create better outcomes when organizations prioritize robust governance, transparency and multi-stakeholder collaboration. Success means bridging two worlds: the developers and end users who build and apply the technology, and the executives, lawyers and policy advisors who shape its use. Each stakeholder plays a distinct governance role that no other can fulfil.
Developers and end users can drive stronger technical governance, for example, enhancing the quality and transparency of the models that generate their synthetic datasets and championing safeguards like watermarking and dataset nutrition labels.
But perhaps the most important intervention to prioritize is investing in data traceability. Robust provenance systems allow organizations to identify how and when synthetic data was introduced, aiding accountability and reducing risks like bias and AI autophagy. Given the high costs of retroactive tracing, upfront investments in robust data provenance systems should be a priority for businesses.
Prioritizing synthetic data governance
As is so often the case, however, technical governance is not enough. Executive and policy leadership must treat synthetic data governance as a strategic priority in its own right, rather than integrating it into broader AI governance questions.
They should pursue tailored approaches to synthetic data governance, including:
- Developing context aware standards that recognize the unique properties of synthetic and simulated data.
- Collaborating closely with privacy and AI regulators to ensure alignment with evolving frameworks.
- Promoting education within the organization about opportunities, risks and best practices.
Realizing the benefits of synthetic data while mitigating known risks is a shared responsibility between engineers, policy advisors, executives and users working collaboratively and proactively. Together, these groups can realize a future that safely unlocks the immense potential of this new generation of synthetic data.
Don't miss any update on this topic
Create a free account and access your personalized content collection with our latest publications and analyses.
License and Republishing
World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.
The views expressed in this article are those of the author alone and not the World Economic Forum.
Stay up to date:
Artificial Intelligence
Related topics:
Forum Stories newsletter
Bringing you weekly curated insights and analysis on the global issues that matter.
More on Emerging TechnologiesSee all
Shivam Parashar and Hitesh Dahiya
December 5, 2025





