AI can unlock cancer’s complexities — if we build the data infrastructure first

Mar 5, 2026

Cancer researchers as AI is helping to make breakthroughs.

Today, most cancer research still happens lab by lab, dataset by dataset. Image: Unsplash/National Cancer Institute

Alicia Zhou

CEO, Cancer Research Institute

Enormous shared datasets enabled modern AI to master complex tasks like coding and human reasoning.
But fragmented data silos currently prevent machine learning from unlocking life-saving breakthroughs in cancer immunotherapy.
Standardized infrastructure and global collaboration could accelerate drug discovery by creating AI-ready research networks.

Large language models learned to write, code and reason because they were trained on enormous, shared datasets — everything from Shakespeare to software repositories. Scale, standardization and open access made modern AI possible.

Cancer research deserves the same treatment.

AI models can already detect patterns across billions of variables. Applied to medicine, these systems could predict which patients will respond to treatment, uncover why therapies fail, and simulate drug combinations before they ever reach a clinical trial. In immunotherapy — where outcomes depend on millions of dynamic interactions between immune cells and tumors — that kind of pattern recognition could be transformative.

“

Machine learning systems only perform as well as the data they are trained on.

”

The science is ready. Unfortunately, the data infrastructure is not. Today, most cancer research still happens lab by lab, dataset by dataset. Valuable information sits in silos — locked behind institutional firewalls, scattered across supplementary files or stored in incompatible formats. Even when findings are published, the underlying data is often incomplete (biased towards only positive outcomes) or impossible to reproduce.

Machine learning systems only perform as well as the data they are trained on. Fragmented, inconsistent datasets produce fragmented, inconsistent insights. Without shared standards and pooled data, AI cannot help us unlock the complexity of cancer treatment — no matter how powerful the algorithms become.

If we want AI to accelerate cures, we first have to build the right foundation of data for it to train on.

Why shared data matters

This moment is uniquely consequential. On one side, biology has entered a new era. Single-cell and spatial technologies now let us observe the immune system with extraordinary resolution — not just which cells are present, but where they are in space, how they interact and how they evolve over time. We can measure cancer (and its treatment) as a living, dynamic system. On the other side, AI architectures have matured to ingest exactly this type of multimodal data — genomic, spatial and longitudinal — at scales humans simply cannot process.

For the first time, the measurement tools and the computational tools are aligned. But without coordinated infrastructure, we risk missing an immense opportunity.

The consequences are not theoretical. Research that can’t be reproduced wastes an estimated $28 billion a year in the United States alone, and the problem starts with access. When the Center for Open Science set out to verify 193 experiments from the most influential cancer studies, they could not obtain enough information to even attempt most of them. Of the 50 experiments they managed to complete over eight years, fewer than half produced the same results. The data was either locked behind paywalls, buried in file drawers or simply never shared. A study in BMC Medicine found that just 16% of oncology data is publicly available, and drops below 1% when checked against standards that would allow other researchers to actually use it.

In a field where lives depend on speed, this inefficiency is unacceptable. And at a time when AI has the potential to accelerate discoveries, it has become our biggest obstacle.

Building the foundation: CRI Discovery Engine

At the Cancer Research Institute, we recently launched the CRI Discovery Engine to address this gap — not as a proprietary database, but as shared infrastructure for the entire field.

Working alongside researchers from Stanford University School of Medicine, the University of Pennsylvania Perelman School of Medicine and Memorial Sloan Kettering Cancer Center, as well as technology partner 10x Genomics, we are standardizing how immunotherapy data is generated, structured and shared. The goal is simple: create a large, harmonized, AI-ready dataset that any qualified researcher can use. Participating scientists have committed to breaking down the silos of academic research by seeding the database with their own initial findings. After the initial phase, outside researchers across the globe will be able to add their data, creating a living resource that continually grows more valuable. We aim to create a common language for cancer immunotherapy research that makes results reproducible, comparable and AI-accessible.

Importantly, this kind of effort only works when incentives are aligned. Companies understandably protect intellectual property. Individual labs compete for recognition and funding. But diseases like cancer do not respect institutional boundaries. Precompetitive collaboration — where data infrastructure is shared even while therapies compete — is essential.

This is where nonprofit and public-private partnerships can play a critical role: convening stakeholders, setting standards and building assets no single entity could justify creating alone.

What comes next

The next breakthroughs in cancer will not come from one lab or one algorithm. They will come from networks: scientists, clinicians, technologists and policy-makers working from the same foundation.

Imagine AI models trained on harmonized data from thousands of cancer and treatment combinations. Researchers could test hypotheses in simulated experiments before running real ones. Clinicians could identify likely responders before treatment begins. Discoveries made in one institution could immediately accelerate progress in another.

This is not a moonshot. It is infrastructure. And like any infrastructure project — roads, power grids, the internet — it requires coordination, standards and collective investment.

AI will help us decode cancer’s complexity. But algorithms alone will not save lives. The real work is building the shared foundation that allows intelligence (both human and artificial) to learn together. If we get that right, we can compress decades of discovery into years.

For patients, that time isn’t a trivial metric. It’s survival.

Have you read?

Don't miss any update on this topic

Create a free account and access your personalized content collection with our latest publications and analyses.

License and Republishing

World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.

The views expressed in this article are those of the author alone and not the World Economic Forum.