Moving beyond AI proofs-of-concept
Towards the industrial - and actual - use of AIML in biotech
Last week, J&J announced a shift in their AI strategy, described as "The company is making a shift to focus on only the highest-value GenAI use cases and shut down pilots that were redundant or underdelivering". It would be easy to make fun of this - 'stop doing things that don't work and do more things that do work' sounds like the sort of cheap business advice that you'd find in an airport bookshop1. But there’s something very real and widespread here:
The majority of biomedical AI innovations rarely move beyond proof of concept into production, with most prototypes failing to translate into real business value.
Most anyone in pharma who has worked with or around AI has seen examples of this:
IBM Watson for Oncology was supposed to help doctors recommend cancer treatments. It performed well in demos but failed in real hospitals, recommending inappropriate treatments. Watson also stumbled when used to mine scientific literature for drug discovery.
Many, many AI-led drug discovery companies have struggled to produce assets that move into trials or even beyond basic pre-clinical tests.
BERG Health's platform for finding oncology biomarkers and targets "discovered" biomarkers that weren’t reproducible in third-party studies.
Many and endless statements have been made about how AI is going to replace radiographers / pathologists / etc. “next year”. To pick one example, Pathai promised to speed pathology slide interpretation with AI, but early examples couldn’t handle variability across labs and processes (stains, scanners, etc.)
DeepMind’s Streams app was meant to predict acute kidney injury but integrated poorly with NHS workflows and ran aground due to ethical considerations, having played fast and loose with patient data.
And these are the examples that made it to production, only to fail visibly in public. We don’t see the remainder of the failure iceberg, where projects shamble along for years, always on the cusp of delivery, before being quietly cancelled.
Let's excavate this problem. As usual, note that I believe firmly in the potential and power of AI to revolutionise drug development and biomedicine. But potential doesn't mean practical or proven. Why is there a gap?
Is there a problem?
The concern is not new. Four years ago, Vibhor Gupta and myself held a workshop on this very problem. The response was muted. Most everyone saw the problem, agreed there was an issue, but felt it was a minor or transient, or that someone else, somewhere else would work it out. But is the gap real? In fact, it’s been studied multiple times across multiple industries:
Only 15% of AI projects make it to production (McKinsey)
3 out of 4 AI projects fail to deliver ROI. 75% of executives say their AI projects did not yield substantial business gains (Boston Consulting Group / MIT Sloan Management Review)
85% of AI projects deliver “no measurable value” (Gartner)
80%+ of pharma leaders see AI as strategic, but say very few projects move beyond pilot phase (Deloitte)
AI adoption plateaus around the proof-of-concept stage (MIT Sloan AI Adoption Reports)
85% of AI pilots in pharma are never deployed (BenchSci)
88% of AI pilots fail to reach production (IDC / Lenovo)
42% of businesses scrapped most of their AI initiatives in 2024, up from 17% in the previous year (S&P Global Market Intelligence)
There's a possible riposte to this, asserting that most projects or innovation initiatives of any kind fail, and this is just the natural attrition of experimentation. That's perhaps true, if non-falsifiable. But many of these experiments have promised much, consumed large amounts of resources over many years, been scheduled for production use … only to run into endless delays, excuses, reduced expectations and finally quiet cancellation. Even if you believe this failure rate is appropriate and just the cost of experimentation, it’s valuable to study what separates the successes and failures, to better understand and improve the process.
Why do so many AI projects fail?
Here’s my one-liner summary:
Biomedical ML/AI is performed by many different types of people with different knowledges, different skills, different incentives and different goals, leading to systemic misalignment.
Let’s step through this.
AI projects are usually misaligned
Most AI proof-of-concepts fail to translate into real value because they start life disconnected from business problems. It’s easy for projects to be initiated by the tech end of the business, wondering what’s possible rather than what’s useful, with poor scoping and requirements gathering. What is the actual problem this piece of software solves?
This can happen because the data science and AI teams are often stuck away in a silo, isolated from the actual stakeholders. Development takes place 'over there' and is done by 'computer people'2. So they end up solving problems that no one actually cares about, or solving them in ways that don’t actually reflect how the business does things.3
Complexity is added in spaces like healthcare and pharma, due to data note being abundant or easily available. Because of small patient populations, silo’d info-systems, legacy infrastructure, changing formats, etc. etc. Or the data is available but has to be accessed by or plugin into particular workflows or systems. We often say that pharma is data-rich. In reality, it isn’t.
In addition, compliance is mandatory and often complicated. You can’t just share data casually; that data has to run on validated and audited systems with stringent security requirements. This clashes with the blissful utopia that most AI apps are developed in, where data is freely available, limitless and yet also impactless, of no value or possible harm to anyone.
Many years ago, I was working with a group of epidemiologists who were obsessed with mosquito wings. There was a logic to it: different species of mosquitoes can be identified from characteristic patterns of veins in their wings. Different species carry different diseases. QED, epidemiologists would trap mosquitoes and then consult “The Big Book of Wings” to find out what they were dealing with.
I saw an opportunity. Why not digitise the wing images, use some fancy AI to match a new wing with the pre-existing corpus? Much better and faster than looking things up in some old-fashioned book, amiright?
The epidemiologists were polite, but demurred. They liked looking up wings in the Big Book. It didn’t take too long anyway - a few hours of page-turning, which they enjoyed. The answer wasn’t massively time-critical anyway, a few hours was as good as a few minutes. And to match a wing in my fancy 21st Century All-AI system, they would have had to photograph it, import it into the system, run the software …
No one needed my brilliant wing-recogniser system.
No one cares
Even a good model will die if users don’t trust, understand, or care about using it. And a lot of the time, no one has thought about change management - adoption and transition, how do we get people onto the new system? The actual users keep defaulting to the old systems they know and understand, while the new tool gathers dust.
Even if the end users were impressed by the prototype, if there’s no business unit, sponsor or budget committed to moving it forward, it's not going to happen. This happens so often in big pharma - an R&D or innovation team puts together a valuable prototype or sometimes even a finished system, but there’s nowhere for it to go and live, no one to be responsible for it. So the R&D team ends up taking care of it, even though they don’t have the skills or resources to do so.
We're just plain bad at writing software
The pandemic triggered a firestorm of applications, code and analyses, from all sorts of people, from all sorts of backgrounds. Some of them wanted to help, others just wanted “to help”, others to grab some of the limelight, or to latch onto some funding. The net result was unimpressive. Several papers surveyed the results.
Laure Wynants et al. (BMJ 2020, Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal) surveyed 232 models that variously promised to help diagnose COVID, predict patients outcomes etc. and assessed them for clinical suitability. In short, could these systems actually deliver any clinical benefit.
Of the 232 models, only two could be argued to hold any promise.
The reasons were multitudinous and varied. The system was opaque. It used data that was difficult or impractical to gather. It targeted the wrong or inappropriate patients. It was developed based on inappropriate data. It was never validated.
Mike Roberts et al. (Nature Machine Intelligence 2021) did a couple of similar studies, looking at using ML over medical images to diagnose or prognosticate COVID. They accounted much the same issues, plus the added issue of some datasets including duplicated images, due to the authors just combining datasets haphazardly:
Of 62 studies that could be adequately assessed, none of the models identified were of use due to methodological flaws and/or underlying biases.
And this is a sad statement of the quality of much code that underlies AI models. It’s poorly written. PoCs are often just literally that, thrown together in Jupyter notebooks with no consideration for production deployment. You could understand this problem, looking at academic software - a lot of the critical software in science is built by people who aren’t programmers and aren’t primarily interested in delivering robust, software to be used by other people. But even informatics professionals can write janky, barely-functioning code. The drive for interesting results and flashy demos overrides good software engineering or data science skills. There are few incentives to do things right. At a recent industry roundtable, a speaker sighed, “It’s no one’s job to make systems run right. You don’t get rewarded for being careful.”
At AstraZeneca, we used to talk about “university quality code” - runs on the author’s local machine in their local account, kinda works, kinda shows something, not really reproducible. In the real-world mess of legacy systems, data pipelines, and workflows, will you be able to get the data you need and make a system that runs? Unclear.
This is important. There’s no zero-cost for bad biomedical AI models. Every bad one consumes space and attention from useful models.
Is this just an IT problem?
Similar complaints can be found about IT and software development projects, going back years if not decades. But it's not just an IT problem. AI-centric projects fail at twice the rate of non-AI projects, as they inherit a lot of problems of IT initiatives and layer on complications. But AI-driven systems have their own unique challenges:
Model outputs may be probabilistic, which is fundamentally alien to work functions that expect answers to be definitive.
Model outputs need to be explained or justified, especially in regulated or patient-centric spaces, which can be difficult with black box systems. This is even keener
AI systems have a greater and more singular dependency on data. In a very real way, the data makes the system, changes what it does. Data can be biased or wrong in many subtle ways. Or it can be right but in the wrong way, based on irrelevant characteristics (the “wolf-husky problem”). Furthermore, as the data or population changes, this can result in performance decay over time.
Arguably, AIML systems are used on more complex and sensitive issues (e.g. patient selection, treatment choice, diagnosis), which makes the potential impact of biased or incorrect models catastrophic.
We may be running a massive multiple-hypothesis test
Consider:
Maybe millions of researchers working on similar problems
Using different approaches and assumptions
Using different data, processed differently
Using different software stacks
Using different tunings & hyper-parameters on these models
Throwing out models that “don’t work”
How many of our results, our “good models” are due to simple, dumb chance?4
The solution …?
If the solutions were easy or obvious, everyone would be doing them. But that's a topic for another time.
There's a Dilbert strip where the pointy-haired boss suggests his team would be more effective if they just did the right thing first. Alas, my Google-fu has failed me.
As was once said to me, “The worst thing you can do with an AI project is treat it as an IT project, to be developed by IT.”
It’s not all the fault of the informatics and computer side. It can be damn difficult to get biologists and medics collaborating in a useful way.
Lauren Rayner-Oakden raised this point about the winners of Kaggle competitions: “AI competitions don’t produce useful models … [someone asserted] the proposed solutions are never intended to be applied directly”