Key challenges for delivering clinical impact with artificial intelligence

January 8, 2020

Artificial intelligence (AI) research in healthcare is accelerating rapidly, with potential applications being demonstrated across various domains of medicine, including algorithms for interpreting chest radiographs, detecting cancer in mammograms, or predicting development of Alzheimer’s disease from positron emission tomography, identifying cancerous skin lesions. Several companies are developing platforms that harness AI as a means to identify genetic variants at the roots of rare diseases, or even identifying genetic conditions from facial appearance.

However, there are currently limited examples of such techniques being successfully deployed into clinical practice. Even as software using AI-labeled techniques continues to yield tremendous improvements in most software categories, both academics and clinicians have observed that such algorithms fall far short of what can be reasonably considered intelligent.

Artificial Intelligence today is commonly described as computer programs capable of learning and reasoning about the world, assigning stories to their successes that extrapolate far beyond the way they really work.Data scientists usually proclaim that their algorithm learned a new task, rather than merely induced a set of statistical patterns from a manually selected and labelled set of training data under the direct supervision of a software developer who chose which algorithms, parameters and workflows they used to build it.

As such, a neural network that correctly distinguishes dogs from cats is said to have learned the innate biological characteristics of those animals, while, in reality, it may merely have detected that all examples of dogs wore collars in the training dataset. In fact, the underlying neural network doesn’t actually understand what a “dog” or a “cat” or a “collar” is. It merely associates specific spatial groupings of colors and textures with particular strings of text. Stray too far from the examples it has seen in the past and it fails, with disastrous consequences if it is screening for cancer or neurodegenerative diseases.

How does AI work?

To some extent, the artificial intelligences of today no more “learn” or “reason” about the world than a linear regression of the past. They merely induce patterns through statistics. It’s just a curve-fitting exercise, albeit complex and nontrivial. The availability of large amounts of data and new hardware architectures, make them capable of representing more complex statistical phenomena than historical approaches, and current deep learning techniques can identify previously hidden patterns, extrapolate trends and predict results across a broad spectrum of problems. But they are still merely mathematical abstractions, no matter how spectacular their results.

Deep learning is a class of computational algorithms based on neural networks which iteratively “learn” an approximation to some function, and we can easily identify three common components that make up almost every machine learning algorithm: representation, evaluation, and optimization.

-Representation involves the transformation of inputs from one space to another more useful space which can be more easily interpreted. Think of this in the context of a Convolutional Neural Network, the most common architecture for image processing. Raw pixels are not useful for distinguishing a dog from a cat, so we transform them to a more useful representation (e.g., logits from a softmax output) which can be interpreted and evaluated.

-Evaluation is essentially the loss function. How effectively did our algorithm transform data to a more useful space? How closely did our output resemble our expected labels (classification)? Did the system correctly predict the next word in the text sequence (Recursive Neural Networks)? How far our data latent distribution diverges from a unit Gaussian (Variational Autoencoder)? These questions tell you how well our representation function is working; more importantly, they define what it will learn to do.

-Optimization is the last piece of the puzzle. Once we have the evaluation component, we can optimize the representation function in order to improve our evaluation metric. In neural networks, this usually means using some variant of stochastic gradient descent to update the parameters (weights and biases) of our network according to some defined loss function.

Deep mining electronic health records

The promise of digital medicine stems in part from the hope that, by digitizing health data, we might more easily leverage computer information systems to understand and improve care. In fact, routinely collected patient healthcare data is now approaching the genomic scale in volume and complexity.

Machine learning is currently applied to health records to predict patients who have a higher risk of readmission to the hospital or patients who have a higher chance of not showing up for an appointment or not adhering to prescribed medications. But the applications are limitless also in diagnostics, research, drug development and clinical trials.

Despite the rich data now digitized, predictive models built with EHR data rarely use more than 20 or 30 variables, and mostly rely on traditional generalized linear models. In clinical practice, simpler models are most commonly deployed, or even single-parameter warning scores.

A key advantage of machine learning techniques is that investigators do not need to specify which potential predictor variables to consider and in what combinations; instead, deep networks, together with natural languages processing (NLP) techniques to analyse clinical notes from doctors and nurses, are able to learn representations of the key factors and interactions from the data itself and open the door to a future where actionable data sources and advanced analytics can be used to generate hypotheses that motivate the development of innovative diagnostics and therapies.

Current challenges

Electronic health records (EHRs) are tremendously complicated. Each health system customizes their EHR system, making the data collected at one hospital look different than data on a similar patient receiving similar care at another hospital. Even a temperature measurement has a different meaning depending on the context. So before we can apply machine learning techniques in a broader scale, we needed a consistent way to represent patient records, such as the open Observational Medical Outcomes Partnership (OMOP) Common Data Model standard.

The OMOP Common Data Model allows for the systematic analysis of disparate observational databases. The concept behind this approach is to transform data contained within those databases into a common format (data model) as well as a common representation (terminologies, vocabularies, coding schemes), and then perform systematic analyses using a library of standard analytic routines that have been written based on the common format allowing for a robust feature generation process and providing better control into adjustment for potential confounders.

This is also important to avoid overfitting, a problem that appears when the learned model becomes too attuned to the data on which it was trained and therefore loses its applicability to any other dataset. This is a risk with curve fitting approaches that are too good at representing a given data set like deep learning, because the algorithm will fail to recognize normal fluctuations in data and end up being whipsawed by noise or detecting spurious correlation between variables.

Besides this data heterogeneity problem, the different legislation on medical data of patients, makes it almost impossible to remove those data from the hospital. One possible solution is not to take the data to the place where you do the analysis, but to take the analysis techniques to the hospitals and collect the results in a federated model, for which the use of standards like OMOP is also needed.

Intertwined with the issue of generalisability is that of discriminatory bias. Blind spots in machine learning training data can reflect societal biases, with a risk of unintended or unknown accuracies in minority subgroups, and there is fear over the potential for amplifying biases present in the historical data. Studies indicate that many AI systems disproportionately affect groups that are already disadvantaged by factors such as race, gender and socioeconomic background. In medicine, examples include hospital mortality prediction algorithms with varying accuracy by ethnicity and algorithms that can classify images of benign and malignant moles with accuracy similar to that of board-certified dermatologists, but with underperformance on images of lesions in skin of colour due to training on open datasets of predominantly fair skinned patients. The ability to detect and correct these biases in the data makes it important to rely on different and rich data sources so that breadth and depth of the data exposes the subtleties corresponding to each patient group.

Algorithmic interpretability is another important issue and a trade-off exists between performance and explainability. The best performing models (e.g. deep learning) are often the least explainable, whereas models with poorer performance (e.g. linear regression, decision trees) are the most explainable.

While AI approaches in medicine have yielded some impressive practical successes to date, their effectiveness is limited by their inability to ‘explain’ their decision-making in an understandable way This is potentially problematic for medical applications, where there is particular demand for approaches that are not only well-performing, but also trustworthy, transparent, interpretable and explainable

Understanding and confronting these challenges is crucial if we want to take full advantage of the opportunities new technologies have to offer for improving the experience of health care, improving the health of populations, and reducing per capita costs of healthcare.

Further reading:

[1] Key challenges for delivering clinical impact with artificial intelligence, BMC Medicine

[2] Scalable and accurate deep learning with electronic health records, Nature

Image Description

Alberto Labarga

Senior Data Engineer