AI tools performed worse on data outside original health system

Deep-learning models trained to detect pneumonia from chest X-rays performed worse when tested on X-rays from outside their original hospital systems, suggesting AI tools should undergo a wide range of testing before being used in clinical settings.

Convolutional neural networks (CNNs) designed to screen pneumonia achieved a better internal than external performance in three out of five natural comparisons, according to a recent study published in PLOS Medicine.

“The performance of CNNs in diagnosing diseases on X-rays may reflect not only their ability to identify disease-specific imaging findings on X-rays but also their ability to exploit confounding information,” the authors stated. “Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.”

With interest growing about utilizing CNNs in healthcare for computer-aided diagnosis, a research team—led by the Mount Sinai Hospital in New York—decided to assess how well deep-learning models trained at one hospital system generalized to other external hospital systems.   

The study was conducted at the Icahn School of Medicine at Mount Sinai. Researchers trained and evaluated the deep-learning model using more than 158,000 chest X-rays from three institutions: the National Institutes of Health Clinical Center, Mount Sinai Hospital and the Indiana University Network for Patient Care.

While internal performance of the CNNs “significantly exceeded” external performance in most of the comparisons, the deep-learning models were able to “detect the hospital system where an X-ray was acquired with a high-degree of accuracy, and cheated at their predictive task based on the prevalence of pneumonia at the training institution,” according to a press release from Mount Sinai.  

Based on the results, researchers believe AI platforms should be thoroughly assessed, in a variety of real-world situations, to ensure their accuracy.

“Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed,” Eric Oermann, MD, senior author and neurosurgery instructor at the Icahn School of Medicine, said in a statement. “Deep learning models trained to perform medical diagnosis can generalize well, but this cannot be taken for granted since patient populations and imaging techniques differ significantly across institutions.”