Better together: Radiology researchers see improved results when combining AI models

Ensemble learning—the combination of multiple AI models into a single model with a single purpose—can lead to better overall results, according to new research published in Radiology: Artificial Intelligence.

The study’s authors noted that diversity is crucial when researchers experiment with different combinations.

“Ensembles tend to perform best when each of the individual models performs well in its own right and the correlation among individual model predictions is relatively low,” wrote Ian Pan, Brown University in Providence, Rhode Island, and colleagues. “Achieving such diversity can be facilitated by developing a number of models using a variety of techniques and selecting the most accurate combination. Because ensembles benefit from low correlation between model predictions, the greater the underlying differences in approach, the greater the improvement, as long as they achieve similar performance.”

Pan et al. explored data from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge for their research. For that competition, a total of 48 teams submitted models to develop the bone age of pediatric hand x-rays.

While a single model achieved a mean absolute deviation (MAD) of 4.55 months, the best-performing ensemble included four different models and achieved a MAD of 3.79 months. The lowest MAD achieved when combining the highest-ranking AI models was 3.93.  

“Our results call attention to a concept that has substantial practical implications as computer vision and other machine learning algorithms begin to move from research to the clinical environment, namely that the best results are likely to be achieved by combining multiple accurate and diverse models rather than from single models alone,” the authors wrote. “Thus, practitioners aiming to incorporate machine learning algorithms into their workflow would benefit from having predictions obtained from different models, similar to how the accuracy of a radiologic interpretation can be bolstered with multiple readers.”

The team also noted that these findings emphasize how important it is for researchers to participate in open competitions such as the 2017 RSNA Pediatric Bone Age Machine Learning Challenge, “as they provide a standardized use case, a common training set and an objective assessment method applied equally to all models.”

“This approach has the benefit of encouraging the development of a diverse set of models and then highlighting not only the best-performing models but also those with the best potential to be combined into high-performing ensembles if the organizers choose to utilize that aspect of the challenge,” the authors wrote.

Ensemble learning in a clinical setting

In a related editorial, radiologist Eliot L. Siegel, MD, chief of imaging for the VA Maryland Healthcare System, noted that the research by Pan et al. “demonstrated multiple important points and potential strategies to advance the current state of the art in artificial intelligence applications in medical imaging.”

Siegel also noted that this concept of combining multiple models could prove effective in a clinical setting. If healthcare providers have access to numerous AI models for identifying intracranial hemorrhage, for example, they could work to develop ensembles of their own that “would likely outperform any individual commercial or research algorithm’s performance.”

“It is becoming increasingly clear that the ultimate partnership in diversity will be between humans and machines and both will inevitably have and develop very different but complementary approaches to challenges in diagnostic imaging and health care in general,” Siegel wrote. “Freeing ourselves of the constraints associated with trying to develop computer algorithms that merely emulate human approaches to problem solving will accelerate the arrival and efficacy of our next generation of artificial intelligence applications.”