Machine learning approaches including deep learning and random forest greatly improved a University of Nottingham team’s ability to predict premature death in a study of half a million U.K. Biobank participants, according to research published in PLOS One.
The study, spearheaded by assistant professor and research scientist Stephen F. Weng, PhD, sought to integrate machine learning into traditional epidemiological work by developing and reporting novel prognostic models to supplement existing techniques. Two years ago, the same team reported findings that machine learning models could improve the accuracy of cardiovascular disease prediction by around 3.6 percent.
“In the era of big data, there is great optimism that machine learning can potentially revolutionize care, offer approaches for diagnostic assessment and personalize therapeutic decisions on par with, or superior to, clinicians,” Weng and co-authors wrote. “The challenge for applications and algorithms developed using machine learning is to not only enhance what can be achieved with traditional methods, but to also develop and report them in a similarly transparent and replicable way.”
For their current work, the researchers considered 502,628 adults aged 40 to 69 years whose health information was logged in the U.K. Biobank between 2006 and 2010. Using demographic data and taking into account biometric, clinical and lifestyle factors, they developed predictive mortality models using deep learning, random forest and Cox regression.
Nearly 3 percent of the study population died during a cumulative follow-up of 3,508,454 person-years, according to the team’s results, and mortality data was corroborated with national records. The age- and gender-based Cox model—a conventional approach to risk prediction—was the least predictive, with an area under the curve (AUC) of 0.689, followed by the multivariate Cox regression model, which improved discrimination by 6.2 percent for an AUC of 0.751.
Applying random forest further improved discrimination by 3.2 percent, reaching an AUC of 0.783, and the deep learning model was most successful, improving discrimination by an additional 3.9 percent from the multivariate Cox regression approach for an AUC of 0.790.
The two machine learning algorithms—random forest and deep learning, respectively—improved discrimination by 9.4 percent and 10.1 percent from the bare-bones age and gender Cox regression model. While both machine learning methods achieved similar levels of discrimination and were well-calibrated, the Cox regression models consistently over-predicted risk.
“The study shows the value of using machine learning to explore a wide array of individual clinical, demographic, lifestyle and environmental risk factors to produce a novel and holistic model that was not possible to achieve using standard approaches,” Weng et al. said. “This work suggests that use of machine learning should be more routinely considered when developing models for prognosis or diagnosis.”
The authors said next steps include validating these approaches in broader populations and integrating them into healthcare systems, as well as exploring how other machine learning models like support vector machines or gradient boosting could play into risk prediction.
“The intriguing variations in machine learning model composition may enable new hypothesis generation for potentially significant risk factors that would otherwise not have been detected,” they wrote. “Epidemiological studies could then be designed specifically, and powered accordingly, to verify these signals.”