Language processing algorithm 97.6% accurate using clinical data to detect disease symptoms

Researchers have developed a natural language processing algorithm, trained on clinical record data, capable of detecting disease symptoms with high accuracy, according to a study published June 15 in the JMIR Medical Informatics.

Clinical records contain valuable information regarding the symptoms of many diseases, but the data need to be codified by natural language processing algorithms. In this study, researchers outlined the development of Clinical History Extractor for Syndromic Surveillance (CHESS), a natural language processing algorithm.


“By grouping symptoms identified into specific syndromes based on the presentation of the illness, we may potentially identify illness clusters that would not otherwise be suspected, particularly when leveraging off other routinely available information in electronic health records, such as demographic and geolocation data,” wrote first author Antony Hardjojo, PhD, and colleagues. “However, to capture clinical presentation as syndromes requires additional intervention.”

CHESS uses keywords found in patients’ electronic medical records to extract 48 signs and symptoms suggesting respiratory infections, gastrointestinal infections and other diseases. In this study, CHESS was evaluated on 1,680 notes—half used for training and half for validation.

Results showed CHESS was able to reach 96.7 percent precision and 97.6 percent recall on the training data set and 96 percent precision and 93.1 percent recall on the validation dataset. The tools overall accuracy was 97.6 percent, and it was able to identify symptom duration in 81.2 percent of records.

“We have developed a natural language processing algorithm dubbed CHESS that achieves good performance in extracting signs and symptoms from primary care free-text clinical records,” concluded Hardjojo and colleagues. “In addition to the presence of symptoms, our algorithm can also accurately distinguish affirmed, negated, and suspected assertion statuses and extract symptom durations.”