Machine learning reidentifies private health data of children, adults

The protected health information of deidentified individuals may not be private after researchers used machine-learning techniques to reidentify the health data of some children and adults. The findings could signal a need for legislation that protects and ensures the privacy of people’s health data.

“The findings of this study suggests that current practices for deidentifying physical activity data are insufficient for privacy and that deidentification should aggregate the physical activity data of many people to ensure individuals’ privacy,” a study published in JAMA said. The study was authored by Liangyuan Na, a graduate student researcher for the Operations Research Center at the Massachusetts Institute of Technology, et al. 

For the study, researchers analyzed several datasets of deidentified physical activity data collected from the National Health and Nutrition Examination Survey (NHANES) during 2003-2004 and 2005-2006. The study explored if it was possible to reidentify that data. According to researchers, there’s concern that deidentified physical activity data collected from wearable devices can be reidentified.

Data from 4,720 adults and 2,427 children in the 2003-2004 dataset and 4,765 adults and 2,539 children in the 2005-2006 dataset was evaluated in the study. By using a random-forest algorithm, researchers successfully reidentified the demographic and physical activity data of 94.9 percent of adults and 87.4 percent of children in the 2003-2004 dataset. The algorithm also successfully reidentified the demographic and physical activity data of 93.8 percent of adults and 85.5 percent of children in the 2005-2006 dataset.

By using a linear support vector machine algorithm, researchers successfully reidentified the demographic and physical activity data of 85.6 percent of adults and 69.8 percent of children in the 2003-2004 dataset. It also successfully reidentified the data of 84.8 percent of adults and 67.2 percent of children in the 2005-2006 dataset.

“This study suggests that current practices for de-identification of accelerometer-measured (physical activity) data might be insufficient to ensure privacy. This finding has important policy implications because it appears to show the need for de-identification that aggregates the (physical activity) data of multiple individuals to ensure privacy for single individuals,” Na et. al wrote. 

According to researchers, policymakers have expressed concerns about identifying individuals or their actions based on activity data, despite device manufacturers and exercise-focused social networks stating that the sharing of deidentified physical activity data poses no privacy risks to individuals. Based on the findings, researchers suggested current health privacy laws and regulations should be updated.  

“These findings show that sharing deidentified physical activity data may constitute a serious privacy risk, which is problematic because employers, advertisers and other groups may receive deidentified physical activity data,” Na et. al concluded.