‘Exaggerated hopes and exaggerated fears’: 4 ways big data complicates research

Using big data and AI-driven prediction models can be clinically useful, but it’s also important to learn about that data and the processes involved in collecting it, according to work published in JAMA Psychiatry.

Gregory E. Simon, MD, MPH, used his background in psychiatry to scrutinize the use of big data in mental health care in the article, published Feb. 27. Simon, a senior investigator at the Kaiser Permanente Washington Health Research Institute in Seattle, said tech innovations like computerizing clinician notes, identifying symptom patterns with machine learning and predicting responses to certain antidepressants have been hugely beneficial in the mental health world, but there’s also an inherent hesitancy about the rapid nature of those changes.

“In practice, machine-learned algorithms already help to identify veterans at highest risk of suicide and medical patients at risk of rapid decompensation,” Simon wrote in JAMA Psychiatry. “But these early sightings of predictive analytics in healthcare have prompted both exaggerated hopes and exaggerated fears that big data and machine learning will supplant human clinical intelligence.

“Both the hopes and the fears stem from some overly optimistic expectations—that data quantity outweighs data quality and that artificial intelligence can spin straw into gold.”

Simon cautioned researchers and physicians alike to be wary of these four factors:

Not all big data are trustworthy or useful

While data drawn from real-world health records are practical and scientifically meaningful, large sample sizes can actually make it harder to complete smaller-scale studies focused on specific conditions in niche populations, Simon said.

When pulling records from national databases, he said physicians should consider whether researchers accurately translated what clinicians recorded (diagnosis codes and procedure codes, for example), whether patients of interest would typically present for care in the setting of that database, how records systems or financial incentives might have influenced results and whether clinicians in the database would accurately recognize and assess the clinical condition in question.

“Slippage at any of these points could distort the relationship between clinical reality and the image of that reality in a dataset,” Simon wrote. “Different types of error (e.g. false-positive versus false-negative diagnoses in electronic health records) would have different implications depending on the specific scientific or clinical question.”

Sample sizes are growing into the millions—and that might be too big

Both sample sizes and the number of variables involved in big data collection are growing, Simon wrote, but as more data are included in studies “nearly any comparison of interest becomes statistically significant.”

The problem with that is even when a result is statistically significant, it still might not have meaningful implications for clinical practice or policy. Simon explained that as data grow longer—in other words, as sample sizes grow larger—the P values that are so critical to clinical research “will lose value as indicators of true significance.” For example, the probability of refilling an initial antidepressant script varies from 70.4 percent to 71.6 percent in studies, which is an insignificant difference unless the P value used falls under the .00000000001 level.

“As data grow wider to include hundreds of potentially correlated predictors, conventional variable selection methods can overfit to chance or idiosyncratic associations,” Simon wrote. “More sophisticated variable selection or machine learning methods become necessary to identify meaningful or generalizable associations among hundreds or thousands of comparisons.”

It’s becoming difficult to establish and explain causal relationships

When analyzing big data, Simon said it’s important to distinguish between two primary tasks: prediction and explanation, or hypothesis testing. He said models optimized for prediction would consider antidepressant use, for example, in the context of hundreds of potential predictors and interactions, but it’s impossible to distinguish between a causal relationship and confounding risks.

“But if our goal is pure prediction, we would not care,” he wrote. “If instead our goal is explanation, then our analyses would attempt to isolate estimated effects of antidepressant use from all other predictors or potential confounders. We might still use variable selection or machine learning methods to select those potential confounders, but with the goal of isolating one potential relationship.”

The development of prediction models tends to be informed by theory, Simon said, but models optimized for prediction aren’t well-suited to analyze causal relationships. On the other hand, analytic methods that are more appropriate for testing theories aren’t the best tools for prediction.

It’s harder than ever to be transparent

Because health system records don’t necessarily permit researchers to disclose detailed data about specific patients—it’s too risky to possibly identify patients in any way and expose their medical information—Simon said researchers need to be overly transparent about their data processing procedures.

“Every line of processing code should be available for inspection, testing and reuse in other settings,” he said. “A similar level of transparency is necessary regarding complex analytic methods. While machine learning models are often described as black boxes, development and validation of those models can still be reported in detail.”

Anyone using a similar model should have access to descriptive statistics for each variable of the dataset, details of model performance across a range of cut points and even the original computer coding for the set, according to Simon. He said complex analyses that aren’t transparently described probably shouldn’t be trusted, and proprietary prediction models are “inherently problematic.”