Correlations and Research Results: Do They Match Up? (Part 1 of 2)

Eight years ago, a widely discussed issue of WIRED Magazine proclaimed cockily that current methods of scientific inquiry, dating back to Galileo, were becoming obsolete in the age of big data. Running controlled experiments on limited samples just have too many limitations and take too long. Instead, we will take any data we have conveniently at hand–purchasing habits for consumers, cell phone records for everybody, Internet-of-Things data generated in the natural world–and run statistical methods over them to find correlations.

Correlations were spotlighted at the annual conference of the Petrie-Flom Center for Health Law Policy, Biotechnology, and Bioethics at Harvard Law School. Although the speakers expressed a healthy respect for big data techniques, they pinpointed their limitations and affirmed the need for human intelligence in choosing what to research, as well as how to use the results.

Petrie-Flom annual 2016 conference

Petrie-Flom annual 2016 conference

A word from our administration

A new White House report also warns that “it is a mistake to assume [that big data techniques] are objective simply because they are data-driven.” The report highlights the risks of inherent discrimination in the use of big data, including:

  • Incomplete and incorrect data (particularly common in credit rating scores)

  • “Unintentional perpetuation and promotion of historical biases,”

  • Poorly designed algorithmic matches

  • “Personaliziaton and recommendation services that narrow instead of expand user options”

  • Assuming that correlation means causation

The report recommends “bias mitigation” (page 10) and “algorithmic systems accountability” (page 23) to overcome some of these distortions, and refers to a larger FTC report that lays out the legal terrain.

Like the WIRED articles mentioned earlier, this gives us some background for discussions of big data in health care.

Putting the promise of analytical research under the microscope

Conference speaker Tal Zarsky offered both fulsome praise and specific cautions regarding correlations. As the WIRED Magazine issue suggested, modern big data analysis can find new correlations between genetics, disease, cures, and side effects. The analysis can find them much cheaper and faster than randomized clinical trials. This can lead to more cures, and has the other salutory effect of opening a way for small, minimally funded start-up companies to enter health care. Jeffrey Senger even suggested that, if analytics such as those used by IBM Watson are good enough, doing diagnoses without them may constitute malpractice.

W. Nicholson Price, II focused on the danger of the FDA placing too many strict limits on the use of big data for developing drugs and other treatments. Instead of making data analysts back up everything with expensive, time-consuming clinical trials, he suggested that the FDA could set up models for the proper use of analytics and check that tools and practices meet requirements.

One of exciting impacts of correlations is that they bypass our assumptions and can uncover associations we never would have expected. The poster child for this effect is the notorious beer-and-diapers connection found by one retailer. This story has many nuances that tend to get lost in the retelling, but perhaps the most important point to note is that a retailer can depend on a correlation without having to ascertain the cause. In health, we feel much more comfortable knowing the cause of the correlation. Price called this aspect of big data search “black box” medicine.” Saying that something works, without knowing why, raises a whole list of ethical concerns.

A correlation stomach pain and disease can’t tell us whether the stomach pain led to the disease, the disease caused the stomach pain, or both are symptoms of a third underlying condition. Causation can make a big difference in health care. It can warn us to avoid a treatment that works 90% of the time (we’d like to know who the other 10% of patients are before they get a treatment that fails). It can help uncover side effects and other long-term effects–and perhaps valuable off-label uses as well.

Zarsky laid out several reasons why a correlation might be wrong.

  • It may reflect errors in the collected data. Good statisticians control for error through techniques such as discarding outliers, but if the original data contains enough apples, the barrel will go rotten.

  • Even if the correlation is accurate for the collected data, it may not be accurate in the larger population. The correlation could be a fluke, or the statistical sample could be unrepresentative of the larger world.

Zarsky suggests using correlations as a starting point for research, but backing them up by further randomized trials or by mathematical proofs that the correlation is correct.

Isaac Kohane described, from the clinical side, some of the pros and cons of using big data. For instance, data collection helps us see that choosing a gender for intersex patients right after birth produces a huge amount of misery, because the doctor guesses wrong half the time. However, he also cited times when data collection can be confusing for the reasons listed by Zarsky and others.

Senger pointed out that after drugs and medical devices are released into the field, data collected on patients can teach developers more about risks and benefits. But this also runs into the classic risks of big data. For instance, if a patient dies, did the drug or device contribute to death? Or did he just succumb to other causes?

We already have enough to make us puzzle over whether we can use big data at all–but there’s still more, as the next part of this article will describe.

About the author

Andy Oram

Andy Oram

Andy Oram writes and edits documents about many aspects of computing, ranging in size from blog postings to full-length books. Topics cover a wide range of computer technologies: data science and machine learning, programming languages, Web performance, Internet of Things, databases, free and open source software, and more. My editorial output at O'Reilly Media included the first books ever published commercially in the United States on Linux, the 2001 title Peer-to-Peer (frequently cited in connection with those technologies), and the 2007 title Beautiful Code. He is a regular correspondent on health IT and health policy for He also contributes to other publications about policy issues related to the Internet and about trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business.