Correlations and Research Results: Do They Match Up? (Part 2 of 2)

The previous part of this article described the benefits of big data analysis, along with some of the formal, inherent risks of using it. We’ll go even more into the problems of real-life use now.

More hidden bias

Jeffrey Skopek pointed out that correlations can perpetuate bias as much as they undermine it. Everything in data analysis is affected by bias, ranging from what we choose to examine and what data we collect to who participates, what tests we run, and how we interpret results.

The potential for seemingly objective data analysis to create (or at least perpetuate) discrimination on the basis of race and other criteria was highlighted recently by a Bloomberg article on Amazon Price deliveries. Nobody thinks that any manager anywhere said, “Let’s not deliver Amazon Prime packages to black neighborhoods.” But that was the natural outcome of depending on data about purchases, incomes, or whatever other data was crunched by the company to produce decisions about deliveries. ( quickly promised to eliminate the disparity.)

At the conference, Sarah Malanga went over the comparable disparities and harms that big data can cause in health care. Think of all the ways modern researchers interact with potential subjects over mobile devices, and how much data is collected from such devices for data analytics. Such data is used to recruit subjects, to design studies, to check compliance with treatment, and for epidemiology and the new Precision Medicine movement.

In all the same ways that the old, the young, the poor, the rural, ethnic minorities, and women can be left out of commerce, they can be left out of health data as well–with even worse impacts on their lives. Malanga reeled out some statistics:

  • 20% of Americans don’t go on the Internet at all.

  • 57% of African-Americans don’t have Internet connections at home.

  • 70% of Americans over 65 don’t have a smart phone.

Those are just examples of ways that collecting data may miss important populations. Often, those populations are sicker than the people we reach with big data, so they need more help while receiving less.

The use of electronic health records, too, is still limited to certain populations in certain regions. Thus, some patients may take a lot of medications but not have “medication histories” available to research. Ameet Sarpatwari said that the exclusion of some populations from research make post-approval research even more important; there we can find correlations that were missed during trials.

A crucial source of well-balanced health data is the All Payer Claims Databases that 18 states have set up to collect data across the board. But a glitch in employment law, highlighted by Carmel Shachar, releases self-funding employers from sending their health data to the databases. This will most likely take a fix from Congress. Unless they do so, researchers and public health will lack the comprehensive data they need to improve health outcomes, and the 12 states that have started their own APCD projects may abandon them.

Other rectifications cited by Malanga include an NIH requirement for studies funded by it to include women and minorities–a requirement Malanga would like other funders to adopt–and the FCC’s Lifeline program, which helps more low-income people get phone and Internet connections.

A recent article at the popular TechCrunch technology site suggests that the inscrutability of big data analytics is intrinsic to artificial intelligence. We must understand where computers outstrip our intuitive ability to understand correlations.

About the author

Andy Oram

Andy Oram

Andy Oram writes and edits documents about many aspects of computing, ranging in size from blog postings to full-length books. Topics cover a wide range of computer technologies: data science and machine learning, programming languages, Web performance, Internet of Things, databases, free and open source software, and more. My editorial output at O'Reilly Media included the first books ever published commercially in the United States on Linux, the 2001 title Peer-to-Peer (frequently cited in connection with those technologies), and the 2007 title Beautiful Code. He is a regular correspondent on health IT and health policy for He also contributes to other publications about policy issues related to the Internet and about trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business.