Is Healthcare Big Data Biased?

Have you ever wondered whether YOUR healthcare data is included in the “big data” everyone’s talking about? After all, healthcare big data analytics are going to change the world; shouldn’t those changes be representative of the population they will impact?

To answer that question, we have to identify the sources of the healthcare big data being used to effect change, and consider the likelihood that your data may have been captured and consumed by one of the reporting organizations. So let’s start with the “capture” part of that equation.

Have you received some type of healthcare service this year? That includes, but is not limited to: hospital visit, physical therapy, doctor visit, chiropractor visit, urgent care visit, e-visit or phone consultation, health risk assessment or health fair.

Have you purchased or requested any regulated healthcare product this year, such as prescription drugs?

Do you have private health insurance?

Are you enrolled in Medicare or Medicaid?

If yes to any of the above, and the last question, in particular, YES, your data is included in the “big data” analytics currently shaping policy. It is likely that each billable product and service is attached to your Electronic Health Record, available for review and reporting by each involved party from your PCP (Primary Care Provider) to your friendly insurance call center agent. Your individual collection of data points are aggregated into a larger population, and sliced and diced to provide insights into groundbreaking research efforts. Congratulations! But does that inclusion mean that the conclusions driven by healthcare big data are representative?

By nature, the relevance of data-driven insights increases in proportion to the size of the population – and data points – included. But what if the outliers for the general population are the norm for your data set? Are your conclusions skewed?

What if you represent a population segment that is recognized as underserved? Consider the following, from the first Health Disparities and Inequalities Report, prepared in 2011 by the CDC (Centers for Disease Control): “Increasingly, the research, policy, and public health practice literature report substantial disparities in life expectancy, morbidity, risk factors, and quality of life, as well as persistence of these disparities among segments of the population…defined by race/ethnicity, sex, education, income, geographic location, and disability status.”

If your access to healthcare is limited by any of the factors indicated above, your data may not be captured unless/until there is an acute episode which requires medical intervention. In the report, the CDC acknowledges the challenge of capturing national data to support health initiatives for these populations; it is widely accepted as a barrier to healthcare equality that must be overcome.

What if you’re healthy? I’ll use myself as an example. I don’t go to the doctor unless it’s urgent, and I haven’t visited my PCP in over a year. I’ve injured my shoulder and my back over the past year, both of which required MRI and CAT scans to diagnose severity; however, I do not follow any medically supervised treatment plan for rehabilitation. I don’t take any routine prescription medication. I’m an exercise enthusiast who works out intensely 5-6 days/week, and I sleep 8-9 hours a night. Yes, I do sleep that much. And no, me putting all this information into a blog does not constitute the data being captured for use in healthcare big data analytics. Because I haven’t needed to go to my PCP lately, don’t take routine prescription medication, and am not of age for Medicare or income level for Medicaid, the only current healthcare data available for analysis for me is orthopedic in nature and revolves around imaging data, not traditional clinical measures. Someone like me who had NOT experienced an acute care episode would have no current data available for consumption and reporting as part of a larger population.

Could it be that much, if not most, healthcare big data cited for research purposes is comprised primarily of a triangle of outlier population segments: 1) oldest, 2) poorest, and 3) sickest?

Perhaps. So, when reading on the advances in healthcare big data analytics, ask yourself whether that “big data” means “YOUR data”.

PS – For those of you curious about defining “big data” in healthcare, read Dr. Graham Hughes blog post for SAS, “How Big Is Big Data In Healthcare?”, detailing the nuances of the term as it relates to data size, complexity, and usage. Also, I’d like to thank the good folks at Vanderbilt University for compiling a fairly comprehensive list of healthcare data resources; it has been highly educational. Finally, if you’d like to read the complete CDC report, you can find it here.

About the author

Mandi Bishop

Mandi Bishop

Mandi Bishop is a hardcore health data geek with a Master's in English and a passion for big data analytics, which she brings to her role as Dell Health’s Analytics Solutions Lead. She fell in love with her PCjr at 9 when she learned to program in BASIC. Individual accountability zealot, patient engagement advocate, innovation lover and ceaseless dreamer. Relentless in pursuit of answers to the question: "How do we GET there from here?" More byte-sized commentary on Twitter: @MandiBPro.


  • For a wacky perspective at 50,000 ft…your examples of key data points is at least noteworthy and their possible use interesting. Ultimately the state needs access to all physician’s notes which contain key patient data not currently available to the collective. This “Big Data” is the actual purpose for the Affordable Healthcare Act (i.e., Obama Care) and its immediate push, and the cost of obtaining this information irrelevant…it’s a must have for the state, and have it they will. “We must pass the bill to know what’s in it.” What?
    EHR’s were never about patient savings as spun, but about having total access to all information about you and me. Many continue to believe EMR’s and its various implementations are about providing efficiency’s. Well, I ask efficiency’s for whom? The NSA’s Utah Data Complex is about the state gathering every single “byte” of digital information created yesterday, today, and tomorrow, and storing it for later analysis if/when necessary…when completed this facility will have the capacity to hold 100 years of the world’s data. What does this mean? I can’t even fathom, so I’ll let you think about it! What we currently think is big data is really jokingly small…but ever wonder why those notes have such value?

  • Thanks for the interest, Carl! I take pride in my “wacky perspective”. I must say, I find your take on my “spin” a bit wacky, as well. EHR as a mandate has always been about improving the volume, accuracy, and transmission of patient data – to ALL stakeholder parties, government included. If I intimated otherwise, it was erroneous on my part.

    Ostensibly, this data collection would lead to improved diagnosis, treatment adherence, and finally outcomes. That’s the noble goal.

    Information is power. The insights we could gain with 100 years of clinical note data – if properly correlated to other relevant factors – could be exactly what the world needs to cure cancer. That’s the optimist’s answer to, “why those notes have such value”.

    Personally, I’m with Spock: “The needs of the many outweigh the needs of the few – or the one.”

  • Carl, I think you and I would have some fascinating conversations IRL.

    I love a good conspiracy theory as much as the next person, but the point of the article is whether healthcare policy is based on outlier populations who do not necessarily represent you or ME.

    I find it sad that an overwhelming majority of Americans feel a sense of entitlement to use our civil liberties to adopt and maintain unhealthy lifestyles – simply because they can. If we would all take accountability for our own health – eat clean, exercise regularly, sleep, how much of an impact would we make on healthcare cost, efficiency, and ultimately outcomes – without any technological influence?

Click here to post a comment