Is Claims Data Really So Bad For Health Care Analytics?

Two commonplaces heard in the health IT field are that the data in EHRs is aimed at billing, and that billing data is unreliable input to clinical decision support or other clinically related analytics. These statements form two premises to a syllogism for which you can fill in the conclusion. But at two conferences last week–the Health Datapalooza and the Health Privacy Summit–speakers indicated that smart analysis can derive a lot of value from claims data.

The Healthcare Cost and Utilization Project (HCUP), run by the government’s Agency for Healthcare Research and Quality (AHRQ), is based on hospital release data. Major elements include the payer, diagnoses, procedures, charges, length of stay, etc. along with potentially richer information such as patients’ ages, genders, and income levels. A separate Clinical Content Enhancement Toolkit does allow states to add clinical data, while American Hospital Association Linkage Files let hospitals upload data about their facilities.

But basically. HCUP data revolves around the claims from all-payer databases. It is collected currently from 47 states, and varies on a state-by-state basis depending on what data they allow to be released. HCUP goes back to 2006 and powers a lot of research, notably to improve outreach to underserved racial and ethnic groups.

During an interview at the Health Privacy Summit, Lucia Savage, Chief Privacy Officer at ONC, mentioned that one can use claims data to determine what treatments doctors offer for various conditions (such as mammograms, which tend to be underused, and antibiotics, which tend to be overused). Thus, analysts can target providers who fail to adhere to standards of care and theoretically improve outcomes.

M1, a large data analytics company serving a number of industries, bases a number of products in the health care space on claims data. For instance, medical device companies contract with M1 to find out which devices doctors are ordering. Insurance companies use it to sniff out fraud.

M1’s business model, incidentally, is a bit different from that pursued by most analytics organizations in the health care arena. Most firms contract with some institution–an insurer, for instance–to analyze its data and provide it with unique findings. But M1 goes around buying up data from multiple institutions and combining it for deeper insights. It then sells results back to these institutions, often paying out taking in payment from the same company.

In short, smart organizations are shelling out money for data about billing and claims. It looks like, if you have a lot of this data, you can reliably lower costs, improve marketing, and–most important of all–improve care. But we mustn’t lose sight of the serious limitations and weaknesses of this data.

  • A scandalously amount of it is clinical just wrong. Doctors “upcode” to extract the largest possible reimbursement for what they treat. A number of them go further and assign codes that have no justification whatsoever. And that doesn’t even count outright fraud, which reaches into the billions of dollars each year and therefore must leave a lot of bad data in the system.

  • Data is atomized, each claim standing on its own. A researcher will find it difficult to impossible (if patient identifiers are totally stripped out) to trace a sequence of visits that tell you about the progress of treatment.

  • Data is relatively impoverished. Clinical records flesh out the diagnosis with related conditions, demographic information, and other things that make the difference between correct and incorrect treatments.

But on the other hand, to go beyond billing data and reach the data utopia that reformers dream about, we’d have to slurp up a lot of complex and sensitive patient data. This has pitfalls of its own. Little clinical data is structured, and the doctors who do take the effort to enter it into structured fields do so inconsistently. Privacy concerns also raise their threatening heads when you get deep into patient conditions and demographics. So perhaps we should see how far we can get with claims data.

About the author

Andy Oram

Andy Oram

Andy Oram writes and edits documents about many aspects of computing, ranging in size from blog postings to full-length books. Topics cover a wide range of computer technologies: data science and machine learning, programming languages, Web performance, Internet of Things, databases, free and open source software, and more. My editorial output at O'Reilly Media included the first books ever published commercially in the United States on Linux, the 2001 title Peer-to-Peer (frequently cited in connection with those technologies), and the 2007 title Beautiful Code. He is a regular correspondent on health IT and health policy for He also contributes to other publications about policy issues related to the Internet and about trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business.


  • For me the questions of data quality are secondary to the who question of building models on data that’s an abstraction of an abstraction. Billing codes and diagnosis codes are an abstract of the information exchanged and the therapeutic interaction during an episode of care. The codes represent a simplified model of that episode of care. Rolling up millions, billions or zillions of billing codes to inform a model for performance analysis or policy formation results in a model (an “abstraction”) built on these simplified codes.

    This is not science nor a good driver of policy and ICD-10 isn’t going to fix it.

  • The issue of data quality and thus its usability is critical in this context. The current EHR’s and extensive coding requirements take time away from patient care assuming that the resulting data will improve patient care in the future thus repaying the costs.

    Is there a study that tests the basic premise that the EHR improve patient outcomes? Such as: decrease number of visits, improve control of Diabetes, … Who has time to perform quality control of the data entered into EHR?

    I would like to know.

Click here to post a comment