Exchange Value: A Review of Our Bodies, Our Data by Adam Tanner (Part 2 of 3)

The previous part of this article summarized the evolution of data brokering in patient information and how it was justified ethically and legally, partly because most data is de-identified. Now we’ll take a look at just what that means.

The identified patient

Although doctors can be individually and precisely identified when they prescribe medicines, patient data is supposedly de-identified so that none of us can be stigmatized when trying to buy insurance, rent an apartment, or apply for a job. The effectiveness of anonymization or de-identification is one of the most hotly debated topics in health IT, and in the computer field more generally.

I have found a disturbing split between experts on this subject. Computer science experts don’t just criticize de-identification, but speak of it as something of a joke, assuming that it can easily be overcome by those with a will to do so. But those who know de-identification best (such as the authors of a book I edited, Anonymizing Health Data) point out that intelligent, well-designed de-identification databases have been resistant to cracking, and that the highly publicized successes in re-identification have used databases that were de-identified unprofessionally and poorly. That said, many entities (including the South Korean institutions whose practices are described in Chapter 10, page 110 of Tanner’s book) don’t call on the relatively rare experts in de-identification to do things right, and therefore fall into the category of unprofessional and poor de-identification.

Tanner accurately pinpoints specific vulnerabilities in patient data, such as the inclusion of genetic information (Chapter 9, page 96). A couple of companies promise de-identified genetic data (Chapter 12, page 130, and Conclusion, page 162), which all the experts agree is impossible due to the wide availability of identified genomes out in the field for comparison (Conclusion, page 162).

Tanner has come down on the side of easy re-identification, having done research in many unconventional areas lacking professional de-identification. However, he occasionally misses a nuance, as when describing the re-identification of people in the Personal Genome Project (Chapter 8 page 92). The PGP is a uniquely idealistic initiative. People who join this project relinquish interest in anonymity (Chapter 9, page 96), declaring their willingness to risk identification in pursuit of the greater good of finding new cures.

In the US, no legal requirement for anonymization interferes with selling personal data collected on social media sites, from retailers, from fitness devices, or from genetic testing labs. For most brokers, no ethical barriers to selling data exist either, although Apple HealthKit bars it (Chapter 14 page 155). So more and more data about our health is circulating widely.

With all these data sets floating around–some supposedly anonymized, some tightly tied to your identity–is anonymization dead? Every anonymized data set already contains a few individuals who can be theoretically re-identified; determining this number is part of the technical process of de-identification? Will more and more of us fall into this category as time goes on, victims of advanced data mining and the “mosaic effect” (combining records from different data sets)? This is a distinct possibility for the future, but in the present, there are no examples of re-identifying data that is anonymized properly–the last word properly being all important here. (The authors of Anonymizing Health Data talk of defensible anonymization, meaning you can show you used research-vetted processes.) Even Latanya Sweeney, whom Tanner tries to portray in Chapter 9 as a relentless attacker who strips away the protections of supposedly de-identified data, believes that data can be shared safely and anonymously.

To address people’s fretting over anonymization, I invoke the analogy of encryption. We know that our secret keys can be broken, given enough computing power. Over the decades, as Moore’s Law and the growth of large computing clusters have increased computing power, the recommended size of keys has also grown. But someday, someone will assemble the power (or find a new algorithm) that cracks our keys. We know this, yet we haven’t stopped using encryption. Why give up the benefits of sharing anonymized data, then? What hurts us is the illegal data breaches that happen on average more than once a day, not the hypothetical re-identification of patients.

To me, the more pressing question is what the data is being used for. No technology can be assessed outside of its economic and social context.

Almighty capitalism

One lesson I take from the creation of a patient data market, but which Tanner doesn’t discuss, is its existence as a side effect of high costs and large inefficiencies in health care generally. In countries that put more controls on doctors’ leeway to order drugs, tests, and other treatments, there is less wiggle room for the marketing of unnecessary or ineffective products.

Tanner does touch on the tendency of the data broker market toward monopoly or oligopoly. Once a company such as IMS Health builds up an enormous historical record, competing is hard. Although Tanner does not explore the affect of size on costs, it is reasonable to expect that low competition fosters padding in the prices of data.

Thus, I believe the inflated health care market leaves lots of room for marketing, and generally props up the companies selling data. The use of data for marketing may actually hinder its use for research, because marketers are willing to pay so much more than research facilities (Conclusion, pages 163-164).

Not everybody sells the data they collect. In Chapter 13, Tanner documents a complicated spectrum for anonymized data, ranging from unpublicized sales to requiring patient consent to forgoing all data sales (for instance, footnote 6 to Chapter 13 lists claims by and Surescripts not to sell patient information). Tenuous as trust in reputation may seem, it does offer some protection to patients. Companies that want to be reputable make sure not to re-identify individual patients (Chapter 7, page 72, Chapter 9, pages 88-90, and Chapter 9, page 99). But data is so valuable that even companies reluctant to enter that market struggle with that decision.

The medical field has also pushed data collectors to make data into a market for all comers. The popular online EHR, Practice Fusion, began with a stable business model offering its service for a monthly fee (Chapter 13, page 140). But it couldn’t persuade doctors to use the service until it moved to an advertising and data-sharing model, giving away the service supposedly for free. The American Medical Association, characteristically, has also found a way to extract profit from sale of patient data, and therefore has colluded in marketing to doctors (Chapter 5, page 41, and Chapter 6, page 54).

Thus, a Medivo executive makes a good argument (Chapter 13, page 147) that the medical field benefits from research without paying for the dissemination of data that makes research possible. Until doctors pony up for this effort, another source of funds has to support the collection and research use of data. And if you believe that valuable research insights come from this data (Chapter 14, page 154, and Conclusion, page 166), you are likely to develop some appreciation for the market they have created. Another obvious option is government support for the collection and provision of data for research, as is done in Britain and some Nordic countries, and to a lesser extent in the US (Chapter 14, pages 158-159).

But another common claim, aired in this book by a Cerner executive (Chapter 13, page 143) is that giving health data to marketers reduces costs across the system, similarly to how supermarkets grant discounts to shoppers willing to have their purchases tracked. I am not convinced that costs are reduced in either case. In the case of supermarkets, their discounts may persuade shoppers to spend more money on expensive items than they would have otherwise. In health care, the data goes to very questionable practices. These become the topic of the last part of this article.

About the author

Andy Oram

Andy Oram

Andy Oram writes and edits documents about many aspects of computing, ranging in size from blog postings to full-length books. Topics cover a wide range of computer technologies: data science and machine learning, programming languages, Web performance, Internet of Things, databases, free and open source software, and more. My editorial output at O'Reilly Media included the first books ever published commercially in the United States on Linux, the 2001 title Peer-to-Peer (frequently cited in connection with those technologies), and the 2007 title Beautiful Code. He is a regular correspondent on health IT and health policy for He also contributes to other publications about policy issues related to the Internet and about trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business.