De-Identification of Data in Healthcare

Today I had a chance to sit down with Khaled El Emam, PhD, CEO and Founder of Privacy Analytics, to talk about healthcare data and the de-identification of that healthcare data. Data is at the center of the future of healthcare IT and so I was interested to hear Khaled’s perspectives on how to manage the privacy and security of that data when you’re working with massive healthcare data sets.

Khaled and I started off the conversation talking about whether healthcare data could indeed be de-identified or not. My favorite Patient Privacy Rights advocate, Deborah C. Peel, MD, has often made the case for why supposedly de-identified healthcare data is not really private or secure since it can be re-identified. So, I posed that question to Khaled and he suggested that Dr. Peel is only telling part of the story when she references stories where healthcare data has been re-identified.

Khaled makes the argument that in all of the cases where healthcare data has been reidentified, it was because those organizations did a poor job of de-identifying the data. He acknowledges that many healthcare organizations don’t do a good job de-identifying healthcare data and so it is a major problem that Dr. Peel should be highlighting. However, just because one organization does a poor job de-identifying data, that doesn’t mean that proper de-identification of healthcare data should be thrown out.

This kind of reminds me of when people ask me if EHR software is secure. My answer is always that EHR software can be more secure than paper charts. However, it depends on how well the EHR vendor and the healthcare organization’s staff have done at implementing security procedures. When it’s done right, an EHR is very secure. When it’s done wrong, and EHR could be very insecure. Khaled is making a similar argument when it comes to de-identified health data.

Khaled did acknowledge that the risks are never going to be 0. However, if you de-identify healthcare data using proper techniques, the risks are small enough that they are similar to the risks we take every day with our healthcare data. I think this is an important point since the reality is that organizations are going to access and use healthcare data. That is not going to stop. I really don’t think there’s any debate on this. Therefore, our focus should be on minimizing the risks associated with this healthcare data sharing. Plus, we should hold organizations accountable for the healthcare data sharing their doing.

Khaled also suggested that one of the challenges the healthcare industry faces with de-identifying healthcare data is that there’s a shortage of skilled professionals who know how to do it properly. I’d suggest that many who are faced with de-identifying data have the right intent, but likely lack the skills needed to ensure that the healthcare data de-identification is done properly. This isn’t a problem that will be solved easily, but should be helped as data security and privacy become more important.

What do you think of de-identification in healthcare? Is the way it’s being done a problem today? I see no end to the use of data in healthcare, and so we really need to make sure we’re de-identifying healthcare data properly.

About the author

John Lynn

John Lynn

John Lynn is the Founder of, a network of leading Healthcare IT resources. The flagship blog, Healthcare IT Today, contains over 13,000 articles with over half of the articles written by John. These EMR and Healthcare IT related articles have been viewed over 20 million times.

John manages Healthcare IT Central, the leading career Health IT job board. He also organizes the first of its kind conference and community focused on healthcare marketing, Healthcare and IT Marketing Conference, and a healthcare IT conference,, focused on practical healthcare IT innovation. John is an advisor to multiple healthcare IT companies. John is highly involved in social media, and in addition to his blogs can be found on Twitter: @techguy.


  • Thank you for the article. You mentioned that many lack the skills necessary to de-identify data. Do you have any good sources to point to as to the proper way to de-identify data?

  • I’ll ask Khaled to respond with good sources. Although, one of the keys to doing it right is keeping up with the ever changing industry. The resource that’s great today might not be good tomorrow. So, you have to create a network of great people and sources in order to keep up with what’s happening.

  • Hello, Jeanine

    Thanks for your great question. There are several good sources for understanding how to de-identify data. I am not sure what level of detail you are looking for, so I have provided a few examples:

    We have a couple white papers that may help to give a general overview of de-identification methods, De-identification 101 and Perspectives on Health Data De-identification. De-identification 101 provides more of the background on de-identification and its methods, while Perspectives on Health Data De-identification will also explain de-identification methods and why masking is not enough for certain analyses.

    If you are looking for a more detailed look at the methodology and base algorithms, I recommend reading this paper titled De-identification Methods for Open Health Data. This is a case study describing a global data mining competition. The purpose of the competition was to de-identify claims data that would be used to predict the number of days patients would be hospitalized the following year. I recommend reading the full report if you have the time, but my colleagues and I specifically talk about the methodology and algorithm in the section titled, “Methods for the De-identification of the HHP Dataset.” Additionally, a risk-based methodology similar to that used by Privacy Analytics has recently been published in the Institute of Medicine’s report Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk.

    Here are the links:

    De-identification 101:
    Perspectives on Health Data De-identification:
    De-identification Methods for Open Health Data:
    Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk:

    Please let me know if there is more you would like to know, I would be happy to answer any further questions.

  • Thank you both so much for the information provided! This will be a very helpful resource going forward. I’ll look forward to any other information coming from this site.

Click here to post a comment