De-identified Healthcare Data – Is It Really Unidentifiable

There’s always been some really interesting discussion about EHR vendors selling the data from their EHR software. Turns out that many EHR vendors and other healthcare entities are selling de-identified healthcare data now, but I haven’t heard much public outcry from them doing it. Is it because the public just doesn’t realize it’s happening or because the public is ok with de-identified data being sold. I’ve heard many argue that they’re happy to have their de-identified data sold if it improves public health or if it gives them a better service at a cheaper cost.

However, a study coming out of Canada has some interesting results when it comes to uniquely identifying people from de-identified data. The only data they used was date of birth, gender, and full postal code data. “When the full date of birth is used together with the full postal code, then approximately 97% of the population are unique with only one year of data.”

One thing that concerns me a little about this study is that postal code is a pretty unique identifier. Take out postal code and you’ll find much different results. Why? Cause a lot of people share the same birthday and gender. However, the article does offer a reasonable suggestion based on the results of the study:

“Most people tend to think twice before reporting their year of birth [to protect their privacy] but this report forces us all to think about the combination or the totality of data we share,” said Dr. El Emam. “It calls out the urgency for more precise and quantitative approaches to measure the different ways in which individuals can be re-identified in databases – and for the general population to think about all of the pieces of personal information which in combination can erode their anonymity.”

To me, this is the key point. It’s not about creating fear and uncertainty that has no foundation, but to consider more fully the effect on patient privacy of multiple pieces of personal information in de-identified patient data.

About the author

John Lynn

John Lynn

John Lynn is the Founder of, a network of leading Healthcare IT resources. The flagship blog, Healthcare IT Today, contains over 13,000 articles with over half of the articles written by John. These EMR and Healthcare IT related articles have been viewed over 20 million times.

John manages Healthcare IT Central, the leading career Health IT job board. He also organizes the first of its kind conference and community focused on healthcare marketing, Healthcare and IT Marketing Conference, and a healthcare IT conference,, focused on practical healthcare IT innovation. John is an advisor to multiple healthcare IT companies. John is highly involved in social media, and in addition to his blogs can be found on Twitter: @techguy.


  • Well, if an academic study can find out identities from de-identified records, you can bet insurance underwriters certainly can.

  • Here’s something unique about electronic dental records. Even if someone tracked down the owners, who cares?

  • D. Kellus Pruitt,
    I thought you were the security and privacy nazi or am I remembering wrong?

    Turns out, the same is generally true for patient records too. Although, there are a few cases where it does matter. In fact, in a few cases it matters a lot. The most common is the insurance companies abusing you based on your medical history.

  • That’s me, John. The privacy nazi. I like that, actually.

    I think you agree that without privacy, EHRs will never be trusted by doctors or patients. And if EHRs aren’t trusted – and it’s pretty obvious that they are not – will they ever reach their potential?

    I’ve always maintained that because of the fundamental difference in content, dental records without PHI and health histories can be shared freely virtually without risk. The same can never be said about even de-identified medical records.

    It’s the “one size fits all” nazis who are stubbornly holding dentistry back from interoperable Practice-Based Research. If HHS could think laterally for once, solutions to painful and even life-threatening diseases of dental origin just might become available long before EMRs safely give anything back to society.

    If PHI were removed from EDRs – including medical histories – it would virtually eliminate all liability. Such risk-free interoperability will never happen with medical records.

  • Leon Rodriguez, formerly chief of staff and deputy assistant attorney general for the Department of Justice Civil Rights Division, became director of HHS’ Office for Civil Rights in early September.

    In an article posted on just now, Howard Anderson says Rodriguez emphasizes that privacy and security are issues that “really matter to me personally and really matter to the secretary [of HHS]. So we’re going to be serious about our enforcement work and no less serious about making sure that we educate everybody out there, both covered entities and patients, about what the requirements are for health information privacy.”

    It doesn’t look like I’m the only privacy nazi.

  • Oh, there are plenty of privacy nazis out there. You’re far from alone in this. I like them because they keep us honest about it.

  • I am interested in knowing how readers answer John’s question re position on use of de-identified data. My guess is that people don’t know it’s going on and will object to it happening in principle.

    Securing PHI feels a lot like Y2K. No doubt breaches occur, and, when they do, they are certainly costly for the offending HCO, but how many examples are there of leaked information being used to harm someone? Seems like the same proscriptions vs. extortion, blackmail, and libel would prevent individuals from using illegally obtained PHI to harm patients.

    In fact, the odds that there is a Person A who wishes to harm Person B AND who somehow comes up with Person B’s sensitive PHI AND is able to use it to harm Person B without Person B having ample legal recourse against Person A are hopelessly LONG. Breaches of thousands/hundreds of thousands/millions of records are too large and unspecific to be “used” for nefarious purposes.

    We need to secure PHI, but we are hoisting ourselves on our own petards if we let legitimate concerns about the use of patient data block or slow our adoption of EMRs and HCIT for ACOs and PCMHs. Just as there are real benefits associated with use of de-id’ed patient data, there are (significant, hidden) costs with not sharing health data.

    The irony here is that the most common, undeniably harmful use of sensitive PHI has been to deny coverage to patients with pre-existing conditions. Kind of makes sense. It is, after all, health information.

  • ip-doctor,
    I love the comparison to Y2K. I’m going to quote you in a future post. It’s really quite an interesting comparison.

  • As you may or may not know, HIPAA laws render a the scenario detailed in the Canadian study exceptionally unlikely in the U.S. OptumInsight has deep expertise in working with PHI and de-identified data, and so we recently published a practial guide to help physicians better understand the issues and take steps to secure data used in their practices. Here is a link:

    Following is a relevant excerpt from that Guide, detailing HIPAA regulations around de-identified data:

    “Like PHI, the use of de-identified information is also governed by HIPAA. De-identified information is “health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual.” De-identified information is not PHI under the HIPAA Privacy Rule. As a result, de-identified information may be shared without restriction…

    … “HIPAA defines two routes by which a Covered Entity may properly de-identify data:
    (1) the “Safe Harbor” method, and (2) through professional statistical analysis.

    “Using the Safe Harbor method, 18 specific identifiers must be removed, including all geographic subdivisions smaller than a state; all elements of dates, except the year; telephone numbers; fax numbers; electronic mail addresses; social security numbers; medical record numbers; health plan beneficiary numbers; account numbers; certificate/license numbers; vehicle identifiers and serial numbers, including license plate numbers; device identifiers and serial numbers; web addresses; IP address numbers; biometric identifiers, including finger- and voiceprints; full-face photographic images and any comparable images; and any other unique identifying number, characteristic, or code.

    “Removing all 18 of these identifiers, as specified by the Safe Harbor method, makes the data much less useful for analyzing health trends over time or for surveillance of health conditions, such as influenza outbreaks or cancer clusters that occur in smaller geographic areas.

    “The second method of de-identification permitted under HIPAA requires that a qualified statistician determine and certify that the likelihood of conclusively re-identifying any single person in the data set is “very small” (less than 4 percent) using the information alone or in combination with other reasonably available information. These findings must be certified by a statistician who has appropriate knowledge and experience of generally accepted scientific principles and methods for rendering information not identifiable.”

    I hope this is helpful.

  • Kyle,
    That was incredibly good information. Glad that HIPAA is covering that. I wonder how many people know about the last part about getting a statistician to find that the likelihood of re-identifying someone is so small. I have a feeling many are missing that part in their de-identification process.

Click here to post a comment