De-Identification of Data in Healthcare

January 14, 2015

3 Min Read

John Lynn

Today I had a chance to sit down with Khaled El Emam, PhD, CEO and Founder of Privacy Analytics, to talk about healthcare data and the de-identification of that healthcare data. Data is at the center of the future of healthcare IT and so I was interested to hear Khaled’s perspectives on how to manage the privacy and security of that data when you’re working with massive healthcare data sets.

Khaled and I started off the conversation talking about whether healthcare data could indeed be de-identified or not. My favorite Patient Privacy Rights advocate, Deborah C. Peel, MD, has often made the case for why supposedly de-identified healthcare data is not really private or secure since it can be re-identified. So, I posed that question to Khaled and he suggested that Dr. Peel is only telling part of the story when she references stories where healthcare data has been re-identified.

Khaled makes the argument that in all of the cases where healthcare data has been reidentified, it was because those organizations did a poor job of de-identifying the data. He acknowledges that many healthcare organizations don’t do a good job de-identifying healthcare data and so it is a major problem that Dr. Peel should be highlighting. However, just because one organization does a poor job de-identifying data, that doesn’t mean that proper de-identification of healthcare data should be thrown out.

This kind of reminds me of when people ask me if EHR software is secure. My answer is always that EHR software can be more secure than paper charts. However, it depends on how well the EHR vendor and the healthcare organization’s staff have done at implementing security procedures. When it’s done right, an EHR is very secure. When it’s done wrong, and EHR could be very insecure. Khaled is making a similar argument when it comes to de-identified health data.

Khaled did acknowledge that the risks are never going to be 0. However, if you de-identify healthcare data using proper techniques, the risks are small enough that they are similar to the risks we take every day with our healthcare data. I think this is an important point since the reality is that organizations are going to access and use healthcare data. That is not going to stop. I really don’t think there’s any debate on this. Therefore, our focus should be on minimizing the risks associated with this healthcare data sharing. Plus, we should hold organizations accountable for the healthcare data sharing their doing.

Khaled also suggested that one of the challenges the healthcare industry faces with de-identifying healthcare data is that there’s a shortage of skilled professionals who know how to do it properly. I’d suggest that many who are faced with de-identifying data have the right intent, but likely lack the skills needed to ensure that the healthcare data de-identification is done properly. This isn’t a problem that will be solved easily, but should be helped as data security and privacy become more important.

What do you think of de-identification in healthcare? Is the way it’s being done a problem today? I see no end to the use of data in healthcare, and so we really need to make sure we’re de-identifying healthcare data properly.

About the author

View All Posts

John Lynn

John Lynn is the Founder of HealthcareScene.com, a network of leading Healthcare IT resources. The flagship blog, Healthcare IT Today, contains over 13,000 articles with over half of the articles written by John. These EMR and Healthcare IT related articles have been viewed over 20 million times.

John manages Healthcare IT Central, the leading career Health IT job board. He also organizes the first of its kind conference and community focused on healthcare marketing, Healthcare and IT Marketing Conference, and a healthcare IT conference, EXPO.health, focused on practical healthcare IT innovation. John is an advisor to multiple healthcare IT companies. John is highly involved in social media, and in addition to his blogs can be found on Twitter: @techguy.

5 Comments

Jeanine says:

January 17, 2015 at 6:58 am

Thank you for the article. You mentioned that many lack the skills necessary to de-identify data. Do you have any good sources to point to as to the proper way to de-identify data?
John Lynn says:

January 21, 2015 at 7:20 pm

I’ll ask Khaled to respond with good sources. Although, one of the keys to doing it right is keeping up with the ever changing industry. The resource that’s great today might not be good tomorrow. So, you have to create a network of great people and sources in order to keep up with what’s happening.
Khaled El Emam says:

January 26, 2015 at 12:52 pm

Hello, Jeanine

Thanks for your great question. There are several good sources for understanding how to de-identify data. I am not sure what level of detail you are looking for, so I have provided a few examples:

We have a couple white papers that may help to give a general overview of de-identification methods, De-identification 101 and Perspectives on Health Data De-identification. De-identification 101 provides more of the background on de-identification and its methods, while Perspectives on Health Data De-identification will also explain de-identification methods and why masking is not enough for certain analyses.

If you are looking for a more detailed look at the methodology and base algorithms, I recommend reading this paper titled De-identification Methods for Open Health Data. This is a case study describing a global data mining competition. The purpose of the competition was to de-identify claims data that would be used to predict the number of days patients would be hospitalized the following year. I recommend reading the full report if you have the time, but my colleagues and I specifically talk about the methodology and algorithm in the section titled, “Methods for the De-identification of the HHP Dataset.” Additionally, a risk-based methodology similar to that used by Privacy Analytics has recently been published in the Institute of Medicine’s report Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk.

Here are the links:

De-identification 101: http://bit.ly/1ATCunI
Perspectives on Health Data De-identification: http://bit.ly/1GByT63
De-identification Methods for Open Health Data: http://1.usa.gov/1zziYBx
Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk: http://bit.ly/15CJi1c

Please let me know if there is more you would like to know, I would be happy to answer any further questions.
John Lynn says:

January 26, 2015 at 2:06 pm

Thanks Khaled for providing the extra resources.
Jeanine says:

January 29, 2015 at 12:33 pm

Thank you both so much for the information provided! This will be a very helpful resource going forward. I’ll look forward to any other information coming from this site.

Click here to post a comment

HIMSS Analytics Announces eClinicalWorks as Certified Educator of the EMR Adoption Model

Top 10 Recruiting Trends for 2015 by Net Hire

Cookie	Duration	Description
__cfruid	session	This cookie is set by the provider Cloudflare. This cookie is used for load balancing and for identifying trusted web traffic.
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
AWSALBCORS	7 days	This cookie is used for load balancing services provded by Amazon inorder to optimize the user experience. Amazon has updated the ALB and CLB so that customers can continue to use the CORS request with stickness.
AWSELB	session	This cookie is associated with Amazon Web Services and is used for managing sticky sessions across production servers.
cf_ob_info		This cookie is set by the provider Cloudflare. The cookie provides informations on HTTP Status Code returned by the origin web server, the Ray ID of the original failed request and the data center serving the traffic.
cf_use_ob		This cookie is set by the provider Cloudflare content delivery network. This cookie is used for determining whether it should continue serving "Always Online" until the cookie expires.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	1 hour	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non-necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
gdpr_status	6 months 2 days	This cookie is set by the provider Media.net. This cookie is used to check the status whether the user has accepted the cookie consent box. It also helps in not showing the cookie consent box upon re-entry to the website.
JSESSIONID	session	Used by sites written in JSP. General purpose platform session cookies that are used to maintain users' state across page requests.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
ts	1 year 1 month	This cookie is provided by the PayPal. It is used to support payment service in a website.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie is set by CloudFlare. The cookie is used to support Cloudflare Bot Management.
_alid_	session	This cookie is set by the provider mielevod-vh.akamaihd.net. This cookie is used for making the live streaming of video content more efficient.
akavpau_ppsd	session	This cookie is provided by Paypal. The cookie is used in context with transactions on the website.
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
language	session	This cookie is used to store the language preference of the user.
lidc	1 day	This cookie is set by LinkedIn and used for routing.
sp_landing	1 day	This cookie is set by the provider Spotify. This cookie is used to implement audio content from spotify on the website. It also helps in collecting information on user interaction with this audio content.
sp_t	1 year	This cookie is set by the provider Spotify. This cookie is used to implement audio content from spotify on the website. It also helps in collecting information on user interaction with this audio content.
v1st	1 year 1 month	This cookie is set by the provider TripAdvisor. This cookie is used to show user reviews, awards and information recieved on the community of TripAdvisor. It helps to collect information about how visitors use the website.

Cookie	Duration	Description
AWSELBCORS	session	This cookie is used for load balancing, inorder to optimize the service. It also stores the information regarding which server cluster is serving the visitor.
dmvk	session	This cookie is set by the provider Dailymotion. This cookie is used for collecting statistical data of the visitor behaviour on the website. It is used for internal analytics.
sid	past	This cookie is very common and is used for session state management.

Cookie	Duration	Description
__gads	1 year 24 days	This cookie is set by Google and stored under the name dounleclick.com. This cookie is used to track how many times users see a particular advert which helps in measuring the success of the campaign and calculate the revenue generated by the campaign. These cookies can only be read from the domain that it is set on so it will not track any data while browsing through another sites.
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_131168995_1	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
CONSENT	16 years 4 months 2 days 9 hours	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.
UID	2 years	No description available.
vuid	2 years	This domain of this cookie is owned by Vimeo. This cookie is used by vimeo to collect tracking information. It sets a unique ID to embed videos to the website.
WMF-Last-Access	1 month 20 hours	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
DSID	1 hour	This cookie is setup by doubleclick.net. This cookie is used by Google to make advertising more engaging to users and are stored under doubleclick.net. It contains an encrypted unique ID.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
NID	6 months	This cookie is used to a profile based on user's interest and display personalized ads to the users.
OAGEO	session	This cookie is set by the provider OpenX. This cookie is used for advertising campaigns on the website. The cookie helps in avoiding the same ad showing repeatedly.
OAID	1 year	This cookie is set when an AdsWizz website visitor have opted out the collection of information by AdsWizz service or opted to disable the targeted ads by AdsWizz.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.
YSC	session	This cookies is set by Youtube and is used to track the views of embedded videos.
yt-remote-connected-devices	never	These cookies are set via embedded youtube-videos.
yt-remote-device-id	never	These cookies are set via embedded youtube-videos.
yt.innertube::nextId	never	These cookies are set via embedded youtube-videos.
yt.innertube::requests	never	These cookies are set via embedded youtube-videos.