The Challenge of Unstructured Data in Healthcare

This semester I’m taking two classes, one which I’ve been looking forward to since I started the HI Master’s program at UIC. I’m taking an independent study course in which I’ll write a 3o-page paper on a health informatics topic of my choice. I’m working with my independent study adviser Andrew D. Boyd, MD Assistant Professor with UIC’s Department of Biomedical and Health Information Sciences. I’m just getting started fleshing out the focus, outline and references for the paper.

I’ve been interested in the topic of unstructured textual data in healthcare for a few years. Because my background is in business analysis and analytics, I am quite accustomed to working with structured data in relational databases. While this data has challenges in itself, like data quality, completeness, availability and consistency, it is much easier to work with than unstructured textual data.

From the first time I worked with health data within a electronic medical record (EMR) I realized how much of this information is not in structured, relational format. Text or natural language data resides in many fields within an EMR including multiple notes fields such as physician notes, nursing notes,  surgical notes, radiology notes, pathology reports, admission notes, etc. All of these fields may have valuable information about the patient including diagnosis, history, family history, complaints, statistics, and opinions. This information may not be available elsewhere in the EMR.

There are many challenges to creating analytics from the textual data collected within an EMR or other healthcare system. When fields are not structured, users can write in any manner they like. They might use full English sentences, abbreviate, use templates with headings and outlines, use medical jargon, or utilize their own personal method of documenting information. In industry we talk about unstructured data or big data. Often big data includes all sorts of additional data sources like social media, emails, blogs, or intranet systems. In my paper, I’ll be focusing on EMR or other healthcare systems. Within the academic community this work is called natural language processing. Regardless, the same problems arise in translating text into a form which can be analyzed via computer software.

Benefits of being able to analyze unstructured data along with structured data is that the whole can provide a much fuller picture of the patient’s history, diagnosis, treatment, and outcome. If details around the pathology of a patient’s tumor are only recorded within the pathology note, then analysis cannot include such things as genomics, margin reports, laterality, size, shape or even perhaps stage of a tumor. Including that information along with trends for an individual patient or an entire population could be extremely value. Additionally, combining that information with data about the treatment and outcome of a patient,  possibly available within textual notes fields, can provide a rich field for research and then results driven treatment.

So you can see why, as an informatician I’m excited about the possibilities of incorporating data which has not yet been easily accessible for analysis. Much academic research has been occurring over time on natural language processing, but few commercial products are yet attacking this opportunity. I’m looking forward to becoming more knowledgeable about this work and finishing my paper. I’m hoping I’ll be able to leverage this as I continue my career.

About the author


Yvette Desmarais

Yvette is a Masters student with University of Illinois at Chicago’s (UIC) Health Informatics Program. She works for Hewlett-Packard as a consulting project manager in their Information Management and Analytics group where she focuses on Health and Life Science analytics and data warehousing.