Mayo Developing Tools To Extract Medical Data From All EMRs

Here’s some interesting and potentially important news. According to some recent news items, it seems that Mayo Clinic investigators are putting the finishing touches on a suite of tools which can identify and sort medical data contained in any electronic medical record.

Mayo investigators are working under a federal grant, the $60 million Strategic Health IT Advanced Research Projects (SHARP) program, which is funded by the ONC.

According to a piece in Government HealthIT, the researchers have used natural language processing tools to isolate health data from about 30 digital medical records of patients with diabetes.  So far, so good. When the extracted data is run through specialized systems developed with IBM’s Watson Research Center, the 30 patient records “explode” into 134 *bilion* individual pieces of information, Government HealthIT reports.

Unfortunately, none of the sources I have explain what specific data pieces make up this total, which sounds extremely high to me. If we’re talking about just 30 patients, it’s hard for me to imagine that mundane details of care represent even multiple thousands of data points, unless you’re dealing with decades of care. (Perhaps the information involved includes the coding needed to extract the data — readers, can you clarify this for me perhaps?)

While I can’t testify as to how realistic the Mayo researchers’ claims are, I have to think that if they’re on target, something very big is in the works.  After all, to date I’ve heard little of tools that can effectively, fluidly extract clinical data from an entire EMR-based patient chart regardless of format or data organization. Concepts like natural language processing are far from new, but it seems they haven’t been up to the job.

Not only would  such capabilities allow virtually any set of institutions to share data, a giant leap in and of itself, they would also allow providers to do unprecedented levels of clinical analysis and ultimately improve care.

On the other hand, it’s not clear how practical this approach will be. If it only takes 30 records to generate that much data, just imagine how much data a single mid-sized hospital would have to wrangle!  If I’m reading things right, this technology may remain stuck at the research stage, as it’s hard to imagine most institutions could manage terabytes of new data.

Still, there’s clearly much to learn here. I’m eager to find out whether Mayo’s SHARP technology turns out to be usable in everyday clinical life.



About the author

Anne Zieger

Anne Zieger

Anne Zieger is a healthcare journalist who has written about the industry for 30 years. Her work has appeared in all of the leading healthcare industry publications, and she's served as editor in chief of several healthcare B2B sites.


  • I completely agree with your interest in figuring out where 134 billion pieces of data is coming from. Seems like the devil must lie in those details somehwere.

  • “When run through computing systems developed in partnership with IBM’s Watson Research Center, those 30 patient records explode into 134 billion individual pieces of information to be organized and stored.”

    125 GB, scanned files? extracted structured data likely much less, perhaps there’s a pointer from each extracted canonical representation back to location within each document image, which might be useful for quality assurance and post editing, or, since it’s research, for a human to tweak the parsing/meaningful assignment software, though that’s a guess

  • Thanks for your comments folks!

    Dr. West:

    Yeah, wow, huh? I’d love to think that we have a real breakthrough on our hands. My gut feeling, as I noted, is that what we have is an impressive but not too practical research accomplishment. But you have to start somewhere.


    Thanks for the suggestions re: why we’re talking about such a large amount of data. Do you agree that given the volume of information, it’s unlikely that this research will be transferable to everyday providers just yet?

  • Not necessarily. Perhaps it’s like debugging a program. Sometimes there’s code or other resources used while developing software that is stripped out before it’s shipped. At this point we’re just guessing and speculating. I’m looking forward to the actual research reports.

Click here to post a comment