Open Source Tool Offers “Synthetic” Patients For Hospital Big Data Projects

As readers will know, using big data in healthcare comes with a host of security and privacy problems, many of which are thorny.

For one thing, the more patient data you accumulate, the bigger the disaster when and if the database is hacked. Another important concern is that if you decide to share the data, there’s always the chance that your partner will use it inappropriately, violating the terms of whatever consent to disclose you had in mind. Then, there’s the issue of working with incomplete or corrupted data which, if extensive enough, can interfere with your analysis or even lead to inaccurate results.

But now, there may be a realistic alternative, one which allows you to experiment with big data models without taking all of these risks. A unique software project is underway which gives healthcare organizations a chance to scope out big data projects without using real patient data.

The software, Synthea, is an open source synthetic patient generator that models the medical history of synthetic patients. It seems to have been built by The MITRE Corporation, a not-for-profit research and development organization sponsored by the U.S. federal government. (This page offers a list of other open source projects in which MITRE is or has been involved.)

Synthea is built on a Generic Module Framework which allows it to model varied diseases and conditions that play a role in the medical history of these patients. The Synthea modules create synthetic patients using not only clinical data, but also real-world statistics collected by agencies like the CDC and NIH. MITRE kicked off the project using models based on the top ten reasons patients see primary care physicians and the top ten conditions that shorten years of life.

Its makers were so thorough that each patient’s medical experiences are simulated independently from their “birth” to the present day. The profiles include a full medical history, which includes medication lists, allergies, physician encounters and social determinants of health. The data can be shared using C-CDA, HL7 FHIR, CSV and other formats.

On its site, MITRE says its intent in creating Synthea is to provide “high-quality, synthetic, realistic but not real patient data and associated health records covering every aspect of healthcare.” As MITRE notes, having a batch of synthetic patient data on hand can be pretty, well, handy in evaluating new treatment models, care management systems, clinical support tools and more. It’s also a convenient way to predict the impact of public health decisions quickly.

This is such a good idea that I’m surprised nobody else has done something comparable. (Well, at least as far as I know no one has.) Not only that, it’s great to see the software being made available freely via the open source distribution model.

Of course, in the final analysis, healthcare organizations want to work with their own data, not synthetic substitutes. But at least in some cases, Synthea may offer hospitals and health systems a nice head start.

About the author

Anne Zieger

Anne Zieger

Anne Zieger is a healthcare journalist who has written about the industry for 30 years. Her work has appeared in all of the leading healthcare industry publications, and she's served as editor in chief of several healthcare B2B sites.

1 Comment

  • Very cool tool. I wonder if the synthetic patients also simulate the errors that a provider will make in documenting the patient. Does it make some men women and things like that which we commonly see happen in health data? Seems like that would be a necessary feature to properly test an analytics tool too, no?

Click here to post a comment