How Much Time Do You Spend Cleaning Your Data?

I recently came across this really great blog post talking about data scientists wasting their time. Here’s a quote from the article (which quotes the NYT):

“Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in [the] mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”
– Steve Lohr, NYT

Then, they have this extraordinary quote from Monica Rogati, VP for Data Science at Jawbone:

“Data scientists are forced to act more like data janitors than actual scientists.”

Every data scientist will tell you this is a problem. They spend far too much time cleaning up the data and they all wish they could spend more time actually looking at the data to find insights. I’ve seen this all over health care. In fact, I’d say we have more data janitors than data scientists in healthcare. Sadly, many healthcare data projects clean up the data and then don’t have any budget left to actually do something with the data.

The solution to this problem is easy to write and much harder to do. The solution is to create an expectation and a culture of clean data in your organization.

I predict that over the next 5-10 years, healthcare data is going to become the backbone of healthcare data decision making. Those organizations that houses are a mess are going to be torn down and sold off to the hospital that’s kept a clean house. Is your hospital data clean or dirty?

About the author

John Lynn

John Lynn

John Lynn is the Founder of the, a network of leading Healthcare IT resources. The flagship blog, Healthcare IT Today, contains over 13,000 articles with over half of the articles written by John. These EMR and Healthcare IT related articles have been viewed over 20 million times.

John manages Healthcare IT Central, the leading career Health IT job board. He also organizes the first of its kind conference and community focused on healthcare marketing, Healthcare and IT Marketing Conference, and a healthcare IT conference,, focused on practical healthcare IT innovation. John is an advisor to multiple healthcare IT companies. John is highly involved in social media, and in addition to his blogs can be found on Twitter: @techguy.


  • This isn’t just true of healthcare, but is true of data science in general. No one tells that newly minted PhD that he/she will not have clean, complete time series data to work with. Plus, to really clean data you have to become somewhat familiar with the domain which in and of itself has a learning curve.

  • Anshuman,
    Very true. Clean data is a problem everywhere. Although in some other industries they’ve been cleaning data for a while.

Click here to post a comment