Being that I am not a high-end technologist, I’m not always up on the latest trends in database management – so the following may not be news to everyone who reads this. As for me, though, the notion of a “data lake” is a new one, and I think it a valuable idea which could hold a lot of promise for managing unruly healthcare data.
The following is a definition of the term appearing on a site called KDnuggets which focuses on data mining, analytics, big data and data science:
A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured and unstructured data. The data structure and requirements are not defined until the data is needed.
According to article author Tamara Dull, while a data warehouse contains data which is structured and processed, expensive to store, relies on a fixed configuration and used by business professionals, a data link contains everything from raw to structured data, is designed for low-cost storage (made possible largely because it relies on open source software Hadoop which can be installed on cheaper commodity hardware), can be configured and reconfigured as needed and is typically used by data scientists. It’s no secret where she comes down as to which model is more exciting.
Perhaps the only downside she identifies as an issue with data lakes is that security may still be a concern, at least when compared to data warehouses. “Data warehouse technologies have been around for decades,” Dull notes. “Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake.” But this issue is likely to receive in the near future, as the big data industry is focused tightly on security of late, and to her it’s not a question of if security will mature but when.
It doesn’t take much to envision how the data lake model might benefit healthcare organizations. After all, it may make sense to collect data for which we don’t yet have a well-developed idea of its use. Wearables data comes to mind, as does video from telemedicine consults, but there are probably many other examples you could supply.
On the other hand, one could always counter that there’s not much value in storing data for which you don’t have an immediate use, and which isn’t structured for handy analysis by business analysts on the fly. So even if data lake technology is less costly than data warehousing, it may or may not be worth the investment.
For what it’s worth, I’d come down on the side of the data-lake boosters. Given the growing volume of heterogenous data being generated by healthcare organizations, it’s worth asking whether deploying a healthcare data lake makes sense. With a data lake in place, healthcare leaders can at least catalog and store large volumes of un-normalized data, and that’s probably a good thing. After all, it seems inevitable that we will have to wring value out of such data at some point.