Newly Released Open Source Libraries for Health Analytics from Health Catalyst

I celebrate and try to report on each addition to the pool of open source resources for health care. Some, of course, are more significant than others, and I suspect the new libraries released by the Health Catalyst organization will prove to be one of the significant offerings. One can do a search for health care software on sites such as GitHub and turn up thousands of hits (of which many are probably under open and free licenses), but for a company with the reputation and accomplishments of Health Catalyst to open up the tools it has been using internally gives great legitimacy from the start.

According to Health Catalyst’s Director of Data Science Levi Thatcher, the main author of the project, these tools are tried and tested. Many of them are based on popular free software libraries in the general machine learning space: he mentions in particular the Python Scikit-learn library and the R language’s caret and and data.table libraries. The contribution of Health Catalyst is to build on these general tools to produce libraries tailored for the needs of health care facilities, with their unique populations, workflows, and billing needs. The company has used the libraries to deploy models related to operational, financial, and clinical questions. Eventually, Thatcher says, most of Health Catalyst’s applications will use predictive analytics based on, and now other programmers can too.

Currently, Health Catalyst is providing libraries for R and Python. Moving them from internal projects to open source was not particularly difficult, according to Thatcher: the team mainly had to improve the documentation and broaden the range of usable data connections (ODBC and more). The packages can be installed in the manner common to free software projects in these language. The documentation includes guidelines for submitting changes, so that an ecosystem of developers can build up around the software. When I asked about RESTful APIs, Thatcher answered, “We do plan on using RESTful APIs in our work—mainly as a way of integrating these tools with ETL processes.”

I asked Thatcher one more general question: why did Health Catalyst open the tools? What benefit do they derive as a company by giving away their creative work? Thatcher answers, “We want to elevate the industry and educate it about what’s possible, because a rising tide will lift all boats. With more data publicly available each year, I’m excited to see what new and open clinical or socio-economic datasets are used to optimize decisions related to health.”

About the author

Andy Oram

Andy Oram

Andy Oram writes and edits documents about many aspects of computing, ranging in size from blog postings to full-length books. Topics cover a wide range of computer technologies: data science and machine learning, programming languages, Web performance, Internet of Things, databases, free and open source software, and more. My editorial output at O'Reilly Media included the first books ever published commercially in the United States on Linux, the 2001 title Peer-to-Peer (frequently cited in connection with those technologies), and the 2007 title Beautiful Code. He is a regular correspondent on health IT and health policy for He also contributes to other publications about policy issues related to the Internet and about trends affecting technical innovation and its effects on society. Print publications where his work has appeared include The Economist, Communications of the ACM, Copyright World, the Journal of Information Technology & Politics, Vanguardia Dossier, and Internet Law and Business.