Presentation
codefinder: optimising Stata for the analysis of large, routinely collected healthcare data
Jonathan Batty, Marlous Hall
13 September 2024
Session
Routinely collected healthcare data (including electronic healthcare records and administrative data) are increasingly available at the whole-population scale, and may span decades of data collection. These data may be analysed as part of clinical, pharmacoepidemiologic and health services research, producing insights that improve future clinical care. However, the analysis of healthcare data on this scale presents a number of unique challenges. These include the storage of diagnosis, medication and procedure codes using a number of discordant systems (including ICD-9 and 10, SNOMED-CT, Read codes, etc.) and the inherently relational nature of the data (each patient has multiple clinical contacts, during which multiple codes may be recorded). Pre-processing and analysing these data using optimised methods has a number of benefits, including minimisation of computational requirements, analytic time, carbon footprint and cost.
We will focus on one of the main issues faced by the healthcare data analyst: how to most efficiently collapse multiple, disparate diagnosis codes (stored as strings across a number of variables) into a discrete disease entity, using a pre-defined code list. A number of approaches (including the use of Boolean logic, the inlist function, string functions and regular expressions) will be sequentially benchmarked in a large, real-world healthcare dataset (n = 192 million hospitalisation episodes during a 12-year period; approximately 1 terabyte of data). The time and space complexity of each approach (in addition to its carbon footprint), will be reported. The most efficient strategy has been implemented into our newly-developed Stata command: codefinder, which will be discussed.
Validate your login
Sign In
Create New Account