Addressing Missing Data in a Study on Predictors of Adherence to HIV Treatment among Young Women in Kisumu, Kenya (2023 DSSG)
Authors: RJ Southward and Lila Mack
Summary:
This summer, our team worked on analysis of a study that aimed to better understand the determinants of adherence to antiretroviral treatment (ART) for HIV among a group of 309 adolescent girls and young women living in and around Kisumu, Kenya.
Our graduate mentor Jonathan Altamirano, and faculty advisor Mike Baiocchi, helped orient us to the purpose and design of the initial study. They explained to us how the study outcome variable, adherence to ART, was a function of a participant’s net viral load, and how due to supply chain constraints surrounding COVID-19, viral load counts were only gathered for about 50% of the study participants. Our challenge for the summer was to help address this missing data on the outcome variable using imputation.
We started to assess the different variables and familiarize ourselves with the dataset … but quickly got off track. During our daily check-ins, we found some interesting columns of free-response doctors complaints, and began discussing the host of possible analyses we could do. “This is cool” we said, and our mentor wholeheartedly agreed, so we spent the next two weeks collaborating to thematically categorize the written complaints.
Thoroughly tired of the thematic analysis we returned to the initial task at hand. The project had two major steps: to understand the factors that influence Missingness and Adherence. Missingness refers to whether or not a participant has a recorded viral load within the past year. For those with a recent viral load, Adherence refers to whether it was <200 (adherent), or ≥200 (non-adherent). To understand these two factors we created descriptive tables, conducted bivariate analysis, and constructed a multivariate model a variable at a time to control for confounding. We arrived at the following set of variables, shown below in both the context of Missingness and Adherence:
We had some interesting results. For example, the variable indicating having a mother alive had a substantial protective effect for both missingness and non-adherence, more so than having a father alive. And physical partner violence had negligible effect on missingness, but was the largest effect for non-adherence!
With these findings we began the process of imputing missing data. This process first involved creating 100,000 bootstrapped data sets of size 309. We then fit a logistic model with an outcome of non-adherence for each of the datasets, using the final set of selected variables as covariates, and averaged the coefficients across all datasets. The imputation model had limited predictive power and a large level of uncertainty, but we successfully implemented a pipeline that could be used in future iterations. Lila notes that while learning about the process of imputation she came across some prominent political scientists: Matthew Blackwell and Gary King. As a political science major herself, Lila was excited by the beauty and relevance of data science, as the same methodologies developed to address problems in political science are just as useful in the context of epidemiology!