Date of Award

August 2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Biomedical and Health Informatics

First Advisor

Jake Luo

Abstract

Social determinants of health (SDoH) significantly impact health outcomes, including critical biomarkers like glycemic control (HbA1c), blood pressure (BP), and Low-density lipoprotein cholesterol levels (LDL). Despite their importance, gaps persist in understanding the predictive role of SDoH in these biomarkers. This research study aims to address these gaps by assessing machine learning (ML) models for biomarker prediction, focusing on HbA1c, BP and LDL levels. Leveraging data from Electronic Health Records (EHRs) and employing techniques such as feature selection, algorithm selection, and hyperparameter tuning, the study seeks to create ML models that incorporate patient demographics and SDoH factors alongside patient comorbid data. The project also emphasizes the importance of model interpretability through XAI (eXplainable Artificial Intelligence) techniques, enabling clinicians to understand and trust model predictions. By evaluating the robustness of both the data and the models, the research aims to provide actionable insights into the impact of SDoH on biomarkers, ultimately informing targeted interventions and personalized treatment plans to improve patient outcomes and promote health equity. Methods: Deidentified electronic health records of adult patients aged 18 years and above, who have complete Social Determinants of Health documented between February 2021 and February 2024 encountered at a tertiary academic medical center in Southeast Wisconsin, were extracted from the CRDW (Clinical Research Data Warehouse (CRDW) Database. The dataset comprised medical records of N=21,707 patients with HbA1c lab values, N=32,668 patients with blood pressure recorded and N= 23,493 with LDL lab values, including demographic information such as age, gender, race/ethnicity, and financial class, general adult risk score as well as comorbidities, SDoH factors for the patients with most recent complete SDoH recorded information. For each of the target, descriptive statistics were performed to visualize data distribution. For preliminary analysis ML models from the scikit-learn library RandomForestClassifier, LogisticRegression, GaussianNB, KNeighborsClassifier, as well as models from the XGBoost and CatBoost libraries XGBClassifier, CatBoostClassifier were chose with default hyperparameters and combination of encoders i.e., OneHotEncoder, CatBoostEncoder and GLLEncoder along with simple feature imputation methods. For best performing CatBoost classifier model identified from model training using default parameters, hyperparameters were tuned using automated techniques along with data preprocessing steps that included imputation, scaling, handling outliers using Winsorization technique, encoding categorical features, handling class imbalance using and Synthetic Minority Over-sampling Technique (SMOTE), generating polynomial features, feature selection based on variance and univariate tests, and filtering correlated features. To aggregate the performance metrics across different cross-validation folds, minimum, maximum, and average values for each metric where computed. Hyperparameter optimized trained classifier was evaluated on the test dataset using the classification metrics such as ROC-AUC score, precision, recall, F1-score, and confusion matrix. For hyperparameter optimized trained models, SHAP (SHapley Additive exPlanations) values and feature importances were calculated to provide insights into feature contributions. Finally, to enhance the interpretability of the model predictions, local interpretation analyses were conducted using SHAP techniques. These analyses generated visual explanations such as SHAP waterfall plots, for individual instances. The entirety of data extraction, transformation, and experimentation with machine learning models for this research endeavor was meticulously executed utilizing a combination of R and Python programming languages. Results indicate promising performance, particularly with CatBoost classifier models enhanced with Optuna hyperparameter optimization and SMOTE oversampling, significantly improved predictions across all biomarkers, with average ROC AUC scores of 0.92, 0.82, and 0.88 , in predicting A1c, BP, and LDL levels respectively. Feature importance analysis underscored the critical roles of age, SDoH, and general health status in predicting biomarker levels. SHAP analysis provided detailed insights into feature impacts, revealing the multifaceted nature of A1c, BP, and LDL predictions. Social determinants consistently emerged as influential predictors, emphasizing their importance in chronic disease management. Local interpretation highlighted individual predictors of unhealthy biomarker levels, suggesting targeted interventions involving lifestyle modifications and community resources. In summary, the substantial contribution of SDoH related features enhances the interpretability of the model and deepens our understanding of underlying data dynamics. Consequently, our findings underscore the importance of incorporating SDoH factors into predictive modeling initiatives aimed at informing interventions and policies targeting glycemic control and broader health outcomes.

Available for download on Saturday, July 18, 2026

Share

COinS