Date of Award

August 2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy

Department

Engineering

First Advisor

Jun JZ Zhang

Second Advisor

Rodney RS Sparapani

Committee Members

Jun JZ Zhang, Rodney RS Sparapani, Yi YH Hu, Zeyun ZY Yu, Jacquelyn JK Kulinski

Keywords

Bayesian approach, Biomedical data, C++, machine learning, R, statistics

Abstract

Advances in Artificial Intelligence (AI), machine learning, and statistics have expanded the potential of health care. In this dissertation, I explored methods to enhance the quantity and quality of medical datasets and improve the accuracy of statistical and machine learning analyses. For this purpose, I addressed three main research questions: (1) ``Extracting Medical Terms from Cardiovascular Imaging Reports‘‘, (2) ``Statistical Analysis for Quantitative and Qualitative Research in Clinical Studies‘‘, with sub-questions (2.1) ``Statistical Comparison of Heart Rate Variability Measurements Between Devices: Chest Strap Versus Finger Probe‘‘, and (2.2) ``Qualitative Analysis of Reasons Given by Those Who Were not Interested in Participating: The Singing and Cardiovascular Health in Older Adults Study (NCT04121741)‘‘ and (3) ``Bayesian Nonparametric Machine Learning Methodology and Variable Selection‘‘.

First project: Despite numerous studies on clinical data extraction, little progress has been made in completely using regular expressions to match and extract specific medical terms from unstructured cardiovascular imaging reports. Therefore, I wanted to close the gap in this study by extracting the ejection fraction (EF) and strain values from several cardiac imaging modalities reports using regular expressions. I developed a Python script using regular expressions. After validation and dealing with errors, it worked with high accuracy (overall 98\%). It allowed us to extract data faster and more accurately than manual methods and provide more medical datasets for AI.

The first aspect of the second project: The finger probe method is more convenient for the medical expert and the subject than the chest probe method. However, the Polar H7 chest probe is the ground truth since it has been validated in previous studies. I wanted to see if the CorSense finger probe is compatible with the Polar H7 chest probe for heart rate variability (HRV) measurements. For this purpose, I compared HRV measurements (root mean square of successive RR interval differences (rMSSD), standard deviation of NN intervals (SDNN), percentage of successive RR intervals that differ by more than 50 ms (PNN50)), low-frequency power (LF Power) and high-frequency power (HF Power)) measured with both probes by applying the Wilcoxon rank-sum test (WSRT). The SDNN and LF Power measurements had similar values for both devices, but the p-values were still unacceptable. Therefore, all the pairs looked statistically different and not exchangeable. The results are helpful for researchers to design their study with better insight.

The second aspect of the second project: It is crucial in a clinical study to find enough subjects within the planned time frame. Some studies explore reasons for refusal, but their scope is different. I categorized refusal reasons from 306 elderly adults (with coronary artery disease (CAD)) by analyzing the content to achieve the best accuracy. The proportion of women who refused to participate in the singing study (NCT04121741) was higher than that of men. Women had more family responsibilities and could not participate in the study. This study gives insight into recruiting more subjects for future work.

Final project: Often in medical datasets, we have many features per patient (greater or smaller than the number of patients). In these situations, it is hard to use machine learning techniques accurately. I addressed this problem with the nonparametric failure time Dirichlet additive regression trees (NFT DART) model. I developed the {\bf nftdart} R package to run NFT DART as well. For this purpose, I extended the nonparametric failure time Bayesian additive regression trees (NFT BART) model with the Dirichlet priors and successfully implemented the model with software (R, C++, and Rcpp). I validated it by noised lung cancer data. It captured the Gaussian and uniform noise well but not as well as binary noise.

Available for download on Saturday, August 29, 2026

Share

COinS