Date of Award

May 2018

Degree Type


Degree Name

Doctor of Philosophy


Biomedical and Health Informatics

First Advisor

Mary Shimoyama

Committee Members

Susan McRoy, Christine C.T. Cheng, Melinda Dwinell


Clustering, Data Mining, Data Visualization, Meta-anlaysis, Reproducibility



The laboratory rat has been widely used as an animal model in biomedical research. There are many strains exhibiting a wide variety of phenotypes. Capturing these phenotypes in a centralized database provides researchers with an easy method for choosing the appropriate strains for their studies. Current resources such as NBRP and PhysGen provided some preliminary work in rat phenotype databases. However, there are drawbacks in both projects: (1) small number of animals (6 rats) used by NBRP; (2) NBRP project is a one-time effort for each strain; (3) PhysGen web interface only enables queries within a single study – data comparison and integration not possible; (4) PhysGen lacks a data standardization process so that the measurement method, experimental condition, and age of rats used are hidden. Therefore, there is a need for a better data integration and visualization method in order to provide users with more insights about phenotype differences across rat strains. The Rat Genome Database (RGD) PhenoMiner tool has provided the first step in this effort by standardizing and integrating data from individual studies as well as NBRP and PhysGen.


Our work involved the following key steps: (1) we developed a meta-analysis pipeline to automatically integrate data from heterogeneous sources and to produce expected ranges (standardized phenotype ranges) for different strains, and different phenotypes under different experimental conditions; (2) we created tools to visualize expected ranges for individual strains and strain groups; (3) we clustered substrains into different sub-populations according to phenotype correlations.


We developed a meta-analysis pipeline and an interactive web interface that summarizes and visualizes expected ranges produced from the meta-analysis pipeline. Automation of the pipeline allows for updates as additional data becomes available. The interactive web interface provides the researchers with a platform for identifying and validating expected ranges for a variety of quantitative phenotypes. In addition, we performed a preliminary cluster analysis that enables researchers to examine similarities of strains, substrains, and different sex or age groups of strains on a multi-dimensional scale by using multiple phenotype features.


The data resources and the data mining and visualization tools will promote an understanding of rat disease models, guide researchers to choose optimal strains for their research needs, and encourage data sharing from different research hubs. Such resources also help to promote research reproducibility. Data produced and interactive platforms created in this project will continue to provide a valuable resource for Translational Research efforts.