Date of Award

August 2014

Degree Type


Degree Name

Doctor of Philosophy



First Advisor

Susan McRoy

Second Advisor

Hong Yu

Committee Members

Hong Yu, Susan McRoy, Christine Cheng, Rohit J. Kate, Peter J. Tonellato


Citation Network, Conditional Random Fields, Graph Analysis, Machine Learning, Protein-protein Interaction, Social Network


With the rapid development of digitalized literature, more and more knowledge has been discovered by computational approaches. This thesis addresses the problem of link prediction in co-authorship networks and protein--protein interaction networks derived from the literature. These networks (and most other types of networks) are growing over time and we assume that a machine can learn from past link creations by examining the network status at the time of their creation. Our goal is to create a computationally efficient approach to recommend new links for a node in a network (e.g., new collaborations in co-authorship networks and new interactions in protein--protein interaction networks).

We consider edges in a network that satisfies certain criteria as training instances for the machine learning algorithms. We analyze the neighborhood structure of each node and derive the topological features. Furthermore, each node has rich semantic information when linked to the literature and can be used to derive semantic features. Using both types of features, we train machine learning models to predict the probability of connection for the new node pairs.

We apply our idea of link prediction to two distinct networks: a co-authorship network and a protein--protein interaction network. We demonstrate that the novel features we derive from both the network topology and literature content help improve link prediction accuracy. We also analyze the factors involved in establishing a new link and recurrent connections.