Date of Award

December 2022

Degree Type


Degree Name

Doctor of Philosophy



First Advisor


Committee Members

Habib Tabatabai, Yin Wang, Rohit J Kate, Susan McRoy


crash narrative, deep learning, natural language processing, text analytics, traffic safety, web tool


Despite significant advances in vehicle technologies, safety data collection and analysis, and engineering advancements, tens of thousands of Americans die every year in motor vehicle crashes. Alarmingly, the trend of fatal and serious injury crashes appears to be heading in the wrong direction. In 2021, the actual rate of fatalities exceeded the predicted rate. This worrisome trend prompts and necessitates the development of advanced and holistic approaches to determining the causes of a crash (particularly fatal and major injuries). These approaches range from analyzing problems from multiple perspectives, utilizing available data sources, and employing the most suitable tools and technologies within and outside traffic safety domain.The primary source for traffic safety analysis is the structure (also called tabular) data collected from crash reports. However, structure data may be insufficient because of missing information, incomplete sequence of events, misclassified crash types, among many issues. Crash narratives, a form of free text recorded by police officers to describe the unique aspects and circumstances of a crash, are commonly used by safety professionals to supplement structure data fields. Due to its unstructured nature, engineers have to manually review every crash narrative. Thanks to the rapid development in natural language processing (NLP) and machine learning (ML) techniques, text mining and analytics has become a popular tool to accelerate information extraction and analysis for unstructured text data. The primary objective of this dissertation is to discover and develop necessary tools, techniques, and algorithms to facilitate traffic safety analysis using crash narratives. The objectives are accomplished in three areas: enhancing data quality by recovering missed crashes through text classification, uncovering complex characteristics of collision generation through information extraction and pattern recognition, and facilitating crash narrative analysis by developing a web-based tool. At first, a variety of NoisyOR classifiers were developed to identify and investigate work zone (WZ), distracted (DD), and inattentive (ID) crashes. In addition, various machine learning (ML) models, including multinomial naive bayes (MNB), logistic regression (LGR), support vector machine (SVM), k-nearest neighbor (K-NN), random forest (RF), and gated recurrent unit (GRU), were developed and compared with NoisyOR. The comparison shows that NoisyOR is simple, computationally efficient, theoretically sound, and has one of the best model performances. Furthermore, a novel neural network architecture named Sentence-based Hierarchical Attention Network (SHAN) was developed to classify crashes and its performance exceeds that of NoisyOR, GRU, Hierarchical Attention Network (HAN), and other ML models. SHAN handled noisy or irrelevant parts of narratives effectively and the model results can be visualized by attention weight. Because a crash often comprises a series of actions and events, breaking the chain of events could prevent a crash from reaching its most dangerous stage. With the objectives of creating crash sequences, discovering pattern of crash events, and finding missing events, the Part-of-Speech tagging (PT), Pattern Matching with POS Tagging (PMPT), Dependency Parser (DP), and Hybrid Generalized (HGEN) algorithms were developed and thoroughly tested using crash narratives. The top performer, HGEN, uses predefined events and event-related action words from crash narratives to find new events not captured in the data fields. Besides, the association analysis unravels the complex interrelations between events within a crash. Finally, the crash information extraction, analysis, and classification tool (CIEACT), a simple and flexible online web tool, was developed to analyze crash narratives using text mining techniques. The tool uses a Python-based Django Web Framework, HTML, and a relational database (PostgreSQL) that enables concurrent model development and analysis. The tool has built-in classifiers by default or can train a model in real time given the data. The interface is user friendly and the results can be displayed in a tabular format or on an interactive map. The tool also provides an option for users to download the word with their probability scores and the results in csv files. The advantages and limitations of each proposed methodology were discussed, and several future research directions were outlined. In summary, the methodologies and tools developed as part of the dissertation can assist transportation engineers and safety professionals in extracting valuable information from narratives, recovering missed crashes, classifying a new crash, and expediting their review process on a large scale. Thus, this research can be used by transportation agencies to analyze crash records, identify appropriate safety solutions, and inform policy making to improve highway safety of our transportation system.