Characterizing Clustering Models of High-dimensional Remotely Sensed Data Using Subsampled Field-subfield Spatial Cross-validated Random Forests
Clustering models are regularly used to construct meaningful groups of observations within complex datasets, and they are an exceptional tool for spatial exploratory analysis. The clusters detected in a recent spatio-temporal cluster analysis of leaf area index (LAI) in the Columbia River Basin (CRB) require further investigation since they are only derived using a single greenness metric. It is of great interest to further understand how greening indices can be used to determine separation of sites across an array of remotely sensed environmental attributes. In this prior work, there are highly localized minority clusters that were detected to be most dissimilar from the remaining clusters as determined by annual variation in remotely sensed LAI. The objective of this study is to discern what other environmental factors are important predictors of cluster allocation from the mentioned cluster analysis, and secondarily, to construct a predictive model that prioritizes minority clusters. A random forest classification is considered to examine the importance of various site attributes in predicting cluster allocation. To satisfy these objectives, I propose an application-specific process that integrates spatial sub-sampling and cross-validation to improve the interpretability and utility of random forests for spatially autocorrelated, highly-localized, and unbalanced class-size response variables. The final random forest model identifies that the cluster allocation, using only LAI, separates sites significantly across many other environmental attributes, and further that elevation, slope, and water storage potential are the most important predictors of cluster allocation. Most importantly, the class errors rates for the clusters that are most dissimilar, as detected by the cluster model, have the best misclassification rates which fulfills the secondary objective of aligning the priorities of a predictive model with a prior cluster model.
Whetten, Andrew B.
"Characterizing Clustering Models of High-dimensional Remotely Sensed Data Using Subsampled Field-subfield Spatial Cross-validated Random Forests,"
International Journal of Geospatial and Environmental Research: Vol. 8:
3, Article 4.
Available at: https://dc.uwm.edu/ijger/vol8/iss3/4
Applied Statistics Commons, Data Science Commons, Earth Sciences Commons, Environmental Monitoring Commons, Geography Commons, Statistical Methodology Commons