The motivation for this post is to see if it is possible to use bird species abundance across space to infer types of habitat. I will use bird species abundance rasters to create clusters in an unsupervised learning context. I focus on Pennsylvania because that is habitat I am personally familiar with, but the approach could be generalized to other areas that the data has been collected for.
Get data from {ebirdst}
The species abundance data I use is from the {ebirdst} package. There is a great series of tutorial videos on Youtube that explain the underlying theory and application. The TLDR is that the authors use citizen science observations from eBird, temporal data, and environmental data (geographic indicators such as elevation, topography, and land cover) to predict the relative abundance of many bird species across space and time. While there is some risk of circular logic here (using estimates partially based on environmental data to make inferences about the environment), I think the signal in spatiotemporal patterns of bird abundance adds enough extra information to justify the approach.
The full paper behind the estimates is available here.
Matthew Strimas-Mackey, Shawn Ligocki, Tom Auer, Daniel Fink (2023). ebirdst: Access and Analyze eBird Status and Trends Data Products. R package version 3.2022.0. https://ebird.github.io/ebirdst/
For this analysis I focus on the proportion of the population for a given species in a 3x3 kilometer area. For example, if 10% of the Pennsylvania population of Yellow Warbler exist in 3x3 km tile “A”, that tile’s proportion of population for that species is 10%. I focus on migratory birds that the Cornell team has high confidence estimates for.
I wrote a convenience function to handle the process of getting the data and subsetting it to my area of interest. The process is:
Download the raster data for the entire world
Crop and mask the raster to just Pennsylvania
Calculate the proportion of the Pennsylvania population that each tile represents for a given species + week of the year
Since the rasters have multiple layers for each week of the year, I take the highest value for the entire year for each raster tile. This means that each abundance estimate represents the maximum proportion of population for that species across the year for a given tile.
There are 166 distinct species across 7,692 distinct tiles in the dataset.
Exploratory data analysis
Problem of scale
While the distribution of some birds’ abundance across space is normally distributed, some distributions are extremely skewed. This makes it more difficult to analyze across species.
This shows the distribution of various species’ abundance, transformed various ways to attempt to normalize the skewed distributions. Unfortunately the species with more skew in their distribution are extremely zero-inflated, i.e. there are many values of 0 abundance.
These transformations are not very effective with this type of skew. Using a scale function to transform the data relative to its mean is not effective for zero-inflated data because the mean is very close to the lower limit (0) of the data. While the maps that use scaled and rescaled may look “better”, I don’t think the output is statistically sensible.
The Snow Goose in Pennsylvania is a good example of a highly zero-inflated distribution because it is either absent or hugely abundant, and only in specific areas.
Snow Goose abundance map from eBird
One Snow Goose:
Many Snow Geese:
Snow Goose make distribution skew go brrrrrrrrrr
My potato photo, to calibrate your expections of my photography skills:
Clustering
Kmeans
First I will use kmeans to cluster the data. This code (adapted from tidymodels) calculates the total within sum of squared error across increasing amounts of clusters.
This code calculates the SSE ratio and silhouette score for the clustering. I will use this to compare the performance against hierarchical clustering later.
This code creates a map where each tile is colored by the cluster it belongs to. I think kmeans misses a fair amount of the variation in the data. The clusters seem way too general, based on my experience in the regions.
Code
cluster_geo <- bird_df_wide |>bind_cols(extract_cluster_assignment(kclust_fit)) |>select(.cluster, everything())custom_palette_kmeans <- RColorBrewer::brewer.pal(nclust_kmeans, "Paired")cluster_geo |>ggplot(aes(x, y, fill = .cluster)) +geom_tile() +scale_fill_manual(values = custom_palette_kmeans) +coord_fixed() +labs(title ="Types of habitat inferred from species proportion of population",subtitle ="Clusters determined by kmeans algorithm",caption ="Data from eBird Status and Trends",fill ="Cluster") +theme_void() +theme(plot.background =element_rect(fill ="grey"))
Cluster 1 indicates fields, forests, and areas with more intensive agriculture.
Cluster 2 contains the shoreline of Lake Erie and the Susquehanna River.
Cluster 3 indicates the peaks of the Allegheny Mountain Range.
Cluster 4 is a unique area in southeastern Pennsylvania that attracts very large flocks of blackbirds.
Code
#faceted mapblank_tiles <- bird_df_wide |>distinct(x, y)cluster_geo |>select(x, y, .cluster) |>ggplot(aes(x, y, fill = .cluster)) +geom_tile(data = blank_tiles, aes(x, y), fill ="grey", inherit.aes =FALSE) +geom_tile() +scale_fill_manual(values = custom_palette_kmeans) +coord_fixed() +labs(title ="Types of habitat inferred from species proportion of population",subtitle ="Clusters determined by kmeans algorithm",caption ="Data from eBird Status and Trends") +guides(fill ="none") +facet_wrap(vars(.cluster)) +theme_void()
Next I will compare the proportion of population for species in each cluster to the statewide average for that species.
Code
#compare clusters to global averageglobal_avg <- bird_df |>summarize(prop_pop_global_mean =mean(prop_pop_max_log10),.by =c(common_name))
This table shows which birds are more represented in a cluster compared to the statewide average:
Code
#which birds are more common in a cluster than average?centroid_vs_global_kmeans |>group_by(.cluster) |>slice_max(order_by = diff, n =5) |>select(.cluster, common_name, diff) |>arrange(.cluster, desc(diff)) |>select(-diff) |>pivot_wider(names_from = .cluster, values_from = common_name) |>unnest(everything()) |>gt()
Cluster_1
Cluster_2
Cluster_3
Cluster_4
Willow Flycatcher
Bonaparte's Gull
Cerulean Warbler
Common Grackle
Swamp Sparrow
Ring-billed Gull
Worm-eating Warbler
Red-winged Blackbird
American Woodcock
Double-crested Cormorant
Broad-winged Hawk
Brown-headed Cowbird
Gray Catbird
Lesser Scaup
Louisiana Waterthrush
American Crow
Mourning Warbler
Canvasback
Hooded Warbler
Tree Swallow
This shows species that are less represented than the statewide average:
Code
#what birds are less common in a cluster than average?centroid_vs_global_kmeans |>group_by(.cluster) |>slice_min(order_by = diff, n =5) |>select(.cluster, common_name, diff) |>arrange(.cluster, desc(diff)) |>select(-diff) |>pivot_wider(names_from = .cluster, values_from = common_name) |>unnest(everything()) |>gt()
Cluster_1
Cluster_2
Cluster_3
Cluster_4
Common Grackle
American Woodcock
Gray Catbird
Worm-eating Warbler
Ring-billed Gull
Ovenbird
Ring-billed Gull
American Woodcock
Worm-eating Warbler
Wood Thrush
Swamp Sparrow
Wood Thrush
Bonaparte's Gull
Blue-winged Warbler
Willow Flycatcher
Cerulean Warbler
Cerulean Warbler
Cerulean Warbler
Common Grackle
Blue-winged Warbler
These patterns match what I would have expected. Gulls and ducks are more likely to be in habitat with lots of water. Species that prefer steep mountains (Worm-eating Warbler, Cerulean Warbler) are much more abundant in that habitat compared to other clusters.
Hierarchical
This code uses the total within sum of squares to show the optimal number of clusters using hierarchical clustering. As more clusters are used, there are diminishing returns in the compactness of the clusters. This makes sense because of how the algorithm divides the data into nested clusters. There is a slight elbow at 6 clusters, so I chose that as the cutoff.
Code
hclust_wss_plot <-fviz_nbclust(bird_df_wide_noloc, FUN = hcut, method ="wss", k.max =10)hclust_wss_plot
This shows the clusters from the hierarhical algorithm on the map of Pennsylvania. To my eye, this algorithm identifies many more unique areas.
Code
hclust_geo <- bird_df_wide |>select(x, y) |>bind_cols(hc_assignments) |>mutate(.cluster =fct_infreq(.cluster))custom_palette_hclust <- RColorBrewer::brewer.pal(nclust_hclust, "Paired")hclust_geo |>ggplot(aes(x, y, fill = .cluster)) +geom_tile() +scale_fill_manual(values = custom_palette_hclust) +coord_fixed() +labs(title ="Types of habitat inferred from species proportion of population",subtitle ="Clusters determined by hclust algorithm",caption ="Data from eBird Status and Trends",fill ="Cluster") +theme_void() +theme(plot.background =element_rect(fill ="grey"))
Code
#faceted maphclust_geo |>ggplot(aes(x, y, fill = .cluster)) +geom_tile(data = blank_tiles, aes(x, y), fill ="grey", inherit.aes =FALSE) +geom_tile() +scale_fill_manual(values = custom_palette_hclust) +facet_wrap(vars(.cluster)) +coord_fixed() +labs(title ="Types of habitat inferred from species proportion of population",subtitle ="Clusters determined by hclust algorithm",caption ="Data from eBird Status and Trends") +guides(fill ="none") +theme_void()
Cluster 1 indicates areas with more open tree canopy and less water. This area is notably smaller compared to the same habitat cluster from the kmeans algorithm.
Cluster 2 contains the shoreline of Lake Erie and the Susquehanna River. This is the same as in kmeans.
Cluster 3 contains the “deep forest” habitat.
Cluster 4 is the habitat dominated by agricultural land use.
Cluster 5 contains habitat with steep mountains and forests.
Cluster 6 has the habitat in the southeast dominated by large flocks of blackbirds.
These tables compare the abundance of species within each cluster to the state-wide average.
Which birds are more abundant in a cluster than average?
Code
#what birds are more common in a cluster than average?centroid_vs_global_hc |>group_by(.cluster) |>slice_max(order_by = diff, n =5) |>select(.cluster, common_name, diff) |>arrange(.cluster, desc(diff)) |>select(-diff) |>pivot_wider(names_from = .cluster, values_from = common_name) |>unnest(everything()) |>gt()
Cluster_1
Cluster_2
Cluster_3
Cluster_4
Cluster_5
Cluster_6
Rusty Blackbird
Bonaparte's Gull
Blue-winged Warbler
Canada Warbler
Cerulean Warbler
Common Grackle
Willow Flycatcher
Ring-billed Gull
Blue-headed Vireo
American Woodcock
Worm-eating Warbler
Red-winged Blackbird
American Crow
Double-crested Cormorant
Mourning Warbler
Wood Thrush
Broad-winged Hawk
Brown-headed Cowbird
Eastern Meadowlark
Lesser Scaup
Blackburnian Warbler
Black-throated Blue Warbler
Louisiana Waterthrush
American Woodcock
Wilson's Snipe
Canvasback
Black-throated Green Warbler
Ovenbird
Ovenbird
Ring-billed Gull
Which birds are less abundant in a cluster than average?
Code
#what birds are less common in a cluster than average?centroid_vs_global_hc |>group_by(.cluster) |>slice_min(order_by = diff, n =5) |>select(.cluster, common_name, diff) |>arrange(.cluster, desc(diff)) |>select(-diff) |>pivot_wider(names_from = .cluster, values_from = common_name) |>unnest(everything()) |>gt()
Cluster_1
Cluster_2
Cluster_3
Cluster_4
Cluster_5
Cluster_6
Ovenbird
American Woodcock
Ring-billed Gull
Red-winged Blackbird
Swamp Sparrow
Scarlet Tanager
Black-throated Blue Warbler
Ovenbird
Wood Thrush
Ring-billed Gull
Bonaparte's Gull
Worm-eating Warbler
Worm-eating Warbler
Wood Thrush
Worm-eating Warbler
Hooded Warbler
Ring-billed Gull
Hooded Warbler
Canada Warbler
Blue-winged Warbler
Cerulean Warbler
Bonaparte's Gull
Willow Flycatcher
Blue-winged Warbler
Cerulean Warbler
Cerulean Warbler
Common Grackle
Cerulean Warbler
Common Grackle
Cerulean Warbler
Conclusion
Code
bind_rows(kmeans_metrics, hclust_metrics) |>gt()
algorithm
sse_ratio
silhouette_score
kmeans
0.4583
0.5905
hclust
0.8053
0.1178
The kmeans clusters have a higher silhouette score and a lower SSE ratio, which means that they are more distinct from each other and the observations within each cluster are more similar to each other than to observations from other clusters. The hierarchical method produced clusters that are less distinctive and less internally cohesive.
This is surprising to me, because the clusters from the hierarchical clustering algorithm seem more natural, according to my own limited experience birding across Pennsylvania. The hierarchical approach is also more appealing because it more naturally reflects the structure of habitats. For example, there are multiple types of forest (light, medium, deep) and topology (flat land, hills, steep mountains) that attract different types of birds.
I see a few natural extensions of this analysis:
Try different linkage methods for the hierarchical method to see if that produces higher quality clusters
Increase the geographic scope of the analysis. Pennsylvania does not have as many high-use migration paths compared to neighboring states like Ohio and New Jersey. If it is computationally feasible on my laptop, I could try expanding the coverage to many more states and Canadian provinces. It would be interesting to see what clusters Pennsylvania habitat are assigned to, in the context of a wider geographic area.
Try other algorithms (DBSCAN, network graph community detection etc.)
Perform PCA on the proportion of population data and use the components as the inputs to the clustering algorithm, instead of directly using the proportion of population data
Add data on non-migratory (resident) birds
Appendix
Pairwise scatter plot
This compares the maximum proportion of population in a tile between species among 10 randomly selected species, stratified by mean proportion.
PC1 is the “Deep Forest” index. Higher values indicate that the bird prefers deeper forest habitat. PC2 indicates how open the tree canopy of the habitat is. Higher values indicate that the bird prefers habitat with less forest.
PC1 indicates the presence/absence of water. The Susquehanna River (southeast) and Ohio River (southwest) are highlighted.
Code
tile_df |>left_join(tile_pca_rot) |>ggplot(aes(x, y, fill = PC1)) +geom_tile() +scale_fill_viridis_c() +coord_fixed() +theme_void()
PC2 indicates how open the tree canopy of the habitat is, largely driven by agricultural land use. The area in the southeast where the Susquehanna River bisects the Allegheny Mountains is known for its farmland.
Code
tile_df |>left_join(tile_pca_rot) |>ggplot(aes(x, y, fill = PC2)) +geom_tile() +scale_fill_viridis_c() +coord_fixed() +theme_void()
PC3 indicates how steep the mountains are.
Code
tile_df |>left_join(tile_pca_rot) |>ggplot(aes(x, y, fill = PC3)) +geom_tile() +scale_fill_viridis_c() +coord_fixed() +theme_void()