Inferring habitat clusters with eBird data

Author

Conor Tompkins

Published

August 13, 2025

Intro

The motivation for this post is to see if it is possible to use bird species abundance across space to infer types of habitat. I will use bird species abundance rasters to create clusters in an unsupervised learning context. I focus on Pennsylvania because that is habitat I am personally familiar with, but the approach could be generalized to other areas that the data has been collected for.

Get data from {ebirdst}

The species abundance data I use is from the {ebirdst} package. There is a great series of tutorial videos on Youtube that explain the underlying theory and application. The TLDR is that the authors use citizen science observations from eBird, temporal data, and environmental data (geographic indicators such as elevation, topography, and land cover) to predict the relative abundance of many bird species across space and time. While there is some risk of circular logic here (using estimates partially based on environmental data to make inferences about the environment), I think the signal in spatiotemporal patterns of bird abundance adds enough extra information to justify the approach.

The full paper behind the estimates is available here.

Matthew Strimas-Mackey, Shawn Ligocki, Tom Auer, Daniel Fink (2023). ebirdst: Access and Analyze eBird Status and Trends Data Products. R package version 3.2022.0. https://ebird.github.io/ebirdst/

For this analysis I focus on the proportion of the population for a given species in a 3x3 kilometer area. For example, if 10% of the Pennsylvania population of Yellow Warbler exist in 3x3 km tile “A”, that tile’s proportion of population for that species is 10%. I focus on migratory birds that the Cornell team has high confidence estimates for.

I wrote a convenience function to handle the process of getting the data and subsetting it to my area of interest. The process is:

  • Download the raster data for the entire world
  • Crop and mask the raster to just Pennsylvania
  • Calculate the proportion of the Pennsylvania population that each tile represents for a given species + week of the year
  • Since the rasters have multiple layers for each week of the year, I take the highest value for the entire year for each raster tile. This means that each abundance estimate represents the maximum proportion of population for that species across the year for a given tile.

Organize data

Code
library(tidyverse)
library(sf)
library(broom)
library(parsnip)
library(tidyclust)
library(factoextra)
library(clValid)
library(tune)
library(gt)
library(ggrepel)
library(patchwork)
library(GGally)
library(hrbrthemes)
library(scales)
library(viridis)

options(scipen = 999, digits = 4)

theme_set(theme_bw())

set.seed(1234)
Code
bird_df_raw <- read_csv("post_data/species_abundance_present_in_pa.csv") |>
  rename(prop_pop_max = prop_pop)

glimpse(bird_df_raw)
Rows: 1,276,872
Columns: 4
$ common_name  <chr> "Acadian Flycatcher", "Acadian Flycatcher", "Acadian Flyc…
$ x            <dbl> -177180, -173232, -169285, -165338, -161391, -177180, -17…
$ y            <dbl> 151850, 151850, 151850, 151850, 151850, 147903, 147903, 1…
$ prop_pop_max <dbl> 0.000000000, 0.000000000, 0.000002311, 0.000000000, 0.000…
Code
distinct_species <- bird_df_raw |> 
  distinct(common_name) |> 
  nrow()

distinct_tiles <- bird_df_raw |> 
  distinct(x, y) |> 
  nrow() |> 
  comma()

There are 166 distinct species across 7,692 distinct tiles in the dataset.

Exploratory data analysis

Problem of scale

While the distribution of some birds’ abundance across space is normally distributed, some distributions are extremely skewed. This makes it more difficult to analyze across species.

Code
bird_df_scale_compare <- bird_df_raw |> 
  mutate(scaled = prop_pop_max |> scale() |> as.numeric(),
         rescaled = prop_pop_max |> rescale(to = c(-1, 1)),
         log = log(prop_pop_max + 1),
         log10 = log10(prop_pop_max + 1),
         .by = common_name) |> 
  rename(original_scale = prop_pop_max)

This shows the distribution of various species’ abundance, transformed various ways to attempt to normalize the skewed distributions. Unfortunately the species with more skew in their distribution are extremely zero-inflated, i.e. there are many values of 0 abundance.

These transformations are not very effective with this type of skew. Using a scale function to transform the data relative to its mean is not effective for zero-inflated data because the mean is very close to the lower limit (0) of the data. While the maps that use scaled and rescaled may look “better”, I don’t think the output is statistically sensible.

Code
test_sp <- c("Dark-eyed Junco", "Yellow Warbler", "Snow Goose")

bird_df_scale_compare <- bird_df_scale_compare |> 
  filter(common_name %in% test_sp) |> 
  select(common_name, x, y, original_scale, scaled, rescaled, log, log10) |> 
  pivot_longer(cols = c(original_scale, scaled, rescaled, log, log10)) |> 
  mutate(name = factor(name, levels = c("original_scale", "scaled", "rescaled", "log", "log10")))

graph_abundance <- function(x, y){
  
  x |> 
    ggplot(aes(value, fill = name)) +
    geom_density() +
    facet_wrap(vars(name), ncol = 5, scales = "free") +
    scale_x_continuous(labels = NULL) +
    scale_y_continuous(labels = NULL) +
    labs(title = y)
  
}

bird_df_scale_compare |> 
  group_nest(common_name) |> 
  mutate(plot = map2(data, common_name, graph_abundance)) |> 
  pull(plot) |> 
  wrap_plots(ncol = 1, guides = "collect")

Code
bird_df_scale_compare |>
  pivot_wider(names_from = name, values_from = value) |> 
  filter(common_name %in% test_sp) |> 
  ggplot(aes(x, y, fill = original_scale)) +
  geom_tile() +
  facet_wrap(vars(common_name), ncol = 1) +
  scale_fill_viridis_c() +
  coord_fixed() +
  labs(title = "Original scale") +
  theme_void()

Code
bird_df_scale_compare |>
  pivot_wider(names_from = name, values_from = value) |> 
  filter(common_name %in% test_sp) |> 
  ggplot(aes(x, y, fill = scaled)) +
  geom_tile() +
  facet_wrap(vars(common_name), ncol = 1) +
  scale_fill_viridis_c() +
  coord_fixed() +
  labs(title = "Scaled") +
  theme_void()

Code
bird_df_scale_compare |>
  pivot_wider(names_from = name, values_from = value) |> 
  filter(common_name %in% test_sp) |> 
  ggplot(aes(x, y, fill = rescaled)) +
  geom_tile() +
  facet_wrap(vars(common_name), ncol = 1) +
  scale_fill_viridis_c() +
  coord_fixed() +
  labs(title = "Rescaled") +
  theme_void()

Code
bird_df_scale_compare |>
  pivot_wider(names_from = name,
              values_from = value) |> 
  filter(common_name %in% test_sp) |> 
  ggplot(aes(x, y, fill = log)) +
  geom_tile() +
  facet_wrap(vars(common_name), ncol = 1) +
  scale_fill_viridis_c() +
  coord_fixed() +
  labs(title = "log scale") +
  theme_void()

Code
bird_df_scale_compare |>
  pivot_wider(names_from = name, values_from = value) |> 
  filter(common_name %in% test_sp) |> 
  ggplot(aes(x, y, fill = log10)) +
  geom_tile() +
  facet_wrap(vars(common_name), ncol = 1) +
  scale_fill_viridis_c() +
  coord_fixed() +
  labs(title = "log10 scale") +
  theme_void()

I ended up choosing the log10 transformation, but more testing would be needed to really justify this choice.

Code
bird_df <- bird_df_raw |> 
  mutate(prop_pop_max_log10 = log10(prop_pop_max + 1),
         .by = common_name) |> 
  select(common_name, x, y, prop_pop_max_log10)

bird_df_wide <- bird_df |> 
  pivot_wider(names_from = common_name, values_from = prop_pop_max_log10)

bird_df_wide_noloc <- select(bird_df_wide, -c(x, y))

The Snow Goose in Pennsylvania is a good example of a highly zero-inflated distribution because it is either absent or hugely abundant, and only in specific areas.

Snow Goose abundance map from eBird

One Snow Goose:

Many Snow Geese:

Snow Goose make distribution skew go brrrrrrrrrr

My potato photo, to calibrate your expections of my photography skills:

Clustering

Kmeans

First I will use kmeans to cluster the data. This code (adapted from tidymodels) calculates the total within sum of squared error across increasing amounts of clusters.

Code
nclust_global <- 10

kclusts <- tibble(k = 1:nclust_global) |>
  mutate(
    kclust = purrr::map(k, ~kmeans(bird_df_wide_noloc, .x)),
    tidied = purrr::map(kclust, tidy),
    glanced = purrr::map(kclust, glance),
    augmented = purrr::map(kclust, augment, bird_df_wide)
  )

clusters <- kclusts |>
  unnest(cols = c(tidied))

assignments <- kclusts |>
  unnest(cols = c(augmented))

clusterings <- kclusts |>
  unnest(cols = c(glanced))

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = c(1:nclust_global))

tot.withinss stops decreasing around 4 clusters, which means that the clusters stop becoming significantly more compact after 4 clusters.

This code fits the kmeans algorithm against the data using 4 clusters.

Code
nclust_kmeans <- 4

kmeans_spec <- k_means(num_clusters = nclust_kmeans) |> 
  set_engine("stats")

kclust_fit <- kmeans_spec |> 
  fit(~ .,
      data = bird_df_wide_noloc
  )

Metrics

This code calculates the SSE ratio and silhouette score for the clustering. I will use this to compare the performance against hierarchical clustering later.

Code
kmeans_summary <- extract_fit_summary(kclust_fit)

dists <- bird_df_wide_noloc |> 
  as.matrix() |> 
  dist()

kmeans_silhouette_score <- silhouette_avg(kclust_fit, dists = dists)

kmeans_metrics <- tibble(algorithm = "kmeans",
                         sse_ratio = sse_ratio(kclust_fit) |> pull(.estimate),
                         silhouette_score = kmeans_silhouette_score |> pull(.estimate))

gt(kmeans_metrics)
algorithm sse_ratio silhouette_score
kmeans 0.4583 0.5905

Maps

This code creates a map where each tile is colored by the cluster it belongs to. I think kmeans misses a fair amount of the variation in the data. The clusters seem way too general, based on my experience in the regions.

Code
cluster_geo <- bird_df_wide |> 
  bind_cols(extract_cluster_assignment(kclust_fit)) |> 
  select(.cluster, everything())

custom_palette_kmeans <- RColorBrewer::brewer.pal(nclust_kmeans, "Paired")

cluster_geo |> 
  ggplot(aes(x, y, fill = .cluster)) +
  geom_tile() +
  scale_fill_manual(values = custom_palette_kmeans) +
  coord_fixed() +
  labs(title = "Types of habitat inferred from species proportion of population",
       subtitle = "Clusters determined by kmeans algorithm",
       caption = "Data from eBird Status and Trends",
       fill = "Cluster") +
  theme_void() +
  theme(plot.background = element_rect(fill = "grey"))

  • Cluster 1 indicates fields, forests, and areas with more intensive agriculture.
  • Cluster 2 contains the shoreline of Lake Erie and the Susquehanna River.
  • Cluster 3 indicates the peaks of the Allegheny Mountain Range.
  • Cluster 4 is a unique area in southeastern Pennsylvania that attracts very large flocks of blackbirds.
Code
#faceted map
blank_tiles <- bird_df_wide |>
  distinct(x, y)

cluster_geo |> 
  select(x, y, .cluster) |> 
  ggplot(aes(x, y, fill = .cluster)) +
  geom_tile(data = blank_tiles, aes(x, y), fill = "grey", inherit.aes = FALSE) +
  geom_tile() +
  scale_fill_manual(values = custom_palette_kmeans) +
  coord_fixed() +
  labs(title = "Types of habitat inferred from species proportion of population",
       subtitle = "Clusters determined by kmeans algorithm",
       caption = "Data from eBird Status and Trends") +
  guides(fill = "none") +
  facet_wrap(vars(.cluster)) +
  theme_void()

Next I will compare the proportion of population for species in each cluster to the statewide average for that species.

Code
#compare clusters to global average
global_avg <- bird_df |> 
  summarize(prop_pop_global_mean = mean(prop_pop_max_log10),
            .by = c(common_name))
Code
centroids_long_kmeans <- kclust_fit |>
  extract_centroids() |> 
  pivot_longer(-.cluster, names_to = "common_name", values_to = "cluster_mean") |> 
  mutate(common_name = str_remove_all(common_name, "\\`"))

centroid_vs_global_kmeans <- centroids_long_kmeans |> 
  left_join(global_avg) |> 
  mutate(diff = cluster_mean - prop_pop_global_mean,
         diff_abs = abs(diff))

This table shows which birds are more represented in a cluster compared to the statewide average:

Code
#which birds are more common in a cluster than average?
centroid_vs_global_kmeans |> 
  group_by(.cluster) |> 
  slice_max(order_by = diff, n = 5) |> 
  select(.cluster, common_name, diff) |> 
  arrange(.cluster, desc(diff)) |> 
  select(-diff) |> 
  pivot_wider(names_from = .cluster, values_from = common_name) |> 
  unnest(everything()) |> 
  gt()
Cluster_1 Cluster_2 Cluster_3 Cluster_4
Willow Flycatcher Bonaparte's Gull Cerulean Warbler Common Grackle
Swamp Sparrow Ring-billed Gull Worm-eating Warbler Red-winged Blackbird
American Woodcock Double-crested Cormorant Broad-winged Hawk Brown-headed Cowbird
Gray Catbird Lesser Scaup Louisiana Waterthrush American Crow
Mourning Warbler Canvasback Hooded Warbler Tree Swallow

This shows species that are less represented than the statewide average:

Code
#what birds are less common in a cluster than average?
centroid_vs_global_kmeans |> 
  group_by(.cluster) |> 
  slice_min(order_by = diff, n = 5) |> 
  select(.cluster, common_name, diff) |> 
  arrange(.cluster, desc(diff)) |> 
  select(-diff) |> 
  pivot_wider(names_from = .cluster, values_from = common_name) |> 
  unnest(everything()) |> 
  gt()
Cluster_1 Cluster_2 Cluster_3 Cluster_4
Common Grackle American Woodcock Gray Catbird Worm-eating Warbler
Ring-billed Gull Ovenbird Ring-billed Gull American Woodcock
Worm-eating Warbler Wood Thrush Swamp Sparrow Wood Thrush
Bonaparte's Gull Blue-winged Warbler Willow Flycatcher Cerulean Warbler
Cerulean Warbler Cerulean Warbler Common Grackle Blue-winged Warbler

These patterns match what I would have expected. Gulls and ducks are more likely to be in habitat with lots of water. Species that prefer steep mountains (Worm-eating Warbler, Cerulean Warbler) are much more abundant in that habitat compared to other clusters.

Hierarchical

This code uses the total within sum of squares to show the optimal number of clusters using hierarchical clustering. As more clusters are used, there are diminishing returns in the compactness of the clusters. This makes sense because of how the algorithm divides the data into nested clusters. There is a slight elbow at 6 clusters, so I chose that as the cutoff.

Code
hclust_wss_plot <- fviz_nbclust(bird_df_wide_noloc, FUN = hcut, method = "wss", k.max = 10)

hclust_wss_plot

Code
nclust_hclust <- 6

hc_spec <- hier_clust(
  num_clusters = nclust_hclust,
  linkage_method = "ward"
)

hc_fit <- hc_spec |>
  fit(~ ., data = bird_df_wide_noloc)

This shows the full dendrogram of the hierarchical cluster output

Code
hclust_tree <- as.dendrogram(hc_fit$fit)

plot(hclust_tree, leaflab = "none")

This shows the same dendrogram cut at 6 clusters (branches).

Code
tree_cut <- hclust_tree |> cut(h = .009)

tree_cut$upper |> plot()

Code
hc_summary <- hc_fit |> extract_fit_summary()

Metrics

This calculates the metrics I will compare later.

Code
hc_silhouette_score <- silhouette_avg(hc_fit, dists = dists)

hclust_metrics <- tibble(algorithm = "hclust",
                         sse_ratio = sse_ratio(hc_fit) |> pull(.estimate),
                         silhouette_score = hc_silhouette_score |> pull(.estimate))

gt(hclust_metrics)
algorithm sse_ratio silhouette_score
hclust 0.8053 0.1178
Code
hc_assignments <- extract_cluster_assignment(hc_fit)

Maps

This shows the clusters from the hierarhical algorithm on the map of Pennsylvania. To my eye, this algorithm identifies many more unique areas.

Code
hclust_geo <- bird_df_wide |> 
  select(x, y) |> 
  bind_cols(hc_assignments) |> 
  mutate(.cluster = fct_infreq(.cluster))

custom_palette_hclust <- RColorBrewer::brewer.pal(nclust_hclust, "Paired")

hclust_geo |> 
  ggplot(aes(x, y, fill = .cluster)) +
  geom_tile() +
  scale_fill_manual(values = custom_palette_hclust) +
  coord_fixed() +
  labs(title = "Types of habitat inferred from species proportion of population",
       subtitle = "Clusters determined by hclust algorithm",
       caption = "Data from eBird Status and Trends",
       fill = "Cluster") +
  theme_void() +
  theme(plot.background = element_rect(fill = "grey"))

Code
#faceted map
hclust_geo |> 
  ggplot(aes(x, y, fill = .cluster)) +
  geom_tile(data = blank_tiles, aes(x, y), fill = "grey", inherit.aes = FALSE) +
  geom_tile() +
  scale_fill_manual(values = custom_palette_hclust) +
  facet_wrap(vars(.cluster)) +
  coord_fixed() +
  labs(title = "Types of habitat inferred from species proportion of population",
       subtitle = "Clusters determined by hclust algorithm",
       caption = "Data from eBird Status and Trends") +
  guides(fill = "none") +
  theme_void()

  • Cluster 1 indicates areas with more open tree canopy and less water. This area is notably smaller compared to the same habitat cluster from the kmeans algorithm.
  • Cluster 2 contains the shoreline of Lake Erie and the Susquehanna River. This is the same as in kmeans.
  • Cluster 3 contains the “deep forest” habitat.
  • Cluster 4 is the habitat dominated by agricultural land use.
  • Cluster 5 contains habitat with steep mountains and forests.
  • Cluster 6 has the habitat in the southeast dominated by large flocks of blackbirds.

These tables compare the abundance of species within each cluster to the state-wide average.

Code
#compare clusters to global average
centroids_long_hc <- hc_fit |>
  extract_centroids() |> 
  pivot_longer(-.cluster, names_to = "common_name", values_to = "cluster_mean") |> 
  mutate(common_name = str_remove_all(common_name, "\\`"))

centroid_vs_global_hc <- centroids_long_hc |> 
  left_join(global_avg) |> 
  mutate(diff = cluster_mean - prop_pop_global_mean,
         diff_abs = abs(diff))

Which birds are more abundant in a cluster than average?

Code
#what birds are more common in a cluster than average?
centroid_vs_global_hc |> 
  group_by(.cluster) |> 
  slice_max(order_by = diff, n = 5) |> 
  select(.cluster, common_name, diff) |> 
  arrange(.cluster, desc(diff)) |> 
  select(-diff) |> 
  pivot_wider(names_from = .cluster, values_from = common_name) |> 
  unnest(everything()) |> 
  gt()
Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 Cluster_6
Rusty Blackbird Bonaparte's Gull Blue-winged Warbler Canada Warbler Cerulean Warbler Common Grackle
Willow Flycatcher Ring-billed Gull Blue-headed Vireo American Woodcock Worm-eating Warbler Red-winged Blackbird
American Crow Double-crested Cormorant Mourning Warbler Wood Thrush Broad-winged Hawk Brown-headed Cowbird
Eastern Meadowlark Lesser Scaup Blackburnian Warbler Black-throated Blue Warbler Louisiana Waterthrush American Woodcock
Wilson's Snipe Canvasback Black-throated Green Warbler Ovenbird Ovenbird Ring-billed Gull

Which birds are less abundant in a cluster than average?

Code
#what birds are less common in a cluster than average?
centroid_vs_global_hc |> 
  group_by(.cluster) |> 
  slice_min(order_by = diff, n = 5) |> 
  select(.cluster, common_name, diff) |> 
  arrange(.cluster, desc(diff)) |> 
  select(-diff) |> 
  pivot_wider(names_from = .cluster, values_from = common_name) |> 
  unnest(everything()) |> 
  gt()
Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 Cluster_6
Ovenbird American Woodcock Ring-billed Gull Red-winged Blackbird Swamp Sparrow Scarlet Tanager
Black-throated Blue Warbler Ovenbird Wood Thrush Ring-billed Gull Bonaparte's Gull Worm-eating Warbler
Worm-eating Warbler Wood Thrush Worm-eating Warbler Hooded Warbler Ring-billed Gull Hooded Warbler
Canada Warbler Blue-winged Warbler Cerulean Warbler Bonaparte's Gull Willow Flycatcher Blue-winged Warbler
Cerulean Warbler Cerulean Warbler Common Grackle Cerulean Warbler Common Grackle Cerulean Warbler

Conclusion

Code
bind_rows(kmeans_metrics, hclust_metrics) |> 
  gt()
algorithm sse_ratio silhouette_score
kmeans 0.4583 0.5905
hclust 0.8053 0.1178

The kmeans clusters have a higher silhouette score and a lower SSE ratio, which means that they are more distinct from each other and the observations within each cluster are more similar to each other than to observations from other clusters. The hierarchical method produced clusters that are less distinctive and less internally cohesive.

This is surprising to me, because the clusters from the hierarchical clustering algorithm seem more natural, according to my own limited experience birding across Pennsylvania. The hierarchical approach is also more appealing because it more naturally reflects the structure of habitats. For example, there are multiple types of forest (light, medium, deep) and topology (flat land, hills, steep mountains) that attract different types of birds.

I see a few natural extensions of this analysis:

  • Try different linkage methods for the hierarchical method to see if that produces higher quality clusters
  • Increase the geographic scope of the analysis. Pennsylvania does not have as many high-use migration paths compared to neighboring states like Ohio and New Jersey. If it is computationally feasible on my laptop, I could try expanding the coverage to many more states and Canadian provinces. It would be interesting to see what clusters Pennsylvania habitat are assigned to, in the context of a wider geographic area.
  • Try other algorithms (DBSCAN, network graph community detection etc.)
  • Perform PCA on the proportion of population data and use the components as the inputs to the clustering algorithm, instead of directly using the proportion of population data
  • Add data on non-migratory (resident) birds

Appendix

Pairwise scatter plot

This compares the maximum proportion of population in a tile between species among 10 randomly selected species, stratified by mean proportion.

Code
#ggpairs plot
my_fn <- function(data, mapping, ...){
  # Using default ggplot density function
  
  p <- ggplot(data = data, mapping = mapping) + 
    stat_density2d(aes(fill=after_stat(density)), geom="tile", contour = FALSE) +
    scale_fill_gradientn(colours=viridis(100, option="viridis")) +
    theme(axis.text = element_blank(),
          strip.text = element_text(size = 1))
  p
}

sampled_sp <- bird_df |>
  select(common_name, prop_pop_max_log10) |>
  summarize(mean = mean(prop_pop_max_log10), .by = common_name) |>
  slice_sample(n = 10, weight_by = mean)

bird_df_wide_noloc |>
  select(all_of(pull(sampled_sp, common_name))) |> 
  ggpairs(lower = list(continuous = my_fn))

This is not a feasible analysis tool at scale, but it confirms some basic intuitive findings.

PCA

On birds

This code calculates principal components using each species’ prop_pop_max_log10 as inputs.

Code
#tidy PCA code from https://clauswilke.com/blog/2020/09/07/pca-tidyverse-style/
pca_fit <- bird_df_wide_noloc |> 
  prcomp(scale = TRUE) # do PCA on scaled data

pca_fit |>
  tidy(matrix = "eigenvalues") |>
  filter(PC <= 10) |> 
  ggplot(aes(PC, percent)) +
  geom_col() +
  scale_x_continuous(breaks = c(1:10)) +
  scale_y_percent() +
  labs(title = "Percent of variance explained")

Code
# plot rotation matrix
pca_rot <- pca_fit |>
  tidy(matrix = "rotation") |>
  pivot_wider(names_from = "PC", names_prefix = "PC", values_from = "value") |> 
  rename(common_name = column)

# define arrow style for plotting
arrow_style <- arrow(
  angle = 20, ends = "first", type = "closed", length = grid::unit(3, "pt")
)

pca_outliers <- pca_rot |>
  filter(PC1 == max(PC1) | PC2 == max(PC2) | PC3 == max(PC3) | PC1 == min(PC1) | PC2 == min(PC2) | PC3 == min(PC3))

pca_outliers_max <- pca_rot |> 
  select(common_name, PC1:PC3) |> 
  pivot_longer(-common_name) |> 
  group_by(name) |> 
  slice_max(order_by = value, n = 3) |> 
  ungroup() |> 
  distinct(common_name)

pca_outliers_min <- pca_rot |> 
  select(common_name, PC1:PC3) |> 
  pivot_longer(-common_name) |> 
  group_by(name) |> 
  slice_min(order_by = value, n = 3) |> 
  ungroup() |> 
  distinct(common_name)

pca_outliers <- bind_rows(pca_outliers_max, pca_outliers_min) |> 
  left_join(pca_rot)

PC1 is the “Deep Forest” index. Higher values indicate that the bird prefers deeper forest habitat. PC2 indicates how open the tree canopy of the habitat is. Higher values indicate that the bird prefers habitat with less forest.

Code
pca_rot |>
  ggplot(aes(PC1, PC2)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style, lwd = .2) +
  geom_point(
    data = pca_outliers,
    color = "#904C2F"
  ) +
  geom_label_repel(
    data = pca_outliers,
    aes(label = common_name),
    color = "#904C2F"
  ) +
  coord_obs_pred()

PC3 indicates the extent of water in the habitat. Lower values indicate that the species prefers habitat with more water.

Code
pca_rot |>
  ggplot(aes(PC1, PC3)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style, lwd = .2) +
  geom_point(
    data = pca_outliers,
    color = "#904C2F"
  ) +
  geom_label_repel(
    data = pca_outliers,
    aes(label = common_name),
    color = "#904C2F"
  ) +
  coord_obs_pred()

Code
pca_rot |>
  ggplot(aes(PC2, PC3)) +
  geom_segment(xend = 0, yend = 0, arrow = arrow_style, lwd = .2) +
  geom_point(
    data = pca_outliers,
    color = "#904C2F"
  ) +
  geom_label_repel(
    data = pca_outliers,
    aes(label = common_name),
    color = "#904C2F"
  ) +
  coord_obs_pred()

On tiles

This code applies PCA on the tiles instead of the birds.

Code
tile_df <- bird_df |> 
  distinct(x, y) |> 
  mutate(tile_id = row_number() |> as.character())

tile_pca_fit <- bird_df |> 
  left_join(tile_df) |> 
  select(tile_id, common_name, prop_pop_max_log10) |> 
  pivot_wider(names_from = tile_id, values_from = prop_pop_max_log10) |> 
  select(-common_name) |> 
  prcomp(scale = TRUE)

tile_pca_fit |>
  tidy(matrix = "eigenvalues") |>
  filter(PC <= 10) |> 
  ggplot(aes(PC, percent)) +
  geom_col() +
  scale_x_continuous(breaks = c(1:10)) +
  scale_y_percent() +
  labs(title = "Percent of variance explained")

Code
tile_pca_rot <- tile_pca_fit |> 
  tidy("rotation") |> 
  rename(tile_id = column) |> 
  pivot_wider(names_from = PC, values_from = value, names_prefix = "PC")

PC1 indicates the presence/absence of water. The Susquehanna River (southeast) and Ohio River (southwest) are highlighted.

Code
tile_df |> 
  left_join(tile_pca_rot) |> 
  ggplot(aes(x, y, fill = PC1)) +
  geom_tile() +
  scale_fill_viridis_c() +
  coord_fixed() +
  theme_void()

PC2 indicates how open the tree canopy of the habitat is, largely driven by agricultural land use. The area in the southeast where the Susquehanna River bisects the Allegheny Mountains is known for its farmland.

Code
tile_df |> 
  left_join(tile_pca_rot) |> 
  ggplot(aes(x, y, fill = PC2)) +
  geom_tile() +
  scale_fill_viridis_c() +
  coord_fixed() +
  theme_void()

PC3 indicates how steep the mountains are.

Code
tile_df |> 
  left_join(tile_pca_rot) |> 
  ggplot(aes(x, y, fill = PC3)) +
  geom_tile() +
  scale_fill_viridis_c() +
  coord_fixed() +
  theme_void()