Networking USL Club Similarity With Euclidean Distance

Euclidean distance is a simple way to measure the distance between two points. It can also be used to measure how similar two sports teams are, given a set of variables. In this post, I use Euclidean distance to calculate the similarity between USL clubs and map that data to a network graph. I will use the 538 Soccer Power Index data to calculate the distance.

Setup

library(tidyverse)
library(broom)
library(ggraph)
library(tidygraph)
library(viridis)

set_graph_style()

set.seed(1234)

Download data

This code downloads the data from 538’s GitHub repo and does some light munging.

read_csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_global_rankings.csv", progress = FALSE) %>% 
  filter(league == "United Soccer League") %>% 
  mutate(name = str_replace(name, "Arizona United", "Phoenix Rising")) -> df

df
## # A tibble: 35 x 7
##     rank prev_rank name                   league                 off   def   spi
##    <dbl>     <dbl> <chr>                  <chr>                <dbl> <dbl> <dbl>
##  1   263       257 Phoenix Rising         United Soccer League  1.59  1.77  42.4
##  2   419       428 San Antonio FC         United Soccer League  1.17  1.82  32.0
##  3   460       475 Pittsburgh Riverhounds United Soccer League  0.98  1.69  29.7
##  4   465       454 Tampa Bay Rowdies      United Soccer League  0.96  1.67  29.7
##  5   478       482 Reno 1868 FC           United Soccer League  1.06  1.92  27.9
##  6   498       496 Indy Eleven            United Soccer League  0.81  1.66  26.2
##  7   505       489 Orange County SC       United Soccer League  0.86  1.76  25.8
##  8   520       518 Louisville City FC     United Soccer League  0.85  1.84  24.2
##  9   533       528 New Mexico United      United Soccer League  0.9   2.01  22.9
## 10   534       532 Sacramento Republic FC United Soccer League  0.75  1.79  22.9
## # … with 25 more rows

Calculate Euclidean distance

This is the code that measures the distance between the clubs. It uses the 538 offensive and defensive ratings.

df %>% 
  select(name, off, def) %>% 
  column_to_rownames(var = "name") -> df_dist

#df_dist
#rownames(df_dist) %>% 
#  head()

df_dist <- dist(df_dist, "euclidean", upper = FALSE, diag = FALSE)
#head(df_dist)

df_dist %>% 
  tidy() %>% 
  arrange(desc(distance)) -> df_dist

#df_dist %>% 
#  count(item1, sort = TRUE) %>% 
#  ggplot(aes(item1, n)) +
#  geom_point() +
#  coord_flip() +
#  theme_bw()

Network graph

In this snippet I set a threshhold for how similar clubs need to be to warrant a connection. Then I graph it using tidygraph and ggraph. Teams that are closer together on the graph are more similar. Darker and thicker lines indicate higher similarity.

distance_filter <- .5

df_dist %>% 
  mutate(distance = distance^2) %>% 
  filter(distance <= distance_filter) %>%
  as_tbl_graph(directed = FALSE) %>% 
  mutate(community = as.factor(group_edge_betweenness())) %>%
  ggraph(layout = "kk", maxiter = 1000) +
    geom_edge_fan(aes(edge_alpha = distance, edge_width = distance)) + 
    geom_node_label(aes(label = name, color = community), size = 5) +
    scale_color_discrete("Group") +
    scale_edge_alpha_continuous("Euclidean distance ^2", range = c(.2, 0)) +
    scale_edge_width_continuous("Euclidean distance ^2", range = c(2, 0)) +
    labs(title = "United Soccer League clubs",
       subtitle = "Euclidean distance (offensive rating, defensive rating)^2",
       x = NULL,
       y = NULL,
       caption = "538 data, @conor_tompkins")

Related