library(tidyverse)
library(broom)
library(ggraph)
library(tidygraph)
library(viridis)
set_graph_style()
set.seed(1234)
Euclidean distance is a simple way to measure the distance between two points. It can also be used to measure how similar two sports teams are, given a set of variables. In this post, I use Euclidean distance to calculate the similarity between USL clubs and map that data to a network graph. I will use the 538 Soccer Power Index data to calculate the distance.
Setup
Download data
This code downloads the data from 538’s GitHub repo and does some light munging.
read_csv("https://projects.fivethirtyeight.com/soccer-api/club/spi_global_rankings.csv", progress = FALSE) %>%
filter(league == "United Soccer League") %>%
mutate(name = str_replace(name, "Arizona United", "Phoenix Rising")) -> df
df
Calculate Euclidean distance
This is the code that measures the distance between the clubs. It uses the 538 offensive and defensive ratings.
%>%
df select(name, off, def) %>%
column_to_rownames(var = "name") -> df_dist
#df_dist
#rownames(df_dist) %>%
# head()
<- dist(df_dist, "euclidean", upper = FALSE, diag = FALSE)
df_dist #head(df_dist)
%>%
df_dist tidy() %>%
arrange(desc(distance)) -> df_dist
#df_dist %>%
# count(item1, sort = TRUE) %>%
# ggplot(aes(item1, n)) +
# geom_point() +
# coord_flip() +
# theme_bw()
Network graph
In this snippet I set a threshhold for how similar clubs need to be to warrant a connection. Then I graph it using tidygraph and ggraph. Teams that are closer together on the graph are more similar. Darker and thicker lines indicate higher similarity.
<- .5
distance_filter
%>%
df_dist mutate(distance = distance^2) %>%
filter(distance <= distance_filter) %>%
as_tbl_graph(directed = FALSE) %>%
mutate(community = as.factor(group_edge_betweenness())) %>%
ggraph(layout = "kk", maxiter = 1000) +
geom_edge_fan(aes(edge_alpha = distance, edge_width = distance)) +
geom_node_label(aes(label = name, color = community), size = 5) +
scale_color_discrete("Group") +
scale_edge_alpha_continuous("Euclidean distance ^2", range = c(.2, 0)) +
scale_edge_width_continuous("Euclidean distance ^2", range = c(2, 0)) +
labs(title = "United Soccer League clubs",
subtitle = "Euclidean distance (offensive rating, defensive rating)^2",
x = NULL,
y = NULL,
caption = "538 data, @conor_tompkins")