Analyzing major commuter routes in Allegheny County

Intro
In this post I will use the Mapbox API to calculate metrics for major commuter routes in Allegheny County. The API will provide the distance and duration of the trip, as well as turn-by-turn directions. The route duration should be considered a “minimum duration” because it does not consider traffic. Then I will estimate the duration of the trips with a linear model and compare that to the actual duration from the Mapbox API. I will use the difference between the actual and estimated duration to identify neighborhoods that experience longer or shorter commutes than expected.
library(tidyverse)
library(tidymodels)
library(mapboxapi)
library(tidycensus)
library(janitor)
library(lehdr)
library(tigris)
library(sf)
library(hrbrthemes)
options(tigris_use_cache = TRUE,
scipen = 999,
digits = 4)
theme_set(theme_ipsum())
Gather data
The first step is to download the census tract shapefiles for the county:
#get tracts
allegheny_county_tracts <- tracts(state = "PA", county = "Allegheny", cb = TRUE) %>%
select(GEOID)
st_erase <- function(x, y) {
st_difference(x, st_union(y))
}
ac_water <- area_water("PA", "Allegheny", class = "sf")
allegheny_county_tracts <- st_erase(allegheny_county_tracts, ac_water)
Then I download the “Origin-Destination” LODES file from the Census for Pennsylvania in 2017 and subset it to commuters within Allegheny County.
#load od tract-level data
lodes_od_ac_main <- grab_lodes(state = "pa", year = 2017,
lodes_type = "od", job_type = "JT00",
segment = "S000", state_part = "main",
agg_geo = "tract") %>%
select(state, w_tract, h_tract, S000, year) %>%
rename(commuters = S000) %>%
mutate(intra_tract_commute = h_tract == w_tract) %>%
semi_join(allegheny_county_tracts, by = c("w_tract" = "GEOID")) %>%
semi_join(allegheny_county_tracts, by = c("h_tract" = "GEOID"))
This analysis only considers routes where the commuter changed census tracts. 96% of commuters in Allegheny County change census tracts.
lodes_od_ac_main %>%
group_by(intra_tract_commute) %>%
summarize(commuters = sum(commuters)) %>%
ungroup() %>%
mutate(pct = commuters / sum(commuters))
## # A tibble: 2 x 3
## intra_tract_commute commuters pct
## <lgl> <dbl> <dbl>
## 1 FALSE 460256 0.961
## 2 TRUE 18750 0.0391
Get directions
This is the code that identifies the center of each tract, geocodes those centroids to get an address, and gets the turn-by-turn directions and route data for each pair of home and work addresses. I will focus on the top 20% of these routes (in terms of cumulative percent of commuters) because the Mapbox API is not designed for the size of query I would need to get directions for all combinations of census tracts.
Note that I manually replaced the geocoded address for the Wexford area because the geocoder was returning a result outside of the county, probably because the center of the tract is directly on I-76 (Pennsylvania Turnpike).
#filter out rows where commuter doesn't change tracts
combined_tract_sf <- lodes_od_ac_main %>%
arrange(desc(commuters)) %>%
filter(w_tract != h_tract)
#calculate cumulative pct of commuters, keep only top 20%
combined_tract_sf_small <- combined_tract_sf %>%
select(h_tract, w_tract, commuters) %>%
arrange(desc(commuters)) %>%
mutate(id = row_number(),
pct_commuters = commuters / sum(commuters),
cumulative_pct_commuters = cumsum(pct_commuters)) %>%
filter(cumulative_pct_commuters < .2) %>%
select(h_tract, w_tract, commuters)
#add census centroid geometry
combined_tract_sf_small <- combined_tract_sf_small %>%
left_join(st_centroid(allegheny_county_tracts), by = c("h_tract" = "GEOID")) %>%
rename(h_tract_geo = geometry) %>%
left_join(st_centroid(allegheny_county_tracts), by = c("w_tract" = "GEOID")) %>%
rename(w_tract_geo = geometry) %>%
select(h_tract, h_tract_geo, w_tract, w_tract_geo, commuters)
#get addresses for tract centroids
tract_od_directions <- combined_tract_sf_small %>%
mutate(home_address = map_chr(h_tract_geo, mb_reverse_geocode),
work_address = map_chr(w_tract_geo, mb_reverse_geocode))
#replace bad address with good address
wexford_good_address <- "3321 Wexford Rd, Gibsonia, PA 15044"
tract_od_directions <- tract_od_directions %>%
mutate(home_address = case_when(h_tract == "42003409000" ~ wexford_good_address,
h_tract != "42003409000" ~ home_address),
work_address = case_when(w_tract == "42003409000" ~ wexford_good_address,
w_tract != "42003409000" ~ work_address))
#define error-safe mb_directions function
mb_directions_possibly <- possibly(mb_directions, otherwise = NA)
#geocode addresses, get directions
tract_od_directions <- tract_od_directions %>%
mutate(home_address_location_geocoded = map(home_address, mb_geocode),
work_address_location_geocoded = map(work_address, mb_geocode)) %>%
mutate(directions = map2(home_address, work_address, ~ mb_directions_possibly(origin = .x,
destination = .y,
steps = TRUE,
profile = "driving"))) %>%
select(h_tract, h_tract_geo, home_address, home_address_location_geocoded,
w_tract, w_tract_geo, work_address, work_address_location_geocoded,
directions, commuters)
The core of the above code is combining map2
and mb_directions_possibly
. This maps the mb_directions_possibly
function against two inputs (the home address and work address).
The result is a dataframe with a row per turn-by-turn direction for each commuter route.
tract_od_directions %>%
select(h_tract, w_tract, instructions, distance, duration, commuters) %>%
st_drop_geometry() %>%
as_tibble()
## # A tibble: 8,450 x 6
## h_tract w_tract instructions distance duration commuters
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 42003409… 4200302… Drive west. 0.0339 0.264 619
## 2 42003409… 4200302… Turn left. 0.173 0.965 619
## 3 42003409… 4200302… Turn right onto Wexford Road/… 8.50 10.7 619
## 4 42003409… 4200302… Turn left to take the PA 910 … 0.636 0.949 619
## 5 42003409… 4200302… Merge left onto I 79 South. 2.44 1.62 619
## 6 42003409… 4200302… Keep left to take I 279 South… 18.2 10.9 619
## 7 42003409… 4200302… Keep right to take I 579 Sout… 1.55 1.14 619
## 8 42003409… 4200302… Take the exit toward 6th Aven… 0.571 0.408 619
## 9 42003409… 4200302… Keep left toward 6th Avenue. 0.154 0.159 619
## 10 42003409… 4200302… Merge left onto Bigelow Boule… 0.068 0.231 619
## # … with 8,440 more rows
This summarizes the data so there is one row per commuter route and creates summarized route data.
#summarize direction data
tract_od_stats <- tract_od_directions %>%
unnest(directions) %>%
group_by(h_tract, home_address, w_tract, work_address) %>%
summarize(duration = sum(duration),
distance = sum(distance),
steps = n(),
commuters = unique(commuters)) %>%
ungroup()
As expected, route duration and distance are highly correlated. The median duration of a trip is 16.7 minutes.
#graph od stats
tract_od_stats %>%
ggplot(aes(distance, duration, size = commuters)) +
geom_point(alpha = .3) +
geom_abline(linetype = 2, color = "red") +
coord_equal() +
theme_ipsum() +
labs(title = "Commutes between census tracts",
subtitle = "Allegheny County, PA",
x = "Distance in KM",
y = "Duration in minutes",
size = "Commuters")
median_duration <- tract_od_stats %>%
uncount(weights = commuters) %>%
summarize(median_duration = median(duration)) %>%
pull(median_duration)
tract_od_stats %>%
uncount(weights = commuters) %>%
ggplot(aes(duration)) +
geom_density(fill = "grey") +
geom_vline(xintercept = median_duration, lty = 2, color = "red") +
annotate("text", x = 21, y = .05, label = "median", color = "red") +
theme_ipsum() +
labs(title = "Trip duration",
x = "Duration in minutes",
y = "Density of observations")
This map shows the main roads that commuter routes use I-376, I-279, and Route 28 are major arteries, as expected.
#map routes
tract_od_stats %>%
ggplot() +
geom_sf(data = allegheny_county_tracts, size = .1, fill = "black") +
geom_sf(aes(alpha = commuters, size = commuters), color = "#ffcc01", alpha = .02) +
guides(size = guide_legend(override.aes= list(alpha = 1))) +
scale_size_continuous(range = c(.3, 4)) +
theme_void() +
labs(title = "Commuter routes between Allegheny County census tracts",
subtitle = "Driving routes",
size = "Commuters")
A high-resolution image of this map is available here. An animation of the routes is here.
People that live closer to downtown Pittsburgh have shorter commutes, on average.
allegheny_county_tracts %>%
st_drop_geometry() %>%
left_join(tract_od_stats %>%
select(h_tract, w_tract, duration) %>%
pivot_longer(contains("tract")) %>%
group_by(name, value) %>%
summarize(avg_duration = mean(duration)) %>%
ungroup(),
by = c("GEOID" = "value")) %>%
complete(GEOID, name) %>%
filter(!is.na(name)) %>%
left_join(allegheny_county_tracts) %>%
mutate(name = case_when(name == "h_tract" ~ "Origin tract",
name == "w_tract" ~ "Destination tract"),
name = as.factor(name) %>% fct_rev()) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = avg_duration), size = .1) +
facet_wrap(~name, ncol = 1) +
scale_fill_viridis_c(na.value = "grey90") +
labs(title = "Average commute duration",
fill = "Minutes") +
theme_void()
Model
The next step is to create a model that estimates the duration of a given commute. I will use the number of steps in the turn-by-turn directions and the distance as predictors. Additionally, I will calculate which rivers a commute route crosses and use those as logical variables in the model.
This collects the geometry for the main rivers in the county.
main_rivers <- ac_water %>%
group_by(FULLNAME) %>%
summarize(AWATER = sum(AWATER)) %>%
arrange(desc(AWATER)) %>%
slice(1:4)
This code calculates whether a given commuter route crosses a river.
tract_od_stats_rivers <- tract_od_stats %>%
mutate(intersects_ohio = st_intersects(., main_rivers %>%
filter(FULLNAME == "Ohio Riv")) %>% as.logical(),
intersects_allegheny = st_intersects(., main_rivers %>%
filter(FULLNAME == "Allegheny Riv")) %>% as.logical(),
intersects_monongahela = st_intersects(., main_rivers %>%
filter(FULLNAME == "Monongahela Riv")) %>% as.logical(),
intersects_youghiogheny = st_intersects(., main_rivers %>%
filter(FULLNAME == "Youghiogheny Riv")) %>% as.logical()) %>%
replace_na(list(intersects_ohio = FALSE,
intersects_allegheny = FALSE,
intersects_monongahela = FALSE,
intersects_youghiogheny = FALSE)) %>%
st_drop_geometry()
glimpse(tract_od_stats_rivers)
## Rows: 709
## Columns: 12
## $ h_tract <chr> "42003010300", "42003020100", "42003020100", …
## $ home_address <chr> "1400 Locust Street, Pittsburgh, Pennsylvania…
## $ w_tract <chr> "42003020100", "42003040200", "42003982200", …
## $ work_address <chr> "Piatt Place, 301 5th Ave, Pittsburgh, Pennsy…
## $ duration <dbl> 6.939, 9.275, 12.943, 8.551, 6.729, 9.528, 2.…
## $ distance <dbl> 2.5773, 4.5330, 5.8670, 2.5988, 1.6500, 3.416…
## $ steps <int> 10, 11, 9, 6, 5, 4, 4, 9, 7, 10, 8, 10, 12, 8…
## $ commuters <dbl> 63, 64, 70, 112, 155, 73, 81, 134, 87, 124, 1…
## $ intersects_ohio <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ intersects_allegheny <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ intersects_monongahela <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ intersects_youghiogheny <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
tract_od_stats_rivers <- tract_od_stats_rivers %>%
mutate(od_id = str_c("h_tract: ", h_tract, ", ", "w_tract: ", w_tract, sep = ""))
First I set the seed and split the data into training and testing sets.
set.seed(1234)
#split data
splits <- initial_split(tract_od_stats_rivers, prop = .75)
training_data <- training(splits)
testing_data <- testing(splits)
Then I use {tidymodels}
to define a linear model, cross-validate it, and extract the coefficients.
#recipe
model_recipe <- recipe(duration ~ .,
data = training_data) %>%
update_role(od_id, new_role = "id") %>%
step_rm(h_tract, home_address, w_tract, work_address, commuters) %>%
step_normalize(distance, steps) %>%
step_zv(all_predictors())
model_recipe %>%
prep() %>%
summary()
## # A tibble: 7 x 4
## variable type role source
## <chr> <chr> <chr> <chr>
## 1 distance numeric predictor original
## 2 steps numeric predictor original
## 3 intersects_ohio logical predictor original
## 4 intersects_allegheny logical predictor original
## 5 intersects_monongahela logical predictor original
## 6 od_id nominal id original
## 7 duration numeric outcome original
model_recipe_prep <- model_recipe %>%
prep()
#apply cv to training data
training_vfold <- vfold_cv(training_data, v = 10, repeats = 2)
#model specification
lm_model <- linear_reg(mode = "regression") %>%
set_engine("lm")
#linear regression workflow
lm_workflow <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(lm_model)
#fit against training resamples
keep_pred <- control_resamples(save_pred = TRUE)
lm_training_fit <- lm_workflow %>%
fit_resamples(training_vfold, control = keep_pred) %>%
mutate(model = "lm")
#get results from training cv
lm_training_fit %>%
collect_metrics()
## # A tibble: 2 x 6
## model .metric .estimator mean n std_err
## <chr> <chr> <chr> <dbl> <int> <dbl>
## 1 lm rmse standard 3.22 20 0.0691
## 2 lm rsq standard 0.822 20 0.00730
The model averaged an R-squared of .82 on the training data, which is pretty good.
The predictions from the training set fit the actual duration pretty well.
#graph predictions from assessment sets
lm_training_fit %>%
collect_predictions() %>%
ggplot(aes(duration, .pred)) +
geom_point(alpha = .3) +
geom_abline(linetype = 2, color = "red") +
coord_equal() +
labs(x = "Actual duration",
y = "Predicted duration")
Next I fit the model against the test data to extract the coefficients. Holding the other variables constant, distance is by far the most influential variable in the model. For every kilometer increase in distance, the duration of the commute can be expected to increase by around 5 minutes. Crossing the Monongahela will add around 2 minutes to a commute, while crossing the Allegheny and Ohio actually decrease commute times. This is probably related to the bridge that the commuter uses.
#variable importance
lm_workflow %>%
fit(testing_data) %>%
pull_workflow_fit() %>%
tidy() %>%
filter(term != "(Intercept)") %>%
mutate(term = fct_reorder(term, estimate)) %>%
ggplot(aes(estimate, term)) +
geom_col(fill = "grey", color = "black")
This fits the model to the full dataset and plots the predicted duration against the actual duration. The fit is tighter than just plotting distance vs. duration.
#final model
tract_od_pred <- lm_workflow %>%
fit(testing_data) %>%
predict(tract_od_stats_rivers) %>%
bind_cols(tract_od_stats_rivers) %>%
select(h_tract, w_tract, distance, steps, duration, .pred, commuters)
tract_od_pred %>%
ggplot(aes(duration, .pred, size = commuters)) +
geom_point(alpha = .3) +
geom_abline(lty = 2, color = "red") +
coord_equal() +
labs(x = "Duration in minutes",
y = "Predicted duration",
size = "Number of commuters")
This calculates how far off the model’s estimation of duration was for each census tract in the dataset (origin and destination). Commuters originating from neighborhoods between State Route 51 and the Monongahela River experience longer than expected commutes.
allegheny_county_tracts %>%
st_drop_geometry() %>%
left_join(tract_od_pred %>%
mutate(.resid = duration - .pred) %>%
select(h_tract, w_tract, .resid) %>%
pivot_longer(contains("tract")) %>%
group_by(name, value) %>%
summarize(avg_resid = mean(.resid)) %>%
ungroup(),
by = c("GEOID" = "value")) %>%
complete(GEOID, name) %>%
filter(!is.na(name)) %>%
left_join(allegheny_county_tracts) %>%
mutate(name = case_when(name == "h_tract" ~ "Origin tract",
name == "w_tract" ~ "Destination tract"),
name = as.factor(name) %>% fct_rev()) %>%
st_sf() %>%
ggplot() +
geom_sf(aes(fill = avg_resid), size = .1) +
facet_wrap(~name, ncol = 1) +
scale_fill_viridis_c(na.value = "grey90") +
labs(title = "Commute duration above/below expected",
fill = "Minutes") +
theme_void()