Forecasting Healthy Ride Ridership With Prophet

R
Pittsburgh
Healthy Ride
Author

Conor Tompkins

Published

August 3, 2019

This post is about predicting demand for the Healthy Ride bike system in Pittsburgh. I wanted to try out Facebook’s prophet package and try to do some time series forecasting.

As usual, load the required packages and set up the environment:

library(tidyverse)
library(lubridate)
library(prophet)
library(janitor)
library(hrbrthemes)

options(scipen = 999)

theme_set(theme_bw())
df <- read_csv("post_data/combined_ride_data.csv") %>%
  filter(date < "2019-08-01") %>% 
  count(date)

This code loads the data and formats the date column so the prophet package can interface with it. I use dir() to find all the CSV files in the folder and then set_names() and map_df(read_csv()) to read each of the CSV files into memory.

files <- dir("data", ".csv")

data <- str_c("data/", files) %>% 
  set_names() %>% 
  map_df(read_csv) %>% 
  clean_names()

df <- data %>% 
  filter(date < "2019-08-01") %>% 
  count(date)

# df %>% 
#   ggplot(aes(date, n)) +
#   geom_point()
# 
# last(df$date)

The data I will use contains the number of rides per day and also includes the month and year. prophet will identify the time series patterns (“seasonality”) in the data and identify the “true” pattern

prophet has a plug-and-play workflow that is easy to use, but it has more stringent requirements for how the data has to be shaped. The date data has to be named ds and the target variable has to be named y. I set the floor to zero because there cannot be fewer than 0 rides in a day. prophet requires a cap

max_rides <- df %>% 
  summarize(max_rides = max(n) * 3) %>% 
  pull()

df <- df %>% 
  mutate(n = log(n),
         cap = log(max_rides)) %>% 
  filter(!is.na(date)) %>% 
  rename(ds = date,
         y = n) %>% 
  mutate(floor = 0)



df %>% 
  filter(is.na(ds))
# A tibble: 0 × 4
# ℹ 4 variables: ds <date>, y <dbl>, cap <dbl>, floor <dbl>
glimpse(df)
Rows: 1,518
Columns: 4
$ ds    <date> 2015-05-31, 2015-06-01, 2015-06-02, 2015-06-03, 2015-06-04, 201…
$ y     <dbl> 6.173786, 4.836282, 4.934474, 4.875197, 5.361292, 5.613128, 5.94…
$ cap   <dbl> 8.013343, 8.013343, 8.013343, 8.013343, 8.013343, 8.013343, 8.01…
$ floor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

This creates the set of holidays I use in the model.

us_holidays <- prophet::generated_holidays %>% 
  as_tibble() %>% 
  filter(country == "US") %>% 
  mutate(ds = as.Date(ds))

This code fits a model to the dataset.

m <- prophet(df, growth = 'logistic', holidays = us_holidays)

make_future_dataframe() creates the dataframe that prophet uses to make its forecast. In this case, I have it create a dataframe with 365 days of additional rows to predict onto.

future <- make_future_dataframe(m, periods = 365, freq = "day") %>% 
  mutate(floor = 0,
         cap = unique(df$cap))

This code performs the forecast on the future dataset.

forecast <- predict(m, future) %>% 
  as_tibble()

The output is a dataframe with the date, the predicted ridership, and the upper and lower bounds of the prediction.

forecast
# A tibble: 1,883 × 66
   ds                  trend   cap floor additive_terms additive_terms_lower
   <dttm>              <dbl> <dbl> <dbl>          <dbl>                <dbl>
 1 2015-05-31 00:00:00  5.13  8.01     0          0.720                0.720
 2 2015-06-01 00:00:00  5.13  8.01     0          0.658                0.658
 3 2015-06-02 00:00:00  5.13  8.01     0          0.724                0.724
 4 2015-06-03 00:00:00  5.13  8.01     0          0.766                0.766
 5 2015-06-04 00:00:00  5.13  8.01     0          0.733                0.733
 6 2015-06-05 00:00:00  5.13  8.01     0          0.784                0.784
 7 2015-06-06 00:00:00  5.13  8.01     0          0.844                0.844
 8 2015-06-07 00:00:00  5.13  8.01     0          0.754                0.754
 9 2015-06-08 00:00:00  5.13  8.01     0          0.691                0.691
10 2015-06-09 00:00:00  5.13  8.01     0          0.755                0.755
# ℹ 1,873 more rows
# ℹ 60 more variables: additive_terms_upper <dbl>, `Christmas Day` <dbl>,
#   `Christmas Day_lower` <dbl>, `Christmas Day_upper` <dbl>,
#   `Christmas Day (Observed)` <dbl>, `Christmas Day (Observed)_lower` <dbl>,
#   `Christmas Day (Observed)_upper` <dbl>, `Columbus Day` <dbl>,
#   `Columbus Day_lower` <dbl>, `Columbus Day_upper` <dbl>, holidays <dbl>,
#   holidays_lower <dbl>, holidays_upper <dbl>, `Independence Day` <dbl>, …

plot automatically plots the forecast data:

plot(m, forecast)

prophet also decomposes the various seasonal effects.

prophet_plot_components(m, forecast)

We can of course use ggplot to manually plot the data.

df_aug <- forecast %>% 
  mutate(ds = ymd(ds)) %>% 
  left_join(df) %>% 
  mutate(yhat = exp(yhat),
         yhat_lower = exp(yhat_lower),
         yhat_upper = exp(yhat_upper),
         y = exp(y))

df_aug %>% 
  ggplot(aes(x = ds)) +
    geom_ribbon(data = df_aug %>% filter(ds > last(df$ds)), 
                aes(ymin = yhat_lower, ymax = yhat_upper), alpha = .2, fill = "blue") +
    geom_line(data = df_aug %>% filter(ds > last(df$ds)), 
              aes(y = yhat), color = "blue") +
    geom_point(aes(y = y), alpha = .5) +
    geom_hline(aes(yintercept = unique(floor)), linetype = 2) +
    labs(x = NULL,
         y = "Number of rides") +
    scale_y_comma() +
  theme_bw(base_size = 20)

prophet also provides functions for cross-validation.

df_cv <- cross_validation(m, horizon = 30, units = 'days')

performance_metrics(df_cv) %>% 
  as_tibble() %>% 
  gather(metric, measure, -horizon) %>% 
  ggplot(aes(horizon, measure)) +
  geom_line() +
  facet_wrap(~metric, scales = "free_y",
             ncol = 1) +
  labs(x = "Horizon",
       y = "Measure") +
  theme_bw()

We can also inspect the impact of holidays on the prediction.

df_holiday_impact <- forecast %>% 
  clean_names() %>% 
  select(ds, christmas_day:washingtons_birthday) %>% 
  select(-contains("upper"), -contains("lower")) %>% 
  pivot_longer(-ds, names_to = "holiday", values_to = "value") %>% 
  mutate(year = as.factor(year(ds))) %>% 
  filter(holiday != "holidays",
         value != 0)

df_holiday_impact %>% 
  arrange(ds) %>% 
  mutate(holiday = as.factor(holiday)) %>% 
  ggplot(aes(holiday, value, color = year)) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_jitter() +
  coord_flip() +
  labs(x = "Holiday",
       y = "Impact") +
  scale_color_discrete("Year") +
  theme_ipsum()

Documentation and references

  • https://facebook.github.io/prophet/
  • https://data.wprdc.org/dataset/healthyride-trip-data