library(tidyverse)
library(lubridate)
library(prophet)
library(janitor)
library(hrbrthemes)
options(scipen = 999)
theme_set(theme_bw())
This post is about predicting demand for the Healthy Ride bike system in Pittsburgh. I wanted to try out Facebook’s prophet
package and try to do some time series forecasting.
As usual, load the required packages and set up the environment:
<- read_csv("post_data/combined_ride_data.csv") %>%
df filter(date < "2019-08-01") %>%
count(date)
This code loads the data and formats the date column so the prophet
package can interface with it. I use dir()
to find all the CSV files in the folder and then set_names()
and map_df(read_csv())
to read each of the CSV files into memory.
<- dir("data", ".csv")
files
<- str_c("data/", files) %>%
data set_names() %>%
map_df(read_csv) %>%
clean_names()
<- data %>%
df filter(date < "2019-08-01") %>%
count(date)
# df %>%
# ggplot(aes(date, n)) +
# geom_point()
#
# last(df$date)
The data I will use contains the number of rides per day and also includes the month and year. prophet
will identify the time series patterns (“seasonality”) in the data and identify the “true” pattern
prophet
has a plug-and-play workflow that is easy to use, but it has more stringent requirements for how the data has to be shaped. The date data has to be named ds
and the target variable has to be named y
. I set the floor to zero because there cannot be fewer than 0 rides in a day. prophet
requires a cap
<- df %>%
max_rides summarize(max_rides = max(n) * 3) %>%
pull()
<- df %>%
df mutate(n = log(n),
cap = log(max_rides)) %>%
filter(!is.na(date)) %>%
rename(ds = date,
y = n) %>%
mutate(floor = 0)
%>%
df filter(is.na(ds))
# A tibble: 0 × 4
# ℹ 4 variables: ds <date>, y <dbl>, cap <dbl>, floor <dbl>
glimpse(df)
Rows: 1,518
Columns: 4
$ ds <date> 2015-05-31, 2015-06-01, 2015-06-02, 2015-06-03, 2015-06-04, 201…
$ y <dbl> 6.173786, 4.836282, 4.934474, 4.875197, 5.361292, 5.613128, 5.94…
$ cap <dbl> 8.013343, 8.013343, 8.013343, 8.013343, 8.013343, 8.013343, 8.01…
$ floor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
This creates the set of holidays I use in the model.
<- prophet::generated_holidays %>%
us_holidays as_tibble() %>%
filter(country == "US") %>%
mutate(ds = as.Date(ds))
This code fits a model to the dataset.
<- prophet(df, growth = 'logistic', holidays = us_holidays) m
make_future_dataframe()
creates the dataframe that prophet
uses to make its forecast. In this case, I have it create a dataframe with 365 days of additional rows to predict onto.
<- make_future_dataframe(m, periods = 365, freq = "day") %>%
future mutate(floor = 0,
cap = unique(df$cap))
This code performs the forecast on the future dataset.
<- predict(m, future) %>%
forecast as_tibble()
The output is a dataframe with the date, the predicted ridership, and the upper and lower bounds of the prediction.
forecast
# A tibble: 1,883 × 66
ds trend cap floor additive_terms additive_terms_lower
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2015-05-31 00:00:00 5.13 8.01 0 0.720 0.720
2 2015-06-01 00:00:00 5.13 8.01 0 0.658 0.658
3 2015-06-02 00:00:00 5.13 8.01 0 0.724 0.724
4 2015-06-03 00:00:00 5.13 8.01 0 0.766 0.766
5 2015-06-04 00:00:00 5.13 8.01 0 0.733 0.733
6 2015-06-05 00:00:00 5.13 8.01 0 0.784 0.784
7 2015-06-06 00:00:00 5.13 8.01 0 0.844 0.844
8 2015-06-07 00:00:00 5.13 8.01 0 0.754 0.754
9 2015-06-08 00:00:00 5.13 8.01 0 0.691 0.691
10 2015-06-09 00:00:00 5.13 8.01 0 0.755 0.755
# ℹ 1,873 more rows
# ℹ 60 more variables: additive_terms_upper <dbl>, `Christmas Day` <dbl>,
# `Christmas Day_lower` <dbl>, `Christmas Day_upper` <dbl>,
# `Christmas Day (Observed)` <dbl>, `Christmas Day (Observed)_lower` <dbl>,
# `Christmas Day (Observed)_upper` <dbl>, `Columbus Day` <dbl>,
# `Columbus Day_lower` <dbl>, `Columbus Day_upper` <dbl>, holidays <dbl>,
# holidays_lower <dbl>, holidays_upper <dbl>, `Independence Day` <dbl>, …
plot
automatically plots the forecast
data:
plot(m, forecast)
prophet
also decomposes the various seasonal effects.
prophet_plot_components(m, forecast)
We can of course use ggplot
to manually plot the data.
<- forecast %>%
df_aug mutate(ds = ymd(ds)) %>%
left_join(df) %>%
mutate(yhat = exp(yhat),
yhat_lower = exp(yhat_lower),
yhat_upper = exp(yhat_upper),
y = exp(y))
%>%
df_aug ggplot(aes(x = ds)) +
geom_ribbon(data = df_aug %>% filter(ds > last(df$ds)),
aes(ymin = yhat_lower, ymax = yhat_upper), alpha = .2, fill = "blue") +
geom_line(data = df_aug %>% filter(ds > last(df$ds)),
aes(y = yhat), color = "blue") +
geom_point(aes(y = y), alpha = .5) +
geom_hline(aes(yintercept = unique(floor)), linetype = 2) +
labs(x = NULL,
y = "Number of rides") +
scale_y_comma() +
theme_bw(base_size = 20)
prophet
also provides functions for cross-validation.
<- cross_validation(m, horizon = 30, units = 'days')
df_cv
performance_metrics(df_cv) %>%
as_tibble() %>%
gather(metric, measure, -horizon) %>%
ggplot(aes(horizon, measure)) +
geom_line() +
facet_wrap(~metric, scales = "free_y",
ncol = 1) +
labs(x = "Horizon",
y = "Measure") +
theme_bw()
We can also inspect the impact of holidays on the prediction.
<- forecast %>%
df_holiday_impact clean_names() %>%
select(ds, christmas_day:washingtons_birthday) %>%
select(-contains("upper"), -contains("lower")) %>%
pivot_longer(-ds, names_to = "holiday", values_to = "value") %>%
mutate(year = as.factor(year(ds))) %>%
filter(holiday != "holidays",
!= 0)
value
%>%
df_holiday_impact arrange(ds) %>%
mutate(holiday = as.factor(holiday)) %>%
ggplot(aes(holiday, value, color = year)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_jitter() +
coord_flip() +
labs(x = "Holiday",
y = "Impact") +
scale_color_discrete("Year") +
theme_ipsum()
Documentation and references
- https://facebook.github.io/prophet/
- https://data.wprdc.org/dataset/healthyride-trip-data