House Price Estimator Dashboard

Click here to view the full dashboard

Model discussion

I trained 3 models against the assessment data:

  • Linear model
  • Random Forest
  • Bagged Tree

I chose the Bagged Tree model because it performed about as well as the Random Forest model, but it predicts against new data much faster. Prediction speed is important because the UI for the dashboard has to update very quickly.

Total observations: 160,181

Training set metrics (75% of total observations)

I used 10-fold cross-validation to assess model performance against the training set:

train_metrics %>% 
  select(model_name, id, .metric, .estimate) %>% 
  pivot_wider(names_from = .metric, values_from = .estimate) %>% 
  ggplot(aes(rmse, rsq, color = model_name)) +
  geom_point() +
  scale_x_continuous(label = dollar) +
  #coord_cartesian(xlim = c(65000, 79000)) +
  labs(x = "Root Mean Squared Error",
       y = "R^2")

Test set metrics (25% of total observations)

test_metrics %>% 
  select(.metric, .estimate)
## # A tibble: 3 x 2
##   .metric .estimate
##   <chr>       <dbl>
## 1 rmse    54470.   
## 2 rsq         0.870
## 3 mape       22.2

Observed vs. Predicted (full dataset)

full_model_results %>% 
  ggplot(aes(sale_price_adj, 10^.pred)) +
  geom_density_2d_filled() +
  geom_abline(lty = 2, color = "white") +
  scale_x_log10(label = dollar) +
  scale_y_log10(label = dollar) +
  coord_cartesian(xlim = c(30000, 10^6),
                  ylim = c(30000, 10^6)) +
  labs(x = "Actual sale price",
       y = "Predicted sale price",
       title = "Observed vs. Predicted sale price",
       fill = "Density of observations")

Variable Importance

var_imp %>% 
  mutate(term = fct_reorder(term, value)) %>% 
  ggplot(aes(value, term)) +
  geom_point() +
  scale_x_comma() +
  labs(x = "Importance",
       y = NULL)

Model performance by geo_id

geo_id_shapes %>% 
  left_join(geo_id_rsq) %>% 
  ggplot() +
  geom_sf(aes(fill = .estimate)) +
  scale_fill_viridis_c() +
  labs(fill = "R-squared") +
  theme_void()