House Price Estimator Dashboard
Click here to view the full dashboard
Model discussion
I trained 3 models against the assessment data:
- Linear model
- Random Forest
- Bagged Tree
I chose the Bagged Tree model because it performed about as well as the Random Forest model, but it predicts against new data much faster. Prediction speed is important because the UI for the dashboard has to update very quickly.
Total observations: 160,181
Training set metrics (75% of total observations)
I used 10-fold cross-validation to assess model performance against the training set:
train_metrics %>%
select(model_name, id, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
ggplot(aes(rmse, rsq, color = model_name)) +
geom_point() +
scale_x_continuous(label = dollar) +
#coord_cartesian(xlim = c(65000, 79000)) +
labs(x = "Root Mean Squared Error",
y = "R^2")
Test set metrics (25% of total observations)
test_metrics %>%
select(.metric, .estimate)
## # A tibble: 3 x 2
## .metric .estimate
## <chr> <dbl>
## 1 rmse 54470.
## 2 rsq 0.870
## 3 mape 22.2
Observed vs. Predicted (full dataset)
full_model_results %>%
ggplot(aes(sale_price_adj, 10^.pred)) +
geom_density_2d_filled() +
geom_abline(lty = 2, color = "white") +
scale_x_log10(label = dollar) +
scale_y_log10(label = dollar) +
coord_cartesian(xlim = c(30000, 10^6),
ylim = c(30000, 10^6)) +
labs(x = "Actual sale price",
y = "Predicted sale price",
title = "Observed vs. Predicted sale price",
fill = "Density of observations")
Variable Importance
var_imp %>%
mutate(term = fct_reorder(term, value)) %>%
ggplot(aes(value, term)) +
geom_point() +
scale_x_comma() +
labs(x = "Importance",
y = NULL)
Model performance by
geo_id
geo_id_shapes %>%
left_join(geo_id_rsq) %>%
ggplot() +
geom_sf(aes(fill = .estimate)) +
scale_fill_viridis_c() +
labs(fill = "R-squared") +
theme_void()