Supervised machine learning
In the “cluster of six”, we used unsupervised machine learning, to reveal hidden structure in unlabelled data, and analyse the voting patterns of Labour Members of Parliament. In this blog post, we’ll use supervised machine learning to see how well we can predict crime in London. Perhaps not specific crimes. But we can use recorded crime summary data at London borough-level (non-personal aggregated data licensed under the Open Government Licence), with some degree of accuracy, to predict crime counts.
Along the way, we’ll see the pay-off from an exploration of multiple models.
Why might one want to predict crime counts? Perhaps we have the responsibility to deploy, or plan a budget for, police resources. Maybe we are considering where we might invest in additional CCTV infrastructure. Robust models can support decision-making by providing predictions grounded in facts, and are especially useful where complexity in the data is otherwise harder to unpick.
No “one size fits all”
There are a range of modelling techniques available. At one end of the spectrum, we have the more intuitive and interpretable models: Given conditions A and B, then we can anticipate outcome X. At the other end, we have more powerful and complex models where one accepts the hidden nature of their inner machinery in exchange for potentially greater predictive power.
There is no “one size fits all”, and the predictive power of each model will vary depending on the data being modelled. That alone is a good reason to consider multiple models. And I don’t mind admitting that I encountered another very good reason whilst preparing my five-model analysis for this post.
I almost abandoned the model that ultimately delivered the lowest prediction error. It was because the other models were generating stronger predictions that I questioned my execution of the fifth. The fuller, but by no means exhaustive methodology, including code, is available here.
Building familiarity with the data is an important first step. We’ll begin with 32 mini-plots; one for each London borough. Within each are the crime trends across nine major crime categories. What does this tell us?
Borough is likely to be a key predictor given the considerable variation in crime counts associated with this categorical variable. Contrast, for example, the vertical scaling for “Westminster” with “Sutton”, in the bottom-right corner.
Major crime category will also likely be a key predictor, with “theft & handling”, and “violence against the person”, associated with significantly more crime across all London boroughs.
There is also a possible interplay between borough and crime category which we may need to account for in models sensitive to interaction. This is evident where more affluent boroughs, or those attracting more visitors, such as “Kensington & Chelsea”, and “Westminster”, have significantly higher counts for “theft & handling”. Contrast these boroughs with, for example, “Lewisham”, where “violence against the person” plays a more significant role.
A summary of each potential predictor also exposes their possible influence, for example, the growth in crime count over time. (I may dedicate a future post to time-series forecasting.)
Do all four potential predictors matter?
One way to address this question is to use recursive partitioning to create a tree diagram. At the top of the tree we have 100% of the over-eleven-thousand observations. The first, and most important split, is based on the major crime category: 23% of the observations are partitioned off to the right (to node 3), for “theft and handling” (abbreviated as T&H) and “violence against the person” (VATP), with the balance branching left.
Similarly, borough appears early in the recursive partitioning where node 3 splits based on this variable.
We could go to lower levels of granularity, but our purpose here is a preliminary assessment of the most important variables. This shows that month is of lesser significance. However, we’ll keep it in our initial modelling to see if it’s significant enough to influence our models’ predictive power.
Training and testing our models
Cross validation is a comparatively simple and popular method for estimating errors in predictions. We’ll use repeated cross-validation to train the models on randomly-selected cuts of the data, and validate them on the remaining cut. This approach is designed to strengthen the models’ ability to perform well on as-yet-unseen observations.
There are many modelling choices we can make to enhance their performance, for example: The initial selection of models; how we pre-process the data; and how we utilise tuning parameters to optimise their performance. The choices made for this post, and I by no means explored every possibility, are discussed in the supporting documentation.
For the purposes of this article, we’ll jump to assessing the models’ predictions versus the known actuals to see how they performed.
Comparing predictive power
Optimal predictions sit close to, or on, the dashed line in the graphic below, i.e. where the prediction for each observation equals the actual. The Root Mean Squared Error (RMSE) measures the average differences, so should be as small as possible. And R-squared measures the correlation between prediction and actual, where 0 reflects no correlation, and 1 perfect positive correlation.
Our supervised machine learning outcomes from the CART and GLM models have weaker RMSEs, and visually exhibit some dispersion in the predictions at higher counts. Stochastic Gradient Boosting, Cubist and Random Forest have handled the higher counts better as we see from the visually tighter clustering.
It was Random Forest that produced marginally the smallest prediction error. And it was a parameter unique to the Random Forest model which almost tripped me up as discussed in the supporting documentation.
The moral of the story reinforces the value of exploring multiple models. One can’t be certain which is best adapted to the data in hand. And model comparison also provides a very helpful check and balance from which the ultimate outcome may be all the stronger.
|dplyr||mutate; select; group_by; summarise; filter; rename|
|stringr||str_c; str_pad; str_replace_all; str_wrap; str_detect|
|caret||trainControl; train; varImp|
|modelr||gather_residuals; spread_predictions; rmse; rsquare|
|ggplot2||geom_line; geom_smooth; facet_wrap; geom_point; geom_abline; geom_hline; geom_text; geom_col|
Citations / Attributions
Contains public sector information licensed under the Open Government Licence v3.0.
R Development Core Team (2008). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Max Kuhn and Ross Quinlan (2017). Cubist: Rule- And Instance-Based Regression Modeling. R package version 0.2.1. https://CRAN.R-project.org/package=Cubist
Terry Therneau, Beth Atkinson and Brian Ripley (2017). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-11.
Marvin N. Wright, Andreas Ziegler (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1-17. doi:10.18637/jss.v077.i01
Greg Ridgeway with contributions from others (2017). gbm: Generalized Boosted Regression Models. R package version 2.1.3. https://CRAN.R-project.org/package=gbm
Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2017). caret: Classification and Regression Training. R package version 6.0-78. https://CRAN.R-project.org/package=caret