Machine Learning algorithms can be applied on geospatial datasets to solve problems which require extensive classification, clustering or predictive processing. The Random Forest algorithm, developed by Leo Breiman and Adele Cutler, is a popular supervised 'Ensemble'-based technique within Machine Learning which can perform such processing workflows at scale with speed and accuracy. The methodology initially involves 'training' the algorithm on sample datasets. Thereafter, the algorithm sifts through vast quantities of unknown data by creating 'decision trees' based on the parameters it was trained on - which eventually helps it to hone in on a well-calculated prediction - 'scientific guesswork' in a manner of speaking. The Forest-based Classification and Regression Tool, which this post is about, is an adaptation of the Random Forest Algorithm which was specifically designed to work on spatial information.
Below is an excellent explainer -
Video 1: Forest based Classification & Regression Machine Learning Algorithm explained
Source: Esri's Spatial Data Science MOOC
Like what you've seen? See another slightly longer (5 minutes) video explainer here. You'd be able to relate better to the examples highlighted below.
Section Hyperlinks below:
The image on the left in the depiction below shows Radar Imagery over a cross-section of the Chaco forest region situated in the north of Paraguay - it is heavily affected by deforestation. The jet-black pixels are indicative of vegetation - some of them have been verified as so in-situ (those overlaid with blue polygons). Thus, these form the supervised observations on which the algorithm will be trained. Also, the grainy black-white pixels are indicative of barren land i.e. those stripped of the tree cover - the supervised observations for these have been marked with yellow polygons.
The algorithm is deployed and it proceeds to understand this training data. Subsequently it proceeds to predict the type of terrain for every pixel in the imagery - it classifies it as either forested (green pixels) or deforested (gray pixels) in the image on the right of the depiction below. The output, as you'd observe, appears very accurate. This is a simple workflow - but imagine training the data with several parameters and running the model on a much larger extent imagery - it is a highly effective way of automated classification. Actually, you don't need to imagine - just have a look at the next workflow😊.
(The sliders below are best viewed on a PC.)
A slightly more complex workflow - here the training data comprises of several features - the supervised observations classify parcels of agricultural land in Seville (Spain) as per the crop type growing over it - Tomato, Wheat, Corn, and so on. The Forest-based algorithm is deployed to predict the type of crop growing over each pixel and classify the entire extent of Optical imagery over Seville, Spain as depicted below based on what it has learnt from the training samples.
Refer the output generated in Figure 2 & Figure 3 - it is likely that in-situ observations will highlight the accuracy of the predictions released by the algorithm. I must emphasize that the higher the quality of the Training Dataset, the greater the likelihood for the algorithm to generate more accurate predictions. The degree of randomness in the training dataset also plays an important role in the overall quality of the predictions - it helps to reduce 'overfitting' - this aspect is highlighted in Video 1 above from the 01:40 mark.
.
One can use the algorithm to 'classify' any surface, including Urban areas. Isn't this fascinating?
Workflow 3: Using the Forest-based Algorithm to estimate the intensity of the variables affecting Voter Turnout during Elections
While the two workflows highlighted previously in this post involved the application of the ML algorithm on 'Raster' form of spatial data, this workflow involves its deployment on 'Vector' form of spatial data.
In this workflow, I will highlight in slightly more detail, the application of Forest-based Machine Learning Algorithm. The algorithm was deployed to estimate the potency of some of the parameters affecting Voter Turnout during national elections in USA.
The purpose behind doing this study is as follows - the researchers want to form insights on how the 2020 National Elections in USA will fare in terms of Voter Turnout i.e. how much percentage of population from each county are expected to cast their vote. The information on the variables impacting Voter Turnout behaviour has been obtained using supervised surveys in select counties conducted during the year 2019 - this forms the random subset i.e. the dataset on which the the Algorithm will be trained and will subsequently be used to determine the Voter Turnout percentage across all the counties in USA during the 2020 National Elections. How well the algorithm fares will obviously be known after the election is concluded in the following year, however, as a measure to estimate the algorithm's accuracy, its classification will be compared to the Actual Voter Turnout information from the previous election i.e. from the 2016 National Elections in USA.
The variables regarding which the information was captured and aggregated at a county-level via the supervised surveys are-
a. Percentage of Population with the most High School Education,
b. Median Age,
c. Per Capita Income,
d. Percent of population who own a selfie stick (A whacky variable😊 ), and
e. Distance to the nearest City Class
(There are 10 city classes for which the data has been obtained. A city class of 10 means the distance of the voter in the county to the nearest city which contains >100,000 residents whereas a city class of 5 means the distance of the voter in the county to nearest city which contains between 50,000-60,000 residents. So essentially, the researchers aim to study how intensely, the proximity to the nearest category of urban centre, would impact Voter Turnout behaviour)
On Esri's ArcGIS Pro software, the 'Forest-based Classification and Regression' geoprocessing tool allows the researchers to train the namesake algorithm to study the supervised observations captured from 2019 surveys -
Upon running the tool, the model predicted Voter Turnout for 2020 (percentage of population from the county) as below -
How accurate would the algorithm's prediction for 2016 National Elections be when compared to the actual Voter Turnout data from that election?
The Regression diagnostic results comparing the model's prediction for 2016 with the actual 2016 Voter Turnout data show the Coefficient of Determination (R-squared) as 61.9%. This implies that if the algorithm's classification of variables was used to predict the Voter Turnout behaviour in 2016, its estimates would actually be reasonably good i.e. very close to the Actual Voter Turnout figures.
Which variables would be the most accurate predictors of 2016 National Election Voter Turnout?
Among the five variables considered, 'Per Capita Income' and 'High School Education' were the best predictors of Actual Voter Turnout in 2016. The 'Distance to nearest city class' variable wasn't found to be a reliable predictor i.e. it did not influence the Voter Turnout. You'll be surprised to know that owning a selfie stick was actually a better predictor / influencer of Voter Turnout than proximity to the nearest city!
Next we'll see how well would the Algorithm predict Voter Turnout for the 2016 National Elections at a 'Census Tract' level. Essentially, a Census Tract is representative of a 'Neighborhood' in USA. What the researchers want to do is to is re-train the algorithm - on the same variable information as captured for the random subset during the 2019 surveys, the difference being the survey feedback will now be aggregated over a smaller geographic extent (a Census Tract) compared to what was done initially (at a County aka at a Regional level). Essentially, this is to see how well the model will predict Voter Turnout Behaviour for the 2016 National Elections in USA at at a more granular geographic level using the same variable information.
Spare a moment here and think what impact would making predictions at a more micro level would have on the algorithm's effectiveness.
Read on now -
Hang on! Notice that we have not included the Selfie Stick Ownership information in the Variables this time. Why?
It is because this piece of information was captured initially as an estimate for the County Level i.e. the information has not been captured at an each individual level which would allow it to be aggregated to a Census Tract level as part of the Training dataset.
So how well did the algorithm's predictions compare to the Actual Voter Turnout at a Census Tract level?
A: 62.9% Coefficient of Determination. The algorithm's accuracy increased by 1% at a Census Tract level when compared to County level predictions.
I had urged you to think what impact would reducing the geographic level of aggregation of the supervised observations would have on the algorithm's output earlier. What had you thought and how did it compare to this result? I'll tell you what I had expected - I did not think the algorithm would be able to give a better prediction at a more micro-level - the possibility appeared very counter-intuitive to me. But as it turned out, what the results indicate is that the four variables selected are actually very good indicators / influencers / predictors of Voter Turnout - each Census Tract within a County would have diverse characteristics and to aggregate the information at a County level distorted the algorithm's ability to classify and predict voter turnout accurately. The more I think about it, the more I am convinced about this! What about you?
How potent are the Variables now, relative to each other, when comparing the Actual 2016 National Elections Voter Turnout Results at a Census Tract level with the Algorithm's predictions for the same?
There has been only been minor fluctuations when compared to the previous box plot diagram aggregated at County level in Figure 8. The relative importance of each variable to the model is still largely the same, even at a Census Tract level. The keyword is 'relative' here if you think about it - we've already deduced that the importance of the variables is actually stronger at a more granular geographic aggregation.
From the graph above, one can infer that the Confidence Intervals for Census Tracts with Low Voter Turnout percentages are much larger than Census Tracts with High Voter Turnout percentages. You can interpret this as the four variables that we've considered for the study and which are aggregated at Census Tract Levels are much better at predicting High Voter Turnout Census Tracts than they are at predicting Low Voter Turnout Census Tracts.
There are several ways to improve the algorithm's overall predictive effectiveness- adding more relevant Variables, removing less relevant Variables, increasing the number Validation Runs, increasing the number of Decision Trees, being some of them.
I hope you've come to appreciate the Forest-based Machine Learning algorithm's ability to predict a complex phenomenon such as Voter Turnout using multiple variables. It is able to sift through swathes of geospatial information within minutes!
For Stats enthusiasts wondering what is the difference in predictive effectiveness of an ensemble technique (Random Forest) vis-a-vis the more traditional methodology (Linear Regression), here are some answers.
What other applications / uses of Random Forest algorithm can you think of?
One can determine which Roads are more accident-prone, which Places are tourists more likely to visit, which Ads / Social Media content are viewers more likely to watch, besides numerous other applications. Can you think of some unique applications?
ABOUT US
Intelloc Mapping Services | Mapmyops.com is based in Kolkata, India and engages in providing Mapping solutions that can be integrated with Operations Planning, Design and Audit workflows. These include but are not limited to - Drone Services, Subsurface Mapping Services, Location Analytics & App Development, Supply Chain Services, Remote Sensing Services and Wastewater Treatment. The services can be rendered pan-India, some even globally, and will aid an organization to meet its stated objectives especially pertaining to Operational Excellence, Cost Reduction, Sustainability and Growth.
Broadly, our area of expertise can be split into two categories - Geographic Mapping and Operations Mapping. The Infographic below highlights our capabilities.
Our 'Mapping for Operations'-themed workflow demonstrations can be accessed from the firm's Website / YouTube Channel and an overview can be obtained from this flyer. Happy to address queries and respond to documented requirements. Custom Demonstration, Training & Trials are facilitated only on a paid-basis. Looking forward to being of service.
Regards,
Much Thanks to RUS Copernicus & Esri for the training material