Random Forest Machine Learning algorithm utilized for Geospatial Data Analysis

Arpit Shah

Nov 8, 202012 min read

Updated: Mar 8

Workflows demonstrated in this post (Hyperlinked to Sections)

Classifying Radar Satellite Imagery to map Deforestation

Classifying Optical Satellite Imagery to map Agricultural Land Use

Classifying the potency of five parameters influencing Voter Turnout during Elections

Yes, Machine Learning algorithms can be applied on Geospatial datasets - they aid in workflows which require extensive classification, clustering, or predictive data-processing.

The Random Forest algorithm, developed by Leo Breiman and Adele Cutler, is a popular, supervised Ensemble-based technique within Machine Learning that can perform such processing tasks at scale, and with speed and accuracy. The methodology initially entails training the algorithm on a pre-classified sample dataset. Subsequently, the algorithm is exposed to unknown data that one seeks to classify - it sifts through and attempts to make sense of it by creating decision trees using the explanatory variables it was trained on. The quality and randomness of the training data and the potency of the chosen explanatory variables aid this popular and powerful algorithm to hone in on an accurate prediction - a form of scientific guesswork at computing speeds in a manner of speaking. Refer to the video explanation from this time-stamped url.

The Forest-based Classification and Regression Tool, which I will utilize in Workflow 3, is an adaptation of the Random Forest Algorithm specifically designed to work with geospatial data. Below is an excellent explainer-

Video 1: Forest-based Classification & Regression Machine Learning Algorithm explained. Source: Esri's Spatial Data Science MOOC

Workflow 1: Classifying Radar Satellite Imagery to map Deforestation

The image on the left in Slider 1 below shows Synthetic Aperture Radar imagery over a cross-section of the Chaco forest region situated in the north of Paraguay - an area plagued by deforestation. The jet-black pixels are representative of vegetation - the blue polygons on some of it have been verified as so. The tiny, grainy black-and-white pixels are representative of barren land i.e. those stripped of vegetation - the in-situ verifications for these are highlighted with yellow polygons. Together, these form the supervised data points which the Random Forest Machine Learning algorithm will be trained on. Once the training is complete. the algorithm will utilize its learning to classify the terrain for each and every pixel in the imagery - it will classify it as either forested (green) or deforested (grey) as can be seen in the classified output on the right.

The Slider below is best viewed on PC

Slider 1: Supervised observations within Chaco Forest (left) and Random Forest Machine Learning Algorithm's predicted output (right) derived using European Space Agency's SNAP software

The Random Forest algorithm's classified output appears largely accurate based on visual evidence. (Refer to this video demonstration in case you'd like to view the actual processing steps)..

While this is a demonstration of a simple workflow, it shouldn't be difficult for you to imagine a scaled-up version of the same - training the algorithm on multiple parameters and setting it loose on a much larger dataset spanning acres of land - yes, that would be a highly-effective and time-saving way of land use classification! Actually, don't imagine - the next workflow delves exactly on this😊.

Workflow 2 : Classifying Optical Imagery Satellite Imagery to map Agricultural Land Use

Figure 1: Study Area as seen in Sentinel-2 Optical Imagery over Seville, Spain (2017). The multi-colored polygons correspond to the verified land use parameters i.e. the crop-type growing on it. Source: RUS Copernicus

As stated previously, this is a more complex workflow than the previous one - it entails training the Random Forest Machine Learning algorithm on multiple parameters and the area that needs to be classified is much larger too - spanning acres of agricultural land in Seville, Spain as seen in the Aerial Satellite Imagery view depicted in Figure 1 above where the in-situ observations are highlighted with multi-color polygons (based on the land use/type of crop verified to be growing on it - tomato, wheat, corn and several others). This forms the Training dataset and upon learning from it the algorithm was deployed to predict the crop-type for each and every pixel in the dataset.

Sharing its output below-

(Refer to this video demonstration in case you'd like to view the actual processing steps)

Figure 2: Crop-type classification over the entire study area generated by the Random Forest Machine Learning Algorithm

Figure 3: Zoomed-in view of the classified output

The higher the quality of the training dataset (the number, accuracy and randomness of in-situ observations), the better the algorithm's predictions would become. The degree of randomness helps to reduce overfitting - this aspect is explained in Video 1 at the 01:40 mark.

ESA's SNAP software was used to run the Random Forest Algorithm on this Raster type of geospatial dataset (as applicable to Workflow 1 as well). One can use the algorithm to classify virtually any raster surface, including Urban areas.

Workflow 3: Classifying the potency of five parameters affecting Voter Turnout during Elections

While the previous workflows involved the application of Random Forest Machine Learning algorithm on Raster form of geospatial data, this workflow involves the deployment of a variant of the algorithm - the Forest-based Classification and Regression Tool - on Vector form of geospatial data with the objective to test the potency of a few parameters that are deemed to affect Voter Turnout at the National Elections in USA. The obtained insights will help to finetune these explanatory variables in a bid to predict the Voter Turnout for the upcoming National Elections in 2020 accurately.

Surveys were conducted in select counties within the country in 2019 to obtain supervised data for these test parameters affecting Voter Turnout - the recorded responses make for a geospatial/location dataset which will be used to train the Forest-based Classification and Regression Machine Learning Algorithm that I will deploy using Esri ArcGIS Pro GIS software. In order to determine the potency of the test parameters, the algorithm's predictions will be compared to the actual Voter Turnout in the previous 2016 National Elections - this is what I will set out to demonstrate below.

Credits: Esri Learn ArcGIS, Esri ArcGIS Pro

The individual responses to the surveys aggregated at the county-level were for these parameters-

Percentage of Population with at least High School Education,
Median Age of Population,
Per Capita Income of Population,
Percentage of Population who own a Selfie-Stick (a whacky variable😊), and
Distance of the County to the nearest City Class - There are ten city classes in all - a city class of 1 corresponds to the distance of the county to the nearest city which has up to 10,000 residents, a city class of 2 corresponds to the distance of the county to the nearest city which has between 10,001 and 20,000 residents, and so on i.e. at 10 intervals upto 100,000 population which is a city class of 10. Essentially, proximity to the nearest urban centre and how it would impact Voter Turnout behaviour is what is being surveyed.

Figure 4: USA National Elections 2016 - Actual Voter Turnout aggregated at County-level and color-coded based on standard deviation from the national mean

On Esri's ArcGIS Pro software, using the Forest-based Classification and Regression geoprocessing tool in Train and Predict mode allows me to do all these three things at once-

train the Machine Learning algorithm on the Training dataset (the recorded responses gathered from surveys at select counties conducted in 2019)
deploy the trained algorithm to obtain predictions (for Voter Turnout across all the counties of USA)
assess the prediction performance using a Validation dataset (actual Voter Turnout during the 2016 National Elections aggregated at County-level)

Figure 5: Snapshot of the Forest-based Classification and Regression geoprocessing tool which runs the namesake Machine Learning algorithm in ArcGIS Pro GIS software

Figure 6 - Output of the Forest-based Classification and Regression Machine Learning Algorithm - Predicted Voter Turnout aggregated at County-level and expressed in percentage for all the Counties (3244) of USA

How did the algorithm's prediction fare to the validation dataset (actual Voter Turnout dataset from the 2016 National Election aggregated at County-level)?

Figure 7: Regression Diagnostics output - comparing the Algorithm's predicted Voter Turnout aggregated at County-level based on the five test parameters to the Validation Dataset (Actual Voter Turnout at the 2016 USA National Elections aggregated at County-level)

The Validation Data: Regression Diagnostics output indicates that the Coefficient of Determination (R-squared) is 61.9%. This implies that the Actual Voter Turnout in 2016 USA National Elections was moderately (not excessively) influenced by the test parameters as a whole.

Which parameters from the five are the most reliable predictors of Voter Turnout based on the algorithm's performance vis-a-vis the Validation dataset (Actual Voter Turnout at the 2016 National Elections)?

Figure 8: Snapshot of the Distribution of Variable Importance Box Plot - Predicted Voter Turnout based on five test parameters v/s Validation Dataset (Actual Voter Turnout in 2016 USA National Elections aggregated at County-level)

From the five parameters considered for the analysis, Per Capita Income and High School Education are the best predictors of Voter Turnout, relative to the other parameters, based on the algorithm's performance on the Validation dataset. The Distance to nearest City Class variable wasn't found to be a reliable predictor. Owning a Selfie-Stick is actually a better influencer to Voter Turnout than Proximity to the nearest Urban centre!

How would the performance of the test parameters change if we were to have the algorithm make predictions at a Census Tract-level instead of County-level i.e. by performing the analysis at a more granular level?

A Census Tract is representative of a neighborhood in USA (this geographical division is used while gathering responses for the decennial U.S. Census). It is possible to re-train the algorithm on data aggregated at Census Tract-level as the responses to the Surveys held in 2019 at select counties were captured at an individual-resident level (the smallest possible unit). Therefore, all that one needs to do is to aggregate the responses at Census Tract-level (84,414 in total) instead of at County-level (3,244 in total) as done in the previous scenario.

Figure 9: Census Tracts in USA (84,414) dataset. Source: Esri Learn ArcGIS / Living Atlas

What do you think would be the impact of making predictions at a geographically more micro-level for the Machine Learning algorithm? Would the predicted Voter Turnout output become more reliable or less reliable? And how will the five test parameters fare? Would any/all of them develop an increase/decrease in potency? Spare a minute here to think about it before reading on....

Figure 10: Deploying the Forest-based Classification and Regression Machine Learning geoprocessing tool again - the namesake algorithm is being trained on the same Survey responses, albeit which are now aggregated at a more micro Census Tract-level.

Notice in Figure 10 that I have not included the Percentage of population who own a Selfie-Stick as a test parameter this time - this is because in reality, this question did not form a part of the Survey at all - rather, an estimate of the same at County-level was obtained from a research institute. As individual responses were not recorded, and also because an estimate of it at Census Tract-level is not available, this parameter is being omitted altogether (even the Distance from the County to the nearest City Class parameter was not recorded during the survey, rather it was derived using geospatial analysis. However, deriving this proximity information at a Census Tract-level is also possible through geospatial analysis and hence, I've included it as part of the test parameters).

Figure 11: Predicted Voter Turnout Output at Census Tract-level

How did the algorithm's prediction fare to the new Validation dataset (Actual Voter Turnout data aggregated at Census Tract-level from the 2016 election)?

Figure 12: Statistics from the Forest-based Classification and Regression algorithm's new output

The Coefficient of Determination is 62.9% which implies that the algorithm's reliability has increased by 1% by using Training data that was aggregated at Census Tract-level instead of at County-level.

Did you anticipate this? Personally, I felt the algorithm would have become less reliable i.e. have more variation from Actual Voter Turnout, as it would be making significantly more predictions (>80,000) than it did on the previous occasion. But as it turns out, narrowing the level of aggregation perhaps made the training data (responses to the test parameters) more directly attributable to the geographic region the algorithm was supposed to make its predictions for, so much so that the benefit arising from being able to make better predictions nullified the drawback from having to make significantly more predictions. This is just one of the many theories that could be possible - it could also be that the potency of the parameters remained largely unchanged even at the narrow-level of geospatial aggregation (perhaps because they are by and large agnostic to geospatial extent as such or because the parameters chosen aren't really causative influencers to Voter Turnout - after all correlation doesn't imply causation). Couple this with the possibility that the total number of predictions to be made has little downward impact on an algorithm's overall reliability because the variances level-out and suddenly, this theory becomes worth considering. What do you think? Feel free to share.

I know what some of you may be thinking - it would help to know if there was a discernible change in potency of the test parameters at the Census Tract-level. Let's explore the box plot-

Distribution of Variable Importance - Voter Turnout Prediction vs Actual aggregated at a 'Census Tract' level — Figure 13: Snapshot of the Distribution of Variable Importance Box Plot - Predicted Turnout vs Validation Dataset (Actual Turnout aggregated at Census Tract-level in 2016 USA National Elections)

Unfortunately, there is very little to distinguish between this box plot and the previous one (Figure 8). High School Education and Per Capita Income still remain the most potent parameters affecting Voter Turnout at Census Tract-level. Their potency has increased, albeit only marginally, relative to the other test parameters. Therefore, to make any new inferences about the test parameters' effectiveness at a narrow level of geospatial aggregation is not possible - perhaps they are actually immune to geospatial extent!

There is another piece of statistics which reveals an interesting insight though-

Figure 13: Prediction Interval graph generated by the Forest-based Classification and Regression geospatial tool

On the Y-axis of this Prediction Interval graph lies the Predicted Voter Turnout percentage (sorted from low to high) and on the X-axis lies the individual Census Tracts (sorted from low to high based on its Predicted Voter Turnout percentage). P05 and P95 represent the upper and lower bounds of the Prediction Interval. Therefore, what this graph represents is that, given the four test parameters utilized as the Training Dataset, what would be the range of the predicted Voter Turnout percentage for each Census Tract, based on whether I choose a Confidence Interval of 5 percent or any of the ninety intervals up to 95 percent.

Therefore, do observe in Figure 13 that for Census Tracts where the algorithm has predicted a low Voter Turnout percentage (below 50%), the algorithm's prediction would assume a wide range of percentages depending on the Confidence Interval chosen for the analysis. However, for Census Tracts with predicted high Voter Turnout percentage (above 50%), the algorithm's prediction would assume a much narrower range of percentages irrespective of the level of Confidence Interval chosen. This implies that the Training Dataset (the responses to High School Education, Age, Income and Distance to nearest city class as submitted by individual residents during the Surveys held at select counties and aggregated at Census Tract-level) has influenced the Machine Learning algorithm to be more confident about its High Voter Turnout predictions than it has for its Low Voter Turnout predictions i.e. the test parameters are much better explanatory variables for High Voter Turnout Census Tracts than they are for Low Voter Turnout Census Tracts.

Isn't this interesting!

Some of the ways to improve the prediction accuracy of the algorithm would be to-

obtain more survey responses and preferably from other Counties i.e. increase the quantity of the Training data
adding more potentially relevant Variables, removing less relevant Variables i.e. increase the quality of the Test parameters
increase the number of Validation Runs in the GIS software i.e. tweaking the Model parameters
increase the number of Decision Trees that the algorithm generates i.e. tweaking the Model parameters

I hope that through this demonstration, you've come to appreciate the utility of Forest-based Machine Learning Algorithm which was able to predict a really complex phenomenon such as Voter Turnout relatively accurately using a few test parameters. While I did not make a screencast of me performing the processing steps on the GIS software, know that the algorithm was able to sift through swathes of geospatial training data and make meaningful interpretation using the learning within a matter of just a few minutes!

With Random Forest algorithm or its Classification and Prediction variant one can determine which Roads are more accident-prone, which Places are tourists likely to visit, which Online content are viewers likely to watch, besides numerous other applications. Which other applications can you thinks of? Feel free to share with me.

ABOUT US

Intelloc Mapping Services, Kolkata | Mapmyops.com offers Mapping services that can be integrated with Operations Planning, Design and Audit workflows. These include but are not limited to Drone Services, Subsurface Mapping Services, Location Analytics & App Development, Supply Chain Services, Remote Sensing Services and Wastewater Treatment. The services can be rendered pan-India and will aid your organization to meet its stated objectives pertaining to Operational Excellence, Sustainability and Growth.

Broadly, the firm's area of expertise can be split into two categories - Geographic Mapping and Operations Mapping. The Infographic below highlights our capabilities-

Mapmyops (Intelloc Mapping Services) - Range of Capabilities and Problem Statements that we can help address

Our Mapping for Operations-themed workflow demonstrations can be accessed from the firm's Website / YouTube Channel and an overview can be obtained from this brochure. Happy to address queries and respond to documented requirements. Custom Demonstration, Training & Trials are facilitated only on a paid-basis. Looking forward to being of service.

Regards,

Arpit Shah

Random Forest Machine Learning algorithm utilized for Geospatial Data Analysis

Workflow 1: Classifying Radar Satellite Imagery to map Deforestation

Workflow 2: Classifying Optical Imagery Satellite Imagery to map Agricultural Land Use

Workflow 3: Classifying the potency of five parameters affecting Voter Turnout during Elections

Recent Posts

Workflow 2 : Classifying Optical Imagery Satellite Imagery to map Agricultural Land Use