Random Forest (RF) hyperspectral data classification¶
In this notebook, you will train and apply a common Machine Learning classifier - Random Forest for classification of hyperspectral data from Bílá Louka, Krkonoše mountains, Czechia. Please start by reading about the dataset and area of interest here.
Prerequisities (This notebook can be run either online using Google Colab or on your local machine)
- A Google account for accessing Google Colab (link to notebook). If running the exercise in Google Colab, please copy the notebook to your own Google Drive and follow the exercise, you don't need to download the dataset.
or
A Python environment with the necessary libraries (manual). Download this notebook (click the download button in the top right corner of the page) and follow the exercise.
Downloaded data (module4.zip/theme4_exercise_machine_learning) The dataset consists of:
- Hyperspectral RPAS imagery of Bílá Louka, Czechia (50.728N, 15.682E) acquired in August of 2020 and resampled to 54 spectral bands with ground sampling distance of 9 cm: BL_202008_imagery.tif
- a raster with reference data: BL_202008_reference.tif
- Pretrained models and corresponding classified rasters: /sample_results/*
Tasks
- Preprocess imagery for Machine Learning
- Classify the hyperspectral image using RF
- Observe how hyperparameter values alter classification results
- Evaluate your results and compare to our pretrained classifier
- Optional: Classify a urban scene
0. Load external libraries and set paths¶
First, we need to import external libraries:
numpy - Arrays to hold our data
matplotlib.pyplot - Draw images
sklearn.ensemble - Random forest classifier
sklearn.model_selection - Cross-validation and hyperparameter tuning implemented in scikit-learn
sklearn.metrics - Compute accuracy metrics using scikit-learn
sklearn.preprocessing - Normalizing input data using scikit-learn
time.perf_counter - Track how long individual functions take to run
os.path - Path manipulation
tqdm - show progress bars during training
joblib - Saving and loading trained classifiers
etrainee_m4_utils.image_preprocessing - Our library holding functions for image tiling, preprocessing, etc.
etrainee_m4_utils.inference_utils - Our library for correctly exporting classifed images
etrainee_m4_utils.visualisation_utils - Our library for visualising the data
Two external libraries are not imported directly in this notebook, but are used by functions in image_preprocessing and inference_utils:
- gdal - Manipulates spatial data
- scipy.io - Reads .mat files
import numpy as np
import matplotlib.pyplot as plt
from time import perf_counter
from os.path import join
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from tqdm import notebook as tqdm
from joblib import dump, load
from etrainee_m4_utils import image_preprocessing
from etrainee_m4_utils import inference_utils
from etrainee_m4_utils import visualisation_utils
# GLOBAL SETTINGS
plt.rcParams['figure.figsize'] = [5, 5]
np.set_printoptions(precision=2, suppress=True) # Array print precision
# Set dataset name (used by visualisation functions) - 'bila_louka' or 'pavia_centre'
# default: 'pavia_centre'
ds_name = 'bila_louka'
# Get a list of class names
_, class_names = visualisation_utils._create_colorlist_classnames(ds_name=ds_name)
Please fill correct paths to your training and reference rasters (just pointing the root_path variable to the project folder should do):
root_path = 'C:/folder/where/this/project/is/saved'
# PATHS TO TRAINING DATA
# Bila Louka
imagery_path = join(root_path, 'BL_202008_imagery.tif')
reference_path = join(root_path, 'BL_202008_reference.tif')
# Pavia
#imagery_path = join(root_path, 'Pavia.mat')
#reference_path = join(root_path, 'Pavia_gt.mat')
# PATH TO SAVE MODELS
model_save_folder = join(root_path, 'models')
# PATH TO SAVE CLASSIFIED IMAGE
out_path = join(root_path, 'results/Krkonose_RF.tif')
# PATH TO THE SAMPLE RESULTS
sample_result_path = join(root_path, 'sample_results/RF_sample_result.tif')
1. Load and preprocess training data¶
1.1. Data loading into NumPy¶
Let's start by reading an image into a numpy array, we do this in the background using GDAL.
The result of our function is a dictionary named loaded_raster, which contains two numpy arrays under keys imagery and reference. As we can see, the loaded hyperspectral dataset has 1088 by 1088 pixels with 54 spectral bands. The raster containing our reference data has the same dimensions in height and width.
For loading most raster datasets, we created a read_gdal() function in the image_preprocessing module. But loading .mat files for the Pavia City Centre requires a specific function (read_pavia_centre()). Both read_pavia_centre() and read_gdal() return a dictionary containing two numpy arrays with keys imagery and reference.
If using the Pavia City Centre dataset, you may notice that the original image has a shape of (1096, 1096, 102), but to make the data easier to tile for neural networks, we crop the image to (1088, 1088, 102) here.
loaded_raster = image_preprocessing.read_gdal(imagery_path, reference_path)
#loaded_raster = image_preprocessing.read_pavia_centre(imagery_path,
# reference_path, out_shape=(1088, 1088, 102))
print(f'Tiled imagery shape {loaded_raster["imagery"].shape}')
print(f'Tiled reference shape {loaded_raster["reference"].shape}')
visualisation_utils.show_img_ref(loaded_raster["imagery"][:, :, [25, 15, 5]],
loaded_raster["reference"], ds_name=ds_name)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
1.2. Flatten array¶
We will be classifying individual pixels, therefore we can transform the 3D image (height, width, spectral bands) to a 2D array (length, spectral bands). This transformation destroys spatial relationships within the image, however the classifiers can only use 1D features anyway and it simplifies the next step (filtering NoData).
orig_shape = loaded_raster['imagery'].shape
flat_arrs = {}
flat_arrs['imagery'] = loaded_raster['imagery'].reshape(
orig_shape[0]*orig_shape[1], orig_shape[2])
flat_arrs['reference'] = loaded_raster['reference'].reshape(
orig_shape[0]*orig_shape[1])
print(f'The flat imagery array has shape {flat_arrs["imagery"].shape}')
1.3. Filter out NoData pixels¶
We can only train the classifier on pixels with a reference value, therefore we remove all pixels belonging to class 0 (NoData). This operation reduces our training dataset from ~1.18 milion to ~130 thousand pixels. We then visualise the spectral curves of individual pixels.
filtered_arrs = {}
filtered_arrs['imagery'] = flat_arrs['imagery'][flat_arrs['reference'] > 0]
filtered_arrs['reference'] = flat_arrs['reference'][flat_arrs['reference'] > 0]
print(f'The filtered array has shape {filtered_arrs["imagery"].shape}')
pixel_number = 1000
visualisation_utils.show_spectral_curve(filtered_arrs, pixel_number,
ds_name=ds_name)
1.4. Data scaling¶
After filtering the training data, we can move onto data scaling. In Machine Learning, it is common to scale all features before classification, because many classifiers assume that all features vary on comparable scales and that each feature has values close to zero.
scaler = StandardScaler()
scaler.fit(flat_arrs['imagery'])
scaled_arrs = {
'imagery': scaler.transform(filtered_arrs['imagery']),
'reference': filtered_arrs['reference']}
pixel_number = 5
visualisation_utils.show_spectral_curve(scaled_arrs, pixel_number,
ds_name=ds_name)
1.5. Splitting data for training/testing¶
Our reference dataset has to be split into three groups:
training / validation / test
In this step we divide the data into train+val and test groups. The variable train_fraction is used to establish how big of a fraction of the reference dataset is used for training and validation. By default, a third of the referece data is useed for training and validation, while the other two thirds are used for testing model preformance. The splitting is done using a scikit learn function train_test_split, which performs a stratified sampling from all classes.
Please experiment with different fractions of training data.
train_fraction = 1/3
X_train, X_test, y_train, y_test = train_test_split(scaled_arrs['imagery'], scaled_arrs['reference'], train_size=train_fraction)
train = {'imagery': X_train, 'reference': y_train}
test = {'imagery': X_test, 'reference': y_test}
# How many training samples do we have for each class?
unique, counts = np.unique(train['reference'], return_counts=True)
print(f'The individual classes contain {counts} training pixels.')
2. Random Forest definition and training¶
After preprocessing our data, we can move onto defining a machine learning model. You can either train your own classifiers or use ones we already trained for you (sample_results/RF_sample_trained.joblib). In case you are using the pretrained RF, skip ahead to section 2.3.
This training uses a Random Forest implementation from scikit-learn, a popular Machine Learning library for Python. The documentation is available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
2.1. Find most suitable parameters¶
To function proprely, RF has to have suitable values for some hyperparameters. A common approach is to train classifiers with different hyperparameter values and select the most suitable ones.
Scikit-Learn makes this easy using RandomizedSearch or GridSearch, these functions train the classifier multiple times using different hyperparameter values and determine the most suitable combination. Each combination of hyperparameter values is tried multiple times using cross-validation (out-of-sample testing).
Run either the cell with RandomizedSearchCV or with GridSearchCV, while Grid Search may be able to find better hyperparameters, Randomized Search will likely also find good solutions in a much shorter amount of time.
# define potential parameter values for the RF
parameters_rf = {
'n_estimators': [50, 100, 250, 500, 750], # Number of trees in the forest
'max_depth': [3, 5, 10, 20, 50], # Maximum depth of a tree
}
# Create the optimizer and run the optimization
opt = RandomizedSearchCV(RandomForestClassifier(), parameters_rf, cv=5, scoring="jaccard_micro", n_iter=8, refit=False, n_jobs=-2, verbose=4)
opt.fit(X=train['imagery'], y=train['reference'])
print(f'The optimisation process identified these parameters as the most suitable: {opt.best_params_}')
# Create the optimizer and run the GridSerach optimization
opt = GridSearchCV(RandomForestClassifier(), parameters_rf, cv=5, scoring="jaccard_micro", refit=False, n_jobs=-2, verbose=4)
opt.fit(X=train['imagery'], y=train['reference'])
print(f'The optimisation process identified these parameters as the most suitable: {opt.best_params_}')
2.2. Fit¶
The best hyperparameter values identified during cross-validation are then used for training the model on the whole training dataset.
rf = RandomForestClassifier(**opt.best_params_)
rf.fit(X=train['imagery'], y=train['reference'])
rf.score(train['imagery'], train['reference'])
2.3. Save/load trained RF for potential future use¶
# save using joblib.dump(object, filename)
model_path = join(model_save_folder, 'RF.joblib')
dump(rf, model_path)
# load using joblib.load(filename)
rf = load(model_path)
3. Model application & evaluation¶
3.1. Loading and preprocessing the data¶
Load a raster to classify. This can be the one that we used for training, but it can also be a different raster with the same number of bands.
By default, the training raster (imagery_path) is used.
# Load raster
raster = image_preprocessing.read_gdal_with_geoinfo(imagery_path, (0,0))
# Flattern spatial dimension of the raster
raster_shape = raster['imagery'].shape
raster_flat = raster['imagery'].reshape(raster_shape[0]*raster_shape[1],
raster_shape[2])
# Scale the input data
scaler = StandardScaler()
scaler.fit(raster_flat)
raster_scaled = scaler.transform(raster_flat)
3.2. Applying the classifier¶
The following snippet applies the classifier to the loaded imagery, and then transforms the flattened array back into a raster.
predicted_flat = rf.predict(raster_scaled)
predicted_raster = predicted_flat.reshape(raster_shape[0], raster_shape[1])
You can also visualise the result:
visualisation_utils.show_classified(loaded_raster['imagery'][:, :, [25, 15, 5]],
loaded_raster['reference'],
predicted_raster, ds_name=ds_name)
3.3. Export resulting raster¶
Export the resulting classified raster into out_path for distribution or further analysis (e.g. validation in GIS).
inference_utils.export_result(out_path, predicted_raster, raster['geoinfo'])
4. Evaluate Classification Result¶
4.1. Apply classifier to the test dastaset¶
test_predicted = rf.predict(test['imagery'])
4.2. Compute accuracy metrics¶
print(classification_report(test['reference'], test_predicted,
target_names=class_names[1:]))
4.3. Show Confusion Matrix¶
visualisation_utils.show_confusion_matrix(test['reference'], test_predicted,
ds_name=ds_name)
5. Sample Solution¶
We have generated this result using these training parameters (please note that just using the same training parameters will not yield the same result):
- Number of trees: 750
- Maximium tree depth: 50
# Read test reference
test_arr = image_preprocessing.read_gdal(imagery_path, test_path)
test_flat = test_arr['reference'].reshape(
test_arr['reference'].shape[0]*test_arr['reference'].shape[1])
test_filtered = test_flat[test_flat > 0]
# Read sample result
sample_arr = image_preprocessing.read_gdal(imagery_path, sample_result_path)
sample_flat = sample_arr['reference'].reshape(
sample_arr['reference'].shape[0] * sample_arr['reference'].shape[1])
sample_filtered = sample_flat[test_flat > 0]
# Visualise the sample result
visualisation_utils.show_classified(loaded_raster['imagery'][:, :, [25, 15, 5]],
loaded_raster['reference'],
sample_arr['reference'],
ds_name=ds_name)
# Print a classification report for the sample result
print(classification_report(test_filtered, sample_filtered,
target_names=class_names[1:]))
# Show a Confusion matrix for the sample result
visualisation_utils.show_confusion_matrix(test_filtered, sample_filtered,
ds_name=ds_name)
Optional: Classify a urban scene¶
Try using this notebook to classify a urban scene (Pavia City Centre). Reflect on how the different landscape structure and clearer class definitions influence the classification result.
Pavia city centre is a common benchmark for hyperspectral data classification and can be obtained from http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes#Pavia_Centre_and_University. You will need to change the paths to input data and use the read_pavia_centre method to load the matlab matrices into numpy.
Return to exercises