Open Data Science Europe workshop 2022

Carmelo Bonannella

Carmelo has a MSc in forest systems sciences and technologies, with a specialization in forest resources monitoring and management through geospatial data science applications and time series analysis.
Carmelo is a PhD Candidate at Wageningen University and Research (WUR) in the Geo-information Science and Remote Sensing program and works as a Research assistant at the OpenGeoHub Foundation


Sessions

06-13
13:30
90min
Forest species distribution modelling and spatial planning in R
Carmelo Bonannella

In this workshop you will learn how to use ensemble machine learning to predict the realized distribution of forest tree species over Europe in spacetime (2000 — 2020). The lecture will provide some basic concepts of Species Distribution Modeling (SDM), focusing mainly on how to prepare and clean a dataset with the target species occurrence, how to select/include absence data in your model and how to avoid spatial clustering due to preferential sampling.
The ensemble strategy used in this lecture is stacked generalization
(Wolpert, 1992): the predictions of all the component models are used to train a meta-learner which then produces the final predictions. After fitting the model and generating predictions, the lecture will provide some additional notes on how to calculate the variable importance and the uncertainty of the ensemble model.
Extrapolation/model transferability (i.e. predictions outside the spatiotemporal range used for model calibration) will not be discussed.

Workshop room 2 - C223
06-16
11:45
20min
Combining Machine learning and Earth Observation data for high resolution tree mapping: building a dynamic forest atlas for Europe
Carmelo Bonannella

The talk will describe a data-driven framework based on spatio-temporal ensemble machine learning to produce distribution maps for 16 tree species at high spatial resolution (30m). Tree occurrence data for a total of 3 million of points was used to train different Machine Learning (ML) algorithms: random forest, gradient-boosted trees, generalized linear models, k-nearest neighbors, CART and an artificial neural network. A stack of 585 coarse and high resolution covariates representing spectral reflectance, different biophysical conditions and biotic competition was used as predictors for realized distributions, while potential distribution was modelled with environmental predictors only. AUC, logloss and computing time were used to select the three best algorithms to train an ensemble model based on stacking with a logistic regressor as a metalearner for each species. Probability and model uncertainty maps were produced for each species using a time window of 4 years for a total of 6 distribution maps per species. The ensemble model outperformed or performed as good as the best individual model in all potential species distributions, while for ten species it performed worse than the best individual model in modeling realized distributions. The framework shows how combining continuous and consistent Earth Observation time series data with state of the art ML can be used to derive dynamic distribution maps.

Conference room - C202