See preliminary results below.
This software lets you make individual and combined suitability maps for any species captured in the iNaturalist dataset, by pulling a rich set of climate, geological, and topological data from other freely available datasets. This lets you map possible habitat ranges for plants, the idea being you may want to develop land with many kinds of species, e.g. for permaculture or industrial agroforestry.
This project requires ~1 TB of data to run yourself:
- bulk global iNaturalist species occurrences via GBIF.org
- Terraclimate 12 month global climate summaries (latest or multi-year)
- SoilGrids 2.0 global grids
- Global TWI
- Global DEM via HydroSHEDs + derived tiles (scripts supplied)
- Global GLiM lithographs.
- MCD12Q1 land use classification map (latest)
This system is memory-optimized to run on laptops, I did it with a USB drive to make it extra slow for myself and focus on optimization, so much of the data requires transforming into better COG TIF format or indexing for quick CSV lookup. All scripts are provided, including an enrichment script that lets you define any taxa levels from iNaturalist and enrich cumulative csvs to run through our suitability mapping programs. This all needs documentation or you can feed files into a good LLM to get the workflow spelled out for you.
Look at Collection.md and each folder for how to start gathering.
Look at iNaturalistOccurrences and Suitability folders for how to start processing that data, lots of assembly required.
Note our results are using an "artist's touch" to manually adjust eta-squared priors and manually control the blending between the ML and eta-squared results. The actual realism is species-dependent and requires more survey data, but this is already a great result running on fairly naive assumptions.
Fairbanks North Star Borough, White Spruce (green) and Poplar/Cottonwood (red) blended suitability map mixed in QGIS. The more limited poplar range is likely due to sparsity of iNaturalist records as they are found everywhere in town.
Where you find good mixes of these, you find one of the best edible mushrooms :P We validated parts of this map by asking local foragers about their previous season. Not bad for a shot in the dark.
Blending eta-squared with the classifier probabilites gives the most convincing results, e.g. the deserts truly are not suitable for most of the plants here:

Eta-squared style empirically-weighted suitability scoring, with stress and reliability modifiers, compared to actual habitat ranges. We next added a machine learning habitat classifier to blend with this.

Initial results overlap well with known habitat, using well known Oregon species as our test case. Future results will show "agroforestry" profiles where we have valid overlap for dozens of useful cultivatable species.
Model performance:
The low F1 scores here are more due to the extrapolating rather than the original classification accuracy, as we deliberately are weakening it to get a larger suitability area.
Leaky XGBoost model that overtunes around the actual observation sites (minus coordinates), useful for blending better from occurrence data ground truth:
This was our first attempt but it didn't do background sampling correctly, however it still has some usefulness as a narrower model.
Leaky model (overtunes around observation sites), very high F1 due to less extrapolation:

We also created a community model on top of this to look at multi-species probabilities, we are still testing it.
Numeric comparisons:
Raw occurrence data, you'll see that the ML model has very strong overlap with the clusters here:

Pseudotsuga menziesii (Douglas Fir) is the most generalized, Alnus Rubra (Red Alder) likes the hills, Arbutus menziesii (Pacific Madrone) favors the savannahs in the Willamette valley, and Kopsiopsis strobilaceae is parasitic to the Pacific Madrone, but found only in Southwestern Oregon and Northwestern California. Overall we get great overlap with their actual ranges and habitat preferences, with distinct hill and valley favoring species, and the parasitic species occupying a subsection of the wider Madrone range.
This is free to use for any reason under an MIT License. It's not trivial to set up by any means but we may streamline it more as we continue exploring this framework. If you find any interest in using or improving it, feel free to fork or contribute to this repository. We're all in this together!
