Preparing occurrence data for modelling

For this module, you’ll need the following packages:

sf
terra
dplyr
rgbif
rnaturalearth
rnaturalearthdata

In Module 5 we established that GBIF data is a mix of ecological signal and observation effort. In Module 6 we assembled our environmental layers. Now we need to bridge the two: prepare the occurrence data so that it’s clean, spatially appropriate, and paired with environmental values — ready to feed into a statistical model.

This step is often glossed over in tutorials. That’s a mistake. As the saying goes in data science: garbage in, garbage out. An SDM built on uncleaned data can perform well statistically while being ecologically meaningless (Hijmans & Elith 2019).

Load the data

library(sf)
library(terra)
library(dplyr)

load('data/region.RData')
env_stack <- terra::rast('data/env_stack.tif')

And re-download the howler monkey records if they’re not already in your environment:

library(rgbif)

species <- occ_search(
  scientificName = "Alouatta palliata",
  country = "CR",
  limit = 2000,
  hasCoordinate = TRUE
)

howler_sf <- st_as_sf(
  species$data,
  coords = c("decimalLongitude", "decimalLatitude"),
  crs = 4326
)

nrow(howler_sf)

## [1] 2000

Step 1: Remove duplicate records

GBIF aggregates records from many sources, and the same observation can appear multiple times — submitted by different institutions, transcribed from the same museum specimen, or simply repeated. Duplicates inflate the apparent importance of heavily recorded locations.

We’ll round coordinates to three decimal places (roughly 100 m resolution) and remove any records that share identical rounded positions.

howler_sf <- howler_sf %>%
  mutate(
    lon_round = round(st_coordinates(.)[, 1], 3),
    lat_round = round(st_coordinates(.)[, 2], 3)
  ) %>%
  distinct(lon_round, lat_round, .keep_all = TRUE)

nrow(howler_sf)

## [1] 1399

Step 2: Remove records with imprecise coordinates

Some GBIF records have whole-number coordinates — e.g. exactly 9.0°N, 84.0°W. These often represent rough centroids or poorly georeferenced historical specimens rather than real GPS locations. We’ll flag and remove them.

coords <- as.data.frame(st_coordinates(howler_sf))
names(coords) <- c("lon", "lat")

# Flag records where both coordinates are whole numbers
imprecise <- coords$lon == round(coords$lon) & coords$lat == round(coords$lat)
cat(sum(imprecise), "records with whole-number coordinates removed\n")

## 0 records with whole-number coordinates removed

howler_sf <- howler_sf[!imprecise, ]

# Also remove any records at exactly 0, 0 — a classic georeferencing error
zero_coords <- coords$lon == 0 & coords$lat == 0
howler_sf   <- howler_sf[!zero_coords, ]

nrow(howler_sf)

## [1] 1399

Step 3: Clip to the study region

Although we filtered by country = "CR" when downloading, GBIF coordinates are not always precise enough to guarantee every point actually falls inside the country polygon. Some records sit just offshore, straddle a border, or have coordinates that place them in a neighbouring country due to georeferencing error. Before doing anything spatial, we should remove these.

# Keep only points that fall within the Costa Rica polygon
howler_sf <- howler_sf[st_within(howler_sf, region, sparse = FALSE)[, 1], ]

nrow(howler_sf)

## [1] 1125

st_within() returns TRUE only for points that fall strictly inside the polygon. Any points on the coast, in the ocean, or across the border are silently removed. Always do this before thinning — there’s no point thinning data you’d discard anyway.

Step 4: Spatial thinning

Even after removing duplicates, GBIF data is spatially uneven. Decades of research near La Selva Biological Station and Corcovado National Park mean those areas are far more densely sampled than remote highland zones. If we feed this directly into a model, the model will effectively learn “this species is associated with areas that ecologists like to visit” — not what we’re after.

Spatial thinning addresses this by keeping at most one record per raster grid cell. It’s worth being clear about why this is appropriate here, and when it might not be.

The rasters we’re using are at 10-minute resolution — each cell covers roughly 18 × 18 km at the equator. Two occurrence points falling in the same cell will always receive identical environmental predictor values when we extract from those rasters, because they occupy the same pixel. Keeping both adds nothing to the model: it’s the same information twice, just inflating the sample size and the apparent weight of heavily sampled locations. One record per cell is therefore not discarding ecological information — it’s discarding redundancy.

If you were working with higher-resolution rasters (1 km WorldClim, LiDAR-derived canopy height, etc.), finer-scale thinning would be appropriate. If you were using very coarse rasters (0.5° or lower), thinning to one point per cell would be extremely aggressive and genuinely wasteful. The right thinning resolution is the raster resolution.

ref_grid <- env_stack[[1]]

# Find which raster cell each presence point falls in
howler_vect    <- vect(howler_sf)
cells          <- terra::cellFromXY(ref_grid, geom(howler_vect)[, c('x', 'y')])
howler_sf$cell <- cells

# Keep one record per cell
howler_thinned <- howler_sf %>%
  group_by(cell) %>%
  slice(1) %>%
  ungroup() %>%
  select(-cell)

nrow(howler_thinned)

## [1] 121

# Visualise before and after
par(mfrow = c(1, 2))
plot(st_geometry(region), main = paste("Before thinning\n(n =", nrow(howler_sf), ")"))
plot(st_geometry(howler_sf), add = TRUE, col = "tomato", pch = 19, cex = 0.4)

plot(st_geometry(region), main = paste("After thinning\n(n =", nrow(howler_thinned), ")"))
plot(st_geometry(howler_thinned), add = TRUE, col = "tomato", pch = 19, cex = 0.4)

par(mfrow = c(1, 1))

The thinned dataset should look less bunched. We’ve traded record volume for spatial representativeness, and because the raster resolution sets the limit on what environmental information is recoverable anyway, we haven’t actually lost anything the model could have used (Boria et al. 2014).

Step 4: Generate background points

Our model needs a reference: what does “not specifically selected by the species” look like environmentally? For this, we generate background points — random locations across the study region.

Background points are sometimes called pseudo-absences. They’re not true absences (we don’t know the species is absent there), but they represent the range of environmental conditions available across the landscape. Without them, a model has no baseline to compare presence conditions against.

We sample from within the region polygon — the same approach we used in Module 5 to generate random points.

set.seed(42)
n_bg  <- nrow(howler_thinned) * 5   # 5x presences is a common rule of thumb
bg_sf <- st_sample(region, size = n_bg)
bg_sf <- st_sf(geometry = bg_sf)

plot(st_geometry(region), main = "Presence (red) and background (grey) points")
plot(st_geometry(bg_sf), add = TRUE, col = adjustcolor("grey50", 0.4), pch = 19, cex = 0.3)
plot(st_geometry(howler_thinned), add = TRUE, col = "tomato", pch = 19, cex = 0.5)
legend("bottomleft", legend = c("Presence", "Background"),
       col = c("tomato", "grey50"), pch = 19, bty = "n")

The 5× ratio is a reasonable starting point; some studies use 10× or more. The key is that background points sample broadly from available environmental space, not just areas near presences (Phillips et al. 2009).

Step 5: Extract environmental values

Now we extract environmental values at both presences and background points, creating the data frame that the model will actually learn from.

# Presence points
pres_vect   <- vect(howler_thinned)
pres_env    <- terra::extract(env_stack, pres_vect)
pres_coords <- as.data.frame(st_coordinates(howler_thinned))
names(pres_coords) <- c("x", "y")
pres_df <- data.frame(pres_coords, occ = 1, pres_env[, -1])

# Background points
bg_vect   <- vect(bg_sf)
bg_env    <- terra::extract(env_stack, bg_vect)
bg_coords <- as.data.frame(st_coordinates(bg_sf))
names(bg_coords) <- c("x", "y")
bg_df <- data.frame(bg_coords, occ = 0, bg_env[, -1])

# Combine and clean
howler_sdm <- rbind(pres_df, bg_df)
howler_sdm <- na.omit(howler_sdm)

nrow(howler_sdm)

## [1] 725

table(howler_sdm$occ)

## 
##   0   1 
## 605 120

head(howler_sdm)

##           x        y occ    bio01    bio02    bio03    bio04    bio05    bio06
## 1 -85.73189 11.03608   1 26.31638 7.191382 71.24990 69.52401 31.83011 21.73693
## 2 -85.59351 11.12194   1 25.75554 7.619792 71.51208 72.19042 31.62700 20.97175
## 3 -84.72051 11.02987   1 26.20991 7.753187 73.59457 74.72079 31.76125 21.22625
## 4 -85.73476 10.95183   1 26.51744 8.294092 72.65295 73.01652 32.79776 21.38172
## 5 -85.61836 10.83835   1 25.83055 8.521063 72.60619 74.77908 32.34800 20.61200
## 6 -85.49022 10.85402   1 22.91079 8.188958 74.82088 69.78344 28.70600 17.76125
##      bio07    bio08    bio09    bio10    bio11 bio12 bio13 bio14    bio15 bio16
## 1 10.09318 26.10009 26.91695 27.35275 25.56411  1739   345     5 87.74148   882
## 2 10.65525 25.55842 26.35017 26.80375 24.94571  1978   328     6 71.11481   873
## 3 10.53500 26.40621 26.67200 27.18171 25.27837  2607   361    43 53.89819  1030
## 4 11.41604 26.16182 26.59708 27.61306 25.78750  1658   340     1 94.26595   868
## 5 11.73600 25.49813 25.87858 26.93763 25.04779  1749   331     2 87.81650   865
## 6 10.94475 22.86354 23.31437 23.80442 22.05571  2694   366    39 50.67036  1013
##   bio17 bio18 bio19 elev
## 1    33   210   193  116
## 2    55   224   338  221
## 3   163   282   629   38
## 4     9   223   129  144
## 5    17   240   165  259
## 6   180   296   675  683

The occ column is what our model will predict: 1 = presence, 0 = background. Every other column is an environmental predictor.

Save your work

save(howler_thinned, bg_sf, howler_sdm, file = 'data/module7_occdata.RData')

Exercise

How many records were lost at each cleaning step — duplicates, imprecise coordinates, clipping to polygon, spatial thinning? Which step had the largest effect?
Try running the clipping step (st_within) before the duplicate removal step. Does the order matter? Why or why not?
Plot the thinned presence points on top of the bio12 (annual precipitation) raster. Do howler records appear to cluster in wetter areas, drier areas, or show no obvious pattern?
Change n_bg to 10× the number of thinned presences. How does table(howler_sdm$occ) change? Why might a very high ratio of background to presence points cause problems for a statistical model?

References

Boria, R.A., Olson, L.E., Goodman, S.M. & Anderson, R.P. (2014). Spatial filtering to reduce sampling bias can improve the performance of ecological niche models. Ecological Modelling, 275, 73–77.

Hijmans, R.J. & Elith, J. (2019). Species Distribution Modeling with R. CRAN vignette. Available at: https://rspatial.org/sdm/

Phillips, S.J., Dudík, M., Elith, J., Graham, C.H., Lehmann, A., Leathwick, J. & Ferrier, S. (2009). Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological Applications, 19, 181–197.

← Previous Next →