Opening and exploring spatial data in R

For this module, you’ll need the following packages:

  • sf
  • rgbif

So far, we’ve talked about what spatial data is and why it matters. Now comes the hands-on part — actually getting it into R and having a look around.

When you load spatial data into R, it doesn’t appear as a neat table like a spreadsheet. Instead, it’s stored in special objects that keep track of both the data (species names, counts, measurements) and the geometry (where those things are). Think of it as a spreadsheet with a built-in map attached — which is essentially what it is.

If we had data from a primate survey, it might look like this:

Site ID Species seen Temperature Location
1 Yes 27.4°C (x₁, y₁)
2 No 24.1°C (x₂, y₂)

The Location column isn’t plain text — it’s geometry. It contains the coordinates that allow R to map and spatially analyse those data. This is the fundamental structure of a simple features object, which is what the sf package works with (Pebesma 2018). The “simple features” standard is an international specification for how geographic vector data should be represented, and sf is its R implementation.

Where does our species data come from?

We’ll use the Global Biodiversity Information Facility (GBIF) — a huge open-access database of biodiversity observations compiled from museum collections, citizen science platforms, research surveys, and automated monitoring. It currently holds over 2 billion occurrence records and is one of the primary data sources in contemporary macroecology and conservation biology (Chamberlain & Boettiger 2017). The rgbif package lets us query it directly from R, which means we can pull real data, from a real database, and work with it immediately.

That said, GBIF data comes with important caveats. Records are contributed by many sources with wildly varying standards of accuracy and completeness. Coordinates can be imprecise, duplicated, or wrong. Sampling effort is deeply uneven — well-studied areas near roads and research stations are heavily over-represented. We will deal with all of this carefully in Module 7. For now, we load and explore.

Loading packages

# install.packages(c("sf", "rgbif"))  # only run once
library(sf)
library(rgbif)

Downloading howler monkey records

Let’s download occurrence records for the mantled howler monkey (Alouatta palliata). We’ll keep the download modest to start (just 300 records) to keep things quick. We’ll work with a fuller, geographically filtered dataset from Module 4 onwards.

howler <- occ_search(
  scientificName = "Alouatta palliata",
  limit = 300,
  hasCoordinate = TRUE  # only return records with usable coordinates
)

Turning it into spatial data

Right now howler$data is a plain data frame — rows and columns, nothing spatial. We tell R which columns contain the coordinates and what coordinate system they use. We’ll discuss coordinate systems properly in Module 3; for now, crs = 4326 refers to WGS84, the standard global system used by GPS receivers and GBIF.

howler_sf <- st_as_sf(
  howler$data,
  coords = c("decimalLongitude", "decimalLatitude"),
  crs = 4326
)

howler_sf
## Simple feature collection with 300 features and 105 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -94.50629 ymin: -3.743768 xmax: -77.36822 ymax: 18.37124
## Geodetic CRS:  WGS 84
## # A tibble: 300 × 106
##    key        scientificName  issues datasetKey publishingOrgKey installationKey
##  * <chr>      <chr>           <chr>  <chr>      <chr>            <chr>          
##  1 5938036987 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  2 5938118541 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  3 5938193003 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  4 5938235647 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  5 5938269984 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  6 5938276988 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  7 5938279221 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  8 5938302462 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
##  9 5938375614 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
## 10 5938407216 Alouatta palli… cdc,c… 50c9509d-… 28eb1a3f-1c15-4… 997448a8-f762-…
## # ℹ 290 more rows
## # ℹ 100 more variables: hostingOrganizationKey <chr>, publishingCountry <chr>,
## #   protocol <chr>, lastCrawled <chr>, lastParsed <chr>, crawlId <int>,
## #   basisOfRecord <chr>, occurrenceStatus <chr>, taxonKey <int>,
## #   kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
## #   familyKey <int>, genusKey <int>, speciesKey <int>, acceptedTaxonKey <int>,
## #   acceptedScientificName <chr>, kingdom <chr>, phylum <chr>, order <chr>, …

You’ll see rows of observations, columns like country, stateProvince, and eventDate, and, crucially, a geometry column of type POINT. That geometry column is what makes this a spatial object. Everything else in the sf package flows from it.

summary(howler_sf)
##      key            scientificName        issues           datasetKey       
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  publishingOrgKey   installationKey    hostingOrganizationKey
##  Length:300         Length:300         Length:300            
##  Class :character   Class :character   Class :character      
##  Mode  :character   Mode  :character   Mode  :character      
##                                                              
##                                                              
##                                                              
##                                                              
##  publishingCountry    protocol         lastCrawled         lastParsed       
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     crawlId      basisOfRecord      occurrenceStatus      taxonKey      
##  Min.   :196.0   Length:300         Length:300         Min.   :2436649  
##  1st Qu.:594.0   Class :character   Class :character   1st Qu.:2436649  
##  Median :594.0   Mode  :character   Mode  :character   Median :2436649  
##  Mean   :579.2                                         Mean   :2830703  
##  3rd Qu.:594.0                                         3rd Qu.:2436649  
##  Max.   :594.0                                         Max.   :7342629  
##                                                                         
##    kingdomKey   phylumKey     classKey      orderKey     familyKey      
##  Min.   :1    Min.   :44   Min.   :359   Min.   :798   Min.   :3239607  
##  1st Qu.:1    1st Qu.:44   1st Qu.:359   1st Qu.:798   1st Qu.:3239607  
##  Median :1    Median :44   Median :359   Median :798   Median :3239607  
##  Mean   :1    Mean   :44   Mean   :359   Mean   :798   Mean   :3239607  
##  3rd Qu.:1    3rd Qu.:44   3rd Qu.:359   3rd Qu.:798   3rd Qu.:3239607  
##  Max.   :1    Max.   :44   Max.   :359   Max.   :798   Max.   :3239607  
##                                                                         
##     genusKey         speciesKey      acceptedTaxonKey  acceptedScientificName
##  Min.   :2436647   Min.   :2436649   Min.   :2436649   Length:300            
##  1st Qu.:2436647   1st Qu.:2436649   1st Qu.:2436649   Class :character      
##  Median :2436647   Median :2436649   Median :2436649   Mode  :character      
##  Mean   :2436647   Mean   :2436649   Mean   :2830703                         
##  3rd Qu.:2436647   3rd Qu.:2436649   3rd Qu.:2436649                         
##  Max.   :2436647   Max.   :2436649   Max.   :7342629                         
##                                                                              
##    kingdom             phylum             order              family         
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     genus             species          genericName        specificEpithet   
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   taxonRank         taxonomicStatus    iucnRedListCategory dateIdentified    
##  Length:300         Length:300         Length:300          Length:300        
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  coordinateUncertaintyInMeters  continent         stateProvince     
##  Min.   :     1.0              Length:300         Length:300        
##  1st Qu.:     8.0              Class :character   Class :character  
##  Median :    39.5              Mode  :character   Mode  :character  
##  Mean   :  5498.2                                                   
##  3rd Qu.:   724.0                                                   
##  Max.   :125919.0                                                   
##  NA's   :50                                                         
##       year          month            day         eventDate        
##  Min.   :2026   Min.   :1.000   Min.   : 1.00   Length:300        
##  1st Qu.:2026   1st Qu.:1.000   1st Qu.: 6.75   Class :character  
##  Median :2026   Median :1.000   Median :12.50   Mode  :character  
##  Mean   :2026   Mean   :1.277   Mean   :13.83                     
##  3rd Qu.:2026   3rd Qu.:2.000   3rd Qu.:22.00                     
##  Max.   :2026   Max.   :2.000   Max.   :31.00                     
##                                                                   
##  startDayOfYear   endDayOfYear     modified         lastInterpreted   
##  Min.   : 1.00   Min.   : 1.00   Length:300         Length:300        
##  1st Qu.:10.00   1st Qu.:10.00   Class :character   Class :character  
##  Median :23.00   Median :23.00   Mode  :character   Mode  :character  
##  Mean   :22.41   Mean   :22.41                                        
##  3rd Qu.:32.00   3rd Qu.:32.00                                        
##  Max.   :54.00   Max.   :54.00                                        
##                                                                       
##   references          license          isSequenced      identifier       
##  Length:300         Length:300         Mode :logical   Length:300        
##  Class :character   Class :character   FALSE:300       Class :character  
##  Mode  :character   Mode  :character                   Mode  :character  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##     facts            relations         isInCluster     datasetName       
##  Length:300         Length:300         Mode :logical   Length:300        
##  Class :character   Class :character   FALSE:300       Class :character  
##  Mode  :character   Mode  :character                   Mode  :character  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##   recordedBy        identifiedBy       nucleotideSequence geodeticDatum     
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     class           countryCode        recordedByIDs      identifiedByIDs   
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   gbifRegion          country          publishedByGbifRegion rightsHolder      
##  Length:300         Length:300         Length:300            Length:300        
##  Class :character   Class :character   Class :character      Class :character  
##  Mode  :character   Mode  :character   Mode  :character      Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  identifier.1       http...unknown.org.nick verbatimEventDate 
##  Length:300         Length:300              Length:300        
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##  http...unknown.org.crawl_attempt http...unknown.org.status verbatimLocality  
##  Length:300                       Length:300                Length:300        
##  Class :character                 Class :character          Class :character  
##  Mode  :character                 Mode  :character          Mode  :character  
##                                                                               
##                                                                               
##                                                                               
##                                                                               
##     gbifID          collectionCode     occurrenceID         taxonID         
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  http...unknown.org.captive_cultivated catalogNumber      institutionCode   
##  Length:300                            Length:300         Length:300        
##  Class :character                      Class :character   Class :character  
##  Mode  :character                      Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   eventTime         identificationID       name           informationWithheld
##  Length:300         Length:300         Length:300         Length:300         
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##   projectId         recordedByIDs.type recordedByIDs.value occurrenceRemarks 
##  Length:300         Length:300         Length:300          Length:300        
##  Class :character   Class :character   Class :character    Class :character  
##  Mode  :character   Mode  :character   Mode  :character    Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##      gadm               sex             lifeStage         dynamicProperties 
##  Length:300         Length:300         Length:300         Length:300        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    vitality         identifiedByIDs.type identifiedByIDs.value
##  Length:300         Length:300           Length:300           
##  Class :character   Class :character     Class :character     
##  Mode  :character   Mode  :character     Mode  :character     
##                                                               
##                                                               
##                                                               
##                                                               
##  infraspecificEpithet individualCount samplingProtocol     habitat         
##  Length:300           Min.   :1.000   Length:300         Length:300        
##  Class :character     1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character     Median :1.000   Mode  :character   Mode  :character  
##                       Mean   :2.667                                        
##                       3rd Qu.:5.000                                        
##                       Max.   :7.000                                        
##                       NA's   :291                                          
##    locality         http...unknown.org.griddedDataset
##  Length:300         Length:300                       
##  Class :character   Class :character                 
##  Mode  :character   Mode  :character                 
##                                                      
##                                                      
##                                                      
##                                                      
##  identificationVerificationStatus vernacularName    
##  Length:300                       Length:300        
##  Class :character                 Class :character  
##  Mode  :character                 Mode  :character  
##                                                     
##                                                     
##                                                     
##                                                     
##  http...unknown.org.omitFromScheduledCrawl http...unknown.org.declaredCount
##  Length:300                                Length:300                      
##  Class :character                          Class :character                
##  Mode  :character                          Mode  :character                
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##           geometry  
##  POINT        :300  
##  epsg:4326    :  0  
##  +proj=long...:  0  
##                     
##                     
##                     
## 

A first map

plot(st_geometry(howler_sf),
     main = "Mantled howler monkey — GBIF occurrences",
     pch = 19, cex = 0.5, col = "tomato")

Note that we use st_geometry() here rather than just plot(howler_sf). The latter would attempt to produce a separate map for every column in the data frame — which is technically valid R but not what we want right now.

Even this basic map is informative. Records cluster in certain areas, and large parts of the species’ potential range look empty. This could reflect genuine habitat preferences. For example, howlers are forest-dependent and associate strongly with lowland and mid-elevation tropical forest (Fedigan & Jack 2001). But it could equally reflect where researchers happened to work, or where road access made surveys feasible.

This ambiguity (signal versus sampling) is one of the central challenges of biodiversity informatics, and we’ll address it directly in Module 5.

Exploring the data

Let’s look at a few useful fields to get a feel for what we’re working with.

# Which countries do records come from?
table(howler_sf$countryCode)
## 
##  CR  EC  HN  MX  NI  PA 
## 239   5   3   4  12  37
# Date range of records
range(howler_sf$eventDate, na.rm = TRUE)
## [1] "2026-01-01T08:34:35" "2026-02-23T06:09:37"
# How many records lack a date?
sum(is.na(howler_sf$eventDate))
## [1] 0

Records span multiple decades and countries, a reminder that GBIF aggregates historical museum specimens alongside recent field surveys. A specimen collected in 1952 and a camera trap photo from last year sit in the same dataset, treated equivalently unless you filter by date or some other criteria. For modelling present-day distributions, this is worth keeping in mind.

A note on data quality

It’s tempting to treat downloaded data as ready to use. It isn’t, at least not yet. GBIF records have been shown to contain substantial proportions of coordinate errors, including points placed in the ocean, at country centroids (a common georeferencing shortcut), and at 0°, 0° in the Gulf of Guinea (Maldonado et al. 2015). These are not rare edge cases; they’re common enough to meaningfully bias analyses if left unchecked.

For now, we’re just getting familiar with how spatial data looks and behaves in R. Cleaning and quality control come in Module 7.

References

Chamberlain, S. & Boettiger, C. (2017). R Python, and Ruby clients for the GBIF species occurrence API. PeerJ Preprints, 5, e3304v1.

Fedigan, L.M. & Jack, K. (2001). Neotropical primates in a regenerating Costa Rican dry forest: a comparison of howler and capuchin population patterns. International Journal of Primatology, 22, 689–713.

Maldonado, C., Molina, C.I., Zizka, A., Persson, C., Taylor, C.M., Albán, J., Chilquillo, E., Rønsted, N. & Antonelli, A. (2015). Estimating species diversity and distribution in the era of Big Data. Global Ecology and Biogeography, 24, 1305–1317.

Pebesma, E. (2018). Simple features for R: standardized support for spatial vector data. The R Journal, 10, 439–446.

← Previous Next →