Media API Demo

Code here written by Erica Krimmel.

General Overview

In this demo we will cover how to:

  1. Write a query to search for specimens using idig_search_media
  2. Download media records

Load Packages

# Load core libraries; install these packages if you have not already
library(ridigbio)
library(tidyverse)

# Load library for making nice HTML output
library(kableExtra)

Write a query to search for specimen records

First, you need to find all the media records for which you are interested in downloading media files. Do this using the idig_search_media function from the ridigbio package, which allows you to search for media records based on data contained in linked specimen records, like species or collecting locality. You can learn more about this function from the iDigBio API documentation and ridigbio documentation. In this example, we want to search for images of herbarium specimens of species in the genus Acer that were collected in the United States.

# Edit the fields (e.g. `genus`) and values (e.g. "manis") in `list()` 
# to adjust your query and the fields (e.g. `uuid`) in `fields` to adjust the
# columns returned in your results; edit the number after `limit` to adjust the
# number of records you will retrieve images for
records <- idig_search_media(rq = list(genus = "acer",
                                       country = "united states"), 
            fields = c("uuid",
                       "accessuri",
                       "rights",
                       "format",
                       "records"),
            limit = 10)

records$accessuri <- if_else(grepl("^http://", records$accessuri),
  gsub("^http://", "", records$accessuri),
  records$accessuri
)
records$accessuri <- if_else(grepl("https://mam.ansp.org", records$accessuri),
  gsub("https://mam.ansp.org", "mam.ansp.org", records$accessuri),
  records$accessuri
)
records$accessuri <- if_else(grepl("https://ibss-images.calacademy.org", records$accessuri),
  gsub("https://ibss-images.calacademy.org", "ibss-images.calacademy.org", records$accessuri),
  records$accessuri
)

The result of the code above is a data frame called records:

uuid accessuri rights format records
0000b146-6fd2-4a6a-bf78-9e709cc995e9 mam.ansp.org/image/CM/Fullsize/345/CM345773.jpg BY-NC-SA image/jpeg 8af932f6-24c1-4c9c-9963-10b9b584b632
0000b511-67fd-4485-adea-56df8f7e4c66 https://sernecportal.org/imglib/seinet/sernec/NCU_VascularPlants/NCU00418/NCU00418780_01.JPG BY-NC-SA image/jpeg 21de8a6f-bb99-4e95-93d9-881fe1b6a73b
0000d1cd-8211-45c6-8dfa-bd4a9a001aad mam.ansp.org/image/TAWES/Fullsize/0005/TAWES0005954.jpg BY-NC-SA image/jpeg 0a18cad9-2e85-4ae1-8274-68274b058b61
000244a4-6c5c-4e75-9c2e-db585b377935 ibss-images.calacademy.org:80/static/botany/originals/12/df/12dff2af-8a4b-4845-9964-79152fb7ce71.jpg CC0 NA 4aa31647-7fcd-45f0-b039-d1a0a27637d6
000277e9-659b-4e0c-a61b-c5262d33969b https://www.pnwherbaria.org/images/jpeg.php?Image=WTU-V-023351.jpg NA image/jpeg ac4d39f6-8775-48ea-b1d1-cd14c6f60e08
0002acf9-13d4-4318-a53e-4c00e9361a07 https://cch2.org/imglib/cch2/CHSC_VascularPlants/CHSC021/CHSC021794_lg.jpg BY-NC-SA image/jpeg b0d09b22-e30f-4b89-ab03-d4c863b4727a
000495be-df5c-4c01-a951-5656a3fe5ef5 bgbaseserver.eeb.uconn.edu/DATABASEIMAGES/CONN00075691.JPG BY-NC-SA image/jpeg 2642a89e-bcda-4c8c-8b20-3753b37ab990
0005475a-c2c9-4908-9a27-ba9e555d0c43 https://cdn.floridamuseum.ufl.edu/Herbarium/945bc71c-a736-42e2-8b41-7a00bf9529fe BY-NC image/jpeg 197ea7d4-de1f-4c93-ac98-8c81b53fbece
00056e02-50b6-4c62-a975-306cc870dd83 https://data.cyverse.org/dav-anon/iplant/projects/magnoliagrandiFLORA/images/specimens/MISS0038464/MISS0038464.JPG BY-NC-SA image/jpeg 04764068-999e-4902-b30b-4e1d0d23d214
000654a2-02f3-4c14-b28a-7567cb55aa57 https://api.idigbio.org/v2/media/793b0495f464a5403db6918811399c1d BY-NC-SA image/jpeg b8f0b19c-2ea7-4f40-971f-4e92f0f6fd98

Generate a list of URLs

Now that we know what media records are of interest to us, we need to isolate the URLs that link to the actual media files so that we can download them. In this example, we will demonstrate how to download files that are cached on the iDigBio server, as well as the original files hosted externally by the data provider. You likely do not need to download two sets of images, so can choose to comment out the steps related to either “_iDigBio” or “_external” depending on your preference.

# Assemble a vector of iDigBio server download URLs from `records`
mediaurl_idigbio <- records %>% 
  mutate(mediaURL = paste("https://api.idigbio.org/v2/media/", uuid, sep = "")) %>% 
  select(mediaURL) %>% 
  pull()

# Assemble a vector of external server download URLs from `records`
mediaurl_external <- records$accessuri %>% 
  str_replace("\\?size=fullsize", "")

These vectors look like this:

mediaurl_idigbio
##  [1] "https://api.idigbio.org/v2/media/0000b146-6fd2-4a6a-bf78-9e709cc995e9"
##  [2] "https://api.idigbio.org/v2/media/0000b511-67fd-4485-adea-56df8f7e4c66"
##  [3] "https://api.idigbio.org/v2/media/0000d1cd-8211-45c6-8dfa-bd4a9a001aad"
##  [4] "https://api.idigbio.org/v2/media/000244a4-6c5c-4e75-9c2e-db585b377935"
##  [5] "https://api.idigbio.org/v2/media/000277e9-659b-4e0c-a61b-c5262d33969b"
##  [6] "https://api.idigbio.org/v2/media/0002acf9-13d4-4318-a53e-4c00e9361a07"
##  [7] "https://api.idigbio.org/v2/media/000495be-df5c-4c01-a951-5656a3fe5ef5"
##  [8] "https://api.idigbio.org/v2/media/0005475a-c2c9-4908-9a27-ba9e555d0c43"
##  [9] "https://api.idigbio.org/v2/media/00056e02-50b6-4c62-a975-306cc870dd83"
## [10] "https://api.idigbio.org/v2/media/000654a2-02f3-4c14-b28a-7567cb55aa57"
mediaurl_external
##  [1] "mam.ansp.org/image/CM/Fullsize/345/CM345773.jpg"                                                                   
##  [2] "https://sernecportal.org/imglib/seinet/sernec/NCU_VascularPlants/NCU00418/NCU00418780_01.JPG"                      
##  [3] "mam.ansp.org/image/TAWES/Fullsize/0005/TAWES0005954.jpg"                                                           
##  [4] "ibss-images.calacademy.org:80/static/botany/originals/12/df/12dff2af-8a4b-4845-9964-79152fb7ce71.jpg"              
##  [5] "https://www.pnwherbaria.org/images/jpeg.php?Image=WTU-V-023351.jpg"                                                
##  [6] "https://cch2.org/imglib/cch2/CHSC_VascularPlants/CHSC021/CHSC021794_lg.jpg"                                        
##  [7] "bgbaseserver.eeb.uconn.edu/DATABASEIMAGES/CONN00075691.JPG"                                                        
##  [8] "https://cdn.floridamuseum.ufl.edu/Herbarium/945bc71c-a736-42e2-8b41-7a00bf9529fe"                                  
##  [9] "https://data.cyverse.org/dav-anon/iplant/projects/magnoliagrandiFLORA/images/specimens/MISS0038464/MISS0038464.JPG"
## [10] "https://api.idigbio.org/v2/media/793b0495f464a5403db6918811399c1d"

Download media

We can use the download URLs that we assembled in the step above to go and download each media file. For clarity, we will place files in two different folders, based on whether we downloaded them from the iDigBio server or an external server. We will name each file based on its unique identifier.

# Create new directories to save media files in
dir.create("jpgs_idigbio")
dir.create("jpgs_external")

# Assemble another vector of file paths to use when saving media downloaded 
# from iDigBio
mediapath_idigbio <- paste("jpgs_idigbio/", records$uuid, ".jpg", sep = "")

# Assemble another vector of file paths to use when saving media downloaded
# from external servers; please note that it's probably not a great idea to
# assume these files are all jpgs, as we're doing here...
mediapath_external <- paste("jpgs_external/", records$uuid, ".jpg", sep = "")

# Add a check to deal with URLs that are broken links
possibly_download.file = purrr::possibly(download.file, 
                                         otherwise = "cannot download")

#"mode" argument (="wb") in the walk function to download.file. 

# Iterate through the action of downloading whatever file is at each
# iDigBio URL
purrr::walk2(.x = mediaurl_idigbio,
             .y = mediapath_idigbio, possibly_download.file)

# Iterate through the action of downloading whatever file is at each
# external URL
purrr::walk2(.x = mediaurl_external,
             .y = mediapath_external, possibly_download.file)

You should now have two folders, each with ten images downloaded from iDigBio and external servers, respectively. Note that we only downloaded ten images here for brevity’s sake, but you can increase this using the limit argument in the first step. Here is an example of one of the images we downloaded:

[1] “Herbarium specimen of an Acer species collected in the United States