• Download portal_mammals.sqlite.
  • Make sure the copy you are going to use in class does not have the SpeciesCounts table or view.
  • We can work with data in databases directly from R
  • No need to export files from the database
  • Lets the database do the heavy lifting
    • Faster
    • No RAM limits
  • We can do this either directly using SQL and the DBI package
  • Or work with the data just like we’ve done before using dplyr

Installation

  • DBI for general database functionality
  • RSQLite for translating between R and SQLite
  • dbplyr for integrating databases with dplyr
install.packages(c("DBI", "dbplyr", "RSQLite"))

Connect

library(DBI)
library(RSQLite)
library(dplyr)

portaldb <- dbConnect(SQLite(), "portal.sqlite")
  • It’s most common for this code to be written as
portaldb <- dbConnect(RSQLite::SQLite(), "portal.sqlite")
portaldb

To avoid needing to load the RSQLite package using library()

Check out database structure

  • Once connected to a database we can list the tables
dbListTables(portaldb)
  • We can also look at the details of individual tables
dbListFields(portaldb, "plots")

Connecting to tables

  • We can also connect to individual tables
surveys <- tbl(portaldb, "surveys")
surveys
  • The data is still in the database, not in R, so we can’t tell how many rows it has or view it in the same way we would view a data frame.
  • If we want to load all of the data from the table into an R data frame we use the collect() function
surveys_df <- collect(surveys)

Write a query

  • We can interact with the data in the database in two ways
  • First, we can write queries in SQL
  • Write a query to extract counts for each species_id
count_query <- "SELECT species_id, COUNT(*)
                FROM surveys
                GROUP BY species_id"

dbGetQuery(portaldb, count_query)
  • This uses DBI to run the query and return it to the R as a data frame

  • Alternatively we can use the tbl() function from dplyr to create a table based on the query

tbl(portaldb, sql(count_query))
  • The table is still stored in the database
  • Number of rows is unknown as shown by ??
  • When we have the results we want we can use collect() to load them into R as a data frame
count_data <- tbl(portaldb, sql(count_query)) %>%
              collect()

Using dplyr pipelines with databases

  • We can also use dplyr commands directly on databases
  • To obtain the same results as our previous query using dplyr we use group_by() and summarize()
species_counts <- surveys %>%
               group_by(species_id) %>%
               summarize(count = n())
  • All of the calculation still happens in the databases
  • So outside of RAM calculations are possible
  • We can then bring the resulting data into R using collect() for further analysis
species_counts <- surveys %>%
               group_by(species_id) %>%
               summarize(count = n()) %>%
               collect()

Do Links to Databases.

Write new information to database

  • Can also move data we created in R into the database using copy_to()
  • We can see that the current version of the database only has the three original tables
dbListTables(portaldb)
  • If we wanted to store our new species_counts table in the database
copy_to(portaldb, species_counts, temporary=FALSE, 
        name="SpeciesCounts")
dbListTables(portaldb)