Learning Objectives

Start-up challenge

To refresh our memories from yesterday, let’s all begin with an exercise:

  1. Load the “survey” data into R as a data.frame
  2. Show the first 5 rows of the the data frame.
  3. Print the last 20 values of the column named “sex”

Getting a modified version of a data frame

Subsetting data with the subset function

In particular for larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. (Are species names in column 5 or 7? oh, right… they are in column 6). In some cases, in which column the variable will be can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

You can do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. For instance, to extract all the weights from our datasets, we can use: surveys$wgt. You can use names(surveys) or colnames(surveys) to remind yourself of the column names.

In some cases, you may want to select more than one column. You can do this using square brackets: surveys[, c("wgt", "sex")].

When analyzing data, though, we often want to look at partial statistics, such as the maximum value of a variable per species or the average value per plot.

We’ve already seen how to select data with logical indexing, using a logical expression such as surveys[surveys$species=="DO",].

Another nice way to do this is the subset() function. For instance, if we just want to look at the animals of the species “DO”:

surveys_DO <- subset(surveys, species == "DO")

Challenge

  1. What does the following do?
## What does the following do? (Try to guess without executing it)
surveys_DO$month[2] <- 8
  1. Use the function subset twice to create a data.frame that contains all individuals of the species “DM” that were collected in 2002.

Question: How many individuals of the species “DM” were collected in 2002?

  1. The logical operators in R are & (to mean AND), | (to mean OR), and ! (to mean NOT). Use the & operator to find the same data subset as in the previous question, without needing to use the subset() function more than once.

Adding a column to our dataset

Sometimes, you may have to add a new column to your dataset that represents a new variable. You can add columns to a data.frame using the function cbind() (column bind). Beware, the additional column must have the same number of elements as there are rows in the data.frame.

In our survey dataset, the species are represented by a 2-letter code (e.g., “AB”), however, we would like to include the species name. The correspondance between the 2 letter code and the names are in the file species.csv. In this file, one column includes the genus and another includes the species. First, we are going to import this file in memory:

species <- read.csv("data/species.csv")

We are then going to use the function match() to create a vector that contains the genus names for all our observations. The function match() takes at least 2 arguments: the values to be matched (in our case the 2 letter code from the surveys data frame held in the column species), and the table that contains the values to be matched against (in our case the 2 letter code in the species data frame held in the column species_id). The function returns the position of the matches in the table, and this can be used to retrieve the genus names:

surveys_spid_index <- match(surveys$species, species$species_id)
surveys_genera <- species$genus[surveys_spid_index]

Now that we have our vector of genus names, we can add it as a new column to our surveys object:

surveys <- cbind(surveys, genus=surveys_genera)

Challenge

Use the same approach to also include the species names in the surveys data frame.

## and check out the result
head(surveys)
##   record_id month day year plot species sex wgt       genus species_name
## 1         1     7  16 1977    2      NL   M  NA     Neotoma     albigula
## 2         2     7  16 1977    3      NL   M  NA     Neotoma     albigula
## 3         3     7  16 1977    2      DM   F  NA   Dipodomys     merriami
## 4         4     7  16 1977    7      DM   M  NA   Dipodomys     merriami
## 5         5     7  16 1977    3      DM   M  NA   Dipodomys     merriami
## 6         6     7  16 1977    1      PF   M  NA Perognathus       flavus

Adding rows

Let’s create a data.frame that contains the information only for the species “DO” and “DM”. We know how to create the data set for each species with the function subset():

surveys_DO <- subset(surveys, species == "DO")
surveys_DM <- subset(surveys, species == "DM")

Similarly to cbind() for columns, there is a function rbind() (row bind) that puts together two data.frame. With rbind() the number of columns and their names must be identical between the two objects:

surveys_DO_DM <- rbind(surveys_DO, surveys_DM)

Challenge

Using a similar approach, construct a new data.frame that only includes data for the years 2000 and 2001.

Removing columns

Just like you can select columns by their positions in the data.frame or by their names, you can remove them similarly.

To remove it by column number:

surveys_noDate <- surveys[, -c(2:4)]
colnames(surveys)
##  [1] "record_id"    "month"        "day"          "year"        
##  [5] "plot"         "species"      "sex"          "wgt"         
##  [9] "genus"        "species_name"
colnames(surveys_noDate)
## [1] "record_id"    "plot"         "species"      "sex"         
## [5] "wgt"          "genus"        "species_name"

The easiest way to remove by name is to use the subset() function. This time we need to specify explicitly the argument select as the default is to subset on rows (as above). The minus sign indicates the names of the columns to remove (note that the column names should not be quoted):

surveys_noDate2 <- subset(surveys, select=-c(month, day, year))
colnames(surveys_noDate2)
## [1] "record_id"    "plot"         "species"      "sex"         
## [5] "wgt"          "genus"        "species_name"

Removing rows

Typically rows are not associated with names, so to remove them from the data.frame, you can do:

surveys_missingRows <- surveys[-c(10, 50:70), ] # removing rows 10, and 50 to 70

Writing to CSV

Now that we’ve made changes to our data frame, let’s see how to write it back out to a CSV file. We read the file in with read.csv(), and we can write the data out with write.csv() (or more generally, write.table()).

Let’s try to write out our data frame with some rows removed:

write.csv(surveys_missingRows, 'data/surveys_missingrows.csv')

This will create the file data/surveys_missingrows.csv and write the data frame to it.

Note that, unlike read.csv(), write.csv() needs to take both the data frame and the filename as arguments.

Open your new file data/surveys_missingrows.csv in a text editor. Does it look as you expected? Compare it to surveys.csv.

You may notice that there is an extra column. This is because data frames in R have not only column names, but also row names. By default, these row names will be written out to the CSV file. You can change this option with the row.names argument to write.csv()

There are several different options when writing a CSV file. You can.use the help functions to see the details of other arguments.

Exercise: Try to write out your data frame without row names, so that you get a CSV file of the same style as the original.

Previous: Data Frames Next: Analyzing data