Introduction to R

Learning Objectives

Familiarize participants with R syntax

Understand the concepts of objects and assignment

Understand the concepts of vector and data types

Get exposed to a few functions

Creating variables

You can get output from R simply by typing in math in the console

3 + 5
12/7

However, to do useful and interesting things, we need to assign values to variables. To create a variable, we need to give it a name followed by the assignment operator <- and the value we want to give it:

weight_kg <- 55

Variables can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit so they help you remember what they represent.

You don’t want variable names to be too long, but they can be somewhat long. RStudio, and many other programs have a feature called tab completion. Start typing the name of a function or variable and hit TAB, and RStudio will complete the name if the object is stored in memory. If it’s ambiguous, it will give you options

Tab completion lets you err on the size of more transparent names.

Variable names cannot start with a number (2x is not valid but x2 is). R is case-sensitive (e.g., weight_kg is different from Weight_kg). There are some names that cannot be used because they represent the names of fundamental functions in R (e.g., if, else, for, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names (e.g., c, T, mean, data, df, weights). In doubt check the help to see if the name is already in use. It’s also best to avoid dots (.) within a variable name as in my.dataset. It is also recommended to use nouns for variable names, and verbs for function names.

When assigning a value to an object, R does not print anything. You can force to print the value by using parentheses or by typing the name:

(weight_kg <- 55)
weight_kg

Now that R has weight_kg in memory, we can do arithmetic with it. For instance, we may want to convert this weight in pounds (weight in pounds is 2.2 times the weight in kg):

2.2 * weight_kg

We can also change a variable’s value by assigning it a new one:

weight_kg <- 57.5
2.2 * weight_kg

This means that assigning a value to one variable does not change the values of other variables. For example, let’s store the animal’s weight in pounds in a variable.

weight_lb <- 2.2 * weight_kg

and then change weight_kg to 100.

weight_kg <- 100

What do you think is the current content of the object weight_lb? 126.5 or 200?

Exercise

What are the values after each statement in the following?

mass <- 47.5           # mass?
age  <- 122            # age?
mass <- mass * 2.0     # mass?
age  <- age - 20       # age?
massIndex <- mass/age  # massIndex?

Commenting

Use # signs to comment. Comment liberally in your R scripts. Anything to the right of a # is ignored by R.

Assignment operator

<- is the assignment operator. Assigns values on the right to objects on the left, it is like an arrow that points from the value to the object.

Many people use =, which works almost the same way. Some programmers have strong opinions about which is better. In most situations it’s really a matter of preference.

= should is also used to specify the values of arguments in functions, for instance read.csv(file="data/some_data.csv").

In RStudio, typing Alt + - (push Alt, the key next to your space bar at the same time as the - key) will write <- in a single keystroke.

Functions and their arguments

Let’s look at a simple function call:

surveys <- read.csv(file="data/surveys.csv")

The file= part inside the parentheses is called an argument, and most functions use arguments. Arguments modify the behavior of the function. Typically, they take some input (e.g., some data, an object) and other options to change what the function will return, or how to treat the data provided.

Most functions can take several arguments, but most are specified by default so you don’t have to enter them. To see these default values, you can either type args(read.csv) or look at the help for this function (e.g., ?read.csv).

args(read.csv)

## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".", 
##     fill = TRUE, comment.char = "", ...) 
## NULL

If you provide the arguments in the exact same order as they are defined you don’t have to name them:

read.csv(file="data/surveys.csv", header=TRUE) # is identical to:
read.csv("data/surveys.csv", TRUE)

However, it’s usually not recommended practice because it’s a lot of remembering to do, and if you share your code with others that includes less known functions it makes your code difficult to read. (It’s however OK to not include the names of the arguments for basic functions like mean, min, etc…)

Another advantage of naming arguments, is that the order doesn’t matter:

read.csv(file="data/surveys.csv", header=TRUE) # is identical to:
read.csv(header=TRUE, file="data/surveys.csv")

Vectors and data types

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It’s a group of values, mainly either numbers or characters. You can assign this list of values to a variable, just like you would for one item. For example we can create a vector of animal weights:

weights <- c(50, 60, 65, 82)
weights

A vector can also contain characters:

animals <- c("mouse", "rat", "dog")
animals

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

length(weights)
length(animals)

class() indicates the class (the type of element) of an object:

class(weights)
class(animals)

The function str() provides an overview of the object and the elements it contains. It is a really useful function when working with large and complex objects:

str(weights)
str(animals)

You can add elements to your vector simply by using the c() function:

weights <- c(weights, 90) # adding at the end
weights <- c(30, weights) # adding at the beginning
weights

What happens here is that we take the original vector weights, and we are adding another item first to the end of the other ones, and then another item at the beginning. We can do this over and over again to build a vector or a dataset. As we program, this may be useful to autoupdate results that we are collecting or calculating.

We just saw 2 of the 6 data types that R uses: "character" and "numeric". The other 4 are: * "logical" for TRUE and FALSE (the boolean data type) * "integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer) * "complex" to represent complex numbers with real and imaginary parts (e.g., 1+4i) and that’s all we’re going to say about them * "raw" that we won’t discuss further

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices (matrix), data frames (data.frame) and factors (factor).

We will use our “surveys” dataset to explore the data.frame data structure.

R syntax example

Here is an example of a script, which your instructor will discuss.

### This function checks that all the plot IDs used in the survey file (`surveys.csv`)
### are defined in the plots file (`plots.csv`). If all the plot IDs are found, the
### function shows a message and returns `TRUE`, otherwise the function emits a 
### warning, and returns `FALSE`
check_plots <- function(survey_file="data/biology/surveys.csv",
                        plot_file="data/biology/plots.csv") {
  ## load files
  srvy <- read.csv(file=survey_file, stringsAsFactors=FALSE)
  plts <- read.csv(file=plot_file, stringsAsFactors=FALSE)
  
  ## Get unique plot_id
  unique_srvy_plots <- unique(srvy$plot_id)
  
  if (all(unique_srvy_plots %in% plts$plot_id)) {
    message("Everything looks good.")
    return(TRUE)
  } else {
    warning("Something is wrong: some plot IDs not defined.")
    return(FALSE)
  }
}

# Call the function, using the default value for plot_file
check_plots(survey_file="latest_surveys.csv")

# Find the number of rows and columns in the survey data
surveys <- read.csv(file="data/biology/surveys.csv", stringsAsFactors=FALSE)
nrow(surveys)
ncol(surveys)

Previous: Before we start Next: Starting with data