Here is an analogy to start us off. If you were a pilot, R is an an airplane. You can use R to go places! With practice you’ll gain skills and confidence; you can fly further distances and get through tricky situations. You will become an awesome pilot and can fly your plane anywhere.

And if R were an airplane, RStudio is the airport. RStudio provides support! Runways, communication and other services, and just makes your overall life easier. So although you can fly your plane without an airport and we could learn R without RStudio, that’s not what we’re going to do.

We are learning R together with RStudio and its many supporting features.

Something else to start us off is to mention that you are learning a new language here. It’s an ongoing process, it takes time, you’ll make mistakes, it can be frustrating, but it will be overwhelmingly awesome in the long run. We all speak at least one language; it’s a similar process, really. And no matter how fluent you are, you’ll always be learning, you’ll be trying things in new contexts, etc, just like everybody else. And just like any form of communication, there will be miscommunications but hands down we are all better off because of it.

While language is a familiar concept, programming languages are in a different context from spoken languages, but you will get to know this context with time. For example: you have a concept that there is a first meal of the day, and there is a name for that: in English it’s “breakfast”. So if you’re learning Spanish, you could expect there is a word for this concept of a first meal. (And you’d be right: ‘desayuno’). We will get you to expect that programming languages also have words (called functions in R) for concepts as well. You’ll soon expect that there is a way to order values numerically. Or alphabetically. Or search for patterns in text. Or calculate the median. Or reorganize columns to rows. Or subset exactly what you want. We will get you increase your expectations and learn to ask and find what you’re looking for.

OK, let’s get going.

To learn R and RStudio we will be using Dr. Jenny Bryan’s lectures from STAT545 at UBC. I have modifed them slightly here for our purposes; to see them in their full and awesome entirety, visit stat545-ubc.github.io. Specifically, we’ll be using these lectures:

Something we won’t cover today but that will be helpful to you in the future is:

The many flavors of R objects

I’ve modified them in part with my own text and in part with text from Software Carpentry’s R for reproducible scientific analysis, specifically:

Seeking help

1 R basics, workspace and working directory, RStudio projects

(modified from Jenny Bryan’s STAT545)

1.1 R at the command line, RStudio goodies

Launch RStudio/R.

Notice the default panes:

Console (entire left)
Environment/History (tabbed in upper right)
Files/Plots/Packages/Help (tabbed in lower right)

FYI: you can change the default location of the panes, among many other things: Customizing RStudio.

There are other great features we don’t really have time for today as we walk through the IDE together. (IDE stands for integrated development environment.) Check out the webinar and RStudio IDE cheatsheet for more. (And this is my blog post about RStudio Awesomeness).

Go into the Console, where we interact with the live R process.

Make an assignment and then inspect the object you just created.

x <- 3 * 4
x

## [1] 12

In my head I hear, e.g., “x gets 12”.

All R statements where you create objects – “assignments” – have this form: objectName <- value.

I’ll write it in the command line with a hashtag #, which is the way R comments so it won’t be evaluated.

# objectName <- value

Object names cannot start with a digit and cannot contain certain other characters such as a comma or a space. You will be wise to adopt a convention for demarcating words in names.

# i_use_snake_case
# other.people.use.periods
# evenOthersUseCamelCase

Make an assignment

this_is_a_really_long_name <- 2.5

To inspect this variable, instead of typing it, we can press the up arrow key and call your command history, with the most recent commands first. Let’s do that, and then delete the assignment:

this_is_a_really_long_name

## [1] 2.5

Another way to inspect this variable is to begin typing this_…and RStudio will automagically have suggested completions for you that you can select by hitting the tab key, then press return.

And another way to inspect this is by looking at the Environment pane in RStudio.

Now, let’s make another assignment

this_is_shorter <- 2 ^ 3

To inspect this, try out RStudio’s completion facility: type the first few characters, press TAB, add characters until you disambiguate, then press return.

this_is_shorter

## [1] 8

One more:

jenny_rocks <- 2

Let’s try to inspect:

jennyrocks

## Error in eval(expr, envir, enclos): object 'jennyrocks' not found

Implicit contract with the computer / scripting language: Computer will do tedious computation for you. In return, you will be completely precise in your instructions. Typos matter. Case matters. Get better at typing.

Remember that this is a language, not unsimilar to English! There are times you aren’t understood – it’s going to happen. There are different ways this can happen. Sometimes you’ll get an error. This is like someone saying ‘What?’ or ‘Pardon’? Error messages can also be more useful, like when they say ‘I didn’t understand this specific part of what you said, I was expecting something else’. That is a great type of error message. Error messages are your friend. Google them (copy-and-paste!) to figure out what they mean.

And also know that there are errors that can creep in more subtly, when you are giving information that is understood, but not in the way you meant. Like if I’m telling a story about pants and suspenders. My story makes sense; I won’t get any ‘pardon’s or errors, but my husband (from Yorkshire) may jump in and say ’I think she means trousers and braces’. This would be a warning message. It’s nice when they happen! Warning messages don’t always happen–like for example if my husband isn’t there to clarify. This can leave me thinking I’ve gotten something across that the listener (or R) interpreted very differently. And as I continue telling my story you get more and more confused… So write clean code and check your work as you go to minimize these circumstances!

A moment about logical operators and expressions. We can ask questions about the objects we just made.

== means ‘is equal to’
!= means ‘is not equal to’
< means ` is less than’
> means ` is greater than’
<= means ` is less than or equal to’
>= means ` is greater than or equal to’

jenny_rocks == 2

## [1] TRUE

jenny_rocks <= 30

## [1] TRUE

jenny_rocks != 5

## [1] TRUE

Shortcuts You will make lots of assignments and the operator <- is a pain to type. Don’t be lazy and use =, although it would work, because it will just sow confusion later. Instead, utilize RStudio’s keyboard shortcut: Alt + - (the minus sign). Notice that RStudio automagically surrounds <- with spaces, which demonstrates a useful code formatting practice. Code is miserable to read on a good day. Give your eyes a break and use spaces. RStudio offers many handy keyboard shortcuts. Also, Alt+Shift+K brings up a keyboard shortcut reference card.

My most common shortcuts include command-Z (undo), and combinations of arrow keys in combination with shift/option/command (moving quickly up, down, sideways, with or without highlighting.

1.2 R functions, help pages

R has a mind-blowing collection of built-in functions that are accessed like so

# functionName(name_of_argument1 = value1, name_of_argument2 = value2, and so on)

Let’s try using seq() which makes regular sequences of numbers and, while we’re at it, demo more helpful features of RStudio.

Type se and hit TAB. A pop up shows you possible completions. Specify seq() by typing more to disambiguate or using the up/down arrows to select. Notice the floating tool-tip-type help that pops up, reminding you of a function’s arguments. If you want even more help, press F1 as directed to get the full documentation in the help tab of the lower right pane.

Type the arguments 1, 10 and hit return.

seq(1, 10)

##  [1]  1  2  3  4  5  6  7  8  9 10

We could probably infer that the seq() function makes a sequence, but let’s learn for sure. Type (and you can autocomplete) and let’s explore the help page:

?seq 
help(seq) # same as ?seq
seq(from = 1, to = 10) # same as seq(1, 10); R assumes by position

##  [1]  1  2  3  4  5  6  7  8  9 10

seq(from = 1, to = 10, by = 2)

## [1] 1 3 5 7 9

The above also demonstrates something about how R resolves function arguments. You can always specify in name = value form. But if you do not, R attempts to resolve by position. So above, it is assumed that we want a sequence from = 1 that goes to = 10. Since we didn’t specify step size, the default value of by in the function definition is used, which ends up being 1 in this case. For functions I call often, I might use this resolve by position for the first argument or maybe the first two. After that, I always use name = value.

The help page tells the name of the package in the top left, and broken down into sections:

Description: An extended description of what the function does.
Usage: The arguments of the function and their default values.
Arguments: An explanation of the data each argument is expecting.
Details: Any important details to be aware of.
Value: The data the function returns.
See Also: Any related functions you might find useful.
Examples: Some examples for how to use the function.

The examples can be copy-pasted into the console for you to understand what’s going on. Remember we were talking about expecting there to be a function for something you want to do? Let’s try it. min(), max(), log()…

Exercise: Talk to your neighbor(s) and look up the help file for a function you know. Try the examples, see if you learn anything new. (need ideas? ?getwd(), ?plot(), ?mean(), ?log()).

Help for when you only sort of remember the function name: double-questionmark:

??install

Not all functions have (or require) arguments:

date()

## [1] "Mon Jul 11 23:07:51 2016"

Now look at your workspace – in the upper right pane. The workspace is where user-defined objects accumulate. You can also get a listing of these objects with commands:

objects()

## [1] "jenny_rocks"                "this_is_a_really_long_name"
## [3] "this_is_shorter"            "x"

ls()

## [1] "jenny_rocks"                "this_is_a_really_long_name"
## [3] "this_is_shorter"            "x"

If you want to remove the object named x, you can do this:

rm(x)

To remove everything:

rm(list = ls())

or click the broom in RStudio’s Environment pane.

Exercise: Clear your workspace, then create a few new variables. Create a variable that is the mean of a sequence of 1-20. What’s a good name for your variable? Does it matter what your ‘by’ argument is? Why?

1.3 Working directories, RStudio projects, R scripts

So we will talk about scripts in a moment, but first let’s talk about where they should live.

We’re not going to cover workspaces today, but this is another alternative to scripts. You can learn about it in this RStudio article: Working Directories and Workspaces.

1.3.1 Working directory

Any process running on your computer has a notion of its “working directory”. In R, this is where R will look, by default, for files you ask it to load. It is also where, by default, any files you write to disk will go. You have a sense of this because whenever you go to save a Word doc or download, it asks where. You often have to navigate to put it exactly where you want it. There are a few ways to check your working directory in RStudio.

You can explicitly check your working directory with:

getwd()

It is also displayed at the top of the RStudio console.

As a beginning R user, it’s OK let your home directory or any other weird directory on your computer be R’s working directory. Very soon, I urge you to evolve to the next level, where you organize your analytical projects into directories and, when working on Project A, set R’s working directory to Project A’s directory.

You can set R’s working directory at the command line like so. You could also do this in a script.

setwd("~/myCoolProject")

But there’s a better way. A way that also puts you on the path to managing your R work like an expert.

1.3.2 RStudio projects

Keeping all the files associated with a project organized together – input data, R scripts, analytical results, figures – is such a wise and common practice that RStudio has built-in support for this via its projects. More here: Using Projects.

Let’s make one to use for the rest of today.

Do this: File > New Project … New Directory > Empty Project. The directory name you choose here will be the project name. Call it whatever you want (or follow me for convenience).

I created a directory and, therefore RStudio project, called software-carpentry in a folder called tmp in my home directory, FYI. What do you notice about your RStudio pane? Look in the right corner–‘software-carpentry’.

Now check that the “home” directory for your project is the working directory of our current R process:

getwd()
# "/Users/julialowndes/tmp/software-carpentry"

About paths: The above is the absolute path: it starts at the /Users root and is specific to my computer (julialowndes) and where I saved it. So if I did an analysis with this filepath, it wouldn’t work on your computer before you altered the filepath.

But with an RStudio project, your paths within this project can be relative, which means they start in the software-carpentry folder, wherever it is. So we can just use filepaths within our project from a relative place, and it can work on your computer or mine, without worrying about the absolute paths. (Analogy: I can give you directions from this building to the pub, since we’re all here in this shared space already. I can’t give you all directions from your home to this building and then the pub, because you all live in different places. But I can give directions relative to this buliding).

OK.

Let’s enter a few commands in the Console, as if we are just beginning a project. Since we’re learning a new language here, an example is often the best way to see how things work. So we’re going to make an introductory plot using the cars dataset that is loaded into R. We can pretend this is data just given to us by a collaborator and we’re trying to see if it’s useful for us.

cars

plot(cars)  
z <- line(cars)
abline(coef(z), col = "purple")
dev.print(pdf, "toy_line_plot.pdf")

Exercise with your neighbor: discuss all of the functions we’ve just used here (plot, line, abline, coef, dev.print.) Talk through what is happening. What is the col argument? What does dev.print do? and where?

1.4 Our first script!

Let’s say this is a good start of an analysis and your ready to start preserving the logic and code. Visit the History tab of the upper right pane. Select these commands. Click “To Source”. Now you have a new pane containing a nascent R script. Click on the floppy disk to save. Give it a name ending in .R or .r, I used toy-line.r and note that, by default, it will go in the directory associated with your project. It is traditional to save R scripts with a .R or .r suffix.

A few things:

Let’s comment our script: Comments start with one or more # symbols. Use them. RStudio helps you (de)comment selected lines with Ctrl+Shift+C (windows and linux) or Command+Shift+C (mac).
Walk through line by line by keyboard shortcut (command + enter) or mouse (click Run in the upper right corner of editor pane).
Source the entire document – equivalent to entering source('toy-line.r') in the Console – by keyboard shortcut (shift command S) or mouse (click Source in the upper right corner of editor pane or select from the mini-menu accessible from the associated down triangle).

## toy-line.r
## a simple plot of the cars dataset
## J Lowndes lowndes@nceas.uscb.edu

## plots R's cars data with a fitted line ----
plot(cars)  
z <- line(cars)
abline(coef(z), col = "purple")

## save as .pdf
dev.print(pdf, "toy_line_plot.pdf")

Notice that the notation with ---- in a comment also enables us to ‘jump’ to it in RStudio

This workflow will serve you well in the future:

Create an RStudio project for an analytical project
Keep inputs there (although we just used the base cars data – we’ll soon talk about importing)
Keep scripts there; edit them, run them in bits or as a whole from there
Keep outputs there (like the PDF written above)

Avoid using the mouse for pieces of your analytical workflow, such as loading a dataset or saving a figure. This is terribly important for reproducility and for making it possible to retrospectively determine how a numerical table or PDF was actually produced (searching on local disk on filename, among .R files, will lead to the relevant script).

To do before coffee: download the data folder from the course website here; save it in your RStudio project folder.

2 Basic care and feeding of data in R

(modified from Jenny Bryan’s STAT545)

Let’s start fresh.

You should clean out your workspace. In RStudio, click on the “Clear” broom icon from the Environment tab or use Session > Clear Workspace. You can also enter rm(list = ls()) in the Console to accomplish same.

Now restart R. In RStudio, use Session > Restart R. Otherwise, quit R (from the ‘File’ menu or by typing q() in the Console) and re-launch it.

Why do we do this? So that the code you write is complete and re-runnable.

root out hidden dependencies where one snippet of code only works because it relies on objects created by code saved elsewhere or, much worse, never saved at all.
expose any usage of packages that have not been explicitly loaded.

Let’s check our working directory:

getwd()

Finally, let’s create a new R script from scratch. We will develop and run our code from there. We’ll be using this script today and tomorrow.

In RStudio, use File > New File > R Script. Save this script with a name ending in .r or .R, containing no spaces or other funny stuff, and that evokes whatever it is we’re doing today. Example: swc-data-explore.r.

2.1 Meet your first data.frame: gapminder

We will work with some of the data from the Gapminder project. Have a look at data/gapminder.csv by navigating to it in the RStudio file pane and looking at it (RStudio is also a text editor so you can read this file right here).

## read gapminder csv
gapminder <- read.csv('data/gapminder.csv')
## Warning in file(file, "rt"): cannot open file 'data/gapminder.csv': No such
## file or directory
## Error in file(file, "rt"): cannot open the connection

Let’s inspect:

## explore the gapminder dataset
gapminder # this is super long! Let's inspect in different ways

Let’s use head and tail:

head(gapminder) # shows first 6
tail(gapminder) # shows last 6

head(gapminder, 10) # shows first X that you indicate
tail(gapminder, 12) # guess what this does!

str() will provide a sensible description of almost anything: when in doubt, just str() some of the recently created objects to get some ideas about what to do next.

str(gapminder) # ?str - displays the structure of an object

gapminder is a data.frame. We aren’t going to get into the other types of data receptacles today (‘arrays’, ‘matrices’), because working with data.frames is what you should primarily use. Why?

data.frames package related variables neatly together, great for analysis
most functions, including the latest and greatest packages actually require that your data be in a data.frame
data.frames can hold variables of different flavors such as
- character data (country or continent names; “Factors”)
- quantitative data (years, population; “Integers (int)” or “Numeric (num)”)
- categorical information (male vs. female)

We can also see the gapminder variable in RStudio’s Environment pane (top right)

More ways to learn basic info on a data.frame.

names(gapminder)
dim(gapminder)    # ?dim dimension
ncol(gapminder)   # ?ncol number of columns; same as dim(gapminder)[1]
nrow(gapminder)   # ?nrow number of rows; same as dim(gapminder)[2]

We can combine using c() to reverse-engineer dim()! Just a side-note here, but I wanted to introduce you to c(): we’ll use it later.

c(nrow(gapminder), ncol(gapminder)) # ?c combines values into a vector or list.

A statistical overview can be obtained with summary()

summary(gapminder)

2.1.1 Look at the variables inside a data.frame

To specify a single variable from a data.frame, use the dollar sign $. The $ operator is a way to extract of replace parts of an object–check out the help menu for $. It’s a common operator you’ll see in R.

gapminder$lifeExp # very long! hard to make sense of...
head(gapminder$lifeExp) # can do the same tests we tried before
str(gapminder$lifeExp) # it is a single numeric vector
summary(gapminder$lifeExp) # same information, just formatted slightly differently

We’ll spend tomorrow talking more about visualization, but it’s so important for smell-testing dataset that we will make a few figures anyway. Here we use only base R graphics, which are very basic.

## plot gapminder
plot(gapminder$year, gapminder$lifeExp) # ?plot
## Error in plot(gapminder$year, gapminder$lifeExp): object 'gapminder' not found
plot(gapminder$gdpPercap, gapminder$lifeExp)
## Error in plot(gapminder$gdpPercap, gapminder$lifeExp): object 'gapminder' not found

These plots can tell us some basic things, like minimum lifeExp has generally increased over time, or lifeExp is largely related to GDP. But there are bit outliers that would be great to investigate, and it might be nice to be able to dive deeper into the data to learn more.

So let’s build up to that. (skip the rest of this section if low on time)

Let’s explore a numeric variable: life expectancy.

## explore numeric variable
summary(gapminder$lifeExp)
hist(gapminder$lifeExp)

Let’s explore a categorical variable (stored as a factor in R): continent.

## explore factor variable
summary(gapminder$continent)
levels(gapminder$continent)
nlevels(gapminder$continent)
hist(gapminder$continent) # whaaaa!?

This error is because of what factors are ‘under the hood’: R is really storing integer codes 1, 2, 3 here, but represent them as text to us. Factors can be problematic to us because of this, but you can learn to navigate with them. There are resources to learn how to properly care and feed for factors.

One thing you’ll learn is how to visualize factors with which functions/packages.

class(gapminder$continent) # ?class returns the class type of the object
table(gapminder$continent) # ?table builds a table based on factor levels 
class(table(gapminder$continent)) # this has morphed the factor...
hist(table(gapminder$continent)) # so we can plot!

I don’t want us to get too bogged down with what’s going on with table() and plotting factors, but I want to expose you to these situations because you will encounter them. Googling the error messages you get, and knowing how to look for good responses is a critical skill. (I tend to look for responses from stackoverflow.com that are recent and have green checks, and ignore snarky comments).

Exercise with your neighbor: Explore gapminder$gdpPercap. What kind of data is it? So which commands do you use?

2.2 Subsetting data

You will want to isolate bits of your data; maybe you want to just look at a single country or a few years. R calls this subsetting. There are several ways to do this. We’ll go through a few options in base R so that you’re familiar with them, and know how to read them. But then we’ll move on to a new, better, intuitive, and game changing way with the dplyr package afterwards.

Remember your logical expressions from this morning? We’ll use == here.

2.2.1 subsetting with base `[rows, columns]` notatation

This notation is something you’ll see a lot in base R. the brackets [ ] allow you to extract parts of an object. Within the brackets, the comma separates rows from columns.

## subset numeric data
gapminder[gapminder$lifeExp <32, ] # don't forget this comma! 

## subset factors
gapminder[gapminder$country == "Uruguay", ] # don't forget this comma!

So our notation is saying ‘select these rows, and all columns’.

We could also select which columns to keep using the c() function:

gapminder[gapminder$country == "Uruguay",
                     c("country", "year", "lifeExp")] # ?c: combines values into a vector or list

Contrast the above command with this one accomplishing the same thing:

gapminder[1621:1632, ] # No idea what we are inspecting. Don't do this.

gapminder[1621:1632, c(1, 3, 4)] # Ditto.

Yes, these both return the same result. But the second command is horrible for these reasons:

It contains Magic Numbers. The reason for keeping rows 1621 to 1632 will be non-obvious to someone else and that includes you in a couple of weeks.
It is fragile. If the rows of gapminder are reordered or if some observations are eliminated, these rows may no longer correspond to the Uruguay data.

In contrast, the first command, is somewhat self-documenting; one does not need to be an R expert to take a pretty good guess at what’s happening. It’s also more robust. It will still produce the correct result even if gapminder has undergone some reasonable set of transformations (what if it were in in reverse alphabetical order?)

2.2.2 subsetting with base `subset()` function

But we can improve how we subset by using the base subset() function, which can isolate pieces of an object for inspection or assignment. subset()’s main argument is also (unfortunately) called subset.

## subset gapminder
?subset
subset(gapminder, subset = country == "Mexico") # Ah, inspecting Mexico. Self documenting!

This returns all the columns.

But what if you just want a few of the columns? You can also subset the columns you want. You can use subset = and select = together to simultaneously subset rows and columns or variables.

subset(gapminder, subset = country == "Mexico", 
       select = c(country, year, lifeExp)) # ?c: combines values into a vector or list

2.3 Repeating operations with for loops

Let’s say we want to subset a few countries and plot pop through time. We could do it the way above, which would look like the following:

## plot population of some countries
mexico <- subset(gapminder, subset = country == "Mexico")
plot(mexico$year, mexico$pop)
dev.print(pdf, "mexico.pdf")

panama <- subset(gapminder, subset = country == "Panama") 
plot(panama$year, panama$pop)
dev.print(pdf, "panama.pdf")

ecuador <- subset(gapminder, subset = country == "Ecuador")
plot(ecuador$year, ecuador$pop)
dev.print(pdf, "ecuador")

But you can see already it’s a lot of text, which means typo-prone and hard to read. Even if you copy-paste each one, there’s a lot of copy-paste, and is very typo-prone. Plus, what if you wanted to instead plot lifeExp? You’d have to remember to change it each time…it gets messy quick. And we’re just doing it with 3 countries here; what if we wanted to do it to all 142 countries? Eek.

Better with a for loop. This will let us cycle through and do what we want to each thing in turn. If you want to iterate over a set of values, and perform the same operation on each, a for loop will do the job.

The basic structure of a for loop is:

for(iterator in set of values){
  do a thing
}

Let’s paste from what we had before, and modify it. Also, the set of values is the list of countries (country_list), and we want to iterate through each country (let’s spell it cntry so it’s distinctive).

for (cntry in country_list) {
  mexico <- subset(gapminder, subset = country == "Mexico") 
  plot(mexico$year, mexico$pop)
}

We can’t call it mexico anymore, but we could call it something more general. And let’s comment the plot() line out while we build this, and add a print statement to see if it’s behaving like we think it is.

for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry)  
  # plot(mexico$year, mexico$pop)
  print(cntry_subset)
}

Question: what is the variable cntry_subset right now, after running the for loop?

Is this doing what we think it’s doing? Let’s create the country list and print the results each time to test our progress:

country_list <- c("Mexico", "Panama", "Ecuador") # identify the thing to loop through
for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry)  
  # plot(mexico$year, mexico$pop)
  print(cntry_subset)
}

Excellent. Let’s move on with the plot.

country_list <- c("Mexico", "Panama", "Ecuador") 
for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry) 
  plot(cntry_subset$year, cntry_subset$pop)
  dev.print(pdf, paste0(cntry,".pdf")) # ?paste0() will paste a string
}

Great! And it doesn’t matter if we just use these three countries or all the countries–let’s try it.

First let’s create a figure directory and make sure it saves there since it’s going to get out of hand quickly:

dir.create('figures') # this will be: software-carpentry/figures

country_list <- unique(gapminder$country) # ?unique() returns the unique values
for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry) 
  plot(cntry_subset$year, cntry_subset$pop)
  dev.print(pdf, paste0("figures/", cntry,".pdf")) # don't forget the `/`: it's a path!
}

So that took a little longer than just the 3, but still super fast. For loops are sometimes just the thing you need to iterate over many things in your analyses.

Now let’s say we also want to record the mean population of each country. We’d add a line to the for loop, and comment out all the plotting for now (to save time, you could also just leave it):

dir.create('figures') # this will be: software-carpentry/figures

country_list <- unique(gapminder$country) 
for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry) 
  # plot(cntry_subset$year, cntry_subset$pop)
  # dev.print(pdf, paste0("figures/", cntry,".pdf"))
  
  pop_mean <- mean(cntry_subset$pop)
  print(paste('mean pop for', cntry, 'is', pop_mean))
}

We know it worked since it printed correctly. But we didn’t capture it: cntry_subset is just Zimbabwe. Let’s create an object outside the loop and add to it each time.

dir.create('figures') # this will be: software-carpentry/figures

country_list <- unique(gapminder$country) # ?unique() returns the unique values
country_pop_mean <- data.frame()

for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry) 
  # plot(cntry_subset$year, cntry_subset$pop)
  # dev.print(pdf, paste0("figures/", cntry,".pdf")) 
  
  pop_mean <- mean(cntry_subset$pop)
  # print(paste('mean pop for', cntry, 'is', pop_mean))
  country_pop_mean <- rbind(country_pop_mean, data.frame(cntry, pop_mean))
}

This approach can be useful, but ‘growing your results’ (building the result object incrementally) is computationally inefficient, so avoid it when you are iterating through a lot of values.

For loops can also lead to temporary variables that you don’t need. But they can be really useful at times.

2.4 conditional statements with `if` and `else`

Often when we’re coding we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met.

# if
if (condition is true) {
  do something
}

# if ... else
if (condition is true) {
  do something
} else {  # that is, if the condition is false,
  do something different
}

Say, for example, that in addition to saving population figures for all countries, we want to save life expectancy figures for countries in Asia only.

dir.create('figures') # this will be: software-carpentry/figures

country_list <- unique(gapminder$country) # ?unique() returns the unique values
country_pop_mean <- data.frame()
for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry) 
  # plot(cntry_subset$year, cntry_subset$pop)
  # dev.print(pdf, paste0("figures/", cntry,".pdf")) 
  
  pop_mean <- mean(cntry_subset$pop)
  # print(paste('mean pop for', cntry, 'is', pop_mean))
  country_pop_mean <- rbind(country_pop_mean, data.frame(cntry, pop_mean))
  
  ## if Asia, calculate mean(lifeExp)
  if (unique(cntry_subset$continent) == "Asia") { # read: if (the continent is Asia) {then}
    plot(cntry_subset$year, cntry_subset$lifeExp) 
    dev.print(pdf, paste0("figures/", cntry, "_lifeExp.pdf")) # change the filename
  }
}

And if the country is in Africa, let’s plot the mean GDP.

dir.create('figures') # this will be: software-carpentry/figures

country_list <- unique(gapminder$country) # ?unique() returns the unique values
country_pop_mean <- data.frame()
for (cntry in country_list) {
  cntry_subset <- subset(gapminder, subset = country == cntry) 
  # plot(cntry_subset$year, cntry_subset$pop)
  # dev.print(pdf, paste0("figures/", cntry,".pdf")) 
  
  pop_mean <- mean(cntry_subset$pop)
  # print(paste('mean pop for', cntry, 'is', pop_mean))
  country_pop_mean <- rbind(country_pop_mean, data.frame(cntry, pop_mean))
  
  ## if Asia, calculate mean(lifeExp)
  if (unique(cntry_subset$continent) == "Asia") { # read: if (the continent is Asia) {then}
    plot(cntry_subset$year, cntry_subset$lifeExp) 
    dev.print(pdf, paste0("figures/", cntry, "_lifeExp.pdf")) 
  } else if (unique(cntry_subset$continent) == "Africa") {
    plot(cntry_subset$year, cntry_subset$gdpPercap) 
    dev.print(pdf, paste0("figures/", cntry, "_gdpPercap.pdf")) # change the filename
  }
}

2.5 clean up and save your .r script

OK, let’s clean up and save your .r script, so it’s a good resource for you! Restart R. This will ensure you don’t have any packages loaded from previous calls to library(). In RStudio, use Session > Restart R. Otherwise, quit R with q() and re-launch it.

You can also delete your ‘figures’ folder so it doesn’t take up space. You can always regenerate them with the code if you wanted to.

Run through each line of code again, make sure your comments are good, delete anything you don’t need. Your script might look like this:

## explore the gapminder dataset ----
gapminder = read.csv('data/gapminder.csv')
str(gapminder) #displays the structure of an object
head(gapminder) # shows first 6 by default
tail(gapminder, 12)# shows last X that you indicate, or 6 by default
names(gapminder)
dim(gapminder)    # ?dim dimension
ncol(gapminder)   # ?ncol number of columns
nrow(gapminder)   # ?nrow number of rows
length(gapminder) # ?length length; although better for vectors
summary(gapminder)

## plot gapminder
plot(lifeExp ~ year, gapminder)
plot(lifeExp ~ gdpPercap, gapminder)

## explore numeric variable
head(gapminder$lifeExp)
summary(gapminder$lifeExp)
hist(gapminder$lifeExp)

## explore numeric variable that functions like a categorical variable
head(gapminder$year)
summary(gapminder$year)

## explore factor variable
class(gapminder$continent)
summary(gapminder$continent)
levels(gapminder$continent)
nlevels(gapminder$continent)
barplot(table(gapminder$continent))

## subset gapminder. Self documenting!
subset(gapminder, subset = country == "Mexico",
       select = c(country, year, lifeExp)) # ?c: combines values

## plot gapminder
plot(gapminder$year, gapminder$lifeExp) # ?plot
plot(gapminder$gdpPercap, gapminder$lifeExp)

## explore numeric variable
summary(gapminder$lifeExp)
hist(gapminder$lifeExp)

## explore factor variable
summary(gapminder$continent)
levels(gapminder$continent)
nlevels(gapminder$continent)
class(gapminder$continent) # ?class returns the class type of the object
table(gapminder$continent) # ?table builds a table based on factor levels 
class(table(gapminder$continent)) # this has morphed the factor...
hist(table(gapminder$continent)) # so we can plot!

## subsetting with base `[rows, columns]` notatation 
uruguay <- gapminder[gapminder$country == "Uruguay", ] # don't forget this comma! 
uruguay <- gapminder[gapminder$country == "Uruguay",
                     c("country", "year", "lifeExp")] 
## don't subset by numeric rows or columns.
# uruguay2 <- gapminder[1621:1632, ] # No idea what we are inspecting. Don't do this.
# uruguay2 <- gapminder[1621:1632, c(1, 3, 4)] # Ditto. 


## subsetting with base `subset()` function
mexico <- subset(gapminder, subset = country == "Mexico") # self documenting. returns all columns.
mexico <- subset(gapminder, subset = country == "Mexico", 
                 select = c(country, year, lifeExp)) # 'select' just return the columns we identify.

# clunky, error-prone way of plotting multiple countries (for loops are better)
mexico <- subset(gapminder, subset = country == "Mexico")
plot(mexico$year, mexico$pop)
dev.print(pdf, "mexico.pdf")

panama <- subset(gapminder, subset = country == "Panama") 
plot(panama$year, panama$pop)
dev.print(pdf, "panama.pdf")

ecuador <- subset(gapminder, subset = country == "Ecuador")
plot(ecuador$year, ecuador$pop)
dev.print(pdf, "ecuador")

## For loops and if statements----

## For loop structure
    # for(iterator in set of values){
    #   do a thing
    # }

## If statement structure
    # if (condition is true) {
    #   do something
    # } else {  # that is, if the condition is false,
    #   do something different
    # }

## Our for loop
dir.create('figures') # this will be: software-carpentry/figures

country_list <- unique(gapminder$country) # list unique country names to plot
country_pop_mean <- data.frame() # create an empty data frame to store mean values
for (cntry in country_list) { # loop through cntry in country_list
  cntry_subset <- subset(gapminder, subset = country == cntry) # subset the cntry 
  plot(cntry_subset$year, cntry_subset$pop)
  dev.print(pdf, paste0("figures/", cntry,".pdf"))
  
  ## calculate mean and save in growing country_pop_mean variable
  pop_mean <- mean(cntry_subset$pop)
  print(paste('mean pop for', cntry, 'is', pop_mean))
  country_pop_mean <- rbind(country_pop_mean, data.frame(cntry, pop_mean))
  
  ## if Asia, do something. if Africa, do something else.
  if (unique(cntry_subset$continent) == "Asia") { # read: if (the continent is Asia) {then}
    plot(cntry_subset$year, cntry_subset$lifeExp) 
    dev.print(pdf, paste0("figures/", cntry, "_lifeExp.pdf")) 
  } else if (unique(cntry_subset$continent) == "Africa") {
    plot(cntry_subset$year, cntry_subset$gdpPercap) 
    dev.print(pdf, paste0("figures/", cntry, "_gdpPercap.pdf")) # change the filename
  }
}

Introduction to R and RStudio

Julie Lowndes

July 12, 2016

1 R basics, workspace and working directory, RStudio projects

1.1 R at the command line, RStudio goodies

1.2 R functions, help pages

1.3 Working directories, RStudio projects, R scripts

1.3.1 Working directory

1.3.2 RStudio projects

1.4 Our first script!

2 Basic care and feeding of data in R

2.1 Meet your first data.frame: gapminder

2.1.1 Look at the variables inside a data.frame

2.2 Subsetting data

2.2.1 subsetting with base `[rows, columns]` notatation

2.2.2 subsetting with base `subset()` function

2.3 Repeating operations with for loops

2.4 conditional statements with `if` and `else`

2.5 clean up and save your .r script

Introduction to R and RStudio

Julie Lowndes

July 12, 2016

1 R basics, workspace and working directory, RStudio projects

1.1 R at the command line, RStudio goodies

1.2 R functions, help pages

1.3 Working directories, RStudio projects, R scripts

1.3.1 Working directory

1.3.2 RStudio projects

1.4 Our first script!

2 Basic care and feeding of data in R

2.1 Meet your first data.frame: gapminder

2.1.1 Look at the variables inside a data.frame

2.2 Subsetting data

2.2.1 subsetting with base [rows, columns] notatation

2.2.2 subsetting with base subset() function

2.3 Repeating operations with for loops

2.4 conditional statements with if and else

2.5 clean up and save your .r script

2.2.1 subsetting with base `[rows, columns]` notatation

2.2.2 subsetting with base `subset()` function

2.4 conditional statements with `if` and `else`