Hadley Wickham, RStudio’s Chief Scientist, has been building R packages for data wrangling and visualization based on the idea of tidy data. Great resources include RStudio’s data wrangling cheatsheet (screenshots below are from this cheatsheet) and data wrangling webinar.
Tidy data has a simple convention: put variables in the columns and observations in the rows.
Our gapminder
dataset is pretty tidy. This enables you to work with it the way you’d want, for your analyses, plots, etc.
Right now we are going to use dplyr
to wrangle this tidyish data set (the transform part of the cycle), and then come back to tidying messy data using tidyr
once we’ve had some fun wrangling.
We’ll do this in a new R script in the software-carpentry
folder we created this morning.
Here’s what to do:
software-carpentry
repo (and if not, get there)gapminder-wrangle.r
Today’s materials are again borrowing from some excellent sources, including:
dplyr
overviewThere are five dplyr
functions that you will use to do the vast majority of data manipulations:
filter()
),
select()
),
mutate()
),
summarise()
),
arrange()
)These can all be used in conjunction with group_by()
which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.
All verbs work similarly:
$
.Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
dplyr
Packages are bundles of functions, along with help pages and other goodies that make them easier for others to use, (ie. vignettes).
So far we’ve been using packages included in ‘base R’; they are ‘out-of-the-box’ functions. You can also install packages from online. The most traditional is CRAN, the Comprehensive R Archive Network. This is where you went to download R originally, and will go again to look for updates.
You don’t need to go to CRAN’s website to install packages, we can do it from within R with the command install.packages("package-name-in-quotes")
.
## from CRAN:
#install.packages("dplyr") ## do this once only to install the package on your computer.
library(dplyr) ## do this every time you restart R and need it
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
What’s the difference between install.packages()
and library()
? Why do you need both? Here’s my analogy:
install.packages()
is setting up electricity for your house. Just need to do this once (let’s ignore monthly bills).library()
is turning on the lights. You only turn them on when you need them, otherwise it wouldn’t be efficient. And when you quit R, it turns the lights off, but the electricity lines are still there. So when you come back, you’ll have to turn them on again with library()
, but you already have your electricity set up.dplyr::filter()
to subset data row-wise.First let’s read in the gapminder data. Earlier, we read in a .csv
of the gapminder data. But Jenny Bryan has also released this as an R package, so you could also install it from CRAN and load it into as extra practice for loading packages.
# install.packages('gapminder') # instead of reading in the csv
library(gapminder) # this is the package name
str(gapminder) # there's still a data frame named 'gapminder'
filter()
takes logical expressions and returns the rows for which all are TRUE
.
Visually, we are doing this (thanks RStudio for your cheatsheet):
filter(gapminder, lifeExp < 29)
filter(gapminder, country == "Mexico")
filter(gapminder, country %in% c("Mexico", "Afghanistan"))
Compare with some base R code to accomplish the same things
gapminder[gapminder$lifeExp < 29, ] ## repeat `gapminder`, [i, j] indexing is distracting
subset(gapminder, country == "Mexico") ## almost same as filter ... but wait ...
Before we go any further, we should exploit the new pipe operator that dplyr
imports from the magrittr
package by Stefan Bache. This is going to change your data analytical life. You no longer need to enact multi-operation commands by nesting them inside each other. This new syntax leads to code that is much easier to write and to read.
Here’s what it looks like: %>%
. The RStudio keyboard shortcut: Ctrl + Shift + M (Windows), Cmd + Shift + M (Mac).
Let’s demo then I’ll explain:
gapminder %>% head()
This is equivalent to head(gapminder)
. This pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side – literally, drops it in as the first argument.
Never fear, you can still specify other arguments to this function! To see the first 3 rows of Gapminder, we could say head(gapminder, 3)
or this:
gapminder %>% head(3)
I’ve advised you to think “gets” whenever you see the assignment operator, <-
. Similary, you should think “then” whenever you see the pipe operator, %>%
.
You are probably not impressed yet, but the magic will soon happen.
dplyr::select()
to subset the data on variables or columns.Back to dplyr
…
Use select()
to subset the data on variables or columns.
Visually, we are doing this (thanks RStudio for your cheatsheet):
Here’s a conventional call:
select(gapminder, year, lifeExp)
But using what we just learned, with a pipe, we can do this:
gapminder %>% select(year, lifeExp)
Let’s write it again but using multiple lines so it’s nicer to read. And let’s add a second pipe operator to pipe through head
:
gapminder %>%
select(year, lifeExp) %>%
head(4)
Think: “Take gapminder
, then select the variables year and lifeExp, then show the first 4 rows.”
Let’s do a little analysis where we calculate the mean gdp for Cambodia.
Here’s the gapminder data for Cambodia, but only certain variables:
gapminder %>%
filter(country == "Cambodia") %>%
# select(country, year, pop, gdpPercap) ## entering 4 of the 6 columns is tedious
select(-continent, -lifeExp) # you can use - to deselect columns
and what a typical base R call would look like:
gapminder[gapminder$country == "Cambodia", c("country", "year", "pop", "gdpPercap")]
or, possibly?, a nicer look using base R’s subset()
function:
subset(gapminder, country == "Cambodia", select = c(country, year, pop, gdpPercap))
mutate()
to add new variablesImagine we wanted to recover each country’s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Let’s add a new column and multiply them together.
Visually, we are doing this (thanks RStudio for your cheatsheet):
gapminder %>%
mutate(gdp = pop * gdpPercap)
Exercise: add this new gdp column to our Cambodia example above.
# Solution (no peeking!)
gapminder %>%
filter(country == "Cambodia") %>%
select(-continent, -lifeExp) %>%
mutate(gdp = pop * gdpPercap)
group_by
and summarize
Great! And now we want to calculate the mean gdp across all years (Let’s pretend that’s a good idea statistically)
Visually, we are doing this (thanks RStudio for your cheatsheet):
gapminder %>%
filter(country == "Cambodia") %>%
select(-continent, -lifeExp) %>%
mutate(gdp = pop * gdpPercap) %>%
group_by(country) %>%
summarize(mean_gdp = mean(gdp)) %>%
ungroup() # if you use group_by, also use ungroup() to save heartache later
And how would you then do this for every country, not just Cambodia? Well, yesterday we would have been thinking about putting this whole analysis inside a for loop, replacing “Cambodia” with a new name each time we iterated through the loop. But today, we have it already, just need to delete one line from our analysis–we don’t need to filter out Cambodia anymore!!
Visually, we are doing this (thanks RStudio for your cheatsheet):
gapminder %>%
select(-continent, -lifeExp) %>%
mutate(gdp = pop * gdpPercap) %>%
group_by(country) %>%
summarize(mean_gdp = mean(gdp)) %>%
ungroup() # if you use group_by, also use ungroup() to save heartache later
So we have done a pretty incredible amount of work in a few lines. Our whole analysis is this. Imagine the possibilities from here. It’s very readable: you see the data as the first thing, it’s not nested. Then, you can read the verbs. This is the whole thing:
library(dplyr)
library(gapminder)
gapminder %>%
select(-continent, -lifeExp) %>%
mutate(gdp = pop * gdpPercap) %>%
group_by(country) %>%
summarize(mean_gdp = mean(gdp)) %>%
ungroup() # if you use group_by, also use ungroup() to save heartache later
(In base-R, this would be a nightmare to read, and you would also have to make many temporary variables since you couldn’t use the %>% operator. Have a look at this blog to get an idea: How dplyr replaced my most common R idioms).
Exercise: with your neighbor, find the maximum life expectancy for countries in Asia. What is the earliest year you enounter? The latest? Hint: you can use or
base::max
anddplyr::arrange()
…
## possible result--no peeking!!
gapminder %>%
filter(continent == 'Asia') %>%
group_by(country) %>%
filter(lifeExp == max(lifeExp)) %>%
arrange(desc(year))
tidyr
overviewOften, data must be reshaped for it to become tidy data. What does that mean? There are four main verbs we’ll use, which are essentially pairs of opposites:
gather()
),spread()
),separate()
),unite()
)You use spread()
and gather()
to transform or reshape data between ‘wide’ to ‘long’ formats. ‘long’ format is the tidy data we are after, where:
In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.
For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). Data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R
‘s functions have been designed assuming you have ’long’ format data.
These data formats mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to it’s shape. However, the long format is more machine readable and is closer to the formating of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.
Question: Is gapminder a purely long, purely wide, or some intermediate format?
Sometimes, as with the gapminder dataset, we have multiple types of observed data. It is somewhere in between the purely ‘long’ and ‘wide’ data formats:
continent
, country
, year
)pop
,lifeExp
,gdpPercap
).It’s pretty common to have data in this intermediate format in most cases despite not having ALL observations in 1 column, since all 3 observation variables have different units. But we can play with switching it to long format and wide to show what that means (i.e. long would be 4 ID variables and 1 Observation variable).
Note: Generally, mathematical operations are better in long format, although some plotting functions actually work better with wide format.
tidyr
, investigate gapminder dataFirst install and load tidyr
:
#install.packages("tidyr")
library("tidyr") # warning messages are OK
gather()
We’ve been working with pretty tidy data. So for practice, now let’s work with these data in a wider format, maybe the way you or your colleague entered it in a spreadsheet. We’ll work with it to get it back to the way we like it.
gap_wide <- read.csv('data/gapminder_wide.csv')
head(gap_wide)
str(gap_wide)
While wide format is nice for data entry, it’s not nice for calculations. What if you were asked for the mean population after 1990 in Algeria? Possible, but ugly. Let’s tidy it back to the intermediate format we’ve been using.
Question: let’s talk this through together. If we’re trying to turn the
gap_wide
format intogapminder
format, what structure does it have that we like? And that we want to change?
tidyr
calls this a ‘key’), and 1 column for the data (tidyr
calls this the ’value’).gdpPercap
, lifeExp
, and pop
.Let’s get it to long format. We’ll have to do this in 2 steps. The first step is to take all of those column names and make them a variable in a new column, and transfer the values into another column.
Let’s have a look at gather()
’s help:
?gather
Question: What is our key-value pair?
We need to name two new variables in the key-value pair, one for the key, one for the value. Let’s name them obstype_year
and obs_value
.
Here’s the start of what we’ll do:
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values)
## Warning: attributes are not identical across measure variables; they will
## be dropped
str(gap_long)
head(gap_long)
tail(gap_long)
So this has reshaped our dataframe, but really not how we wanted. Very important to check, and listen to that warning message–dropping attributes seems very suspicious. Like suspenders.
What went wrong? Notice that it didn’t know that we wanted to keep continent
and country
untouched; we need to give it more information about which columns we want reshaped. We can do this in several ways.
One way: identify the column numbers you want to use. Not ideal because column numbers could change, but it does exactly what we want! And illustrates it nicely.
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
3:38) # could also do -1, -2: 'not column one, not column two
str(gap_long)
head(gap_long)
tail(gap_long)
Better way: identify the columns by name. Listing them out by explicit name can be a good approach if there are a few. But I’m not going to list them out here, and way too much potential for error if you tried gdpPercap_1952, gdpPercap_1957, gdpPercap_1962… So let’s use some of dplyr
’s awesome helper functions.
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
dplyr::starts_with('pop'),
dplyr::starts_with('lifeExp'),
dplyr::starts_with('gdpPercap'))
str(gap_long)
head(gap_long)
tail(gap_long)
Success! And you could also do it this way.
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
-continent, -country)
str(gap_long)
head(gap_long)
tail(gap_long)
To recap:
Inside gather()
we first name the new column for the new ID variable (obstype_year
), the name for the new amalgamated observation variable (obs_value
), then the names of the old observation variable. We could have typed out all the observation variables, but as in the select()
function (see dplyr
lesson), we can use the starts_with()
argument to select all variables that starts with the desired character string. Gather also allows the alternative syntax of using the -
symbol to identify which variables are not to be gathered (i.e. ID variables)
OK, but we’re not done yet. obstype_year
actually contains 2 pieces of information, the observation type (pop
,lifeExp
, or gdpPercap
) and the year
. We can use the separate()
function to split the character strings into multiple variables
?separate
–> we’ll tell it which column we want separated, name new columns that we want to create, and tell it what we want it to separate by. Since the obstype_year
variable has observation types and years separated by a _
, we’ll use that.
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
-continent, -country) %>%
separate(obstype_year,
into = c('obs_type','year'),
sep="_")
str(gap_long)
head(gap_long)
tail(gap_long)
Excellent. This is long format: every row is a unique observation. Yay!
Exercise: Using
gap_long
, calculate the mean life expectancy, population, and gdpPercap for each continent. Hint: use thegroup_by()
andsummarize()
functions we learned in thedplyr
lesson
# solution (no peeking!)
gap_long %>%
group_by(continent, obs_type) %>%
summarize(means = mean(obs_values))
Now just to double-check our work, let’s use the opposite of gather()
to spread our observation variables back to the original format with the aptly named spread()
. You pass spread()
the key and value pair, which is now obs_type
and obs_values
.
gap_normal <- gap_long %>%
spread(obs_type, obs_values)
dim(gap_normal)
dim(gapminder)
names(gap_normal)
names(gapminder)
Now we’ve got an intermediate dataframe gap_normal
with the same dimensions as the original gapminder
, but the order of the variables is different. Let’s fix that before checking if they are all.equal()
.
Exercise: reorder the columns in gap_normal to match gapminder.
# Solution, no peeking!
# one way with dplyr (also nice because can chain this from gap_normal creation)
gap_normal <- gap_normal %>%
select(country, continent, year, lifeExp, pop, gdpPercap)
# another way with base R
gap_normal <- gap_normal[,names(gapminder)]
Now let’s check if they are all.equal (?all.equal
) is a handy test
all.equal(gap_normal,gapminder)
## [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"
## [3] "Component \"country\": Attributes: < Component \"levels\": 6 string mismatches >"
## [4] "Component \"country\": 1704 string mismatches"
## [5] "Component \"continent\": 1212 string mismatches"
## [6] "Component \"year\": Modes: character, numeric"
## [7] "Component \"year\": target is character, current is numeric"
## [8] "Component \"lifeExp\": Mean relative difference: 0.203822"
## [9] "Component \"pop\": Mean relative difference: 1.634504"
## [10] "Component \"gdpPercap\": Mean relative difference: 1.162302"
head(gap_normal)
head(gapminder)
We’re almost there, the original was sorted by country
, continent
, then year
.
gap_normal <- gap_normal %>% arrange(country,continent,year)
all.equal(gap_normal,gapminder)
## [1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
## [2] "Attributes: < Component \"class\": 1 string mismatch >"
## [3] "Component \"country\": Attributes: < Component \"levels\": 6 string mismatches >"
## [4] "Component \"country\": 72 string mismatches"
## [5] "Component \"year\": Modes: character, numeric"
## [6] "Component \"year\": target is character, current is numeric"
str(gap_normal)
str(gapminder)
Mine shows a slight difference because one is a data.frame and one is a tbl_df, which is similar to a data.frame. We won’t get into this difference now, I fm feeling good atout these data sets! We’ve gone from the longest format back to the intermediate and we didn’t introduce any errors in our code.
Exercise: convert gap_long all the way back to gap_wide. Hint: you’ll need to create appropriate labels for all our new variables (time*metric combinations) with
tidyr::unite()
.
# Solution, no peeking:
head(gap_long) # remember the columns
gap_wide_new <- gap_long %>%
# first unite obs_type and year into a new column called var_names. Separate by _
unite(col = var_names, obs_type, year, sep = "_") %>%
# then spread var_names out by key-value pair.
spread(key = var_names, value = obs_values)
str(gap_wide_new)
Spend some time cleaning up and saving gapminder-wrangle.r
Restart R. In RStudio, use Session > Restart R. Otherwise, quit R with q()
and re-launch it.
Your final R script could look something like this:
## install, load dplyr
#install.packages("dplyr") ## do this once only to install the package on your computer.
library(dplyr)
## load gapminder
library(gapminder) # this is the package name
str(gapminder) # there's still a data frame named 'gapminder'
## practice dplyr::filter()
filter(gapminder, lifeExp < 29)
filter(gapminder, country == "Mexico")
filter(gapminder, country %in% c("Mexico", "Afghanistan"))
## base R alternatives to dplyr::filter()
gapminder[gapminder$lifeExp < 29, ] ## repeat `gapminder`, [i, j] indexing is distracting
subset(gapminder, country == "Mexico") ## almost same as filter ... but wait ...
## pipe operator %>%
gapminder %>% head # this...
head(gapminder) # ...is the same as this!
gapminder %>% head(3) # can pass arguments! this...
head(gapminder, 3) # ...is the same as this!
## practice dplyr::select() with %>%
select(gapminder, year, lifeExp) # this...
gapminder %>% select(year, lifeExp) # ...is the same as this!
## practice with %>% chains
gapminder %>%
select(year, lifeExp) %>%
head(4) # doesn't have to be a dplyr function
gapminder %>%
filter(country == "Cambodia") %>%
select(-continent, -lifeExp) # same as select(country, year, pop, gdpPercap)
# compare to base R, hard to read!
gapminder[gapminder$country == "Cambodia", c("country", "year", "pop", "gdpPercap")]
subset(gapminder, country == "Cambodia", select = c(country, year, pop, gdpPercap))
## dplyr::mutate() adds new columns
gapminder %>%
filter(country == "Cambodia") %>%
select(-continent, -lifeExp) %>%
mutate(gdp = pop * gdpPercap)
## dplyr::summarize() or summarise() adds new column when grouping
gapminder %>%
filter(country == "Cambodia") %>%
select(-continent, -lifeExp) %>%
mutate(gdp = pop * gdpPercap) %>%
group_by(country) %>%
summarize(mean_gdp = mean(gdp)) %>%
ungroup() # if you use group_by, also use ungroup() to save heartache later
## summarize for all countries (replaces our for loop!)
gapminder %>%
select(-continent, -lifeExp) %>%
mutate(gdp = pop * gdpPercap) %>%
group_by(country) %>%
summarize(mean_gdp = mean(gdp)) %>%
ungroup() # if you use group_by, also use ungroup() to save heartache later
## install tidyr
install.packages("tidyr")
library(tidyr)
## load wide data
gap_wide <- read.csv('data/gapminder_wide.csv')
head(gap_wide)
str(gap_wide)
## practice tidyr::gather() wide to long
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
-continent, -country)
# or
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
dplyr::starts_with('pop'),
dplyr::starts_with('lifeExp'),
dplyr::starts_with('gdpPercap'))
# or (but always be wary of numerics because they could change silently)
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
3:38) # could also do -1, -2: 'not column one, not column two
## gather() and separate() to create our original gapminder
gap_long <- gap_wide %>%
gather(key = obstype_year,
value = obs_values,
-continent, -country) %>%
separate(obstype_year,
into = c('obs_type','year'),
sep="_")
## practice: can still do calculations in long format
gap_long %>%
group_by(continent, obs_type) %>%
summarize(means = mean(obs_values))
## spread() from normal to wide
gap_normal <- gap_long %>%
spread(obs_type, obs_values) %>%
select(country, continent, year, lifeExp, pop, gdpPercap)
# or
gap_normal <- gap_long %>%
spread(obs_type, obs_values)
gap_normal <- gap_normal[,names(gapminder)]
## check that all.equal()
all.equal(gap_normal,gapminder)
## unite() and spread(): convert gap_long to gap_wide
head(gap_long) # remember the columns
gap_wide_new <- gap_long %>%
# first unite obs_type and year into a new column called var_names. Separate by _
unite(col = var_names, obs_type, year, sep = "_") %>%
# then spread var_names out by key-value pair.
spread(key = var_names, value = obs_values)
str(gap_wide_new)
complete()
We don’t have time to discuss importing data, but here are some packages by Hadley Wickham to explore and use. Remember you’ll use install.packages("package-name-in-quotes")
and then library(package-name)
, and then you can explore the help or vignettes. And also, of course, Google to see how to use them!
readr
to read in .csv filesreadxl
to read in Excel filesstringr
to work with stringslubridate
to work with dates