Chapter 7 dplyr and Pivot Tables

7.1 Summary

Pivot tables are powerful tools in Excel for summarizing data in different ways. We will create these tables using the group_by and summarize functions from the dplyr package (part of the Tidyverse). We will also learn how to format tables and practice creating a reproducible report using RMarkdown and sharing it with GitHub.

7.2 Objectives

In R, we can use dplyr for pivot tables by using 2 main verbs in combination: group_by and summarize. We will also continue to emphasize reproducibility in all our analyses.

  • Discuss pivot tables in Excel
  • Introduce group_by() %>% summarize() from the dplyr package
  • Practice our reproducible workflow with RMarkdown and GitHub

7.4 Pivot table overview

Wikipedia describes a pivot table as a “table of statistics that summarizes the data of a more extensive table…This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.” Fun fact: it also says that “Although pivot table is a generic term, Microsoft trademarked PivotTable in the United States in 1994.”

Pivot tables are a really powerful tool for summarizing data, and we can have similar functionality in R — as well as nicely automating and reporting these tables. We will learn about this using data about lobsters and will go back and forth between R and Excel as we learn.

Let’s start off in R, and have a look at the data.

7.5 RMarkdown setup

Let’s start a new RMarkdown file in our repo, at the top-level (where it will be created by default in our Project). I’ll call mine pivot_lobsters.Rmd.

In the setup chunk, let’s attach our libraries and read in our lobster data. In addition to the tidyverse package we will also use the skimr package. You will have to install it, but don’t want it to be installed every time you write your code. The following is a nice convention for having the install instructions available (on the same line) as the library() call.

Let’s add a code chunk and explore the data in a few ways.

## # A tibble: 6 x 7
##    year month date    site  transect replicate size_mm
##   <dbl> <dbl> <chr>   <chr>    <dbl> <chr>       <dbl>
## 1  2012     8 8/20/12 ivee         3 A              70
## 2  2012     8 8/20/12 ivee         3 B              60
## 3  2012     8 8/20/12 ivee         3 B              65
## 4  2012     8 8/20/12 ivee         3 B              70
## 5  2012     8 8/20/12 ivee         3 B              85
## 6  2012     8 8/20/12 ivee         3 C              60

head() gives us a look at the first rows of the data (6 by default). I like this because I can see the column names and get a sense of the shape of the data. I can also see the class of each column (double or character)

In this data set, every row is a unique observation. This is called “uncounted” data; you’ll see there is no row for how many lobsters were seen because each row is an observation, or an “n of 1”.

##       year          month           date               site          
##  Min.   :2012   Min.   :8.000   Length:6366        Length:6366       
##  1st Qu.:2015   1st Qu.:8.000   Class :character   Class :character  
##  Median :2017   Median :8.000   Mode  :character   Mode  :character  
##  Mean   :2016   Mean   :8.017                                        
##  3rd Qu.:2018   3rd Qu.:8.000                                        
##  Max.   :2018   Max.   :9.000                                        
##                                                                      
##     transect      replicate            size_mm      
##  Min.   :1.000   Length:6366        Min.   : 18.00  
##  1st Qu.:2.000   Class :character   1st Qu.: 65.00  
##  Median :4.000   Mode  :character   Median : 75.00  
##  Mean   :3.806                      Mean   : 73.02  
##  3rd Qu.:5.000                      3rd Qu.: 82.00  
##  Max.   :9.000                      Max.   :183.00  
##                                     NA's   :6

summary gives us summary statistics for each variable (column). I like this for numeric columns, but it doesn’t give a lot of useful information for non-numeric data. To have a look there I like using the skimr package:

Table 7.1: Data summary
Name lobsters
Number of rows 6366
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 6 7 0 38 0
site 0 1 4 4 0 5 0
replicate 0 1 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2016.24 1.66 2012 2015 2017 2018 2018 ▁▁▂▂▇
month 0 1 8.02 0.13 8 8 8 8 9 ▇▁▁▁▁
transect 0 1 3.81 2.17 1 2 4 5 9 ▇▆▅▃▂
size_mm 6 1 73.02 13.61 18 65 75 82 183 ▁▇▂▁▁

This skimr:: notation is a reminder to me that skim is from the skimr package. It is a nice convention: it’s a reminder to others (especially you!).

skim lets us look more at each variable. I particularly like looking at missing data. There are 6 missing values in the size_mm variable.

We can also make a quick plot to have a look at these data, and use our new ggplot2 skills. Let’s make a bar chart by year for each site

(geom_bar() counts things and geom_col() is for values within the data (mean))

7.5.1 Our task

So this is all great to get a quick look. But what if we needed to report to someone about how the average size of lobsters has changed over time across sites?

To answer this we need to do a pivot table in Excel, or data wrangling in R.

Let’s start by having a quick look at what pivot tables can do in Excel.

7.6 Pivot table demo

Let’s make a pivot table with our lobster data.

Let’s start off with how many lobsters were counted each year. I want a count of rows by year.

So to do this in Excel we would initiate the Pivot Table Process:



And it will do its best to find the data I would like to include in my Pivot Table (it can have difficulty with non-rectangular or “non-tidy” data), and suggest we make this in a new sheet:



And then we’ll get a little wizard to help us create the Pivot Table.

7.6.1 pivot one variable

I want to summarize by year, so I drag “year” down into the “Rows” box, and to get the counts by year I actually drag the same variable, “year” into the “Values” box. And it will create a Pivot Table for me! But “sum” as the default summary statistic, so I can click the little “I” icon to change this to count.



A few things to note:

  • The pivot table is separate entity from our data (it’s on a different sheet); the original data has not been affected
  • The pivot table only shows the variables we requested; we don’t see other columns (like date, month, or site).
  • Notice that in Excel we retain the overall totals for each site (in bold, on the same line with the site name). This is nice for communicating about data. But it can be problematic for further analyses, because it could be easy to take a total of this column and introduce errors.

So pivot tables are great because they summarize the data and keep the raw data raw — they even promote good pratice because they by default ask you if you’d like to present the data in a new sheet rather than in the same sheet.

7.6.2 pivot two variables

We can also add site as a second variable by dragging it:

And then can reverse the order by dragging:

So in terms of our final interest of average size by site and year, we are on our way! I’m going to stop here because we want to be able to do this in R.

The power of R is in the automation, and in keeping that raw data truly raw.

Let’s talk about how this looks like in R.

7.7 group_by() %>% summarize()

In R, we can create the functionality of pivot tables by using 2 main dplyr verbs in combination: group_by and summarize.

Say it with me: “pivot tables are group_by and then summarize”. And just like pivot tables, you have flexibility with how you are going to summarize. For example, we can calculate an average, or a total.

I think it’s incredibly powerful to visualize what we are talking about with our data when do do these kinds of operations. It looks like this (from RStudio’s cheatsheet; all cheatsheets available from https://rstudio.com/resources/cheatsheets):





When we were reporting by year or site, we were essentially modifying what we were grouping by (the different colors here in this figure.

Let’s do this in R.

7.7.1 group_by one variable

Let’s try this on our lobsters data, just like we did in Excel. We will count the the total number of lobster by year. In R vocabulary, we will group_by year and then summarize by counting using n(), which is a function from dplyr. n() counts the number of times an observation shows up, and since this is uncounted data, this will count each row. We’ll also use the pipe operator %>%, which you can read as “and then”.

This to me reads: “take the lobsters data and then group_by year and then summarize by count in a new column called ‘count’”

## # A tibble: 7 x 2
##    year count_by_year
##   <dbl>         <int>
## 1  2012           231
## 2  2013           243
## 3  2014           510
## 4  2015          1100
## 5  2016           809
## 6  2017          1668
## 7  2018          1805

Notice how together, group_by and summarize minimize the amount of information we see. We also saw this with the pivot table. We lose the other columns that aren’t involved here.

Question: What if you don’t group_by first? Let’s try it and discuss what’s going on.

## # A tibble: 1 x 1
##   count
##   <int>
## 1  6366

So if we don’t group_by first, we will get a single summary statistic (sum in this case) for the whole dataset.

Another question: what if we only group_by?

## # A tibble: 6,366 x 7
## # Groups:   year [7]
##     year month date    site  transect replicate size_mm
##    <dbl> <dbl> <chr>   <chr>    <dbl> <chr>       <dbl>
##  1  2012     8 8/20/12 ivee         3 A              70
##  2  2012     8 8/20/12 ivee         3 B              60
##  3  2012     8 8/20/12 ivee         3 B              65
##  4  2012     8 8/20/12 ivee         3 B              70
##  5  2012     8 8/20/12 ivee         3 B              85
##  6  2012     8 8/20/12 ivee         3 C              60
##  7  2012     8 8/20/12 ivee         3 C              65
##  8  2012     8 8/20/12 ivee         3 C              67
##  9  2012     8 8/20/12 ivee         3 D              70
## 10  2012     8 8/20/12 ivee         4 B              85
## # … with 6,356 more rows

7.7.2 RStudio Viewer

Let’s now check the lobsters variable. We can do this by clicking on lobsters in the Environment pane in RStudio.

We see that we haven’t changed any of our original data that was stored in this variable. (Just like how the pivot table didn’t affect the raw data on the original sheet).

Aside: You’ll also see that when you click on the variable name in the Environment pane, View(lobsters) shows up in your Console. View() (capital V) is the R function to view any variable in the viewer. So this is something that you can write in your RMarkdown script, although RMarkdown will not be able to knit this view feature into the formatted document. So, if you want include View() in your RMarkdown document you will need to either comment it out #View() or add eval=FALSE to the top of the code chunk so that the full line reads {r, eval=FALSE}.

7.7.3 group_by multiple variables

Great. Now let’s summarize by both year and site like we did in the pivot table. We are able to group_by more than one variable. Let’s do this together:

## # A tibble: 35 x 3
## # Groups:   site [5]
##    site   year count_by_siteyear
##    <chr> <dbl>             <int>
##  1 aque   2012                38
##  2 aque   2013                32
##  3 aque   2014               100
##  4 aque   2015                83
##  5 aque   2016                48
##  6 aque   2017                67
##  7 aque   2018                54
##  8 carp   2012                78
##  9 carp   2013                93
## 10 carp   2014                79
## # … with 25 more rows

text.

7.7.4 summarize multiple variables

We can summarize multiple variables at a time. So far we’ve done the count of lobster observations. Let’s also do the mean and standard deviation. First let’s use the mean() function to calculate the mean. We do this within the same summarize() function, but we can add a new line to make it easier to read. Notice how when you put your curser within the parenthesis and hit return, the indentation will automatically align.

## # A tibble: 35 x 4
## # Groups:   site [5]
##    site   year count_by_siteyear mean_size_mm
##    <chr> <dbl>             <int>        <dbl>
##  1 aque   2012                38         71  
##  2 aque   2013                32         72.1
##  3 aque   2014               100         76.9
##  4 aque   2015                83         68.5
##  5 aque   2016                48         68.7
##  6 aque   2017                67         73.9
##  7 aque   2018                54         71.7
##  8 carp   2012                78         74.4
##  9 carp   2013                93         76.6
## 10 carp   2014                79         NA  
## # … with 25 more rows

Aside Command-I will properly indent selected lines.

Great! But this will actually calculate some of the means as NA because one or more values in that year are NA. So we can pass an argument that says to remove NAs first before calculating the average. Let’s do that, and then also calculate the standard deviation with the sd() function:

## # A tibble: 35 x 5
## # Groups:   site [5]
##    site   year count_by_siteyear mean_size_mm sd_size_mm
##    <chr> <dbl>             <int>        <dbl>      <dbl>
##  1 aque   2012                38         71        10.2 
##  2 aque   2013                32         72.1      12.3 
##  3 aque   2014               100         76.9       9.32
##  4 aque   2015                83         68.5      12.6 
##  5 aque   2016                48         68.7      12.5 
##  6 aque   2017                67         73.9      11.9 
##  7 aque   2018                54         71.7       8.14
##  8 carp   2012                78         74.4      14.6 
##  9 carp   2013                93         76.6       8.71
## 10 carp   2014                79         79.1       8.57
## # … with 25 more rows

So we can make the equivalent of Excel’s pivot table in R with group_by and then summarize. But a powerful thing about R is that maybe we want this information to be used in further analyses. We can make this easier for ourselves by saving this as a variable. So let’s add a variable assignment to that first line:

## # A tibble: 35 x 5
## # Groups:   site [5]
##    site   year count_by_siteyear mean_size_mm sd_size_mm
##    <chr> <dbl>             <int>        <dbl>      <dbl>
##  1 aque   2012                38         71        10.2 
##  2 aque   2013                32         72.1      12.3 
##  3 aque   2014               100         76.9       9.32
##  4 aque   2015                83         68.5      12.6 
##  5 aque   2016                48         68.7      12.5 
##  6 aque   2017                67         73.9      11.9 
##  7 aque   2018                54         71.7       8.14
##  8 carp   2012                78         74.4      14.6 
##  9 carp   2013                93         76.6       8.71
## 10 carp   2014                79         79.1       8.57
## # … with 25 more rows

7.8 Oh no, our colleague sent the wrong data!

Oh no! After all our analyses and everything we’ve done, our colleague just emailed us at 4:30pm on Friday that he sent the wrong data and we need to redo all our analyses with a new .xlsx file: lobsters2.xlsx, not lobsters.xlsx. Aaaaah!

If we were doing this in Excel, this would be a bummer; we’d have to rebuild our pivot table and click through all of our logic again. And then export our figures and save them into our report.

But, since we did it in R, we are much safer. We can go back to the top of our RMarkdown file, and read in the updated dataset, and then re-knit. We will still need to check that everything outputs correctly, (and that column headers haven’t been renamed), but our first pass will be to update the filename and re-knit:

And now we can see that our plot updated as well:

## # A tibble: 35 x 6
## # Groups:   site [5]
##    site   year count_by_siteyear mean_size_mm sd_size_mm median_size_mm
##    <chr> <dbl>             <int>        <dbl>      <dbl>          <dbl>
##  1 aque   2012                38         71        10.2            70  
##  2 aque   2013                32         72.1      12.3            75  
##  3 aque   2014               100         76.9       9.32           75.5
##  4 aque   2015                83         68.5      12.6            70  
##  5 aque   2016                48         68.7      12.5            71  
##  6 aque   2017                67         73.9      11.9            75  
##  7 aque   2018                54         71.7       8.14           72  
##  8 carp   2012                78         74.4      14.6            74.5
##  9 carp   2013                93         76.6       8.71           76  
## 10 carp   2014                79         79.1       8.57           79  
## # … with 25 more rows

## Saving 7 x 5 in image

## Saving 7 x 5 in image

7.8.1 Knit, push, & show differences on GitHub

So cool.

7.8.2 dplyr::count()

Now that we’ve spent time with group_by %>% summarize, there is a shortcut if you only want to summarize by count. This is with a function called count(), and it will group_by your selected variable, count, and then also ungroup. It looks like this:

Switching gears…

7.9 mutate()


There are a lot of times where you don’t want to summarize your data, but you do want to operate beyond the original data. This is often done by adding a column. We do this with the mutate() function from dplyr. Let’s try this with our original lobsters data. The sizes are in millimeters but let’s say it was important for them to be in meters. We can add a column with this calculation:

## # A tibble: 6 x 7
##    year month date    site  transect replicate size_mm
##   <dbl> <dbl> <chr>   <chr>    <dbl> <chr>       <dbl>
## 1  2012     8 8/20/12 ivee         3 A              70
## 2  2012     8 8/20/12 ivee         3 B              60
## 3  2012     8 8/20/12 ivee         3 B              65
## 4  2012     8 8/20/12 ivee         3 B              70
## 5  2012     8 8/20/12 ivee         3 B              85
## 6  2012     8 8/20/12 ivee         3 C              60
## # A tibble: 6,366 x 8
##     year month date    site  transect replicate size_mm size_m
##    <dbl> <dbl> <chr>   <chr>    <dbl> <chr>       <dbl>  <dbl>
##  1  2012     8 8/20/12 ivee         3 A              70  0.07 
##  2  2012     8 8/20/12 ivee         3 B              60  0.06 
##  3  2012     8 8/20/12 ivee         3 B              65  0.065
##  4  2012     8 8/20/12 ivee         3 B              70  0.07 
##  5  2012     8 8/20/12 ivee         3 B              85  0.085
##  6  2012     8 8/20/12 ivee         3 C              60  0.06 
##  7  2012     8 8/20/12 ivee         3 C              65  0.065
##  8  2012     8 8/20/12 ivee         3 C              67  0.067
##  9  2012     8 8/20/12 ivee         3 D              70  0.07 
## 10  2012     8 8/20/12 ivee         4 B              85  0.085
## # … with 6,356 more rows

If we want to add a column that has the same value repeated, we can pass it just one value, either a number or a character string (in quotes). And let’s save this as a variable called lobsters_detailed

7.10 select()

We will end with one final function, select. This is how to choose, retain, and move your data by columns:


Let’s say that we want to present this data finally with only columns for date, site, and size in meters. We would do this:

## # A tibble: 6,366 x 3
##    date    site  size_m
##    <chr>   <chr>  <dbl>
##  1 8/20/12 ivee   0.07 
##  2 8/20/12 ivee   0.06 
##  3 8/20/12 ivee   0.065
##  4 8/20/12 ivee   0.07 
##  5 8/20/12 ivee   0.085
##  6 8/20/12 ivee   0.06 
##  7 8/20/12 ivee   0.065
##  8 8/20/12 ivee   0.067
##  9 8/20/12 ivee   0.07 
## 10 8/20/12 ivee   0.085
## # … with 6,356 more rows

One last time, let’s knit, save, commit, and push to GitHub.

7.11 Deep thoughts

Highly recommended read: Broman & Woo: Data organization in spreadsheets. Practical tips to make spreadsheets less error-prone, easier for computers to process, easier to share

Great opening line: “Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades.”

7.12 Efficiency Tips

arrow keys with shift, option, command