Chapter 9 Dplyr and vlookups

9.1 Summary

In previous sessions, we’ve learned to do some basic wrangling and find summary information with functions in the dplyr package, which exists within the tidyverse. We’ve used:

  • count(): get counts of observations for groupings we specify
  • mutate(): add a new column, while keeping the existing ones
  • group_by(): let R know that groups exist within the dataset, by variable(s)
  • summarize(): calculate a value (that you specify) for each group, then report each group’s value in a table

In this session, we’ll expand our data wrangling toolkit using:

  • filter() to conditionally subset our data by rows, and
  • *_join() functions to merge data frames together

The combination of filter() and *_join() - to return rows satisfying a condition we specify, and merging data frames by like variables - is analogous to the useful VLOOKUP function in Excel.

9.1.1 Objectives

  • Use filter() to subset data frames, returning rows that satisfy variable conditions
  • Use full_join(), left_join(), and inner_join() to merge data frames, with different endpoints in mind
  • Use filter() and *_join() as part of a wrangling sequence

9.2 Lessons

9.2.1 Getting started - Create a new .Rmd, attach packages & get data

Create a new R Markdown document in your r-workshop project and knit to save as filter_join.Rmd. Remove all the example code (everything below the set-up code chunk).

In this session, we’ll use three packages:

  • tidyverse
  • readxl
  • here

Attach the packages in the setup code chunk in your .Rmd:

Then create a new code chunk to read in three files from your ‘data’ subfolder:

  • invert_counts_curated.xlsx
  • fish_counts_curated.csv
  • kelp_counts_curated.xlsx

We should always explore the data we’ve read in using functions like View(), names(), summary(), head() and tail() to ensure that the data we think we’ve read in is actually the data we’ve read in.

Now, let’s use filter() to decide which observations (rows) we’ll keep or exclude in new subsets, similar to using Excel’s VLOOKUP function.

9.2.2 filter() to conditionally subset by rows

Use filter() to let R know which rows you want to keep or exclude, based whether or not their contents match conditions that you set for one or more variables.

Some examples in words that might inspire you to use filter():

  • “I only want to keep rows where the temperature is greater than 90°F.”
  • “I want to keep all observations except those where the tree type is listed as unknown.”
  • “I want to make a new subset with only data for mountain lions (the species variable) in California (the state variable).”

When we use filter(), we need to let R know a couple of things:

  • What data frame we’re filtering from
  • What condition(s) we want observations to match and/or not match in order to keep them in the new subset

Here, we’ll learn some common ways to use filter().

9.2.2.1 Filter rows by matching a single character string

Let’s say we want to keep all observations from the fish_counts data frame where the common name is “garibaldi.” Here, we need to tell R to only keep rows from the fish_counts data frame when the common name (common_name variable) exactly matches garibaldi. Use == to ask R to look for matching strings:

Check out the fish_garibaldi object to ensure that only garibaldi observations remain.

You could also do this using the pipe operator %>%:

9.2.2.2 Activity

Task: Create a subset from the fish_counts data frame, stored as object fish_abur, that only contains observations from Arroyo Burro (site ‘abur’).

Solution:

Explore the subset you just created to ensure that only Arroyo Burro observations are returned.

9.2.2.3 Filter rows based on numeric conditions

Use expected operators (>, <, >=, <=, ==) to set conditions for a numeric variable when filtering. For this example, we only want to retain observations when the total_count column value is >= 50:

Or, using the pipe:

9.2.2.4 Filter to return rows that match this OR that OR that

What if we want to return a subset of the fish_counts df that contains garibaldi, blacksmith OR black surfperch?

There are several ways to write an “OR” statement for filtering, which will keep any observations that match Condition A or Condition B or Condition C. In this example, we will create a subset from fish_counts that only contains rows where the common_name is garibaldi or blacksmith or black surfperch.

Use %in% to ask R to look for any matches within a combined vector of strings:

Alternatively, you can indicate OR using the vertical line operator | to do the same thing (but you can see that it’s more repetitive when looking for matches within the same variable):

9.2.2.5 Filter to return rows that match conditions for multiple variables

In the previous examples, we set filter conditions based on a single variable (e.g. common_name). What if we want to return observations that satisfy conditions for multiple variables?

For example: We want to create a subset that only returns rows from ‘invert_counts’ where the site is “abur” or “mohk” and the common_name is “purple urchin.” In filter(), add a comma (or ampersand, &) between arguments for multiple AND conditions:

## # A tibble: 2 x 6
##   month site  common_name   `2016` `2017` `2018`
##   <chr> <chr> <chr>          <dbl>  <dbl>  <dbl>
## 1 7     abur  purple urchin     48     48     48
## 2 7     mohk  purple urchin    620    505    323

Like most things in R, there are other ways to do the same thing. For example, you could do the same thing using & (instead of a comma) between “and” conditions:

Or you could just do two filter steps in sequence:

9.2.2.5.1 Activity: combined filter conditions

Task: Create a subset from the fish_counts data frame, called low_gb_wr that only contains:

  • Observations of garibaldi and rock wrasse
  • Where the total_count is less than or equal to 10

Solution:

Sync your local project to your repo on GitHub.

9.2.2.6 Filter to return rows that do not match conditions

Sometimes we might want to exclude observations. Here, let’s say we want to make a subset that contains all rows from fish_counts except those recorded at the Mohawk Reef site (“mohk” in the site variable).

We use != to return observations that do not match a condition.

Like this:

This similarly works to exclude observations by a value.

For example, if we want to return all observations except those where the total fish count is 1, we use:

What if we want to exclude observations for multiple conditions? For example, here we want to return all rows where the fish species is not garibaldi or rock wrasse.

We can use filter(!variable %in% c("apple", "orange")) to return rows where the variable does not match “apple” or “orange”. For our fish example, that looks like this:

Which then only returns observations for the other fish species in the dataset.

## # A tibble: 6 x 4
##    year site  common_name     total_count
##   <dbl> <chr> <chr>                 <dbl>
## 1  2016 abur  black surfperch           2
## 2  2016 abur  blacksmith                1
## 3  2016 abur  senorita                 58
## 4  2016 aque  black surfperch           1
## 5  2016 aque  blacksmith                1
## 6  2016 aque  senorita                 57

9.2.2.7 Example: combining filter() with other functions using the pipe operator (%>%)

We can also use filter() in combination with the functions we previously learned for wrangling. If we have multiple sequential steps to perform, we can string them together using the pipe operator (%>%).

Here, we’ll start with the invert_counts data frame and create a subset that:

  • Converts to long(er) format with pivot_longer()
  • Only keeps observations for rock scallops
  • Calculates the total count of rock scallops by site only
## # A tibble: 11 x 2
##    site  tot_count
##    <chr>     <dbl>
##  1 abur         48
##  2 ahnd         48
##  3 aque        152
##  4 bull         48
##  5 carp       2519
##  6 golb         48
##  7 ivee        169
##  8 mohk        346
##  9 napl       6416
## 10 scdi       2390
## 11 sctw       1259

9.2.3 Merging data frames with *_join()

Excel’s VLOOKUP can also be used to merge data from separate tables or worksheets. Here, we’ll use the *_join() functions to merge separate data frames in R.

There are a number of ways to merge data frames in R. We’ll use full_join(), left_join(), and inner_join() in this session.

From R Documentation (?join):

  • full_join(): “returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.” Basically, nothing gets thrown out, even if a match doesn’t exist - making full_join() the safest option for merging data frames. When in doubt, full_join().

  • left_join(): “return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.”

  • inner_join(): “returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.” This will drop observations that don’t have a match between the merged data frames, which makes it a riskier merging option if you’re not sure what you’re trying to do.

Schematic (from RStudio data wrangling cheat sheet):

To clarify what the different joins are doing, let’s first make a subset of the fish_counts data frame that only contains observations from 2016 and 2017.

Take a look to ensure that only those years are included with View(fish_2016_2017). Now, let’s merge it with our kelp fronds data in different ways.

9.2.3.1 full_join() to merge data frames, keeping everything

When we join data frames in R, we need to tell R a couple of things (and it does the hard joining work for us):

  • Which data frames we want to merge together
  • Which variables to merge by

Note: If there are exactly matching column names in the data frames you’re merging, the *_join() functions will assume that you want to join by those columns. If there are no matching column names, you can specify which columns to join by manually. We’ll do both here.

When we try to do that join, we get an error message: Error: Can't join on 'year' x 'year' because of incompatible types (character / numeric)

Let’s google this. That means copying this from the console and pasting it into Google.

What’s going on here? First, there’s something fishy (ha) going on with the class of the year variable in kelp_counts_abur. Use the class() function to see how R understands that variable (remember, we use $ to return a specific column from a data frame).

## [1] "character"

So the variable is currently stored as a character. Why?

If we go back to the kelp_counts_curated.xlsx file, we’ll see that the numbers in both the year and month column have been stored as text. There are several hints Excel gives us:

  • Cells are left aligned, when values stored as numbers are right aligned
  • The green triangles in the corner indicate some formatting
  • The warning sign shows up when you click on one of the values with text formatting, and lets you know that the cell has been stored as text. We are given the option to reformat as numeric in Excel, but we’ll do it here in R so we have a reproducible record of the change to the variable class.

There are a number of ways to do this in R. We’ll use mutate() to overwrite the existing year column while coercing it to class numeric using the as.numeric() function.

Now if we check the class of the year variable in kelp_counts_abur, we’ll see that it has been coerced to ‘numeric’:

## [1] "numeric"

Question: Isn’t it bad practice to overwrite variables, instead of just making a new one? Great question, and usually the answer is yes. Here, we feel fine with “overwriting” the year column because we’re not changing anything about what’s contained within the column, we’re only changing how R understands it. Always use caution if overwriting variables, and if in doubt, add one instead!

OK, so now the class of year in the data frames we’re joining is the same. Let’s try that full_join() again:

Let’s look at the merged data frame with View(abur_kelp_join). A few things to notice about how full_join() has worked:

  1. All columns that existed in both data frames still exist.
  2. All observations are retained, even if they don’t have a match. In this case, notice that for other sites (not ‘abur’) the observation for fish still exists, even though there was no corresponding kelp data to merge with it. The kelp frond data from 2018 is also returned, even though the fish counts dataset did not have ‘year == 2018’ in it.
  3. The kelp frond data is joined to all observations where the joining variables (year, site) are a match, which is why it is repeated 5 times for each year (once for each fish species).

Because all data (observations & columns) are retained, full_join() is the safest option if you’re unclear about how to merge data frames. #### left_join() to merge data frames, keeping everything in the ‘x’ data frame and only matches from the ‘y’ data frame

Now, we want to keep all observations in fish_2016_2017, and merge them with kelp_counts_abur while only keeping observations from kelp_counts_abur that match an observation within fish_2016_2017. So when we use left_join(), any information on kelp counts from 2018 should be dropped.

## Joining, by = c("year", "site")

Notice when you look at fish_kelp_2016_2017, the 2018 data that does exist in kelp_counts_abur does not get joined to the fish_2016_2017 data frame, because left_join(df_a, df_b) will only keep observations from df_b if they have a match in df_a!

9.2.3.2 inner_join() to merge data frames, only keeping observations with a match in both

When we used left_join(df_a, df_b), we kept all observations in df_a but only observations from df_b that matched an entry in df_a (in other words, some entries from df_b were excluded).

Use inner_join() if you know that you only want to retain observations when they match across both data frames. Caution: this is built to exclude any observations that don’t match across data frames by joined variables - double check to make sure this is actually what you want to do!

For example, if we use inner_join() to merge fish_counts and kelp_counts_abur, then we are asking R to only return observations where the joining variables (year and site) have matches in both data frames. Let’s see what the outcome is:

## Joining, by = c("year", "site")
## # A tibble: 15 x 6
##     year site  common_name     total_count month total_fronds
##    <dbl> <chr> <chr>                 <dbl> <chr>        <dbl>
##  1  2016 abur  black surfperch           2 7              307
##  2  2016 abur  blacksmith                1 7              307
##  3  2016 abur  garibaldi                 1 7              307
##  4  2016 abur  rock wrasse               2 7              307
##  5  2016 abur  senorita                 58 7              307
##  6  2017 abur  black surfperch           4 7              604
##  7  2017 abur  blacksmith                1 7              604
##  8  2017 abur  garibaldi                 1 7              604
##  9  2017 abur  rock wrasse              57 7              604
## 10  2017 abur  senorita                 64 7              604
## 11  2018 abur  black surfperch           1 7             3532
## 12  2018 abur  blacksmith                1 7             3532
## 13  2018 abur  garibaldi                 1 7             3532
## 14  2018 abur  rock wrasse               1 7             3532
## 15  2018 abur  senorita                  1 7             3532

Here, we see that only observations where there is a match for year and site in both data frames are returned.

9.2.3.3 *_join() in a sequence

We can also merge data frames as part of a sequence of wrangling steps.

As an example: Starting with the invert_counts data frame, we want to:

  • First, use pivot_longer() to get year and counts each into a single column
  • Convert the class of year to numeric (so it can join with another numeric year variable)
  • Then, only keep observations for “california spiny lobster”
  • Next, join the kelp_counts_abur to the resulting subset above, only keeping observations that have a match in both data frames

That might look like this:

## Joining, by = c("month", "site", "year")
## # A tibble: 3 x 6
##   month site  common_name               year total_counts total_fronds
##   <chr> <chr> <chr>                    <dbl>        <dbl>        <dbl>
## 1 7     abur  california spiny lobster  2016           17          307
## 2 7     abur  california spiny lobster  2017           17          604
## 3 7     abur  california spiny lobster  2018           16         3532
9.2.3.3.1 Activity

Now let’s combine what we’ve learned about piping, filtering and joining!

Task: Complete the following as part of a single sequence (remember, check to see what you’ve produced after each step) to create a new data frame called my_fish_join:

  • Start with fish_counts data frame
  • Filter to only including observations for 2017 at Arroyo Burro
  • Join the kelp_counts_abur data frame to the resulting subset using left_join()
  • Add a new column that contains the ‘fish per kelp fronds’ density (total_count / total_fronds)

Solution:

Sync your project with your repo on GitHub.

9.3 Fun / kind of scary facts

How is this similar to VLOOKUP in Excel? How does it differ?

From Microsoft Office Support, “use VLOOKUP when you need to find things in a table or a range by row.”

So, both filter() and VLOOKUP look through your data frame (or spreadsheet, in Excel) to look for observations that match your conditions. But they also differ in important ways:

  1. By default VLOOKUP looks for and returns an observation for approximate matches (and you have to change the final argument to FALSE to look for an exact match). In contrast, by default filter() will look for exact conditional matches.

  2. VLOOKUP will look for and return information from the first observation that matches (or approximately matches) a condition. filter() will return all observations (rows) that exactly match a condition.

9.4 Interludes (deep thoughts/openscapes)

  • Not overusing the pipe in really long sequences. What are other options? Why is that a concern? What are some ways to always know that what’s happening in a sequence is what you EXPECT is happening in a sequence? tidylog, check intermediate data frames, sometimes write intermediate data frames, etc.

  • The risk of partial joins (& a case for full_join + drop_na instead?)

9.5 Efficiency Tips

  • Comment out multiline code with Command + Shift + C
  • Knit with Command + Shift + K