Chapter 8 Tidying

TODO: janitor: adorn and kable

8.1 Summary

In previous sessions, we learned to read in data, do some wrangling, and create a graph and table. Here, we’ll continue by reshaping data frames (converting from long-to-wide, or wide-to-long format), separating and uniting variable (column) contents, converting between explicit and implicit missing (NA) values, and cleaning up our column names with the janitor package.

8.2 Objectives

  • Reshape data frames with tidyr::pivot_wider() and tidyr::pivot_longer()
  • Convert column names with janitor::clean_names()
  • Combine or separate information from columns with tidyr::unite() and tidyr::separate()
  • Make implicit missings explicit with tidyr::complete()
  • Make explicit missings implicit with tidyr::drop_na()
  • Use our new skills as part of a bigger wrangling sequence
  • Make a customized table (TODO: or introduce Kable if not time in pivot tables chapter)

8.4 Lesson

8.4.1 Lesson Prep

8.4.1.1 Create a new R Markdown and attach packages

Within your day 2 R Project, create a new .Rmd. Attach the tidyverse, janitor and readxl packages with library(package_name). Knit and save your new .Rmd within the project folder.

8.4.1.2 Read in data

Use readxl::read_excel() to import the “invert_counts_curated.xlsx” data:

Be sure to explore the imported data a bit:

  • View()
  • names()
  • summary()

8.4.2 Reshaping with tidyr::pivot_longer() and tidyr::pivot_wider()

8.4.2.1 Wide-to-longer format with tidyr::pivot_longer()

In tidy format, each variable is contained within a single column. If we look at inverts_df, we can see that the year variable is actually split over 3 columns, so we’d say this is currently in wide format.

There may be times when you want to have data in wide format, but often with code it is more efficient to convert to long format by gathering together observations for a variable that is currently split into multiple columns.

Schematically, converting from wide to long format looks like this:

Generally, the code to gather wide columns together using tidyr::pivot_longer() looks like this:

TODO: Add pivot_longer() schematic

We’ll use tidyr::pivot_longer() to gather data from all years in inverts_df into two columns: one called year, which contains the year (as a number), and another called sp_count that contains the number of each species observed. The new data frame will be stored as inverts_long:

The outcome is the new long-format inverts_long data frame:

## # A tibble: 165 x 5
##    month site  common_name              year  sp_count
##    <chr> <chr> <chr>                    <chr>    <dbl>
##  1 7     abur  california cone snail    2016       451
##  2 7     abur  california cone snail    2017        28
##  3 7     abur  california cone snail    2018       762
##  4 7     abur  california spiny lobster 2016        17
##  5 7     abur  california spiny lobster 2017        17
##  6 7     abur  california spiny lobster 2018        16
##  7 7     abur  orange cup coral         2016        24
##  8 7     abur  orange cup coral         2017        24
##  9 7     abur  orange cup coral         2018        24
## 10 7     abur  purple urchin            2016        48
## # … with 155 more rows

Hooray, long format!

One thing that isn’t obvious at first (but would become obvious if you continued working with this data) is that since those year numbers were initially column names (characters), when they are stacked into the year column, their class wasn’t auto-updated to numeric.

Explore the class of year in inverts_long:

## [1] "character"

We’ll use dplyr::mutate() in a different way here: to create a new column (that’s how we’ve used mutate() previously) that has the same name of an existing column, in order to update and overwrite the existing column.

In this case, we’ll mutate() to add a column called year, which contains an as.numeric() version of the existing year variable:

Checking the class again, we see that year has been updated to a numeric variable:

## [1] "numeric"

8.4.2.2 Long-to-wider format with tidyr::pivot_wider()

In the previous example, we had information spread over multiple columns that we wanted to gather. Sometimes, we’ll have data that we want to spread over multiple columns.

For example, imagine that starting from inverts_long we want each species in the common_name column to exist as its own column. In that case, we would be converting from a longer to a wider format, and will use tidyr::pivot_wider() as follows:

TODO: Add pivot_wider() schematic

Specifically for our data, we write code to spread the common_name column as follows:

## # A tibble: 33 x 8
##    month site   year `california con… `california spi… `orange cup cor…
##    <chr> <chr> <dbl>            <dbl>            <dbl>            <dbl>
##  1 7     abur   2016              451               17               24
##  2 7     abur   2017               28               17               24
##  3 7     abur   2018              762               16               24
##  4 7     ahnd   2016               27               16               24
##  5 7     ahnd   2017               24               16               24
##  6 7     ahnd   2018               24               16               24
##  7 7     aque   2016             4971               48             1526
##  8 7     aque   2017             1752               48             1623
##  9 7     aque   2018             2616               48             1859
## 10 7     bull   2016             1735               24               36
## # … with 23 more rows, and 2 more variables: `purple urchin` <dbl>, `rock
## #   scallop` <dbl>

We can see that now each species has its own column (wider format). But also notice that those column headers (since they have spaces) might not be in the most coder-friendly format…

8.4.2.3 Meet the janitor package

The janitor package by Sam Firke is a brilliant collection of functions for some quick data cleaning. We recommend that you explore the different functions it contains. Like:

  • janitor::clean_names(): update column headers to a case of your choosing
  • janitor::get_dupes(): see all rows that are duplicates within variables you choose
  • janitor::remove_empty(): remove empty rows and/or columns
  • janitor::andorn_*(): jazz up frequency tables of counts (we’ll return to this for a table example in TODO: Session 8)
  • …and more!

Here, we’ll use janitor::clean_names() to convert all of our column headers to a more convenient case - the default is lower_snake_case, which means all spaces and symbols are replaced with an underscore (or a word describing the symbol), all characters are lowercase, and a few other nice adjustments.

For example, janitor::clean_names() would update these nightmare column names into much nicer forms:

  • My...RECENT-income! becomes my_recent_income
  • SAMPLE2.!test1 becomes sample2_test1
  • ThisIsTheName becomes this_is_the_name
  • 2015 becomes x2015

If we wanted to then use these columns (which we probably would, since we created them), we could clean the names to get them into more coder-friendly lower_snake_case with janitor::clean_names():

## [1] "month"                    "site"                    
## [3] "year"                     "california_cone_snail"   
## [5] "california_spiny_lobster" "orange_cup_coral"        
## [7] "purple_urchin"            "rock_scallop"

And there are other options for the case, like:

  • “snake” produces snake_case
  • “lower_camel” or “small_camel” produces lowerCamel
  • “upper_camel” or “big_camel” produces UpperCamel
  • “screaming_snake” or “all_caps” produces ALL_CAPS
  • “lower_upper” produces lowerUPPER
  • “upper_lower” produces UPPERlower

8.4.3 Combine or separate information in columns with tidyr::unite() and tidyr::separate()

Sometimes we’ll want to separate contents of a single column into multiple columns, or combine entries from different columns into a single column.

For example, the following data frame has genus and species in separate columns:
id genus species common_name
1 Scorpaena guttata sculpin
2 Sebastes miniatus vermillion
We may want to combine the genus and species into a single column, scientific_name:
id scientific_name common_name
1 Scorpaena guttata sculpin
2 Sebastes miniatus vermillion

Or we may want to do the reverse (separate information from a single column into multiple columns). Here, we’ll learn tidyr::unite() and tidyr::separate() to help us do both.

8.4.3.1 tidyr::unite() to merge information from separate columns

Use tidyr::unite() to combine (paste) information from multiple columns into a single column (as for the scientific name example above)

To demonstrate uniting information from separate columns, we’ll make a single column that has the combined information from site abbreviation and year in inverts_wide.

We need to give tidyr::unite() several arguments:

  • data: the data frame containing columns we want to combine (or pipe into the function from the data frame)
  • col: the name of the new “united” column
  • the columns you are uniting
  • sep: the symbol, value or character to put between the united information from each column
## # A tibble: 6 x 7
##   month site_year california_cone… california_spin… orange_cup_coral
##   <chr> <chr>                <dbl>            <dbl>            <dbl>
## 1 7     abur_2016              451               17               24
## 2 7     abur_2017               28               17               24
## 3 7     abur_2018              762               16               24
## 4 7     ahnd_2016               27               16               24
## 5 7     ahnd_2017               24               16               24
## 6 7     ahnd_2018               24               16               24
## # … with 2 more variables: purple_urchin <dbl>, rock_scallop <dbl>

Try updating the separator from "_" to “hello!” to see what the outcome column contains.

tidyr::unite() can also combine information from more than two columns. For example, to combine the site, common_name and year columns from inverts_long, we could use:

## # A tibble: 6 x 3
##   month year_site_name                     sp_count
##   <chr> <chr>                                 <dbl>
## 1 7     2016-abur-california cone snail         451
## 2 7     2017-abur-california cone snail          28
## 3 7     2018-abur-california cone snail         762
## 4 7     2016-abur-california spiny lobster       17
## 5 7     2017-abur-california spiny lobster       17
## 6 7     2018-abur-california spiny lobster       16

8.4.3.2 tidyr::separate() to separate information into multiple columns

While tidyr::unite() allows us to combine information from multiple columns, it’s more likely that you’ll start with a single column that you want to split up into pieces.

For example, I might want to split up a column containing the genus and species (Scorpaena guttata) into two separate columns (Scorpaena | guttata), so that I can count how many Scorpaena organisms exist in my dataset at the genus level.

Use tidyr::separate() to “separate a character column into multiple columns using a regular expression separator.”

Let’s start again with inverts_unite, where we have combined the site and year into a single column called site_year. If we want to separate those, we can use:

## Warning: Expected 2 pieces. Additional pieces discarded in 165 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

What is that warning Expected 2 pieces... telling us? If we take a look at the resulting data frame inverts_sep, we see that it only keeps the first two pieces, and gets rid of the third (name). Which is a bit concerning, because we rarely want to just throw away information in a data frame.

## # A tibble: 6 x 4
##   month my_year my_site_name sp_count
##   <chr> <chr>   <chr>           <dbl>
## 1 7     2016    abur              451
## 2 7     2017    abur               28
## 3 7     2018    abur              762
## 4 7     2016    abur               17
## 5 7     2017    abur               17
## 6 7     2018    abur               16

That’s problematic. How can we make sure we’re keeping as many different elements as exist in the united column?

We have a couple of options:

  1. Create the number of columns that are needed to retain as many elements as exist (in this case, 3, but we only created two new columns in the example above)
## Warning: Expected 3 pieces. Additional pieces discarded in 165 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

Another warning. What is that about? Let’s take a look at the resulting data frame and think about what’s missing (what are the “pieces discarded”?):

## # A tibble: 6 x 5
##   month the_year the_site the_name   sp_count
##   <chr> <chr>    <chr>    <chr>         <dbl>
## 1 7     2016     abur     california      451
## 2 7     2017     abur     california       28
## 3 7     2018     abur     california      762
## 4 7     2016     abur     california       17
## 5 7     2017     abur     california       17
## 6 7     2018     abur     california       16

Aha! Only the first word of the common name was retained, and anything else was trashed. We want to keep everything after the second dash in the new the_name column.

That’s because the default is extra = "warn", which means that if you have more pieces than columns you’re separating into, it will populate the columns that have been allotted (in our case, just 3) then drop any additional information, giving you a warning that pieces have been dropped.

To keep the extra pieces that have been dropped, add the extra = "merge" argument within tidyr::separate() to override:

No warning there about things being discarded. Explore inverts_sep_all:

## # A tibble: 165 x 5
##    month sample_year location sp_name                  sp_count
##    <chr> <chr>       <chr>    <chr>                       <dbl>
##  1 7     2016        abur     california cone snail         451
##  2 7     2017        abur     california cone snail          28
##  3 7     2018        abur     california cone snail         762
##  4 7     2016        abur     california spiny lobster       17
##  5 7     2017        abur     california spiny lobster       17
##  6 7     2018        abur     california spiny lobster       16
##  7 7     2016        abur     orange cup coral               24
##  8 7     2017        abur     orange cup coral               24
##  9 7     2018        abur     orange cup coral               24
## 10 7     2016        abur     purple urchin                  48
## # … with 155 more rows

We see that the resulting data frame has split year_site_name into three separate columns, sample_year, location, and sp_name, but now everything after the second break (“-”) remains together in sp_name instead of dropping pieces following the third word.

8.4.4 Convert between explicit and implicit missings (NAs)

An explicit missing is when every possible outcome actually appears in a data frame as a row, even if a variable of interest for that row is missing (NA).

Conversely, an implicit missing is when an observation (row) does not appear in the data frame because a variable of interest contains an NA missing value.

Consider the following data:

day animal food_choice
Monday eagle fish
Monday mountain lion squirrel
Monday toad NA
Tuesday eagle fish
Tuesday mountain lion deer
Tuesday toad flies

Notice that the row for toad still appears in the dataset for Tuesday, despite having a missing food choice for that day. This is an explicit missing because the row still appears in the data frame.

If that row was removed, the resulting dataset would look like this:

day animal food_choice
Monday eagle fish
Monday mountain lion squirrel
Tuesday eagle fish
Tuesday mountain lion deer
Tuesday toad flies

…and if your reaction is “But then how do I know there’s a toad from MONDAY?”, then you can see how it can be a bit risky to have implicit missings instead of explicit missings.

Whichever we choose, we can convert between the two forms using tidyr::drop_na() or tidyr::complete():

  • tidyr::drop_na(): removes observations (rows) that contain NA for variable(s) of interest
  • tidyr::complete(): turns implicit missing values into explicit missing values by completing a data frame with missing combinations of data

We’ll use both here, starting with the inverts_long data frame we created above.

Looking through inverts_long, we’ll see that there are NA observations for every species at site bull in 2018 - but those NA counts do show up. First, we’ll use tidyr::drop_na() to make those missings implicit (invisible) instead:

See that now, the rows that contained an NA in the sp_count column from inverts_long have been removed.

WAIT, I want them back! We can ask R to create explicit missings (by identifying which combinations of groups currently don’t appear in the data frame) using tidyr::complete():

Now you’ll see inverts_explicit_NA has those 5 “missing” observations shown in the data frame.

8.4.5 Activities

TODO

8.4.6 Fun facts / insights

TODO