Chapter 5 readxl

5.1 Summary

Check this, may need to be a block quote: The readxl package makes it easy to import tabular data from Excel spreadsheets (.xls or .xlsx files) and includes several options for cleaning data during import. readxl has no external dependencies and functions on any operating system, making it an OS- and user-friendly package that simplifies getting your data from Excel into R.

5.1.1 Objectives

  • Use read_csv() to read in a comma separated value (CSV) file
  • Use read_excel() to read in an Excel worksheet from an Excel workbook
  • Replace a specific string/value in a spreadsheet with with NA
  • Skip n rows when importing an Excel worksheet
  • Use read_excel() to read in parts of a worksheet (by cell range)
  • Specify column names when importing Excel data
  • Read and combine data from multiple Excel worksheets into a single df using purrr::map_df()
  • Write data using write_csv() or write_xlsx()

5.2 Lesson

5.2.1 Attach packages

In the .Rmd you just created within your version-controlled r-workshop R project, attach the tidyverse, readxl, writexl, and here packages.

In this lesson, we’ll read in a CSV file with the read_csv() function, so we need to have the readr package attached. Since it’s part of the tidyverse, we’ll go ahead and attach the tidyverse package below our script header using library(package_name). It’s a good idea to attach packages within the set-up chunk in R Markdown, so we’ll also attach the readxl, writexl, and here packages there.

Here’s our first code chunk:

{r setup, eval = FALSE}
knitr::opts_chunk$set(echo = TRUE)

# Attach the tidyverse, readxl, writexl and here packages:
library(tidyverse)
library(readxl)
library(writexl)
library(here)

Now, all of the packages and functions within the tidyverse and readxl, including read_csv() and read_excel(), are available for use.

5.2.2 Use read_csv() to read in data from a CSV file

There are many types of files containing data that you might want to work with in R. A common one is a comma separated value (CSV) file, which contains values with each column entry separated by a comma delimiter. CSVs can be opened, viewed, and worked with in Excel just like an .xls or .xlsx file - but let’s learn how to get data directly from a CSV into R where we can work with it more reproducibly.

The CSV we’ll read in here is called “fish_counts_curated.csv”, and contains observations for “the abundance and size of fish species as part of SBCLTER’s kelp forest monitoring program to track long-term patterns in species abundance and diversity” from the Santa Barbara Channel Long Term Ecological Research program.

Source: Reed D. 2018. SBC LTER: Reef: Kelp Forest Community Dynamics: Fish abundance. Environmental Data Initiative. https://doi.org/10.6073/pasta/dbd1d5f0b225d903371ce89b09ee7379. Dataset accessed 9/26/2019.

Read in the “fish_counts_curated.csv” file read_csv("file_name.csv"), and store it in R as an object called fish_counts:

Notice that the name of the stored object (here, fish_counts) will show up in our Environment tab in RStudio.

Click on the object in the Environment, and R will automatically run the View() function for you to pull up your data in a separate viewing tab. Now we can look at it in the spreadsheet format we’re used to.

Here are a few other functions for quickly exploring imported data:

  • summary(): summary of class, dimensions, NA values, etc.
  • names(): variable names (column headers)
  • ls(): list all objects in environment
  • head(): Show the first x rows (default is 6 lines)
  • tail(): Show the last x rows (default is 6 lines)

Now that we have our fish counts data ready to work with in R, let’s get the substrate cover and kelp data (both .xlsx files). In the following sections, we’ll learn that we can use read_excel() to read in Excel files directly.

5.2.3 Use read_excel() to read in a single Excel worksheet

First, take a look at substrate_cover_curated.xlsx in Excel, which contains a single worksheet with substrate type and percent cover observations at different sampling locations in the Santa Barbara Channel.

A few things to notice:

  • The file contains a single worksheet
  • There are multiple rows containing text information up top
  • Where observations were not recorded, there exists ‘-9999’

Let’s go ahead and read in the data. If the file is in our working directory, we can read in a single worksheet .xlsx file using read_excel("file_name.xlsx"). Note: read_excel() works for both .xlsx and .xls types.

Like this:

Tada? Not quite.

Click on the object name (substrate_cover) in the Environment to view the data in a new tab. A few things aren’t ideal:

## # A tibble: 23,942 x 9
##    `Substrate cover dataset,… ...2  ...3   ...4  ...5  ...6  ...7  ...8   ...9  
##    <chr>                      <chr> <chr>  <chr> <chr> <chr> <chr> <chr>  <chr> 
##  1 Source: https://portal.ed… <NA>  <NA>   <NA>  <NA>  <NA>  <NA>  <NA>   <NA>  
##  2 Accessed: 9/28/2019        <NA>  <NA>   <NA>  <NA>  <NA>  <NA>  <NA>   <NA>  
##  3 <NA>                       <NA>  <NA>   <NA>  <NA>  <NA>  <NA>  <NA>   <NA>  
##  4 year                       month date   site  tran… quad  side  subst… perce…
##  5 -9999                      -9999 -9999  carp  1     20    i     b      -9999 
##  6 2000                       9     -9999  carp  1     20    o     b      -9999 
##  7 2000                       9     9/8/00 carp  1     20    i     b      100   
##  8 2000                       9     9/8/00 carp  1     20    o     b      100   
##  9 2000                       9     9/8/00 carp  1     40    i     b      100   
## 10 2000                       9     9/8/00 carp  1     40    o     b      100   
## # … with 23,932 more rows
  • The top row of text has automatically become the (messy) column headers
  • There are multiple descriptive rows before we actually get to the data
  • There are -9999s that we want R to understand as NA instead

We can deal with those issues by adding arguments within read_excel(). Like:

  • Add skip = n to skip the first ‘n’ rows when importing data
  • Add na = "this" to replace “this” with NA when reading in spreadsheet data
## # A tibble: 23,938 x 9
##    year  month date   site  transect quad  side  substrate_type percent_cover
##    <chr> <chr> <chr>  <chr> <chr>    <chr> <chr> <chr>          <chr>        
##  1 <NA>  <NA>  <NA>   carp  1        20    i     b              <NA>         
##  2 2000  9     <NA>   carp  1        20    o     b              <NA>         
##  3 2000  9     9/8/00 carp  1        20    i     b              100          
##  4 2000  9     9/8/00 carp  1        20    o     b              100          
##  5 2000  9     9/8/00 carp  1        40    i     b              100          
##  6 2000  9     9/8/00 carp  1        40    o     b              100          
##  7 2000  9     9/8/00 carp  2        20    i     b              90           
##  8 2000  9     9/8/00 carp  2        20    o     b              80           
##  9 2000  9     9/8/00 carp  2        40    i     b              80           
## 10 2000  9     9/8/00 carp  2        40    o     b              85           
## # … with 23,928 more rows

Check out substrate_cover, and see that the first row after the 4 skipped are the column names, and all -9999s have been updated to NA. Hooray!

5.2.4 Use read_excel() to read in only part of an Excel worksheet

We always advocate for leaving the raw data raw, and writing a complete script containing all steps of data wrangling & transformation. But in some situations (be careful), you may want to specify a range of cells to read in from an Excel worksheet.

You can specify a range of cells to read in using the range = argument in read_excel(). For example, if I want to read in the rectangle from D12:I15 in substrate_cover_curated.xlsx - only observations for Carpenteria Beach (Transect 2) in September 2000 - I can use:

But yuck. Look at carp_cover_2000 and you’ll notice that the first row of that range is automatically made the column headers. To keep all rows within a range and add your own column names, add a col_names = argument:

## # A tibble: 4 x 6
##   site_name transect quad  plot_side type  coverage
##   <chr>     <chr>    <chr> <chr>     <chr> <chr>   
## 1 carp      2        20    i         b     90      
## 2 carp      2        20    o         b     80      
## 3 carp      2        40    i         b     80      
## 4 carp      2        40    o         b     85

So far we’ve read in a single CSV file using read_csv(), and an Excel file containing a single worksheet with read_excel(). Next, let’s read in data from an Excel workbook that contains multiple worksheets.

5.2.5 Use read_excel() to read in selected worksheets from a workbook

Now, we’ll read in the kelp fronds data from file kelp_counts_curated.xlsx. Open the file in Excel, and notice see that it contains multiple worksheets with giant kelp observations in the Santa Barbara Channel during July 2016, 2017, and 2018, with data collected at each site in a separate worksheet.

To read in a single Excel worksheet from a workbook we’ll again use read_excel("file_name.xlsx"), but we’ll need to let R know which worksheet to get.

Let’s read in the kelp data just like we did above, as an object called kelp_counts.

You might be thinking, “Hooray, I got all of my Excel workbook data!” But remember to always look at your data - you will see that actually only the first worksheet was read in. The default in read_excel() is to read in the first worksheet in a multi-sheet Excel workbook.

To check the worksheet names in an Excel workbook, use excel_sheets():

If we want to read in data from a worksheet other than the first one in an Excel workbook, we can specify the correct worksheet by name or position by adding the sheet argument.

Let’s read in data from the worksheet named golb (Goleta Beach) in the kelp_counts_curated.xlsx workbook:

Note that you can also specify a worksheet by position: since golb is the 6th worksheet in the workbook, we could also use the following:

## # A tibble: 3 x 4
##   year  month site  total_fronds
##   <chr> <chr> <chr>        <dbl>
## 1 2016  7     golb          2557
## 2 2017  7     golb          1575
## 3 2018  7     golb          1629

5.2.6 Read in and combine data from multiple worksheets into a data frame simultaneously with purrr::map_df()

So far, we’ve read in entire Excel worksheets and pieces of a worksheet. What if we have a workbook (like kelp_counts_curated.xlsx) that contains worksheets that contain observations for the same variables, in the same organization? Then we may want to read in data from all worksheets, and combine them into a single data frame.

We’ll use purrr::map_df() to loop through all the worksheets in a workbook, reading them in & putting them together into a single data frame in the process.

The steps we’ll go through in the code below are:

  1. Set a pathway so that R knows where to look for an Excel workbook
  2. Get the names of all worksheets in that workbook with excel_sheets()
  3. Set names of a vector with set_names()
  4. Read in all worksheets, and put them together into a single data frame with purrr::map_df()

Aside: the pipe operator (%>%)

There are many ways to use R functions in sequence. One way, that follows the order that we think about steps in sequence, is using the pipe operator (%>%). We can use the pipe operator between steps in a sequence, each place we think “and then do this.”

For example, if I would like to walk my dog and then eat a burrito, in code that might be: walk(dog) %>% eat(burrito, type = "California")

Here, we’ll use the pipe operator to complete steps 1 - 4 above in sequence:

Check out kelp_all_sites, and notice that now the data from all 11 sites is now collected into a single data frame:

## # A tibble: 32 x 4
##    year  month site  total_fronds
##    <chr> <chr> <chr>        <dbl>
##  1 2016  7     abur           307
##  2 2017  7     abur           604
##  3 2018  7     abur          3532
##  4 2016  7     ahnd          2572
##  5 2017  7     ahnd            16
##  6 2018  7     ahnd            16
##  7 2016  7     aque         11152
##  8 2017  7     aque          9194
##  9 2018  7     aque          7754
## 10 2016  7     bull          6706
## # … with 22 more rows

5.2.7 Save data frames as .csv or .xlsx with write_csv() or write_xlsx()

There are a number of reasons you might want to save (/export) data in a data frame as a .csv or Excel worksheet, including:

  • To store raw data within the project you’re working in
  • To store copies of intermediate data frames
  • To convert your data back to a format that your coworkers/clients/colleagues will be able to use it more easily

Use write_csv(object, "file_name.csv") to write a data frame to a CSV, or write_xlsx(object, "file_name.xlsx") to similarly export as a .xlsx (or .xls) worksheet.

In the previous step, we combined all of our kelp frond observations into a single data frame. Wouldn’t it make sense to store a copy?

As a CSV:

A cool thing about write_csv() is that it just quietly works without wrecking anything else you do in a sequence, so it’s great to add at the end of a piped sequence.

For example, if I want to read in the range cells C1:D3 ‘ivee’ worksheet from kelp_counts_curated.xlsx, then write that new subset to a .csv file, I can pipe all the way through:

Now I’ve created kelp_ivee_subset.csv, but the object kelp_ivee_subset also exists as an object for me to use in R.

If needed, I can also export a data frame as an Excel (.xlsx) worksheet:

5.3 Activity: Import some invertebrates!

There’s one dataset we haven’t imported or explored yet: invertebrate counts for 5 popular invertebrates (California cone snail, California spiny lobster, orange cup coral, purple urchin and rock scallops) at 11 sites in the Santa Barbara Channel. Take a look at the invert_counts_curated.xlsx data by opening it in Excel

  • Read in the invert_counts_curated.xlsx worksheet as object ‘inverts_july’, only retaining site, common_name, and 2016 and setting the existing first row in the worksheet as to column headers upon import
  • Explore the imported data frame using View, names, head, tail, etc.
  • Write ‘inverts_july’ to a CSV file in your working directory called “inverts_july.csv”

5.4 Efficiency Tips

  • Add an assignment arrow in (<-): Alt + minus (-)
  • Undo shortcut: Command + Z
  • Redo shortcut: Command + Shift + Z

5.5 Additional thoughts

  • Economist article about gene > dates issue in Excel
  • Mine: data frame of bike casualities in NC, column names are age ranges but some of them import as dates
  • Excel makes some wrong assumptions and doesn’t give you a heads up about its decision making