October 21, 2016, Hopkins Marine Station, Stanford University

Outline

  • Definitions and tools

Outline

  • Definitions and tools

  • My programming origin story

Outline

  • Definitions and tools

  • My programming origin story

  • Reproducibility, collaboration and communication with open science tools
    • Lowndes et al., in prep
    • reproducibility is fundamental, but rarely tested
    • tools have changed how we do science

Outline

  • Definitions and tools

  • My programming origin story

  • Reproducibility, collaboration and communication with open science tools
    • Lowndes et al., in prep
    • reproducibility is fundamental, but rarely tested
    • tools have changed how we do science
  • Resources and recommendations
    • exposure to tools and confidence to use them

Data science and open science

Data science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Data science and open science

Data science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Data science and open science

Data science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Open science:

"the concept of transparency at all stages of the research process, coupled with free and open access to data, code, and papers" (Hampton et al. 2014)

Open science tools

Open science workflow

Open science workflow

My programming origin story

Photo credit: Greg Auger

Some thesis questions

  • what are Humboldt squid habitat preferences?
  • what season are they most abundant?
  • how fast and far can they migrate?
  • how do they interact with other species?
  • how do I work with data that's too big for Excel?
  • how do I subset years or other attributes?
  • how do I visualize this?
  • how on earth do I even think about this?

Conflated questions

Science:

  • what are their habitat preferences?
  • what season are they most abundant?
  • how fast and far can they migrate?
  • how do they interact with other species?

Data science:

  • how do I work with data too big for Excel?
  • how do I subset years or other attributes?
  • how do I visualize this?
  • how on earth do I even think about this?

I learned to program like many do

  • in a panic
  • for a single purpose (get this thesis done!)
  • in isolation*


I learned to program like many do

  • in a panic
  • for a single purpose (get this thesis done!)
  • in isolation*


* except for wonderful programming mentors:

Steve Haddock, Dave Foley, Ashley Booth

NCEAS, UC Santa Barbara

Ocean Health Index

method to categorize benefits that oceans provide to people

scores are modeled using existing data; data intensive

Ocean Health Index

method to categorize benefits that oceans provide to people

scores are modeled using existing data; data intensive

method can be tailored to different geographies

can help inform policy decisions, especially when repeated

OHI Global Assessments

OHI Global Assessments

2013: second annual global assessment

  • repeat methods
  • update data
  • compare between years

OHI Global Assessments

2013: second annual global assessment

  • repeat methods
  • update data
  • compare between years


We expected to easily reproduce our previous work. We had planned ahead:

  • coded models
  • 130 pages of published supplemental material
  • internal documents and notes

We thought we were doing reproducible science

We struggled to reproduce our work using standard approaches

We thought we were doing reproducible science

We struggled to reproduce our work using standard approaches

…mainly due to our approaches to data preparation (data science)

We thought we were doing reproducible science

We struggled to reproduce our work using standard approaches

…mainly due to our approaches to data preparation (data science)

Additional challenge of managing multiple years

Overcoming three main challenges

Completed second assessment by addressing:

  1. reproducibility
  2. collaboration
  3. communication

Overcoming three main challenges

Completed second assessment by addressing:

  1. reproducibility
  2. collaboration
  3. communication




Lowndes et al. Improving reproducibility, collaboration, and communication in environmental science using open science tools, in prep

Addressing challenges using open science tools

Addressing challenges using open science tools

Reproducibility - data preparation

"Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information." - NYTimes (2014)

  • transforming, rescaling, gap-filling, formatting, renaming, etc.
  • seldom mentioned but underpins the scientific process

Reproducibility - data preparation

Before

  • manually (without coding)
  • largely Microsoft Excel
  • internal documents and emails

After

  • full process coded
    • R with documentation
    • RMarkdown

Reproducibility - data preparation

Reproducibility - version control

"For scientist coders, [Git] works like a laboratory notebook for scientific computing…it keeps a lasting record of events." - Nature 2016

Reproducibility - version control

"For scientist coders, [Git] works like a laboratory notebook for scientific computing…it keeps a lasting record of events." - Nature 2016

Reproducibility - version control

Before

  • filenames suffixed with dates, initials
    • e.g. final.csv and final_JL-2016-08-05.csv
  • email descriptions of what changed between files

After

  • version control with git
  • short messages accompany committed changes

Reproducibility - version control

Collaboration - communication + file sharing

Collaboration - communication + file sharing

Before

  • email chains (often forwarded)

After

  • GitHub issues

Collaboration - communication + file sharing

Demo link (private)

Communication - sharing data, code, methods

Communication - sharing data, code, methods

Communication - sharing data, code, methods

Ocean Health Index Today

These tools and this workflow make our science possible.

Ocean Health Index Today

These tools and this workflow make our science possible.

  • December 8 2016: releasing 5th global assessment
  • Support and training for government or academic 'OHI+' assessments

All on ohi-science.org

Better science in less time

  • incremental adoption
  • always improving, learning
  • teaching and training, support

My recommendations

Get to your science questions sooner



1. Learn to code
    - in R
    - with RStudio

2. Use version control
    - git
    - with GitHub
    - through RStudio

Introduce these concepts incrementally

Great resources

Learn to program in an intentional way

  • in a panic feeling empowered
  • for a single purpose thinking ahead
  • in isolation with a community

Community

Thank you