This book teaches the concepts and tools behind reporting modern data analyses in a reproducible manner. Reproducibilty is the idea that data analyses should be published or made available with their data and software code so that others may verify the findings and build upon them. The need for reproducible report writing is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available. This book will focus on literate statistical analysis tools which allow one to publish data analyses in a single document that allows others to easily execute the same analysis to obtain the same results.

Roger D. Peng

Roger D. Peng is an Associate Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. He is also a Co-Founder of the Johns Hopkins Data Science Specialization, which has enrolled over 1.5 million students, and the Simply Statistics blog where he writes about statistics for the general public. Roger can be found on Twitter and GitHub @rdpeng.

  • Getting Started with R
  • What is Reproducible Reporting?
  • The Data Science Pipeline
  • Literate Statistical Programming
  • Organizing a Data Analysis
  • Structure of a Data Analysis: Part 1
  • Structure of a Data Analysis: Part 2
  • Markdown
  • Using knitr for Reproducible Reports
  • Communicating Results Over E-mail
  • Reproducibility Check List
  • Evidence-based Data Analysis
  • Public Reproducibility Resources