The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement revolution that is transforming science. In the life sciences, data analysis is now part of practically every research project. Genomics, in particular, is being driven by new measurement technologies that permit us to observe certain molecular entities for the first time. These observations are leading to discoveries analogous to identifying microorganisms and other breakthroughs permitted by the invention of the microscope. Choice examples of these technologies are microarrays and next generation sequencing. This book will cover several of the statistical concepts and data analytic skills needed to succeed in data-driven life science research. We go from relatively basic concepts related to computing p-values to advanced topics related to analyzing high-throughput data.

While statistics textbooks focus on mathematics, this book focuses on using a computer to perform data analysis. Instead of explaining the mathematics and theory, and then showing examples, we start by stating a practical data-related challenge. This book also includes the computer code that provides a solution to the problem and helps illustrate the concepts behind the solution. By running the code yourself, and seeing data generation and analysis happen live, you will get a better intuition for the concepts, the mathematics, and the theory. The book was created using the R markdown language and we make all this code available to the reader. This means that readers can replicate all the figures and analyses used to create the book.

Rafael A Irizarry

Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 17 years, Dr. Irizarry’s research has focused on the analysis of genomics data. Dr. Irizarry is also one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data.

Michael I Love

Michael Love is a postdoctoral fellow with Dr. Irizarry in the Department of Biostatistics at the Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health. Dr. Love uses statistical models to infer biologically meaningful patterns from high-throughput sequencing data, and develops open-source statistical software for the Bioconductor Project.

  • Acknowledgements

  • Introduction

    • What Does This Book Cover?

    • How Is This Book Different?

  • Getting Started

    • Installing R

    • Installing RStudio

    • Learn R Basics

    • Installing Packages

    • Importing Data into R

    • Brief Introduction to dplyr

    • Mathematical Notation
  • Inference

    • Introduction

    • Random Variables

    • The Null Hypothesis

    • Distributions

    • Probability Distribution

    • Normal Distribution

    • Populations, Samples and Estimates

    • Central Limit Theorem and t-distribution

    • Central Limit Theorem in Practice

    • t-tests in Practice

    • The t-distribution in Practice

    • Confidence Intervals

    • Power Calculations

    • Monte Carlo Simulation

    • Parametric Simulations for the Observations

    • Permutation Tests

    • Association Tests

  • Exploratory Data Analysis

    • Quantile Quantile Plots

    • Boxplots

    • Scatterplots And Correlation

    • Stratification

    • Bi-variate Normal Distribution

    • Plots To Avoid

    • Misunderstanding Correlation (Advanced)

    • Robust Summaries

    • Wilcoxon Rank Sum Test

  • Matrix Algebra

    • Motivating Examples

    • Matrix Notation

    • Solving System of Equations

    • Vectors, Matrices and Scalars

    • Matrix Operations

    • Examples

  • Linear Models

    • The Design Matrix

    • The Mathematics Behind lm()

    • Standard Errors

    • Interactions and Contrasts

    • Linear Model with Interactions

    • Analysis of variance

    • Co-linearity

    • Rank

    • Removing Confounding

    • The QR Factorization (Advanced)

    • Going Further

  • Inference For High Dimensional Data

    • Introduction

    • Inference in Practice

    • Procedures

    • Error Rates

    • The Bonferroni Correction

    • False Discovery Rate

    • Direct Approach to FDR and q-values (Advanced)

    • Basic Exploratory Data Analysis

  • Statistical Models

    • The Binomial Distribution

    • The Poisson Distribution

    • Maximum Likelihood Estimation

    • Distributions for Positive Continuous Values

    • Bayesian Statistics

    • Hierarchical Models

  • Distance and Dimension Reduction

    • Introduction

    • Euclidean Distance

    • Distance in High Dimensions

    • Dimension Reduction Motivation

    • Singular Value Decomposition

    • Projections

    • Rotations

    • Multi-Dimensional Scaling Plots

    • Principal Component Analysis

  • Basic Machine Learning

    • Clustering

    • Conditional Probabilities and Expectations

    • Smoothing

    • Bin Smoothing

    • Loess

    • Class Prediction

    • Cross-validation

  • Batch Effects

    • Confounding

    • Confounding: High-throughput Example

    • Discovering Batch Effects with EDA

    • Gene Expression Data

    • Motivation for Statistical Approaches

    • Adjusting for Batch Effects with Linear Models

    • Factor Analysis

    • Modeling Batch Effects with Factor Analysis