Roger D. Peng
Roger D. Peng is an Associate Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. He is also a CoFounder of the Johns Hopkins Data Science Specialization, which has enrolled over 1.5 million students, and the Simply Statistics blog where he writes about statistics for the general public. Roger can be found on Twitter and GitHub @rdpeng.

Preface

Getting Started with R
 Installation
 Getting started with the R interface

Managing Data Frames with the dplyr
package
 Data Frames
 The
dplyr
Package
dplyr
Grammar
 Installing the
dplyr
package
select()
filter()
arrange()
rename()
mutate()
group_by()
%>%
 Summary

Exploratory Data Analysis Checklist
 Formulate your question
 Read in your data
 Check the packaging
 Run
str()
 Look at the top and the bottom of your data
 Check your “n”s
 Validate with at least one external data source
 Try the easy solution first
 Challenge your solution
 Follow up questions

Principles of Analytic Graphics
 Show comparisons
 Show causality, mechanism, explanation, systematic structure
 Show multivariate data
 Integrate evidence
 Describe and document the evidence
 Content, Content, Content
 References

Exploratory Graphs
 Characteristics of exploratory graphs
 Air Pollution in the United States
 Getting the Data
 Simple Summaries: One Dimension
 Five Number Summary
 Boxplot
 Histogram
 Overlaying Features
 Barplot
 Simple Summaries: Two Dimensions and Beyond
 Multiple Boxplots
 Multiple Histograms
 Scatterplots
 Scatterplot  Using Color
 Multiple Scatterplots
 Summary

Plotting Systems
 The Base Plotting System
 The Lattice System
 The ggplot2 System
 References

Graphics Devices
 The Process of Making a Plot
 How Does a Plot Get Created?
 Graphics File Devices
 Multiple Open Graphics Devices
 Copying Plots
 Summary

The Base Plotting System
 Base Graphics
 Simple Base Graphics
 Some Important Base Graphics Parameters
 Base Plotting Functions
 Base Plot with Regression Line
 Multiple Base Plots
 Summary

Plotting and Color in R
 Colors 1, 2, and 3
 Connecting colors with data
 Color Utilities in R
colorRamp()
colorRampPalette()
 RColorBrewer Package
 Using the RColorBrewer palettes
 The
smoothScatter()
function
 Adding transparency
 Summary

Hierarchical Clustering
 Hierarchical clustering
 How do we define close?
 Example: Euclidean distance
 Example: Manhattan distance
 Example: Hierarchical clustering
 Prettier dendrograms
 Merging points: Complete
 Merging points: Average
 Using the
heatmap()
function
 Notes and further resources

KMeans Clustering
 Illustrating the Kmeans algorithm
 Stopping the algorithm
 Using the
kmeans()
function
 Building heatmaps from Kmeans solutions
 Notes and further resources

Dimension Reduction
 Matrix data
 Patterns in rows and columns
 Related problem
 SVD and PCA
 Unpacking the SVD: u and v
 SVD for data compression
 Components of the SVD  Variance explained
 Relationship to principal components
 What if we add a second pattern?
 Dealing with missing values
 Example: Face data
 Notes and further resources

The ggplot2 Plotting System: Part 1
 The Basics:
qplot()
 Before You Start: Label Your Data
 ggplot2 “Hello, world!”
 Modifying aesthetics
 Adding a geom
 Histograms
 Facets
 Case Study: MAACS Cohort
 Summary of qplot()

The ggplot2 Plotting System: Part 2
 Basic Components of a ggplot2 Plot
 Example: BMI, PM2.5, Asthma
 Building Up in Layers
 First Plot with Point Layer
 Adding More Layers: Smooth
 Adding More Layers: Facets
 Modifying Geom Properties
 Modifying Labels
 Customizing the Smooth
 Changing the Theme
 More Complex Example
 A Quick Aside about Axis Limits
 Resources

Data Analysis Case Study: Changes in Fine Particle Air Pollution in the U.S.
 Synopsis
 Loading and Processing the Raw Data
 Results