Step 1 — What kind of data do you have, and what do you want to know?


Choosing a Plot

Data typeQuestionPlot
One quantitative variableShape, spread, modalityHistogram or Density plot
One quantitative variableOutliers, quartiles, skewBoxplot
One quantitative variableIs it Normal?QQ-plot
One quantitative + one groupCompare distributionsSide-by-side boxplots
Two quantitative variablesLinear relationship?Scatterplot
Many quantitative variablesAll pairwise relationshipsScatterplot matrix
One categorical variableFrequency or proportionBar chart
Two categorical variablesAssociation patternMosaic plot or grouped bar chart
Categorical outcome + quantitative predictorHow probability changesConditional density plot

Histogram vs density plot: histogram depends on bin width choice; density plot uses a bandwidth . Too small → spiky; too large → over-smoothed. Use both to cross-check.


Reading a Distribution

ShapeMean vs MedianWhat it looks likeImplication
Symmetric (Normal-like)Mean ≈ MedianBell, roughly equal tailsParametric methods safe
Right-skewedMean > MedianLong tail to the rightOutliers inflate mean; use median or robust stats
Left-skewedMean < MedianLong tail to the leftSame; check for floor effects
BimodalN/ATwo peaksMay indicate two subgroups; split and analyse separately
Heavy-tailedMean ≈ Median, but wideSymmetric but wide tailsSD unreliable; use IQR or MAD

Five-number summary shortcut: look at the gaps between Min–Q1, Q1–Median, Median–Q3, Q3–Max. Unequal gaps = skew in that direction.


Step 2 — Checking Normality

Use in order: visual first, formal test last.

1. Histogram
   └─ Roughly bell-shaped, single peak? → possible Normal
   └─ Skewed, bimodal, or heavy-tailed? → likely NOT Normal

2. QQ-plot (quantile-quantile plot)
   └─ Points lie on straight diagonal line → Normal
   └─ S-curve → light tails (platykurtic)
   └─ Inverted S-curve → heavy tails (leptokurtic)
   └─ Points curve up at right end → right-skewed
   └─ Points curve down at left end → left-skewed

3. Formal test (use only for confirmation, not as sole evidence)
   └─ Shapiro-Wilk: preferred for small n
   └─ Kolmogorov-Smirnov: large n
   └─ Caution: large n almost always rejects — visual check matters more

Skewness and kurtosis numbers: → symmetric; → fat tails (kurtosis > 3). These supplement plots; don’t rely on them alone.


Step 3 — Checking Association Between Two Quantitative Variables

MeasureFormulaRangeLimitation
Pearson Linear association only

close to = strong linear association; = no linear association (a curved relationship can exist and still be 0).

Always plot a scatterplot before reporting . Outliers can distort heavily.

When to use scatterplot matrix: when you have 3+ quantitative variables and want to scan all pairwise relationships simultaneously. Use a correlation heatmap alongside it to spot clusters.


Step 4 — Outlier Detection via Boxplot Rule

This is a flag, not a deletion rule. Investigate outliers — they may be data errors or genuinely extreme observations. Do not remove without justification.


See also: L3 Exploring Quantitative Data · L4 Exploring Categorical Data · L1 Introduction to R · Robust Statistics Guide