Step 1 — When to use regression
| Situation | Use |
|---|---|
| Continuous , one continuous | Simple linear regression |
| Continuous , multiple s (continuous or categorical) | Multiple linear regression |
| Want to quantify the relationship / make predictions | Regression |
| Want to test if a group difference exists | t-test or ANOVA (regression with only dummy predictors is equivalent) |
Regression vs ANOVA: ANOVA is a special case of regression where all predictors are categorical indicator variables. They give identical results.
Step 2 — Interpreting Outputs
| Output | What it means |
|---|---|
| (intercept) | Predicted when all ; often not directly meaningful |
| (slope) | Change in predicted for a 1-unit increase in , holding others fixed |
| Proportion of variance in explained by the model; range | |
| Adjusted | penalised for number of predictors; use this to compare models of different sizes |
| F-test p-value | Tests if at least one predictor is useful; overall model significance |
| t-test p-value (per coefficient) | Tests if that specific |
vs Adjusted : always increases when you add a variable (even a useless one). Adjusted decreases if the new variable doesn’t help enough. Use adjusted when comparing models.
Step 3 — Categorical Predictors (Indicator Variables)
| Rule | Detail |
|---|---|
| A categorical variable with levels needs dummy variables | One level is the reference category (absorbed into intercept) |
| Dummy coefficient interpretation | Difference in mean between that level and the reference, holding other predictors fixed |
| Interaction term | Allows slope of to differ across levels of (or vice versa) |
Why and not ? Using all dummies causes perfect collinearity with the intercept — the model matrix becomes singular and cannot be computed.
Example — gender with reference = Female:
- : males have higher predicted than females on average
Adding an interaction:
- Now the slope of is for females, for males
Step 4 — Residual Diagnostics
Always run these after fitting. The model is only valid if assumptions hold.
Residuals vs Fitted values plot
│
├─ Random scatter around zero, no pattern → assumptions OK ✓
├─ Curved / U-shape → non-linearity; add polynomial term (X²)
├─ Funnel / wedge (variance increases with fitted values) → heteroscedasticity
│ └─ Fix: log-transform Y, or use weighted least squares
└─ Systematic trend remaining → missing variable; add it to the model
QQ-plot of standardised residuals
├─ Points on diagonal → Normality OK ✓
└─ Deviations → Normality violated; consider transformation or robust regression
Residuals vs each X variable
└─ Pattern remaining → that variable needs a non-linear term
Standardised residuals: where is the leverage. Points with or are potential outliers.
Leverage: flags a high-leverage point — extreme in -space. High leverage alone is not a problem; high leverage plus large residual = influential point.
Cook’s distance : the point is likely influential — removing it would substantially change . Investigate, but don’t automatically delete.
Step 5 — Assumption Summary
| Assumption | How to check |
|---|---|
| Linearity | Residuals vs fitted / residuals vs — no pattern |
| Normality of errors | QQ-plot of standardised residuals; Shapiro-Wilk |
| Homoscedasticity (equal variance) | Residuals vs fitted — no funnel shape |
| Independence of observations | Study design (not checkable from residual plots) |
Step 6 — Quick Flowchart
Fit model → plot residuals vs fitted
│
├─ Random scatter → check QQ-plot of residuals
│ ├─ On diagonal → model OK; report R²_adj, coefficients, p-values
│ └─ Deviations → Normality violated; consider transformation
│
├─ Curved pattern → add X² term; refit
├─ Funnel pattern → transform Y (try log); refit
└─ Trend → add missing variable; refit
After refit → repeat diagnostics until residuals are clean
See also: L9 Linear Regression · Test Selection Guide · EDA Guide