Step 1 — When to use regression

SituationUse
Continuous , one continuous Simple linear regression
Continuous , multiple s (continuous or categorical)Multiple linear regression
Want to quantify the relationship / make predictionsRegression
Want to test if a group difference existst-test or ANOVA (regression with only dummy predictors is equivalent)

Regression vs ANOVA: ANOVA is a special case of regression where all predictors are categorical indicator variables. They give identical results.


Step 2 — Interpreting Outputs

OutputWhat it means
(intercept)Predicted when all ; often not directly meaningful
(slope)Change in predicted for a 1-unit increase in , holding others fixed
Proportion of variance in explained by the model; range
Adjusted penalised for number of predictors; use this to compare models of different sizes
F-test p-valueTests if at least one predictor is useful; overall model significance
t-test p-value (per coefficient)Tests if that specific

vs Adjusted : always increases when you add a variable (even a useless one). Adjusted decreases if the new variable doesn’t help enough. Use adjusted when comparing models.


Step 3 — Categorical Predictors (Indicator Variables)

RuleDetail
A categorical variable with levels needs dummy variablesOne level is the reference category (absorbed into intercept)
Dummy coefficient interpretationDifference in mean between that level and the reference, holding other predictors fixed
Interaction term Allows slope of to differ across levels of (or vice versa)

Why and not ? Using all dummies causes perfect collinearity with the intercept — the model matrix becomes singular and cannot be computed.

Example — gender with reference = Female:

  • : males have higher predicted than females on average

Adding an interaction:

  • Now the slope of is for females, for males

Step 4 — Residual Diagnostics

Always run these after fitting. The model is only valid if assumptions hold.

Residuals vs Fitted values plot
│
├─ Random scatter around zero, no pattern → assumptions OK ✓
├─ Curved / U-shape → non-linearity; add polynomial term (X²)
├─ Funnel / wedge (variance increases with fitted values) → heteroscedasticity
│   └─ Fix: log-transform Y, or use weighted least squares
└─ Systematic trend remaining → missing variable; add it to the model

QQ-plot of standardised residuals
├─ Points on diagonal → Normality OK ✓
└─ Deviations → Normality violated; consider transformation or robust regression

Residuals vs each X variable
└─ Pattern remaining → that variable needs a non-linear term

Standardised residuals: where is the leverage. Points with or are potential outliers.

Leverage: flags a high-leverage point — extreme in -space. High leverage alone is not a problem; high leverage plus large residual = influential point.

Cook’s distance : the point is likely influential — removing it would substantially change . Investigate, but don’t automatically delete.


Step 5 — Assumption Summary

AssumptionHow to check
LinearityResiduals vs fitted / residuals vs — no pattern
Normality of errorsQQ-plot of standardised residuals; Shapiro-Wilk
Homoscedasticity (equal variance)Residuals vs fitted — no funnel shape
Independence of observationsStudy design (not checkable from residual plots)

Step 6 — Quick Flowchart

Fit model → plot residuals vs fitted
│
├─ Random scatter → check QQ-plot of residuals
│   ├─ On diagonal → model OK; report R²_adj, coefficients, p-values
│   └─ Deviations → Normality violated; consider transformation
│
├─ Curved pattern → add X² term; refit
├─ Funnel pattern → transform Y (try log); refit
└─ Trend → add missing variable; refit

After refit → repeat diagnostics until residuals are clean

See also: L9 Linear Regression · Test Selection Guide · EDA Guide