Description
Portuguese secondary-school student performance records (UCI ML repo). The running example in L3 Exploring Quantitative Data focuses on G3 (the final math grade). Other variables include:
Medu— mother’s education level (0–4, ordinal)goout— frequency of going out with friends (1–5)Walc,Dalc— weekend / weekday alcohol consumptionaddress,paid— categorical (urban/rural, tuition yes/no)
File: data/student/student-mat.csv (semicolon-delimited).
Numerical summaries
From Numerical Summaries
Basics: count, missing values, central tendency (mean/median), spread (sd, IQR, range).
R code
stud_perf <- read.table("data/student/student-mat.csv", sep=";",
header=TRUE)
summary(stud_perf$G3)
sum(is.na(stud_perf$G3))Python code
import pandas as pd
stud_perf = pd.read_csv("data/student/student-mat.csv", delimiter=";")
stud_perf.G3.describe()
# stud_perf.G3.info()Conditioning on an explanatory variable (mother’s education):
round(aggregate(G3 ~ Medu, data=stud_perf, FUN=summary), 2)
table(stud_perf$Medu)stud_perf[['Medu', 'G3']].groupby('Medu').describe()Interpretation notes:
- mean ≈ median → roughly symmetric
- mean > median → likely right-skew (a few large values pulling mean up)
- mean < median → likely left-skew
Histograms
From Histograms
What to look for: overall cluster pattern, suspected outliers, modality (uni-/bi-/multimodal), symmetry/skew.
R code
hist(stud_perf$G3, main="G3 Histogram", xlab="G3 scores")
library(lattice)
histogram(~G3 | Medu, data=stud_perf, type="density",
main="G3 scores, by Medu levels", as.table=TRUE)Python code
fig = stud_perf.G3.hist(grid=False)
fig.set_title('G3 histogram')
fig.set_xlabel('G3 scores');
stud_perf.G3.hist(by=stud_perf.Medu, figsize=(15,10),
density=True, layout=(2,3));Conditioning on Medu gives a panel of histograms — useful to see whether G3 shifts with the mother’s education level.
Density plots
From Density Plots
Kernel density estimate (KDE) — a smoothed alternative to a histogram:
Bandwidth controls smoothness: too small → spiky; too large → over-smoothed.
R code
densityplot(~G3, groups=Medu, data=stud_perf, auto.key=TRUE,
main="G3 scores, by Medu", bw=1.5)Python code
import matplotlib.pyplot as plt
f, axs = plt.subplots(2, 3, squeeze=False, figsize=(15, 6))
out2 = stud_perf.groupby("Medu")
for y, df0 in enumerate(out2):
tmp = plt.subplot(2, 3, y+1)
df0[1].G3.plot(kind='kde')
tmp.set_title(df0[0])In R/lattice: | makes separate panels; groups= overlays on a single panel.
Boxplots
From Boxplots
Boxplots skeletally summarise a distribution (Q1, median, Q3, whiskers, outliers) and are ideal for comparing across groups. Whisker reach: to . Points outside are suspected outliers.
R code
bwplot(G3 ~ goout, horizontal = FALSE, main="G3 scores, by goout",
xlab="No. of times the student goes out per week",
data=stud_perf)Python code
stud_perf.plot.box(column='G3', by='goout',
xlabel='No. of times student goes out per week');Kendall’s tau (Walc and Dalc)
Both Walc (weekend alcohol consumption) and Dalc (weekday alcohol consumption) are ordinal (1–5). Kendall’s is the right measure of ordinal association (analogue of correlation for ordered categorical).
In SAS Studio: Tasks > Tables > Table Analysis, select the two variables, and check the Kendall’s option. The output format is similar to R’s DescTools::Desc — see the treatment under the Job satisfaction dataset.
See For Ordinal Variables and Gamma Tau for theory.
Chi-squared test (address vs paid)
Tests independence of address (urban/rural) and paid (took paid classes or not). In SAS Studio: Tasks > Table Analysis, select one as column, the other as row, and enable “Chi-square statistics” under OPTIONS.
See Chi-Square & Fisher for the formula and Chi-squared Test for Independence for the theory.
See also: L3 Exploring Quantitative Data · L6 Introduction to SAS · L4 Exploring Categorical Data