Student performance

Description

From L3 Exploring Quantitative Data

Portuguese secondary-school student performance records (UCI ML repo). The running example in L3 Exploring Quantitative Data focuses on G3 (the final math grade). Other variables include:

Medu — mother’s education level (0–4, ordinal)
goout — frequency of going out with friends (1–5)
Walc, Dalc — weekend / weekday alcohol consumption
address, paid — categorical (urban/rural, tuition yes/no)

File: data/student/student-mat.csv (semicolon-delimited).

Numerical summaries

From Numerical Summaries

Basics: count, missing values, central tendency (mean/median), spread (sd, IQR, range).

R code

stud_perf <- read.table("data/student/student-mat.csv", sep=";",
                        header=TRUE)
summary(stud_perf$G3)
sum(is.na(stud_perf$G3))

Python code

import pandas as pd
stud_perf = pd.read_csv("data/student/student-mat.csv", delimiter=";")
stud_perf.G3.describe()
# stud_perf.G3.info()

Conditioning on an explanatory variable (mother’s education):

round(aggregate(G3 ~ Medu, data=stud_perf, FUN=summary), 2)
table(stud_perf$Medu)

stud_perf[['Medu', 'G3']].groupby('Medu').describe()

Interpretation notes:

mean ≈ median → roughly symmetric
mean > median → likely right-skew (a few large values pulling mean up)
mean < median → likely left-skew

Histograms

From Histograms

What to look for: overall cluster pattern, suspected outliers, modality (uni-/bi-/multimodal), symmetry/skew.

R code

hist(stud_perf$G3, main="G3 Histogram", xlab="G3 scores")
 
library(lattice)
histogram(~G3 | Medu, data=stud_perf, type="density",
          main="G3 scores, by Medu levels", as.table=TRUE)

Python code

fig = stud_perf.G3.hist(grid=False)
fig.set_title('G3 histogram')
fig.set_xlabel('G3 scores');
 
stud_perf.G3.hist(by=stud_perf.Medu, figsize=(15,10),
                  density=True, layout=(2,3));

Conditioning on Medu gives a panel of histograms — useful to see whether G3 shifts with the mother’s education level.

Density plots

From Density Plots

Kernel density estimate (KDE) — a smoothed alternative to a histogram:

\hat{f} (x) = \frac{1}{nh} i = 1 \sum n K (\frac{x - x _{i}}{h})

Bandwidth $h$ controls smoothness: too small → spiky; too large → over-smoothed.

R code

densityplot(~G3, groups=Medu, data=stud_perf, auto.key=TRUE,
            main="G3 scores, by Medu", bw=1.5)

Python code

import matplotlib.pyplot as plt
 
f, axs = plt.subplots(2, 3, squeeze=False, figsize=(15, 6))
out2 = stud_perf.groupby("Medu")
 
for y, df0 in enumerate(out2):
    tmp = plt.subplot(2, 3, y+1)
    df0[1].G3.plot(kind='kde')
    tmp.set_title(df0[0])

In R/lattice: | makes separate panels; groups= overlays on a single panel.

Boxplots

From Boxplots

Boxplots skeletally summarise a distribution (Q1, median, Q3, whiskers, outliers) and are ideal for comparing across groups. Whisker reach: $Q_{1} - 1.5 \cdot I QR$ to $Q_{3} + 1.5 \cdot I QR$ . Points outside are suspected outliers.

R code

bwplot(G3 ~ goout, horizontal = FALSE, main="G3 scores, by goout",
       xlab="No. of times the student goes out per week",
       data=stud_perf)

Python code

stud_perf.plot.box(column='G3', by='goout',
                   xlabel='No. of times student goes out per week');

Kendall’s tau (Walc and Dalc)

From Example 6.10 (Kendall’s Tau for Walc and Dalc)

Both Walc (weekend alcohol consumption) and Dalc (weekday alcohol consumption) are ordinal (1–5). Kendall’s $τ_{b}$ is the right measure of ordinal association (analogue of correlation for ordered categorical).

In SAS Studio: Tasks > Tables > Table Analysis, select the two variables, and check the Kendall’s $τ_{b}$ option. The output format is similar to R’s DescTools::Desc — see the treatment under the Job satisfaction dataset.

See For Ordinal Variables and Gamma Tau for theory.

Chi-squared test (address vs paid)

From Example 6.9 (Chi-square Test for Independence)

Tests independence of address (urban/rural) and paid (took paid classes or not). In SAS Studio: Tasks > Table Analysis, select one as column, the other as row, and enable “Chi-square statistics” under OPTIONS.

See Chi-Square & Fisher for the formula and Chi-squared Test for Independence for the theory.

kienans garden *

Explorer

Student performance

Description

Numerical summaries

R code

Python code

Histograms

R code

Python code

Density plots

R code

Python code

Boxplots

R code

Python code

Kendall’s tau (Walc and Dalc)

Chi-squared test (address vs paid)

Graph View

Table of Contents

Backlinks