Python: Pandas + NumPy

Pandas handles tabular data (DataFrames/Series). NumPy handles numerical arrays. They work together — pandas is built on top of NumPy.


Loading & Inspecting

FunctionUse Case
pd.read_csv("file.csv")Load CSV into a DataFrame
pd.read_excel("file.xlsx")Load Excel file
df.to_csv("out.csv", index=False)Export DataFrame to CSV
df.head(n) / df.tail(n)Preview first/last n rows
df.shapeReturns (rows, columns) tuple
df.dtypesData types of each column
df.describe()Summary stats for numeric columns
df.info()Column names, types, non-null counts

Selecting & Filtering

FunctionUse Case
df["col"] or df[["col1","col2"]]Select column(s)
df.loc[rows, cols]Label-based selection (inclusive both ends)
df.iloc[rows, cols]Position-based selection (exclusive end)
df[df["col"] > 5]Filter rows by condition
df.query("col > 5 and col2 == 'A'")Filter using query string
df["col"].isin([1, 2, 3])Filter by membership
# loc vs iloc
df.loc[0:4, "age"]       # rows 0–4 (inclusive), column "age"
df.iloc[0:4, 2]          # rows 0–3 (exclusive end), column index 2

Cleaning & Handling Missing Data

FunctionUse Case
df.isnull().sum()Count missing values per column
df.dropna()Remove rows with any NaN
df.fillna(value)Replace NaN with a value
df.drop_duplicates()Remove duplicate rows
df.rename(columns={"old": "new"})Rename columns
df["col"].astype(int)Change data type of a column
pd.to_datetime(df["col"])Convert column to datetime

Creating & Transforming Columns

FunctionUse Case
df["new"] = df["a"] + df["b"]Create new column from arithmetic
df["col"].apply(func)Apply a function to every element
df["col"].map({old: new})Map/replace values using a dict
df["col"].str.lower()String operations on a column
pd.cut(df["col"], bins=3)Bin continuous values into categories
# apply with lambda
df["income_k"] = df["income"].apply(lambda x: x / 1000)

Sorting & Ranking

FunctionUse Case
df.sort_values("col")Sort by column (ascending by default)
df.sort_values("col", ascending=False)Sort descending
df["col"].rank()Rank values within a column
df["col"].value_counts()Frequency count of unique values

Grouping & Aggregation

FunctionUse Case
df.groupby("col").mean()Group by column and get mean
df.groupby("col").agg({"a": "sum", "b": "mean"})Multiple aggregations
df.groupby("col").size()Count rows per group
df.pivot_table(index="col1", values="col2", aggfunc="mean")Excel-style pivot table
pd.crosstab(df["col1"], df["col2"])Cross-tabulation frequency table
# groupby + agg
df.groupby("department").agg(
    avg_salary=("salary", "mean"),
    headcount=("salary", "count")
)

Combining DataFrames

FunctionUse Case
pd.concat([df1, df2])Stack DataFrames vertically
pd.concat([df1, df2], axis=1)Join DataFrames side by side
df1.merge(df2, on="key")Inner join on a column
df1.merge(df2, on="key", how="left")Left join
df1.merge(df2, on="key", how="outer")Full outer join

Reshaping

FunctionUse Case
df.melt(id_vars=["id"], value_vars=["a","b"])Wide → Long format
df.pivot(index="id", columns="var", values="val")Long → Wide format
df.stack() / df.unstack()Rotate columns to rows / rows to columns
df.TTranspose rows and columns

NumPy: Core Functions

NumPy operates on arrays (not DataFrames). Used heavily for computation and passing data to ML models.

FunctionUse Case
np.array([1, 2, 3])Create a 1D array
np.zeros((3,3)) / np.ones((3,3))Create arrays of 0s or 1s
np.arange(start, stop, step)Evenly spaced values (like Python range)
np.linspace(start, stop, n)n evenly spaced values between start and stop
arr.reshape(rows, cols)Reshape array without changing data
arr.ravel() / arr.flatten()Flatten to 1D
np.concatenate([a, b], axis=0)Join arrays along an axis
np.where(condition, val_if_true, val_if_false)Conditional element replacement
np.sort(arr)Sort array values
np.argsort(arr)Return indices that would sort array
arr[arr > 5]Boolean indexing / filtering
np.unique(arr)Unique values
np.nan / np.isnan(arr)Handle NaN values

Stats functions (all accept axis= argument):

FunctionUse Case
np.mean(arr) / np.median(arr)Mean / median
np.std(arr) / np.var(arr)Standard deviation / variance
np.sum(arr) / np.cumsum(arr)Sum / cumulative sum
np.min(arr) / np.max(arr)Min / max values
np.percentile(arr, 75)Percentile
# axis= controls direction
a = np.array([[1,2],[3,4]])
np.sum(a, axis=0)   # sum down columns → [4, 6]
np.sum(a, axis=1)   # sum across rows → [3, 7]

Pandas ↔ NumPy conversion:

arr = df["col"].to_numpy()          # Series → NumPy array
arr = df.to_numpy()                 # DataFrame → NumPy array
df = pd.DataFrame(arr, columns=["a","b"])  # NumPy → DataFrame

R: Data Manipulation

The dominant workflow is the tidyverse (dplyr + tidyr). Base R functions also appear frequently.

library(tidyverse)  # loads dplyr, tidyr, readr, ggplot2, etc.

The pipe operator |> (or %>%) chains operations: output of left becomes input of right.


Loading & Inspecting

FunctionUse Case
read_csv("file.csv")Load CSV (tidyverse; faster than base R)
read.csv("file.csv")Load CSV (base R)
head(df, n) / tail(df, n)Preview first/last n rows
glimpse(df)Compact column overview (tidyverse)
str(df)Structure of object (base R)
summary(df)Summary stats for all columns
dim(df)Rows and columns
names(df) / colnames(df)Column names

dplyr: The Core Verbs

These six functions handle the vast majority of manipulation tasks.

FunctionUse Case
filter(df, col > 5)Keep rows meeting a condition
select(df, col1, col2)Keep specific columns
mutate(df, new_col = col1 + col2)Add or transform columns
arrange(df, col)Sort rows (use desc(col) to reverse)
summarise(df, mean_val = mean(col))Collapse to summary values
group_by(df, col)Group for per-group operations
# Chaining with pipe
df |>
  filter(age > 18) |>
  group_by(department) |>
  summarise(avg_salary = mean(salary),
            n = n())

Selecting Columns (select helpers)

select(df, starts_with("sales_"))   # columns starting with "sales_"
select(df, contains("date"))        # columns containing "date"
select(df, where(is.numeric))       # all numeric columns
select(df, -col_to_drop)            # all except one column

Other Useful dplyr Functions

FunctionUse Case
rename(df, new = old)Rename a column
distinct(df, col)Unique rows (optionally by column)
slice(df, 1:10)Select rows by position
slice_sample(df, n=50)Random sample of rows
count(df, col)Frequency count per group
n()Count rows (used inside summarise)
case_when(...)Vectorised if-else (inside mutate)
coalesce(col1, col2)First non-NA value across columns
# case_when example
df |> mutate(grade = case_when(
  score >= 90 ~ "A",
  score >= 70 ~ "B",
  TRUE        ~ "C"
))

Joining Tables

FunctionUse Case
inner_join(df1, df2, by="key")Only matching rows
left_join(df1, df2, by="key")All rows from df1, NAs for non-matches
right_join(df1, df2, by="key")All rows from df2
full_join(df1, df2, by="key")All rows from both
anti_join(df1, df2, by="key")Rows in df1 NOT in df2
bind_rows(df1, df2)Stack DataFrames vertically
bind_cols(df1, df2)Join DataFrames side by side

tidyr: Reshaping Data

FunctionUse Case
pivot_longer(df, cols, names_to, values_to)Wide → Long format
pivot_wider(df, names_from, values_from)Long → Wide format
separate(df, col, into=c("a","b"), sep="-")Split one column into two
unite(df, "new_col", col1, col2, sep="_")Combine two columns into one
drop_na(df) / drop_na(df, col)Remove rows with NAs
replace_na(list(col = 0))Replace NAs with a value
fill(df, col)Fill NAs with previous/next value
# Wide to long
df |> pivot_longer(
  cols = c(jan, feb, mar),
  names_to = "month",
  values_to = "sales"
)

Base R: Still Commonly Used

FunctionUse Case
df[rows, cols]Indexing (e.g., df[1:5, c("a","b")])
df$colAccess a single column
subset(df, col > 5)Filter rows
which(condition)Indices where condition is TRUE
table(df$col)Frequency table
tapply(x, group, fun)Apply function by group
apply(m, margin, fun)Apply function over rows (1) or cols (2)
is.na(x) / na.omit(x)Check/remove NAs
order(df$col)Get sort order indices
paste(a, b, sep="-")Concatenate strings

Quick Comparison: Python vs R

TaskPython (pandas)R (dplyr)
Filter rowsdf[df["age"] > 18]filter(df, age > 18)
Select columnsdf[["a", "b"]]select(df, a, b)
Add columndf["new"] = df["a"] * 2mutate(df, new = a * 2)
Sortdf.sort_values("col")arrange(df, col)
Group + summarisedf.groupby("g").mean()group_by(df, g) |> summarise(mean(x))
Wide → Longdf.melt(...)pivot_longer(...)
Long → Widedf.pivot(...)pivot_wider(...)
Joindf1.merge(df2, on="key")inner_join(df1, df2, by="key")