Discussion 1. Prob & Stats Review

STSCI/INFO/ILRST 3900: Causal Inference

August 27, 2025

To execute these simulations locally, download the .Rmd here

Announcements

Office Hours throughout the week (see Syllabus or website)
- Filippo: Thursday 4-5pm in 321A Computing & Information Science Building
- Shira: Monday 5-6 pm in 329A Computing and Information Science Building

Probability and Statistics Review

Expectation
Variance
Conditional Expectation
Independence
Bernoulli Random Variables
Law of Total Expectation
Confidence Intervals
Regression (OLS, logistic)

1. Expectation

(Expected Value, Population Mean, Average)

Notation: $E(X), \mu$
The expected value of a finite random variable \[\mu=E(X) := \sum_{i=1}^N x_i P(x_i) \quad \text{where} \quad P(x_i):=\text{prob}(X=x_i)\]
The expected value of a countable random variable, i.e. the (long run) average \[E(X):= \sum_{i=1}^\infty x_i P(x_i)\]
For $n$ independent and identically distributed (i.i.d.) random variables $X_1,\ldots,X_N$ the sample mean is \[\bar X = \frac{1}{N}\sum_{i=1}^N X_i\]
Law of Large Numbers (LLN): the sample mean converges to the expected value (population mean) as $N \to \infty$
Example: $X_i$ are random draws from $\sim \mathcal{N}(2,5)$ (a Normal r.v. with mean 2, variance 5)

true_mean <- 2
true_var <- 5

x <- seq(-8, 12 , length=1000)
y <- dnorm(x, mean=true_mean, sd=sqrt(true_var))

ggplot() + geom_line(aes(x,y)) + 
    geom_vline(xintercept = true_mean, color = "red") +
    theme_bw() +
    labs(y="Density")

sample_seq <- 1:3000
means <- numeric(length(sample_seq))
vars <- numeric(length(sample_seq))


for(i in 1:length(sample_seq)){
  n <- sample_seq[i]
  data <- rnorm(n = n, mean = true_mean, sd = sqrt(true_var))
  
  sample_mean <- mean(data)
  sample_var <- sum((data - sample_mean)^2)/length(data)
  
  means[i] <- sample_mean
  vars[i] <- sample_var
}


means <- tibble("N" = sample_seq, "Sample Mean" = means)
vars <- tibble("N" = sample_seq, "Sample Variance" = vars)

colors <- c("Sample Mean" = "lightblue", "Population Mean" = "red")

ggplot(means, aes(y = `Sample Mean`, x = N)) +
  geom_line(color = "lightblue") +
  geom_abline(slope = 0, intercept = true_mean, color = "red") +
  theme_bw()

2. Variance

Describes the spread of the data

Notation: $V(X), Var(X),\sigma^2$
Variance is the average of the squared differences from the mean
For a random variable $X$ with expected value $\mu:=E(X)$, the variance is \[\sigma^2 = Var(X) := E\Big[\big(X-\mu\big)^2\Big] = E\big[X^2\big] - \mu^2\] More explicitly \[Var(X) = \sum_{i=1}^n P(x_i)\cdot (x_i-\mu)^2 \quad \text{where} \quad P(x_i):=\text{prob}(X=x_i)\]

3. Sample (Empirical) Variance

For a finite dataset or finite sample

In practice, you can compute the variance of a finite dataset as \[\sigma^2 = \Big(\frac{1}{N}\sum_{i=1}^N x_i^2\Big)-\bar{X}^2 \quad \text{where} \quad \bar{X} := \frac{1}{N}\sum_{i=1}^N x_i\]
You don’t need to have the formula memorized, just be aware of it
Likely you’ll never have to explicitly compute it this way, just use an R function
Example: $X_i$ are random draws from $\sim \mathcal{N}(2,5)$ (a Normal r.v. with mean 2, variance 5)

ggplot(vars, aes(y = `Sample Variance`, x = N)) +
  geom_line(color = "lightblue") +
  geom_abline(slope = 0, intercept = true_var, color = "red") +
  theme_bw()

4. Conditional Expectation

Notation: $E(X|Y)$
The expected value given a set of “conditions”
Read as “the expectation of $X$ given (or conditioned on) $Y$” \[E(X|Y) = \sum_{i=1}^n x_i \cdot P(X=x_i | Y) \quad \text{where} \quad P(X=x_i|Y) = \frac{P(X=x_i \text{ and } Y)}{P(Y)}\]
Example: Roll a fair dice
- Let $A=1$ if you roll an even number, $0$ otherwise
- Let $B=1$ if you roll a prime number, $0$ otherwise
- Then \[E[A] = \sum_{i=1}^6 a_i\cdot P(a_i) = \frac{0+1+0+1+0+1}{6} = \frac{1}{2}\] and the conditional expectation of $A$ given $B=1$ (i.e. we rolled 2, 3, or 5) \[E[A | B=1]= \sum_{i=1}^3 a_i\cdot P(a_i|B=1) = \frac{1 + 0 + 0}{3}= \frac{1}{3}\]
Visualization in R for $E(X)=25$, $E[X| \text{group 1}] = 20$, $E[X| \text{group 2}] = 30$

5. Independence

Notation: $\perp, \ X \perp Y$
Two random variables are independent if the outcome of one does not give any information about the outcome of the other
Events $A$ and $B$ are independent if $P(A \text{ and } B) =P(A \cap B) = P(A)P(B)$
Recall: $P(A \cap B) = P(A | B)P(B)$
If $A \perp B$ , then $P(A|B)=P(A) \text{ and } P(B|A)=P(B)$
Example:
- Suppose you roll two fair dice. Let $A$ be the value of the first dice and let $B$ be the value of the second dice.
- If I say that $A=3$, does that give you any info about what the value of $B$ is?
- We can show that the events $A=3$ and $B=3$ are independent: \[\begin{align*} P(\{A=3\} \cap \{B=3\}) &= P(\{A=3\} | \{B=3\})\cdot P(\{B=3\}) \\ &= \frac{1}{6} \cdot \frac{1}{6} \\ &= P(\{A=3\}) \cdot P(\{B=3\}) \end{align*}\]
- To show $A \perp B$, you would show this holds for all values of $A$ and $B$

6. Bernoulli Random Variables

A binary/dichotomous random variable

Notation: $B(p), \text{Bernoulli}(p), \mathcal{B}(p)$
Takes the value $1$ with probability (w.p.) $p$, and the value $0$ w.p. $q:=1-p$
Let $X \sim B(p)$:
- “Let $X$ be a Bernoulli random variable with mean $p$”
- $E(X) = p \text{ and } Var(X) = p(1-p) = pq$
Cool fact: $E(X) = P(X=1) = p$

7. Law of Total Expectation

(i.e. law of iterated expectations, tower rule)

Useful property (or “trick) that will be used in class \[E(X) = E\big(E(X|Y)\big) \]
Don’t worry too much about the technical details, just add to your toolbox :)

8. Confidence Intervals

A set of values that contains the real parameter with probability $1-\alpha$
Define $CI=[L,U]$ then $P(L \leq \mu \leq U)= 1-\alpha$
Usually $1-\alpha$ is $95\%$ or $99\%$
Example: $X_i$ are random draws from $\sim \mathcal{N}(2,5)$
Estimating expectation of a random variable using sample mean: \[\hat E(X)=\hat\mu= \bar X =\frac{1}{N}\sum_{i=1}^N X_i\]
$\bar X$ is an estimate for $\mu$ with some uncertainty
$P(\mu \leq \bar X -c)=P(\mu \geq \bar X +c)=\frac{\alpha}{2}$
$P\left(\frac{\bar X-\mu}{\sigma/\sqrt{N}}\leq \frac{\mu-c-\mu}{\sigma/\sqrt{N}}\right) \Rightarrow -c=Z_\frac{\alpha}{2}\frac{\sigma}{\sqrt{N}}$
$Z_\frac{\alpha}{2}$ is the the critical value of the Normal distribution (For example in R: $\texttt{qnorm(0.025)})$
$CI= \bar X \pm Z_\frac{\alpha}{2} \frac{\sigma}{\sqrt{N}}$

9. Regression

Estimates the relationships between $X$ and $Y$ where
$Y$- the dependent variable, outcome/response
$X$- independent variable, regressor/explanatory
Main types of regression: Linear and Logistic

9.1. Linear Regression

Assume data was generated: $Y_i=\alpha+\beta X_i+\varepsilon_i$ for $i=1,\ldots,N$
$\alpha, \beta$ are the coefficients where $\alpha$ is the intercept and $\beta$ the slope
Using ordinary least squares (OLS) to estimate $\hat Y_i=\hat\alpha+\hat\beta X_i$
Minimizes sum of squared errors: $(\hat \alpha,\hat \beta)=\mathrm{argmin}_{a,b} \sum_{i=1}^N\big(Y_i-(a+bX_i)\big)^2$
$\frac{\partial}{\partial a} SSE = \sum_{i=1}^N -2(Y_i-a-bX_i) \qquad \Rightarrow \qquad \hat \alpha=\bar Y-\hat \beta \bar X$
$\frac{\partial}{\partial b} SSE = \sum_{i=1}^N -2(Y_i-(\bar Y-b\bar X) -bX_i)X_i$ $\qquad \qquad = \sum_{i=1}^N -2\big[(Y_i-\bar Y)X_i-b(X_i-\bar X)X_i \big]$ $\qquad\qquad\qquad \Rightarrow \hat \beta=\frac{\sum_{i=1}^N (Y_i-\bar Y)(X_i-\bar X)}{\sum_{i=1}^N (X_i-\bar X)^2}$

9.2. Logistic Regression

$Y_i$- the outcome variable is binary for $i=1,\ldots,N$
Use a link function to estimate $P(Y_i=1):=p_i$ that satisfies $\mathbb{R} \to (0,1)$
- Most common- logistic function: $\sigma(t)=\frac{1}{1+e^{-t}}$
In a linear model we estimate $\hat Y_i=\hat\alpha+\hat\beta X_i$
In logistic model we estimate $\hat p_i= \frac{1}{1+e^{-(\hat\alpha+\hat\beta X_i)}}$
$\alpha+\beta X_i= \ln\left(\frac{ p_i}{1- p_i}\right)$
Odds ratio: $\frac{ p_i}{1- p_i}=\frac{P(Y_i=1)}{P(Y_1=0)}$
For example: $\frac{P(\text{Passing exam})}{P(\text{Not passing})}=\frac{3/4}{1/4}$ the odds ratio is $3:1$
To estimate $\hat \alpha, \hat \beta$ we use maximum likelihood estimates (MLE)
Likelihood function: $L(a,b;y)= \prod_{i=1}^N P(Y_i=y_i)= \prod_{i=1}^Np_i^{y_i}(1-p_i)^{(1-y_i)}$
Log likelihood: $l(a,b;y)= \sum_{i=1}^N y_i\ln(p_i)+(1-y_i)\ln(1-p_i)=\sum_{i=1}^N \ln(1-p_i)+y_i \ln\left(\frac{p_i}{1-p_i}\right)$
To find MLE we solve $\frac{\partial}{\partial (a,b)}l(a,b;y)=0$
No close form solution iterative method such as: gradient descent or Newton–Raphson

R/RStudio Intro

R is an open-source programming language
Used for statistical computing and creating plots
Download and install R
RStudio is an open-source IDE (integrated development environment)
Download and install RStudio (scroll down for earlier versions)
install.packages(“rmarkdown”)
install.packages(“knitr”)
tinytex::install_tinytex()
Download this .Rmd and open in RStudio
Compile to a PDF (HW submission will be a PDF file)
R Markdown tutorial and open in RStudio
Subscripts and superscripts: to get $Y_{i}^{a}$ inline use $Y_{i}^{a}$

11 Course recap

Discussion 2. Analyzing an Experiment in R