Solutions: Parametric g-formula Lab
- The goal: estimate the average treatment effect (ATE) \(E(Y^{a=1}) - E(Y^{a=0})\).
- If I knew both potential outcomes for everyone in the population, I could just take the averages.
- The fundamental problem of causal inference: I can only observe one potential outcome for each person.
- A potential solution: create a model for the outcome
- For a person who is treated, I know their outcome under treatment. I use a model to predict what their outcome would be if they had not been treated.
- I can predict the other potential outcome for everyone in my dataset using a model!
Download the Data
Follow the steps on Ed Discussion to download the data!
## ── Attaching core tidyverse packages ──── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
d <- readRDS("assets/discussions/d.RDS")
After you download the data and save it into the variable name d
, go up top to where it says knitr::opts_chunk$set(echo = TRUE, eval=F)
at the top of the R Markdown file and change eval
to true, i.e. set eval=T
so that when you knit the file, your code runs.
Step 0: Learn a model to predict \(Y\) given \(\{A,\vec{L}\}\)
You can learn more about the lm
function here.
Note that this function takes several arguments. The first argument is a formula, specifically a symbolic description of the model to be fitted.
First, we create the formula f
which represents the outcome y
as a function of our treatment a
and all our confounders. Then, using the lm()
function, we fit a model with linear regression.
f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income + log_parent_wealth + test_percentile)
# Estimate the model with OLS
fit <- lm(formula = f, data = d)
Step 1: Create Two Copies of the Original Data
Next, we are going to create two copies of our original data. The first copy will set all the treatment values = "no_college"
and the second copy will set all the treatment values = "college"
.
Step 2: Predict both potential outcomes for everyone
Next we will use our model (estimated above) to predict both potential outcomes for each individual using the two data.frames that we just created. We will then append the estimated potential outcomes to our original data.
To see more details on the predict
function, see the
documentation here.
# Estimate potential outcomes using our model
potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control)
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated)
# append the estimated potential outcomes to our original data.
conditional_average_outcomes <- d %>%
mutate(yhat0 = potential_outcomes_under_control,
yhat1 = potential_outcomes_under_treatment)
Step 3: Estimate the causal effect
Now let’s compute the causal effect by taking the difference between their potential outcomes and averaging.
One way to do this is:
conditional_average_effects <- conditional_average_outcomes %>%
mutate(effect = yhat1 - yhat0)
conditional_average_effects %>%
select(yhat1, yhat0, effect) %>%
summarise_all(.funs = mean)
## # A tibble: 1 × 3
## yhat1 yhat0 effect
## <dbl> <dbl> <dbl>
## 1 0.404 0.163 0.241
An alternative way to compute the treatment effect is below
## # A tibble: 1 × 1
## ate
## <dbl>
## 1 0.241
Extension 1: Conditional average effects
Modify the procedure above to estimate the average effect in subgroups defined by mom’s education:
- those with
sex == Male
- those with
sex == Female
Specifically this means, what is the average effect among those for whom sex == Male
?
Among those for whom sex == Female
? HINT: this should only be a slight modification
of your code above that takes the average for the whole data.frame.
conditional_average_effects %>%
select(yhat1, yhat0, effect, sex) %>%
group_by(sex) %>%
summarise(across(.fns = mean))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(.fns = mean)`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated
## in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.
## # A tibble: 2 × 4
## sex yhat1 yhat0 effect
## <chr> <dbl> <dbl> <dbl>
## 1 Female 0.348 0.129 0.219
## 2 Male 0.460 0.196 0.263
Extension 2: Logistic regression
In groups: Repeat the steps above with logistic regression
\[\text{log}\left(\frac{\hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}{1 - \hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}\right) = \hat\alpha + A\hat\beta + \vec{L}'\hat{\vec\gamma} + A\vec{L}'\hat{\vec\eta}\] Helpful hints:
Extension: Logistic regression
Fit a model
f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income +
log_parent_wealth + test_percentile)
fit <- glm(formula = f, family = binomial, data = d)
Extension: Logistic regression
Predict and summarize to estimate the average effect
potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control, type = "response")
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated, type = "response")
conditional_average_outcomes <- d %>%
mutate(yhat1 = potential_outcomes_under_treatment,
yhat0 = potential_outcomes_under_control,
effect = yhat1 - yhat0) %>%
select(yhat1, yhat0, effect) %>%
summarise(across(.fns = mean))
conditional_average_outcomes
## # A tibble: 1 × 3
## yhat1 yhat0 effect
## <dbl> <dbl> <dbl>
## 1 0.406 0.165 0.241