Discussion: Parametric g-formula

Download the Data

Follow the steps on Ed Discussion to download the data!

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
d <- readRDS("d.RDS")

After you download the data and save it into the variable name d, go up top to where it says knitr::opts_chunk$set(echo = TRUE, eval=F) at the top of the R Markdown file and change eval to true, i.e. set eval=T so that when you knit the file, your code runs.

Learn a model to predict \(Y\) given \(\{A,\vec{L}\}\)

You can learn more about the lm function here. Note that this function takes several arguments. The first argument is a formula, specifically a symbolic description of the model to be fitted.

# The following formula represents Y as a function of our treatment (A) and all our confounders
f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income +
  log_parent_wealth + test_percentile)

# Estimate the model with OLS
fit <- lm(formula = f, data = d)

Predict conditional average potential outcomes for every unit

Next, we are going to create two copies of our original data. The first copy will set all the treatment values = "no_college" and the second copy will set all the treatment values = "college".

d_all_control <- d %>%
  mutate(a = "no_college")

glimpse(d_all_control)
## Rows: 7,771
## Columns: 9
## $ a                 <chr> "no_college", "no_college", "no_college", "no_colleg…
## $ y                 <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex               <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race              <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ          <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ          <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile   <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
d_all_treated <- d %>%
  mutate(a = "college")

glimpse(d_all_treated)
## Rows: 7,771
## Columns: 9
## $ a                 <chr> "college", "college", "college", "college", "college…
## $ y                 <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex               <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race              <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ          <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ          <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile   <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…

Next we will use our linear regression estimated above to predict both potential outcomes for each individual using the two data.frames that we just created. We will then append the estimated potential outcomes to our original data. To see more details on the predict function, see the documentation here.

# Estimate potential outcomes using our model
potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control)
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated)

conditional_average_outcomes <- d %>%
  mutate(yhat0 = potential_outcomes_under_control,
         yhat1 = potential_outcomes_under_treatment)

glimpse(conditional_average_outcomes)
## Rows: 7,771
## Columns: 11
## $ a                 <chr> "college", "no_college", "no_college", "no_college",…
## $ y                 <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex               <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race              <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ          <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ          <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile   <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
## $ yhat0             <dbl> 0.16631844, 0.20681372, 0.07512611, 0.15039113, 0.06…
## $ yhat1             <dbl> 0.4474505, 0.5498541, 0.3049190, 0.3745281, 0.293438…

Difference to estimate conditional average effects

Now, we will estimate the causal effect for each individual by taking the difference between their potential outcomes.

conditional_average_effects <- conditional_average_outcomes %>%
  mutate(effect = yhat1 - yhat0)

Average over units

Now, we can calculate the average treatment effect by taking the average individual effect across the whole data.frame. HINT: Use the summarise function to take the mean effect over the whole data.frame. summarise(across(.fns = mean))

conditional_average_effects %>%
  select(yhat1, yhat0, effect) %>%
  summarise_all(.funs = mean)

Alternative Solution:

conditional_average_effects %>%
  select(yhat1, yhat0, effect) %>%
  summarise(across(.fns = mean))

Extension 1: Conditional average effects

Modify the procedure above to estimate the average effect in subgroups defined by mom’s education:

  1. those with sex == Male
  2. those with sex == Female

Specifically this means, what is the average effect among those for whom sex == Male? Among those for whom sex == Female? HINT: this should only be a slight modification of your code above that takes the average for the whole data.frame.

conditional_average_effects %>%
  select(yhat1, yhat0, effect, sex) %>%
  group_by(sex) %>%
  summarise(across(.fns = mean))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(.fns = mean)`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.

Extension 2: Logistic regression

In groups: Repeat the steps above with logistic regression

\[\text{log}\left(\frac{\hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}{1 - \hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}\right) = \hat\alpha + A\hat\beta + \vec{L}'\hat{\vec\gamma} + A\vec{L}'\hat{\vec\eta}\] Helpful hints:

  • This should be a simple extension of the code above that uses lm().
  • read about using glm() to estimate logistic regression
  • when using predict(), search to find out how to predict probabilities

Extension: Logistic regression

Fit a model

f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income +
  log_parent_wealth + test_percentile)

fit <- glm(formula = f, family = binomial, data = d)

Extension: Logistic regression

Predict and summarize to estimate the average effect

potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control, type = "response")
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated, type = "response")

conditional_average_outcomes <- d %>%
  mutate(yhat1 = potential_outcomes_under_treatment,
         yhat0 = potential_outcomes_under_control,
         effect = yhat1 - yhat0) %>%
  select(yhat1, yhat0, effect) %>%
  summarise(across(.fns = mean))

conditional_average_outcomes