Follow the steps on Ed Discussion to download the data!
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
d <- readRDS("d.RDS")
After you download the data and save it into the variable name
d
, go up top to where it says
knitr::opts_chunk$set(echo = TRUE, eval=F)
at the top of
the R Markdown file and change eval
to true, i.e. set
eval=T
so that when you knit the file, your code runs.
You can learn more about the lm
function here.
Note that this function takes several arguments. The first argument is a
formula, specifically a symbolic description of the model to be
fitted.
# The following formula represents Y as a function of our treatment (A) and all our confounders
f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income +
log_parent_wealth + test_percentile)
# Estimate the model with OLS
fit <- lm(formula = f, data = d)
Next, we are going to create two copies of our original data. The
first copy will set all the treatment values = "no_college"
and the second copy will set all the treatment values
= "college"
.
d_all_control <- d %>%
mutate(a = "no_college")
glimpse(d_all_control)
## Rows: 7,771
## Columns: 9
## $ a <chr> "no_college", "no_college", "no_college", "no_colleg…
## $ y <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
d_all_treated <- d %>%
mutate(a = "college")
glimpse(d_all_treated)
## Rows: 7,771
## Columns: 9
## $ a <chr> "college", "college", "college", "college", "college…
## $ y <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
Next we will use our linear regression estimated above to predict
both potential outcomes for each individual using the two data.frames
that we just created. We will then append the estimated potential
outcomes to our original data. To see more details on the
predict
function, see the documentation
here.
# Estimate potential outcomes using our model
potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control)
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated)
conditional_average_outcomes <- d %>%
mutate(yhat0 = potential_outcomes_under_control,
yhat1 = potential_outcomes_under_treatment)
glimpse(conditional_average_outcomes)
## Rows: 7,771
## Columns: 11
## $ a <chr> "college", "no_college", "no_college", "no_college",…
## $ y <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
## $ yhat0 <dbl> 0.16631844, 0.20681372, 0.07512611, 0.15039113, 0.06…
## $ yhat1 <dbl> 0.4474505, 0.5498541, 0.3049190, 0.3745281, 0.293438…
Now, we will estimate the causal effect for each individual by taking the difference between their potential outcomes.
conditional_average_effects <- conditional_average_outcomes %>%
mutate(effect = yhat1 - yhat0)
Now, we can calculate the average treatment effect by taking the
average individual effect across the whole data.frame.
HINT: Use the summarise
function to take
the mean effect over the whole data.frame.
summarise(across(.fns = mean))
conditional_average_effects %>%
select(yhat1, yhat0, effect) %>%
summarise_all(.funs = mean)
Alternative Solution:
conditional_average_effects %>%
select(yhat1, yhat0, effect) %>%
summarise(across(.fns = mean))
Modify the procedure above to estimate the average effect in subgroups defined by mom’s education:
sex == Male
sex == Female
Specifically this means, what is the average effect among those for
whom sex == Male
? Among those for whom
sex == Female
? HINT: this should only be a
slight modification of your code above that takes the average for the
whole data.frame.
conditional_average_effects %>%
select(yhat1, yhat0, effect, sex) %>%
group_by(sex) %>%
summarise(across(.fns = mean))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(.fns = mean)`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.
In groups: Repeat the steps above with logistic regression
\[\text{log}\left(\frac{\hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}{1 - \hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}\right) = \hat\alpha + A\hat\beta + \vec{L}'\hat{\vec\gamma} + A\vec{L}'\hat{\vec\eta}\] Helpful hints:
lm()
.glm()
to estimate logistic
regressionpredict()
, search to find out how to predict
probabilitiesFit a model
f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income +
log_parent_wealth + test_percentile)
fit <- glm(formula = f, family = binomial, data = d)
Predict and summarize to estimate the average effect
potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control, type = "response")
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated, type = "response")
conditional_average_outcomes <- d %>%
mutate(yhat1 = potential_outcomes_under_treatment,
yhat0 = potential_outcomes_under_control,
effect = yhat1 - yhat0) %>%
select(yhat1, yhat0, effect) %>%
summarise(across(.fns = mean))
conditional_average_outcomes