Follow the steps on Ed Discussion to download the data!

`library(tidyverse)`

```
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```

`d <- readRDS("d.RDS")`

After you download the data and save it into the variable name
`d`

, go up top to where it says
`knitr::opts_chunk$set(echo = TRUE, eval=F)`

at the top of
the R Markdown file and change `eval`

to true, i.e. set
`eval=T`

so that when you knit the file, your code runs.

You can learn more about the `lm`

function **here**.
Note that this function takes several arguments. The first argument is a
formula, specifically a symbolic description of the model to be
fitted.

```
# The following formula represents Y as a function of our treatment (A) and all our confounders
f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income +
log_parent_wealth + test_percentile)
# Estimate the model with OLS
fit <- lm(formula = f, data = d)
```

Next, we are going to create two copies of our original data. The
first copy will set all the treatment values `= "no_college"`

and the second copy will set all the treatment values
`= "college"`

.

```
d_all_control <- d %>%
mutate(a = "no_college")
glimpse(d_all_control)
```

```
## Rows: 7,771
## Columns: 9
## $ a <chr> "no_college", "no_college", "no_college", "no_colleg…
## $ y <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
```

```
d_all_treated <- d %>%
mutate(a = "college")
glimpse(d_all_treated)
```

```
## Rows: 7,771
## Columns: 9
## $ a <chr> "college", "college", "college", "college", "college…
## $ y <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
```

Next we will use our linear regression estimated above to predict
both potential outcomes for each individual using the two data.frames
that we just created. We will then append the estimated potential
outcomes to our original data. To see more details on the
`predict`

function, see the documentation
here.

```
# Estimate potential outcomes using our model
potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control)
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated)
conditional_average_outcomes <- d %>%
mutate(yhat0 = potential_outcomes_under_control,
yhat1 = potential_outcomes_under_treatment)
glimpse(conditional_average_outcomes)
```

```
## Rows: 7,771
## Columns: 11
## $ a <chr> "college", "no_college", "no_college", "no_college",…
## $ y <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE…
## $ sex <chr> "Female", "Male", "Female", "Male", "Female", "Male"…
## $ race <fct> Non-Hispanic Non-Black, Hispanic, Hispanic, Hispanic…
## $ mom_educ <fct> < HS, Some college, High school, High school, High s…
## $ dad_educ <fct> College, Some college, No dad, High school, No dad, …
## $ log_parent_income <dbl> 8.914567, 6.907755, 9.367344, 9.350878, 10.060380, 8…
## $ log_parent_wealth <dbl> 8.767242, 8.100612, 6.907755, 7.920992, 8.922658, 8.…
## $ test_percentile <dbl> 45.07000, 58.48300, 37.01200, 41.99211, 22.00100, 3.…
## $ yhat0 <dbl> 0.16631844, 0.20681372, 0.07512611, 0.15039113, 0.06…
## $ yhat1 <dbl> 0.4474505, 0.5498541, 0.3049190, 0.3745281, 0.293438…
```

Now, we will estimate the causal effect for each individual by taking the difference between their potential outcomes.

```
conditional_average_effects <- conditional_average_outcomes %>%
mutate(effect = yhat1 - yhat0)
```

Now, we can calculate the average treatment effect by taking the
average individual effect across the whole data.frame.
**HINT**: Use the `summarise`

function to take
the mean effect over the whole data.frame.
`summarise(across(.fns = mean))`

```
conditional_average_effects %>%
select(yhat1, yhat0, effect) %>%
summarise_all(.funs = mean)
```

Alternative Solution:

```
conditional_average_effects %>%
select(yhat1, yhat0, effect) %>%
summarise(across(.fns = mean))
```

Modify the procedure above to estimate the average effect in subgroups defined by mom’s education:

- those with
`sex == Male`

- those with
`sex == Female`

Specifically this means, what is the average effect among those for
whom `sex == Male`

? Among those for whom
`sex == Female`

? **HINT**: this should only be a
slight modification of your code above that takes the average for the
whole data.frame.

```
conditional_average_effects %>%
select(yhat1, yhat0, effect, sex) %>%
group_by(sex) %>%
summarise(across(.fns = mean))
```

```
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(.fns = mean)`.
## Caused by warning:
## ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.
```

In groups: Repeat the steps above with logistic regression

\[\text{log}\left(\frac{\hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}{1 - \hat{P}\left(Y\mid A = a, \vec{L} = \vec\ell\right)}\right) = \hat\alpha + A\hat\beta + \vec{L}'\hat{\vec\gamma} + A\vec{L}'\hat{\vec\eta}\] Helpful hints:

- This should be a simple extension of the code above that uses
`lm()`

. - read about using
`glm()`

to estimate logistic regression - when using
`predict()`

, search to find out how to predict probabilities

Fit a model

```
f <- y ~ a*(sex + race + mom_educ + dad_educ + log_parent_income +
log_parent_wealth + test_percentile)
fit <- glm(formula = f, family = binomial, data = d)
```

Predict and summarize to estimate the average effect

```
potential_outcomes_under_control <- predict(object = fit, newdata = d_all_control, type = "response")
potential_outcomes_under_treatment <- predict(object = fit, newdata = d_all_treated, type = "response")
conditional_average_outcomes <- d %>%
mutate(yhat1 = potential_outcomes_under_treatment,
yhat0 = potential_outcomes_under_control,
effect = yhat1 - yhat0) %>%
select(yhat1, yhat0, effect) %>%
summarise(across(.fns = mean))
conditional_average_outcomes
```