# 11 Data-driven methods

## 11.1 Introduction

Nov 14

Slides.

For any given intervention, some subgroups of people will respond more than others. Ideas from machine learning can help us target human attention toward these subgroups.

**Concrete example.** Who responds most to a nudget to go for a walk? Imagine your first conduct a survey that asks people how much they love the fall, from (\(X = 1\) for least) to (\(X = 10\) for most). You then randomize them to a control condition (\(A = \texttt{untreated}\)) or a treatment condition (\(A = \texttt{treated}\)) that encourages them to go for a walk outside. The outcome \(Y\) is active minutes in the day, as recorded on an activity tracker.

**Simulated data.** In real data, it can be difficult to evaluate causal estimators because the truth is unknown. Today we will use data simulated from a known process in order to study the properties of estimators. The code below will prepare your R environment with a function `simulate_sample()`

that will generate data with 50 observations.

```
library(tidyverse)
source("https://raw.githubusercontent.com/causal3900/causal3900.github.io/main/assets/data/simulate_sample.R")
```

Here is an example of the code to simulate data:

`simulated <- simulate_sample()`

```
## X A Y
## 1 1 untreated 64.107442
## 2 1 treated 1.877372
## 3 1 untreated 22.928282
## 4 1 treated 87.447344
## 5 1 untreated 10.047032
## 6 2 treated 25.245564
```

**Causal estimands.** In this example, we would like to estimate \[\tau_x = E(\underbrace{Y^1 - Y^0}_{\substack{\text{effect of}\\\text{nudge to walk}\\\text{on active}\\\text{minutes}}}\mid \underbrace{X = x}_{\substack{\text{among those}\\\text{with love of}\\\text{fall = }x}})\]
for each value \(x = 1,\dots,10\). These estimands are the average causal effect of a nudge to walk (\(A\)) on active minutes (\(Y\)) within subgroups defined by each value of the scale for love of fall (\(X\)).

**Identification.** In our simulate data, \(A\) is assigned at random. There are no backdoor paths between \(A\) and \(Y\).

**Estimator.** An estimator is a function that takes a dataset and returns estimates. Below is a nonparametric estimator for our setting.

```
estimator <- function(data) {
data %>%
# Group by treatment A and confounder X
group_by(A, X) %>%
# Summarize by the average outcome within groups
summarize(Y = mean(Y),
.groups = "drop") %>%
# Reshape the data
pivot_wider(names_from = "A",
values_from = "Y",
names_prefix = "y_") %>%
# Estimate the effect within groups
mutate(effect = y_treated - y_untreated)
}
```

You can apply this estimator as follows.

`estimate <- estimator(simulated)`

**Task.** Using a sample simulated on your computer, estimate the average causal effect of \(A\) on \(Y\) within subgroups defined by \(X\). Report two numbers to us.

- for which value of \(X\) is the estimated effect of \(A\) most positive?
- what is that effect estimate?

We will discuss the distribution of estimates that we get as a class.

If you are ready early, you could think about how you might evaluate performance of this approach over many repeated simulations.

## 11.2 Machine learning approaches

Nov 16

Slides.

Today we generalize the ideas from Tuesday. We will discuss how sample splitting makes it easier to

- choose among many estimands
- choose among many estimators
- develop new data science approaches