The first exercise and problem set 4 use the lalonde
dataset from the following paper:
Dehejia, R. H. and Wahba, S. 1999. Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs. Journal of the American Statistical Association 94(448):1053–1062.
The paper compares methods for observational causal inference to recover an average causal effect that was already known from a randomized experiment. You do not need to read the paper; we will just use the study’s data as an illustration. We’ll load the data into R with the first code block.
To learn about the data, type ?lalonde
in your R
console.
Our goal is to estimate the effect of job training treat
on future earnings re78
(real earnings in 1978), among
those who received job training (the average treatment effect on the
treated, ATT).
matchit()
to conduct a matchingFor this part, we assume that three variables comprise a sufficient
adjustment set: race
, married
, and
nodegree
. We use matchit
with:
treat ~ race + married + nodegree
method = "exact"
to conduct exact matching, which
matches two units only if they are identical along race
,
married
, and nodegree
data = lalonde
since we are using the
lalonde
dataestimand = "ATT"
since we are targeting the average
treatment effect on the treated (ATT)We then use the summary()
function to see how many
control units and how many treatment units were matched.
exact_low <- matchit(treat ~ race + married + nodegree,
data = lalonde,
method = "exact",
estimand = "ATT")
# Note: There are multiple correct ways to extract the numbers below
summary(exact_low)$nn
## Control Treated
## All (ESS) 429.0000 185
## All 429.0000 185
## Matched (ESS) 111.5254 185
## Matched 429.0000 185
## Unmatched 0.0000 0
## Discarded 0.0000 0
Question: How many control units were matched? How many treated units?
All treated and control units were kept!
Here, we estimate a linear regression model using the match data from
2.1 using the lm()
function with the formula
re78 ~ treat + race + married + nodegree
. We pass weights
that come from the matching. Notice that for this piece, we have passed
the matched data match.data(exact_low)
. The coefficient in
front of the variable treat
in the linear regression is our
estimated effect.
fit <- lm(re78 ~ treat + race + married + nodegree,
data = match.data(exact_low),
w = weights)
print(round(coef(fit)["treat"],2))
## treat
## 1309.9
Question: What is the estimated effect of job training on earnings?
Answer. The estimate suggests that job training increases future earnings by $1309.90.
In matching, one thing we care about is balance across covariates. In
other words, we want to see that the distributions of different
covariates are about the same between the treatment and the control
groups. We can check how well the balancing has been done with the
summary()
function.
interactions
: check interaction terms too? (T or
F)un
: show statistics for unmatched data as well? (T or
F)summary(exact_low, interactions = F, un = F)$sum.matched
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## raceblack 0.84324324 0.84324324 -4.440892e-16 NA 4.440892e-16
## racehispan 0.05945946 0.05945946 -4.857226e-17 NA 4.857226e-17
## racewhite 0.09729730 0.09729730 -6.938894e-17 NA 6.938894e-17
## married 0.18918919 0.18918919 -1.387779e-16 NA 1.387779e-16
## nodegree 0.70810811 0.70810811 -3.330669e-16 NA 3.330669e-16
## eCDF Max Std. Pair Dist.
## raceblack 4.440892e-16 0
## racehispan 4.857226e-17 0
## racewhite 6.938894e-17 0
## married 1.387779e-16 0
## nodegree 3.330669e-16 0
Question: What do you notice about the means of different covariates for the treated versus control groups?
Answer: Their means are the same!
In this case, we basically have perfect balance. This doesn’t always happen. Depending on the method and parameters you use, you could have “bad” matches where the covariates are unbalanced. If you conduct a matching and the covariate balance doesn’t look good, try another matching procedure!
You will use the results from this section in Problem Set 4.
matchit()
to conduct a matchingNow suppose the adjustment set needs to also include 1974 earnings,
re74
. The adjustment set for this part is
race
, married
, nodegree
, and
re74
. Repeat exact matching as above.
exact_high <- matchit(treat ~ race + married + nodegree + re74,
data = lalonde,
method = "exact",
estimand = "ATT")
# Note: There are multiple correct ways to extract the numbers below
summary(exact_high)$nn
## Control Treated
## All (ESS) 429.00000 185
## All 429.00000 185
## Matched (ESS) 48.73116 131
## Matched 108.00000 131
## Unmatched 321.00000 54
## Discarded 0.00000 0
Question: How many control units were matched? How many treated units?
Now only 108 out of 429 control units are matched, and only 131 out of 185 treated units.
Look at the re74
values in the full data and among the
matched units.
Here is one way to do this:
select()
function to get the re74
column in the full data. Pass this to the
summary()
function to look at descriptive statistics of the
re74
values in the full data.select()
function to get the re74
column in the matched data. Pass this to the
summary()
function to look at descriptive statistics of the
re74
values in the full data. You can get the matched data
using the match.data
function.Full data:
summary(
lalonde %>%
select(re74)
)
## re74
## Min. : 0
## 1st Qu.: 0
## Median : 1042
## Mean : 4558
## 3rd Qu.: 7888
## Max. :35040
Matched data:
matched_data <- match.data(exact_high)
summary(
matched_data %>%
select(re74)
)
## re74
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
Explain what happened: What do you notice? What is
different about the values of re74
in the full data versus
the matched data? Explain what happened and why it happened. Briefly
interpret the result from 3.2: what is the drawback of using exact
matching in this setting?