The “MatchIt” package has many options for conducting matching! In this example, we walk through different ways to do matching using the NLSY data you downloaded for last week’s lab. Walking through this example will be helpful for Problem Set 4.
If you try running the block below and it gives you an error, you may need to install some packages first.
install.packages("optmatch")
install.packages("sandwich")
install.packages("MatchIt")
install.packages("marginaleffects")
First, we’ll load the data set from last week.
d <- readRDS("d.RDS")
head(d)
## # A tibble: 6 × 9
## a y sex race mom_educ dad_educ log_parent_income log_parent_wealth
## <chr> <lgl> <chr> <fct> <fct> <fct> <dbl> <dbl>
## 1 colle… FALSE Fema… Non-… < HS College 8.91 8.77
## 2 no_co… FALSE Male Hisp… Some co… Some co… 6.91 8.10
## 3 no_co… FALSE Fema… Hisp… High sc… No dad 9.37 6.91
## 4 no_co… TRUE Male Hisp… High sc… High sc… 9.35 7.92
## 5 no_co… FALSE Fema… Hisp… High sc… No dad 10.1 8.92
## 6 no_co… FALSE Male Hisp… High sc… No dad 8.21 8.92
## # ℹ 1 more variable: test_percentile <dbl>
matchit()
functionAll the options can be found here, but below we’ve listed some of them for convenience!
method
: the matching method to use, some options
are:
distance
: the distance metric to use, some options
are:
euclidean
scaled_euclidean
mahalanobis
robust_mahalanobis
replace
: if TRUE, does matching with replacement;
if FALSE, does matching without replacement
caliper
: for methods that allow it, this specifies
the width(s) of the caliper(s) to use in matching
ratio
: for methods that allow it, how many control
units should be matched to each treated unit in k:1 matching (should be
a single integer value)
Optimal matching is usually better than greedy matching, as long as your data isn’t too big.
# This is an example of running a greedy nearest neighbor matching with the euclidean distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
method = "nearest",
distance = "euclidean",
data = d)
# This is an example of running an optimal matching with the euclidean distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
method = "optimal",
distance = "euclidean",
data = d)
Ask yourself: What is different about the above two code chunks? What is the same?
The following code times the two methods! Try it with different
values of n by changing the value of size
:
## Optimal vs greedy with NYLSR data
ind <- sample(nrow(d), size = 3000, replace = T)
# Method: greedy nearest neighbor
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
+ test_percentile,
method = "nearest", distance = "euclidean",
data = d[ind, ]))
# Method: optimal matching
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
+ test_percentile,
method = "optimal", distance = "euclidean",
data = d[ind, ]))
Ask yourself: What do you notice about the differences in the time it takes to run each method?
On the full data, using optimal is possible, but can take a bit of time. On larger data sets, it might not be possible:
# With full data set greedy matching takes ~ 0.5-1.5 seconds
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
+ test_percentile,
method = "nearest", distance = "euclidean",
data = d))
# Meanwhile, optimal matching takes 60-130 seconds
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
+ test_percentile,
method = "optimal", distance = "euclidean",
data = d))
The Euclidean distance metric can be absolutely fine to use, depending on which covariates you are considering.
The Mahalanobis distance metric is motivated by two principles:
If we know our variables have similar variance and no correlations, then using Euclidean is probably fine! For our data the covariance (and correlation) matrix of the 3 variables we are interested in is:
# This is an example of running a greedy nearest neighbor matching with the euclidean distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
method = "nearest",
distance = "euclidean",
data = d)
# This is an example of running a greedy nearest neighbor matching with the mahalanobis distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
method = "nearest",
distance = "mahalanobis",
data = d)
Ask yourself: What is different about the above two code chunks? What is the same? Bonus: What is the difference between using the euclidean distance metric and the mahalanobis distance metric? Why would you use one over the other?
For the remainder of the example, we’ll use a portion of the data
from a job training program. In particular, the treatment is whether or
not someone participated in a job training program. The outcome of
interest is their salary in 1978 (re78
).
# Load the data
data("lalonde")
# See what's in the data
?lalonde # this opens up the "Help" tab with the documentation
head(lalonde) # this shows you the first few rows of the dataset
## treat age educ race married nodegree re74 re75 re78
## NSW1 1 37 11 black 1 1 0 0 9930.0460
## NSW2 1 22 9 hispan 0 1 0 0 3595.8940
## NSW3 1 30 12 black 0 0 0 0 24909.4500
## NSW4 1 27 11 black 0 1 0 0 7506.1460
## NSW5 1 33 8 black 0 1 0 0 289.7899
## NSW6 1 22 9 black 0 1 0 0 4056.4940
We expect income in 1974 (re74
) is highly correlated
with income in 1975 (re75
). It also has a much higher
variability than age.
## Suppose there are 3 individuals
dat <- matrix(c(50, 5000,
20, 5900,
48, 5100,
25, 5050), ncol = 2, byrow = T)
colnames(dat) <- c("age", "re74")
dat
## age re74
## [1,] 50 5000
## [2,] 20 5900
## [3,] 48 5100
## [4,] 25 5050
Is individual 2 or 3 more similar to individual 1? To answer this, we should compute the distances between individuals 1 and 2, and 1 and 3. One way is to compute the euclidean distances.
# We can also compute the Euclidean distance
dist(dat, method = "euclidean")
## 1 2 3
## 2 900.49986
## 3 100.02000 800.48985
## 4 55.90170 850.01471 55.03635
If we instead measure the mahalanobis distance which accounts for the different scaling of the data, we get something more similar to what we’d expect
# Then we compute the Mahalanobis distance with the function mahalanobis_dist
mahalanobis(dat[-1,], dat[1,], cov(dat))
## [1] 4.91902314 0.05655527 4.42544987
Now let’s try to run the matching procedure using the
matchit()
function. Let’s start with Euclidean distance,
even if may not be great.
# method: nearest (i.e. greedy 1:1)
# distance: euclidean
# data: lalonde (the data set we're working with)
# replace: True (T) or False (F) - whether to sample with or without replacement.
# Note, if allowing for replacement, greedy and optimal are the same
# So for the function, you only need to specify if using method = "nearest"
# ratio: how many control matches for each treated unit (i.e. k:1 matching)
# caliper: by default, the caliper width in standard units (i.e., Z-scores)
m.out0 <- matchit(treat ~ re74 + re75 + age + race,
method = "nearest",
distance = "euclidean",
data = lalonde,
replace = F,
ratio = 1)
We can check how well the balancing has been done. The following code gives the following information (& more):
?summary.matchit # Look in the Help tab (on the bottom right) for documentation on summary.matchit
# interactions: check interaction terms too? (T or F)
# un: show statistics for unmatched data as well (T or F)
summary(m.out0, interactions = F, un = F)
##
## Call:
## matchit(formula = treat ~ re74 + re75 + age + race, data = lalonde,
## method = "nearest", distance = "euclidean", replace = F,
## ratio = 1)
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## re74 2095.5737 2113.1168 -0.0036 1.1093 0.0124
## re75 1532.0553 1514.1484 0.0056 1.2864 0.0159
## age 25.8162 26.2486 -0.0604 0.4294 0.0851
## raceblack 0.8432 0.2757 1.5611 . 0.5676
## racehispan 0.0595 0.1568 -0.4114 . 0.0973
## racewhite 0.0973 0.5676 -1.5868 . 0.4703
## eCDF Max Std. Pair Dist.
## re74 0.1568 0.0490
## re75 0.1459 0.0805
## age 0.1892 0.9730
## raceblack 0.5676 1.7395
## racehispan 0.0973 0.6857
## racewhite 0.4703 1.7327
##
## Sample Sizes:
## Control Treated
## All 429 185
## Matched 185 185
## Unmatched 244 0
## Discarded 0 0
Ask Yourself: How is the balance? How do the means compare between the treatment and control groups? What about SMD and variance ratio?
We can also visually asses the matches with different plots:
## Plots the density of all variables in which.xs
plot(m.out0, type = "density", which.xs = ~age + re74 + race, interactive = F)
## Plots the empirical CDF of all variables in which.xs
plot(m.out0, type = "ecdf", which.xs = ~age + re74, interactive = F)
Let’s try using propensity scores with calipers.
# distance: glm indicates a propensity score
m.out0 <- matchit(treat ~ re74 + re75 + age + race + educ + married + nodegree,
method = "nearest", distance = "glm",
data = lalonde, replace = F, caliper = .05, std.caliper = F,
ratio = 1)
summary(m.out0, interactions = F, un = F)
As an alternative, let’s try with coarsened exact matching.
# data: lalonde (the data set we're working with)
# replace: True (T) or False (F) - whether to sample with or without replacement.
# Note, if allowing for replacement, greedy and optimal are the same
# ratio: how many control matches for each treated unit
m.out1 <- matchit(treat ~ re74 + re75 + age + race + educ + married + nodegree,
method = "cem",
data = lalonde, replace = F,
ratio = 1)
# We only match 65 of the original 185 treated units
summary(m.out1, interactions = F, un = F)
m.out2 <- matchit(treat ~ re74 + re75 + age + race + educ + married + nodegree,
method = "cem",
grouping = list(race = list(c("black"), c("hispan", "white"))),
data = lalonde, replace = F,
ratio = 1)
# We only match 65 of the original 185 treated units
summary(m.out2, interactions = F, un = F)
# We could group hispanic/white together for the purposes of matching
# and not enforce matching on married, educ, nodegree
m.out2 <- matchit(treat ~ re74 + re75 + age + race,
method = "cem",
grouping = list(race = list(c("black"), c("hispan", "white"))),
data = lalonde, replace = F,
ratio = 1)
summary(m.out2, interactions = F, un = F, addlvariables = ~ married + nodegree + educ)
match.data(m.out2) %>%
arrange(subclass)
Given the matching (and assuming it is good enough), we can estimate the ATT. But we can do even better by combining regression + matching.
# propensity score matched
m.out0.data <- match.data(m.out0)
# CEM matched
m.out2.data <- match.data(m.out2)
# Fitting a linear model on the treatments will work
# even if all weights are not 1. We just need to feed them in
fit1 <- lm(re78 ~ treat, data = m.out2.data, weights = weights)
# vcov: tells R to use robust standard errors
avg_comparisons(fit1, variables = "treat",
vcov = "HC3",
newdata = subset(m.out2.data, treat == 1),
wts = "weights")
##
## Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## 1469 1014 1.45 0.147 2.8 -517 3456
##
## Term: treat
## Type: response
## Comparison: mean(1) - mean(0)
## Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
# Fitting a linear model on the treatments will work
# even if all weights are not 1. We just need to feed them in
fit2 <- lm(re78 ~ treat + re74 + re75 + age + race + married + nodegree + educ,
data =m.out2.data, weights = weights)
# vcov: tells R to use robust standard errors
avg_comparisons(fit2, variables = "treat",
vcov = "HC3",
newdata = subset(m.out2.data, treat == 1),
wts = "weights")
##
## Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## 1445 984 1.47 0.142 2.8 -484 3373
##
## Term: treat
## Type: response
## Comparison: mean(1) - mean(0)
## Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
fit3 <- lm(re78 ~ treat + re74 + re75 + age + race + married + nodegree + educ,
data =m.out0.data, weights = weights)
# vcov: tells R to use robust standard errors
avg_comparisons(fit3, variables = "treat",
vcov = "HC3",
newdata = subset(m.out0.data, treat == 1),
wts = "weights")
##
## Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## 1296 942 1.38 0.169 2.6 -551 3142
##
## Term: treat
## Type: response
## Comparison: mean(1) - mean(0)
## Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
Ask yourself: What is the estimate and how do I interpret it?
In the problem set, we ask you to do your own matching and explain your choices. Some guiding questions to help you are: