More Examples of Matching with the “MatchIt” Package

The “MatchIt” package has many options for conducting matching! In this example, we walk through different ways to do matching using the NLSY data you downloaded for last week’s lab. Walking through this example will be helpful for Problem Set 4.

Load Data and Libraries

If you try running the block below and it gives you an error, you may need to install some packages first.

First, we’ll load the data set from last week.

d <- readRDS("d.RDS")
head(d)
## # A tibble: 6 × 9
##   a      y     sex   race  mom_educ dad_educ log_parent_income log_parent_wealth
##   <chr>  <lgl> <chr> <fct> <fct>    <fct>                <dbl>             <dbl>
## 1 colle… FALSE Fema… Non-… < HS     College               8.91              8.77
## 2 no_co… FALSE Male  Hisp… Some co… Some co…              6.91              8.10
## 3 no_co… FALSE Fema… Hisp… High sc… No dad                9.37              6.91
## 4 no_co… TRUE  Male  Hisp… High sc… High sc…              9.35              7.92
## 5 no_co… FALSE Fema… Hisp… High sc… No dad               10.1               8.92
## 6 no_co… FALSE Male  Hisp… High sc… No dad                8.21              8.92
## # ℹ 1 more variable: test_percentile <dbl>

Different options in the matchit() function

All the options can be found here, but below we’ve listed some of them for convenience!

Example: Greedy Nearest Neighbor vs Optimal Matching

Optimal matching is usually better than greedy matching, as long as your data isn’t too big.

# This is an example of running a greedy nearest neighbor matching with the euclidean distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
                  method = "nearest", 
                  distance = "euclidean",
                  data = d)
# This is an example of running an optimal matching with the euclidean distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
                  method = "optimal", 
                  distance = "euclidean",
                  data = d)

Ask yourself: What is different about the above two code chunks? What is the same?

The following code times the two methods! Try it with different values of n by changing the value of size:

## Optimal vs greedy with NYLSR data
ind <- sample(nrow(d), size = 3000, replace = T)

# Method: greedy nearest neighbor
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
                              + test_percentile,
                              method = "nearest", distance = "euclidean",
                              data = d[ind, ]))
# Method: optimal matching
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
                              + test_percentile,
                              method = "optimal", distance = "euclidean",
                              data = d[ind, ]))

Ask yourself: What do you notice about the differences in the time it takes to run each method?

On the full data, using optimal is possible, but can take a bit of time. On larger data sets, it might not be possible:

# With full data set greedy matching takes ~ 0.5-1.5 seconds
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
                              + test_percentile,
                              method = "nearest", distance = "euclidean",
                              data = d))

# Meanwhile, optimal matching takes 60-130 seconds
system.time(m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth
                              + test_percentile,
                              method = "optimal", distance = "euclidean",
                              data = d))

Example: Different Distance Metrics

The Euclidean distance metric can be absolutely fine to use, depending on which covariates you are considering.

The Mahalanobis distance metric is motivated by two principles:

  • Principle 1: Address unequal variances
    • Age might range uniformly from 18 to 80
    • Education range uniformly from 0 to 16
    • We want correct for this so age doesn’t dominate the distance
  • Principle 2: Address correlations
    • Suppose we included age in years, age in months, and education
    • Suppose we included age in years and age in months are very correlated
    • We would care about a correlation-corrected distance

If we know our variables have similar variance and no correlations, then using Euclidean is probably fine! For our data the covariance (and correlation) matrix of the 3 variables we are interested in is:

# This is an example of running a greedy nearest neighbor matching with the euclidean distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
                  method = "nearest", 
                  distance = "euclidean",
                  data = d)
# This is an example of running a greedy nearest neighbor matching with the mahalanobis distance metric
m.out0 <- matchit(a == "college" ~ log_parent_income + log_parent_wealth + test_percentile,
                  method = "nearest", 
                  distance = "mahalanobis",
                  data = d)

Ask yourself: What is different about the above two code chunks? What is the same? Bonus: What is the difference between using the euclidean distance metric and the mahalanobis distance metric? Why would you use one over the other?

Matching with job training data from (LaLonde 1986)

For the remainder of the example, we’ll use a portion of the data from a job training program. In particular, the treatment is whether or not someone participated in a job training program. The outcome of interest is their salary in 1978 (re78).

# Load the data
data("lalonde")

# See what's in the data
?lalonde # this opens up the "Help" tab with the documentation 
head(lalonde) # this shows you the first few rows of the dataset
##      treat age educ   race married nodegree re74 re75       re78
## NSW1     1  37   11  black       1        1    0    0  9930.0460
## NSW2     1  22    9 hispan       0        1    0    0  3595.8940
## NSW3     1  30   12  black       0        0    0    0 24909.4500
## NSW4     1  27   11  black       0        1    0    0  7506.1460
## NSW5     1  33    8  black       0        1    0    0   289.7899
## NSW6     1  22    9  black       0        1    0    0  4056.4940

We expect income in 1974 (re74) is highly correlated with income in 1975 (re75). It also has a much higher variability than age.

## Suppose there are 3 individuals
dat <- matrix(c(50, 5000,
                20, 5900,
                48, 5100,
                25, 5050), ncol = 2, byrow = T)
colnames(dat) <- c("age", "re74")
dat
##      age re74
## [1,]  50 5000
## [2,]  20 5900
## [3,]  48 5100
## [4,]  25 5050

Is individual 2 or 3 more similar to individual 1? To answer this, we should compute the distances between individuals 1 and 2, and 1 and 3. One way is to compute the euclidean distances.

# We can also compute the Euclidean distance
dist(dat, method = "euclidean")
##           1         2         3
## 2 900.49986                    
## 3 100.02000 800.48985          
## 4  55.90170 850.01471  55.03635

If we instead measure the mahalanobis distance which accounts for the different scaling of the data, we get something more similar to what we’d expect

# Then we compute the Mahalanobis distance with the function mahalanobis_dist

mahalanobis(dat[-1,], dat[1,], cov(dat))
## [1] 4.91902314 0.05655527 4.42544987

Now let’s try to run the matching procedure using the matchit() function. Let’s start with Euclidean distance, even if may not be great.

# method: nearest (i.e. greedy 1:1)
# distance: euclidean
# data: lalonde (the data set we're working with)
# replace: True (T) or False (F) - whether to sample with or without replacement. 
    # Note, if allowing for replacement, greedy and optimal are the same
    # So for the function, you only need to specify if using method = "nearest"
# ratio: how many control matches for each treated unit (i.e. k:1 matching)
# caliper: by default, the caliper width in standard units (i.e., Z-scores)
m.out0 <- matchit(treat ~ re74 + re75 + age + race,
                  method = "nearest", 
                  distance = "euclidean",
                  data = lalonde, 
                  replace = F, 
                  ratio = 1)

Assessing the matching

We can check how well the balancing has been done. The following code gives the following information (& more):

  • Means Treated: average value of that covariate for the treated group
  • Means Control: average value of that covariate for the control group
  • Std. Mean Diff (SMD): difference of means, standardized by the standard deviation of treatment group
  • Var. Ratio: ratio of variances in treatment and control group. Compares spread of data
  • Rubin (2001) presents rule of thumb that SMD should be less than .25 and variance ratio should be between .5 and 2
  • Max eCDF: Kolmogorov-Smirnov statistic. Max difference across entire CDF
?summary.matchit # Look in the Help tab (on the bottom right) for documentation on summary.matchit

# interactions: check interaction terms too? (T or F)
# un: show statistics for unmatched data as well (T or F)
summary(m.out0, interactions = F, un = F)
## 
## Call:
## matchit(formula = treat ~ re74 + re75 + age + race, data = lalonde, 
##     method = "nearest", distance = "euclidean", replace = F, 
##     ratio = 1)
## 
## Summary of Balance for Matched Data:
##            Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## re74           2095.5737     2113.1168         -0.0036     1.1093    0.0124
## re75           1532.0553     1514.1484          0.0056     1.2864    0.0159
## age              25.8162       26.2486         -0.0604     0.4294    0.0851
## raceblack         0.8432        0.2757          1.5611          .    0.5676
## racehispan        0.0595        0.1568         -0.4114          .    0.0973
## racewhite         0.0973        0.5676         -1.5868          .    0.4703
##            eCDF Max Std. Pair Dist.
## re74         0.1568          0.0490
## re75         0.1459          0.0805
## age          0.1892          0.9730
## raceblack    0.5676          1.7395
## racehispan   0.0973          0.6857
## racewhite    0.4703          1.7327
## 
## Sample Sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

Ask Yourself: How is the balance? How do the means compare between the treatment and control groups? What about SMD and variance ratio?

We can also visually asses the matches with different plots:

## Plots the density of all variables in which.xs
plot(m.out0, type = "density", which.xs = ~age + re74 + race, interactive = F)

## Plots the empirical CDF of all variables in which.xs
plot(m.out0, type = "ecdf", which.xs = ~age + re74, interactive = F)

Other options

Let’s try using propensity scores with calipers.

# distance: glm indicates a propensity score
m.out0 <- matchit(treat ~ re74 + re75 + age + race + educ + married + nodegree,
                  method = "nearest", distance = "glm",
                  data = lalonde, replace = F, caliper = .05, std.caliper = F,
                  ratio = 1)
summary(m.out0, interactions = F, un = F)

As an alternative, let’s try with coarsened exact matching.

# data: lalonde (the data set we're working with)
# replace: True (T) or False (F) - whether to sample with or without replacement. 
    # Note, if allowing for replacement, greedy and optimal are the same
# ratio: how many control matches for each treated unit
m.out1 <- matchit(treat ~ re74 + re75 + age + race + educ + married + nodegree,
                  method = "cem",
                  data = lalonde, replace = F, 
                  ratio = 1)

# We only match 65 of the original 185 treated units
summary(m.out1, interactions = F, un = F)

m.out2 <- matchit(treat ~ re74 + re75 + age + race + educ + married + nodegree,
                  method = "cem", 
                  grouping = list(race = list(c("black"), c("hispan", "white"))),
                  data = lalonde, replace = F, 
                  ratio = 1)

# We only match 65 of the original 185 treated units
summary(m.out2, interactions = F, un = F)

# We could group hispanic/white together for the purposes of matching
# and not enforce matching on married, educ, nodegree
m.out2 <- matchit(treat ~ re74 + re75 + age + race,
                  method = "cem", 
                  grouping = list(race = list(c("black"), c("hispan", "white"))),
                  data = lalonde, replace = F, 
                  ratio = 1)
summary(m.out2, interactions = F, un = F, addlvariables = ~ married + nodegree + educ)

match.data(m.out2) %>%
  arrange(subclass)

Estimating the effect

Given the matching (and assuming it is good enough), we can estimate the ATT. But we can do even better by combining regression + matching.

# propensity score matched
m.out0.data <- match.data(m.out0)
# CEM matched
m.out2.data <- match.data(m.out2)

# Fitting a linear model on the treatments will work
# even if all weights are not 1. We just need to feed them in
fit1 <- lm(re78 ~ treat, data = m.out2.data, weights = weights)

# vcov: tells R to use robust standard errors
avg_comparisons(fit1, variables = "treat",
                vcov = "HC3",
                newdata = subset(m.out2.data, treat == 1),
                wts = "weights")
## 
##  Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
##      1469       1014 1.45    0.147 2.8  -517   3456
## 
## Term: treat
## Type:  response 
## Comparison: mean(1) - mean(0)
## Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
# Fitting a linear model on the treatments will work
# even if all weights are not 1. We just need to feed them in
fit2 <- lm(re78 ~ treat + re74 + re75 + age + race + married + nodegree + educ,
           data =m.out2.data, weights = weights)

# vcov: tells R to use robust standard errors
avg_comparisons(fit2, variables = "treat",
                vcov = "HC3",
                newdata = subset(m.out2.data, treat == 1),
                wts = "weights")
## 
##  Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
##      1445        984 1.47    0.142 2.8  -484   3373
## 
## Term: treat
## Type:  response 
## Comparison: mean(1) - mean(0)
## Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
fit3 <- lm(re78 ~ treat + re74 + re75 + age + race + married + nodegree + educ,
           data =m.out0.data, weights = weights)

# vcov: tells R to use robust standard errors
avg_comparisons(fit3, variables = "treat",
                vcov = "HC3",
                newdata = subset(m.out0.data, treat == 1),
                wts = "weights")
## 
##  Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
##      1296        942 1.38    0.169 2.6  -551   3142
## 
## Term: treat
## Type:  response 
## Comparison: mean(1) - mean(0)
## Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted

Ask yourself: What is the estimate and how do I interpret it?

Fit your own model

In the problem set, we ask you to do your own matching and explain your choices. Some guiding questions to help you are: