All Courses

Critique of IV

The use of Instrumental variables is appealing mainly because it provides a workaround for dealing with unobserved confounders. If IVs were easy to find and estimate, causal inference would be a lot easier, but unfortunately, this isn’t the case. In this lesson, we’ll explore some of the challenges and critiques of using IVs to estimate causal effects.

What are some of the critiques of IVs? Martens et al. Hernan and Robin, Heckman and Urzua, and Deaton, are relevant reads if you want to do a thorough read, but here is a summary of four major critiques that these authors put forward.

Assumption 2 and 3 are unverifiable

The most common threat to the IV approach is the selection of a bad instrument. The easiest way to end up with a bad instrument is to choose something that doesn’t actually satisfy the exclusion restriction and the non-confounding assumption. These two assumptions are a major vulnerability in IV analysis because they cannot be verified empirically. If these assumptions are not met, IV estimates will be biased, and the resulting bias may be larger than if you hadn’t used the IV in the first place.

For instance, in the returns to education example we discussed, we saw that college proximity was used as an instrument because it was assumed to be correlated with the outcome (wages) only through the treatment variable (education). But what if households with certain characteristics are more likely to live close to a college campus, as well as, more likely to have children with higher or lower wages? And what if those characteristics aren’t measurable?

Assumption 3 would be violated.

Because of the difficulty in verifying these assumptions, some researchers argue in favor of estimating upper and lower bounds for the causal effect. Instead of finding one number as the estimate of the local average causal effect, the researcher would provide a range of estimates that depend on certain levels of correlation between the IV and the outcome variable. If the bands are very wide, it tends to signal lower confidence in the IV.

LATE non est ATE (LATE is not ATE)

Heckman and Urzua, along with Deaton, point out that the LATE estimator is often very difficult to interpret and hard to justify. Remember that LATE measures the causal effect for compliers only. As Deaton describes it, LATE is like looking for causal effects where the light is strong enough but what we find isn’t really what we’re after.

The complier group, as we saw, is a hypothetical concept. We don’t actually know who is a complier and who is not, and we have to assume monotonicity to make use of the concept. It’s only in an exceptional case that LATE is equal to the average treatment effect among the treated (ATT), the case where there are no always-takers.

When the exceptional case is what you want

Guido Imbens, a Stanford economist, argues that in some cases, LATE might be exactly what we’re looking for depending on the policy question. He uses a paper by Joshua Angrist on the causal effect of veteran status on earnings as an example. In this paper, Angrist uses the draft during the Vietnam war as an instrument. In the early 1970s, draft numbers were randomly assigned to young men in the United States, and the numbers were highly correlated with those who served in the military.

The estimates using draft lottery numbers as the instrument show the causal effect on only the draftees who would not have joined the military otherwise, and not other people who voluntarily served in the military during that time. This distinction is important because a significant share of those who served during the war were true volunteers.

Imbens argues that if the policy is focused on the experience of those who involuntarily served during the war, then LATE is appropriate. However, in a policy question where the focus is on all veterans (such as future veterans who didn’t have draft lottery), then LATE isn’t really what we’re looking for.

Weak instruments are bad

The strength of an IV is measured by its ability to predict treatment received. The same way that the strength of an encouragement is measured through how strongly it is correlated with treatment received. A common critique of IVs is that many of the instruments that end up being used are weak instruments.

In the education-earning example, if college proximity and a college degree (or years of schooling) are weakly correlated, then college proximity is said to be a weak instrument.

Strength of an IV is something we can measure by estimating the correlation between the instrument and the treatment. We can also measure strength by estimating the share of compliers in our data:

E(DZ=1)E(DZ=0)E(D|Z=1) - E(D|Z=0)

A weak instrument is associated with a share closer to 0 whereas a strong instrument is associated with a share closer to 1 because a weak instrument means a lower correlation between ZZ and DD. If LATE is all about the compliers, then we can see why a weak instrument is problematic. A small share of compliers means that the estimate based on our IV is only giving us information on a small sliver of our sample. The estimates of LATE will have large variations and the confidence intervals of the estimates will likely be wide.

Also, because E(DZ=1)E(DZ=0)E(D|Z=1) - E(D|Z=0) is the denominator of the Wald estimator, a weak instrument leads to a small denominator for the LATE estimator. If there are biases in the numerator of the Wald estimator (such as biases arising from violations of assumptions 2 and 3), they can be inflated by the nature of the small denominator, and this will contribute further to the bias of the IV estimate.

So the check is to see if the instrument has a high correlation with the treatment received.

In the Vietnam draft lottery example, we know that there were a lot of volunteers who served in the military but the lottery numbers were strongly associated with who served in the military. Do you think the instrument used in the paper was strong or weak?
Strong
Weak
We do not have enough information

Back to the college proximity example

Remember in the education-earning example, we used college proximity as an instrument. However, we didn’t really check to see if the instrument we used was strong or weak. One quick check is to see how strongly the instrument and the treatment variables are correlated.

Let’s do this with using software:

# finding the correaltion between the the instrument and the treatment cor(card_1995$nearc4, card_1995$educ)
import pandas as pd import numpy as np # This is the data from the previous lesson card_1995 = pd.read_csv("https://bit.ly/card_1995") # Finding the correlation between the the instrument and the treatment np.corrcoef(card_1995.nearc4, card_1995.educ) # array([[1. , 0.14424021], # [0.14424021, 1. ]])``` ```multi-stata * Finding the correlation matrix between the instrument and the treatment cor educ nearc4

As you can see, the correlation coefficient between college proximity and years of schooling is 0.14 which is not that high. But, how do we know what is high and what is low?

Luckily, there is a systematic way of testing for weak instruments.

In the first-stage regression, we can test the hypothesis that the coefficient on the instrument is 0. If there are multiple instruments in the study (yes, that’s possible), then we can use joint hypothesis testing. We can then calculate an F-statistic from the null test.

It’s shown that 1/Fstatistic1/{F-statistic} is a good approximation of the bias between the IV estimate and the OLS estimate. Generally, the F-statistic for joint significance testing of the instrument(s) in the first-stage regression should be larger than 10. The larger the F-statistic, the smaller the bias between the IV estimate and the OLS estimate.

Let’s find this F-statistic in the example above.

# Again, the first-stage regresison is: model_1s <- lm(educ ~ nearc4, data = card_1995) # Again, the first-stage regresison is without the instrument is: model_1s_short <- lm(educ ~ 1, data = card_1995) # Function waldtest() for a simple F-test waldtest(model_1s, model_1s_short)$F[2] # The follwing would be an F-test robust to heteroskedasticity library(sandwich) waldtest(model_1s, model_1s_short, vcov = vcovHC(model_1s, type="HC0"))$F[2]
from sklearn.linear_model import LinearRegression X = np.array(card_1995.nearc4).reshape(-1, 1) y = np.array(card_1995.educ).reshape(-1, 1) # Again, the first-stage regresison is: model_1s = LinearRegression() model_1s.fit(X, y) # Again, the first-stage regresison is without the instrument is: model_1s_short = LinearRegression() model_1s_short.fit(np.ones(len(y)).reshape(-1, 1), y) # Function waldtest() for a simple F-test will be added soon...
* Again, the first-stage regresison is: regress educ nearc4, robust * F-test test nearc4

We can also do the following:

# using ivreg to perform both stages in one command iv2 = ivreg(lwage ~ educ | nearc4, data = card_1995) summary(iv2, vcov = sandwich, diagnostics = TRUE)
# Python codes will be added soon
* using ivregress to perform both stages in one command ivregress 2sls lwage (educ = nearc4), robust estat firststage

Which would return a similar F-statistic. From the hypothesis testing above, we find that the F-statistic is roughly 60 which is way greater than 10, indicating that the instrument is not weak.

Next Lesson

Quasi-experimental designs

You'll learn about the basics of causal inference and why it matters in this course.