All Courses

Estimating IV estimators in practice

Now that we’ve seen some examples of good instruments, let’s take a look at how to estimate LATE using IVs.

We saw that we could estimate LATE (the causal effect of the treatment received among compliers) by first estimating the causal effect of the treatment on the outcome and the causal effect of the instrument on treatment and then dividing the two.

LATE=Causal effect of the IV on the outcomeCausal effect of the IV on treatment received\text{LATE} = \dfrac{\text{Causal effect of the IV on the outcome}}{\text{Causal effect of the IV on treatment received}}

Or

LATE =E(YZ=1)E(YZ=0)E(DZ=1)E(DZ=0)\text{LATE } = \dfrac{E(Y|Z=1) - E(Y|Z=0)}{E(D|Z=1) - E(D|Z=0)}

This method of calculating the causal effect of the treatment received on the outcome is called the Wald estimator method (after Abraham Wald). It’s one of the simplest ways of estimating the causal effect. If the treatment and the encouragement are binary, we can easily calculate the mean differences and estimate LATE.

However, if the treatment and/or the IV variables aren’t binary, we can instead regress the outcome on the IV variable and find the coefficient on the instrument (βYZ\beta_{Y|Z}). We can then regress the treatment received on the instrument (βDZ\beta_{D|Z}) and find the instrument’s coefficient. At the end, we divide the first coefficient by the second one to estimate LATE:

LATE=βYZβDZ\text{LATE} = \dfrac{\beta_{Y|Z}}{\beta_{D|Z}}

But wait! There’s an even simpler way to find the LATE estimator! 😎

Two-stage linear regression

Two-stage least squares, or simply 2SLS is comprised of the following two stages:

  • The first stage is a regression of the treatment on the IV,
  • The second stage is the regression of the outcome on the fitted (predicted) values of the treatment variable from the first stage.

Hopefully you remember the basics of linear regressions. In the first regression, we perform the following regression:

Di=α0+α1Zi+eiD_i = \alpha_0 + \alpha_1 Z_i + e_i

The subscript ii stands for subject ii. ZiZ_i is the value of the instrument variable for subject ii. eie_i is the error term; we assume it has mean zero and constant variance. Through randomization, ZiZ_i and ϵi\epsilon_i are independent of each other, so an ordinary least square (OLS) estimate should be unbiased. After we perform this regression, we find the fitted (predicted) values of DiD_i.

The variable, ZiZ_i, needs to be correlated with DiD_i, but otherwise uncorrelated with YiY_i. Because we want to only look at the variation in the outcome causes by exogenous variations in the treatment, we want to separate the endogenous effects of the treatment by using the IV. We do that by the regression above.

In the second stage, we regress the outcome variable on the fitted values of DiD_i found in the previous step. In the second stage, what we’re doing is examining the effect of the exogenous variables in the treatment DiD_i on the outcome YiY_i. In other words, we run the following regression:

Yi=β0+β1Di^+viY_i = \beta_0 + \beta_1 \hat{D_i} + v_i

Where Di^\hat{D_i} is the fitted values of DiD_i. Again, we assume that the error term, viv_i has mean zero a constant variance.

Using the exclusion restriction, we have assumed that ZZ is independent of YY given DD. Because D^\hat{D} is a projection (prediction) of ZZ, it should be independent of YY too. Therefore, the second regression should be unbiased, β1\beta_1 is the causal effect of the treatment on the outcome, and therefore, it is the LATE estimator.

When we have a non-binary treatment/instrument and have other covariates, 2SLS is the way to estimate LATE.

An example

This example is from David Card’s paper we mentioned in the previous lesson. If you recall, David Card was interested in the effect of education on wages. He used college proximity as his instrument.

The data set used contains 3,613 observations from the 1976 National Longitudinal Survey (NLS). More information about the variables can be found here. Let’s first load a slightly modified version of the data.

# data will be automatically downloaded from the web # The data is in .csv form card_1995 <- read.csv("https://bit.ly/card_1995") # Checking the first five rows of the data head(card_1995)
import pandas as pd # Data will be automatically downloaded from the web # The data is in .csv form card_1995 = pd.read_csv("https://bit.ly/card_1995") card_1995.head(5)``` ```multi-stata * import delimited directly downloads data from the web * The data is in .csv form import delimited https://bit.ly/card_1995

Education is measured in years of schooling; it includes all years leading up to college. Wages are in dollars and converted to logs. We know already that we have unobserved confounders, so a simple regression of wages on years of schooling will be biased even if we include observable confounders such as family background.

lm(lwage ~ educ, data = card_1995)
from sklearn.linear_model import LinearRegression import numpy as np # Fixing the dimensions X = np.array(card_1995.educ).reshape(-1, 1) y = np.array(card_1995.lwage).reshape(-1, 1) # There is a better way to do this with splitting the data into train/test set # but it is out of the scope for this lesson. You could simply add the following: # from sklearn.model_selection import train_test_split # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) # regr = LinearRegression() # regr.fit(X_train, y_train) # print(regr.score(X_test, y_test)) regr = LinearRegression() regr.fit(X, y) regr.coef_
regress lwage educ, robust

If you run the code above, you will see that the coefficient on education is 0.052, which means for every year of schooling, wages increase by 5.2 percent (the increase is in percentage terms because the dependent variable is in logs).

Because of the bias due to unobserved confounding, Card uses college proximity as an instrument. nearc4 is a binary variable that is 1 if the individual grew up in an area with a 4-year college and 0 otherwise.

As described above, we start with a first-stage linear regression of the treatment on the instrument and find the fitted (predicted values) of the treatment.

# First-stage regression model_1s <- lm(educ ~ nearc4, data = card_1995) # Estimating the predicted values of education card_1995$pr_education <- model_1s$fitted.values
# Changing the X variable X = np.array(card_1995.nearc4).reshape(-1, 1) y = np.array(card_1995.educ).reshape(-1, 1) model_1s = LinearRegression() # First-stage regression model_1s.fit(X, y) # Estimating the predicted values of education card_1995['pr_education'] = model_1s.predict(X)``` ```multi-stata * First-stage regression regress educ nearc4, robust * Estimating the predicted values of education predict pr_education, xb

In the second-stage regression, we regress the outcome on the predicted values of the treatment.

# Second-stage regression lm(lwage ~ pr_education, data = card_1995)
# Second-stage regression X = np.array(card_1995.pr_education).reshape(-1, 1) y = np.array(card_1995.lwage).reshape(-1, 1) model_2s = LinearRegression() model_2s.fit(X, y) model_2s.coef_
* Second-stage regression regress lwage pr_education, robust

And just like that, the coefficient on education changes significantly. It goes from 0.052 to 0.188. This change likely shows the biases that existed in the absence of using an instrument. Note the regressions above are illustrative. In practice, we should include some of the observed confounders in the regression above.

If you think performing the two stages one-by-one is painful, then you’re in luck. Most software packages provide functions with which you can perform instrumental variable regression in one line. Here is an example:

# using ivreg to perform both stages in one command ivreg(lwage ~ educ | nearc4, data = card_1995)
# Currently, there are some python libraries that perform IV calculations, # but they are not as straightforward/need download/etc. If you want, you # you can download "!pip install linearmodels" and do the following: from linearmodels.iv import IV2SLS iv = IV2SLS.from_formula("lwage ~ 1 + [educ ~ nearc4]", card_1995).fit() iv.summary.tables[1] # If you don't want to download, you can just paste this function and use # it next time if you don't want to do everything manually. from sklearn.linear_model import LinearRegression import numpy as np def ivreg(y, X, iv, data): ''' This function performs two stages regression. Returns the coefficient. y: the name of the outcome variable - as a string. X: the name of the independent variable - as a string. iv: the name of the IV - as a string. data: the pandas dataframe with your data. ''' X1 = np.array(data[iv]).reshape(-1, 1) y1 = np.array(data[X]).reshape(-1, 1) # First-stage regression model_1s = LinearRegression() model_1s.fit(X1, y1) data['predicted'] = model_1s.predict(X1) # Second-stage regression X2 = np.array(data['predicted']).reshape(-1, 1) y2 = np.array(data[y]).reshape(-1, 1) model_2s = LinearRegression() model_2s.fit(X2, y2) return model_2s.coef_ # Returns array([[0.1880626]]) ivreg('lwage', 'educ', 'nearc4', card_1995)
* using ivregress to perform both stages in one command ivregress 2sls lwage (educ = nearc4), robust

Notice you’ll get the exact same coefficients on education as you did before.

Next Lesson

Critique of IV

You'll learn about the basics of causal inference and why it matters in this course.