Estimating IV estimators in practice

Now that we’ve seen some examples of good instruments, let’s take a look at how to estimate LATE using IVs.

We saw that we could estimate LATE (the causal effect of the treatment received among compliers) by first estimating the causal effect of the treatment on the outcome and the causal effect of the instrument on treatment and then dividing the two.

\text{LATE} = \dfrac{\text{Causal effect of the IV on the outcome}}{\text{Causal effect of the IV on treatment received}}

\text{LATE } = \dfrac{E(Y|Z=1) - E(Y|Z=0)}{E(D|Z=1) - E(D|Z=0)}

This method of calculating the causal effect of the treatment received on the outcome is called the Wald estimator method (after Abraham Wald). It’s one of the simplest ways of estimating the causal effect. If the treatment and the encouragement are binary, we can easily calculate the mean differences and estimate LATE.

However, if the treatment and/or the IV variables aren’t binary, we can instead regress the outcome on the IV variable and find the coefficient on the instrument ( $\beta_{Y|Z}$ ). We can then regress the treatment received on the instrument ( $\beta_{D|Z}$ ) and find the instrument’s coefficient. At the end, we divide the first coefficient by the second one to estimate LATE:

\text{LATE} = \dfrac{\beta_{Y|Z}}{\beta_{D|Z}}

But wait! There’s an even simpler way to find the LATE estimator! 😎

Two-stage linear regression

Two-stage least squares, or simply 2SLS is comprised of the following two stages:

The first stage is a regression of the treatment on the IV,
The second stage is the regression of the outcome on the fitted (predicted) values of the treatment variable from the first stage.

Hopefully you remember the basics of linear regressions. In the first regression, we perform the following regression:

D_i = \alpha_0 + \alpha_1 Z_i + e_i

The subscript $i$ stands for subject $i$ . $Z_i$ is the value of the instrument variable for subject $i$ . $e_i$ is the error term; we assume it has mean zero and constant variance. Through randomization, $Z_i$ and $\epsilon_i$ are independent of each other, so an ordinary least square (OLS) estimate should be unbiased. After we perform this regression, we find the fitted (predicted) values of $D_i$ .

The variable, $Z_i$ , needs to be correlated with $D_i$ , but otherwise uncorrelated with $Y_i$ . Because we want to only look at the variation in the outcome causes by exogenous variations in the treatment, we want to separate the endogenous effects of the treatment by using the IV. We do that by the regression above.

In the second stage, we regress the outcome variable on the fitted values of $D_i$ found in the previous step. In the second stage, what we’re doing is examining the effect of the exogenous variables in the treatment $D_i$ on the outcome $Y_i$ . In other words, we run the following regression:

Y_i = \beta_0 + \beta_1 \hat{D_i} + v_i

Where $\hat{D_i}$ is the fitted values of $D_i$ . Again, we assume that the error term, $v_i$ has mean zero a constant variance.

Using the exclusion restriction, we have assumed that $Z$ is independent of $Y$ given $D$ . Because $\hat{D}$ is a projection (prediction) of $Z$ , it should be independent of $Y$ too. Therefore, the second regression should be unbiased, $\beta_1$ is the causal effect of the treatment on the outcome, and therefore, it is the LATE estimator.

When we have a non-binary treatment/instrument and have other covariates, 2SLS is the way to estimate LATE.

An example

This example is from David Card’s paper we mentioned in the previous lesson. If you recall, David Card was interested in the effect of education on wages. He used college proximity as his instrument.

The data set used contains 3,613 observations from the 1976 National Longitudinal Survey (NLS). More information about the variables can be found here. Let’s first load a slightly modified version of the data.

# data will be automatically downloaded from the web
# The data is in .csv form
card_1995 <- read.csv("https://bit.ly/card_1995")
# Checking the first five rows of the data
head(card_1995)

import pandas as pd
 
# Data will be automatically downloaded from the web
# The data is in .csv form
 
card_1995 = pd.read_csv("https://bit.ly/card_1995")
card_1995.head(5)```

```multi-stata
* import delimited directly downloads data from the web
* The data is in .csv form
import delimited https://bit.ly/card_1995

Education is measured in years of schooling; it includes all years leading up to college. Wages are in dollars and converted to logs. We know already that we have unobserved confounders, so a simple regression of wages on years of schooling will be biased even if we include observable confounders such as family background.

lm(lwage ~ educ, data = card_1995)

from sklearn.linear_model import LinearRegression
import numpy as np 

# Fixing the dimensions
X = np.array(card_1995.educ).reshape(-1, 1)
y = np.array(card_1995.lwage).reshape(-1, 1)
      
# There  is a better way to do this with splitting the data into train/test set
# but it is out of the scope for this lesson. You could simply add the following:

# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# regr = LinearRegression()
# regr.fit(X_train, y_train)
# print(regr.score(X_test, y_test))

regr = LinearRegression()
  
regr.fit(X, y)
regr.coef_

regress lwage educ, robust

If you run the code above, you will see that the coefficient on education is 0.052, which means for every year of schooling, wages increase by 5.2 percent (the increase is in percentage terms because the dependent variable is in logs).

Because of the bias due to unobserved confounding, Card uses college proximity as an instrument. nearc4 is a binary variable that is 1 if the individual grew up in an area with a 4-year college and 0 otherwise.

As described above, we start with a first-stage linear regression of the treatment on the instrument and find the fitted (predicted values) of the treatment.

# First-stage regression
model_1s <- lm(educ ~ nearc4, data = card_1995)
# Estimating the predicted values of education
card_1995$pr_education <- model_1s$fitted.values

# Changing the X variable
X = np.array(card_1995.nearc4).reshape(-1, 1)
y = np.array(card_1995.educ).reshape(-1, 1)

model_1s = LinearRegression()
  
# First-stage regression
model_1s.fit(X, y)

# Estimating the predicted values of education
card_1995['pr_education'] = model_1s.predict(X)```

```multi-stata
* First-stage regression
regress educ nearc4, robust
* Estimating the predicted values of education
predict pr_education, xb

In the second-stage regression, we regress the outcome on the predicted values of the treatment.

# Second-stage regression
lm(lwage ~ pr_education, data = card_1995)

# Second-stage regression
X = np.array(card_1995.pr_education).reshape(-1, 1)
y = np.array(card_1995.lwage).reshape(-1, 1)

model_2s = LinearRegression()
  
model_2s.fit(X, y)

model_2s.coef_

* Second-stage regression
regress lwage pr_education, robust

And just like that, the coefficient on education changes significantly. It goes from 0.052 to 0.188. This change likely shows the biases that existed in the absence of using an instrument. Note the regressions above are illustrative. In practice, we should include some of the observed confounders in the regression above.

If you think performing the two stages one-by-one is painful, then you’re in luck. Most software packages provide functions with which you can perform instrumental variable regression in one line. Here is an example:

# using ivreg to perform both stages in one command
ivreg(lwage ~ educ | nearc4, data = card_1995)

# Currently, there are some python libraries that perform IV calculations, 
# but they are not as straightforward/need download/etc. If you want, you 
# you can download "!pip install linearmodels" and do the following:

from linearmodels.iv import IV2SLS

iv = IV2SLS.from_formula("lwage ~ 1 + [educ ~ nearc4]", card_1995).fit()
iv.summary.tables[1]

# If you don't want to download, you can just paste this function and use 
# it next time if you don't want to do everything manually.

from sklearn.linear_model import LinearRegression
import numpy as np 

def ivreg(y, X, iv, data):
  '''
  This function performs two stages regression. Returns the coefficient.

  y: the name of the outcome variable - as a string.
  X: the name of the independent variable - as a string.
  iv: the name of the IV - as a string.
  data: the pandas dataframe with your data.
    '''

  X1 = np.array(data[iv]).reshape(-1, 1)
  y1 = np.array(data[X]).reshape(-1, 1)

  # First-stage regression
  model_1s = LinearRegression()
  model_1s.fit(X1, y1)

  data['predicted'] = model_1s.predict(X1)

  # Second-stage regression
  X2 = np.array(data['predicted']).reshape(-1, 1)
  y2 = np.array(data[y]).reshape(-1, 1)

  model_2s = LinearRegression()
    
  model_2s.fit(X2, y2)

  return model_2s.coef_

# Returns array([[0.1880626]])
ivreg('lwage', 'educ', 'nearc4', card_1995)

* using ivregress to perform both stages in one command
ivregress 2sls lwage (educ = nearc4), robust

Notice you’ll get the exact same coefficients on education as you did before.

Next Lesson

Critique of IV

You'll learn about the basics of causal inference and why it matters in this course.

Go to the next lesson

All Courses

Estimating IV estimators in practice

Two-stage linear regression

An example

Next Lesson

Critique of IV