Spatial Clustering of Standard Errors in Stata | Conley SEs

*Note: A previous version of my site, which had about 5000 monthly views was hacked, and I rebuilt this from scratch. Some content you read before may not be available. I may earn commissions on some links at no cost to you. Opinions are my own.

Want to learn how to spatially cluster standard errors at different radiuses? Well, you’ve come to the right place. (You can find the Stata code and example table at the end).

When working on research papers, an important decision often arises: “Whether and how to cluster the standard errors.” Abadie, Athey, Imbens, and Wooldridge wrote a recent paper on “When Should You Adjust Standard Errors for Clustering”.

The two reasons are sampling design (i.e., if the sample was drawn from the population using clustered sampling) and experimental design (i.e., a cluster of units are assigned a treatment).

In practice, it is common to cluster standard errors at the level of the treatment. For example, if the treatment is at the village level or state level, we often cluster errors at that level.

When working spatial or remote sensing data (e.g., weather, internet, crime), there is often another problem that arises requiring clustering standard errors differently than the traditional way: the issue of spatial dependence in units, which means nearby units are more similar than farther ones or the present values of the spatial units reflect characteristics of the same units in a more recent or distant past.

Let’s talk about how to spatially cluster standard errors in Stata following Conley (1999) and the challenges encountered in an instrumental variable (IV) context.

What is Spatial Dependence & Why is it an Issue?

If you visually explore (i.e., heatmap) data on variables such as crime, income, crop production, population density, rainfall, or internet connectivity, you might notice clustering in the values of administrative units.

Figure 1: Spatial Distribution of Income by County in the US

Spatial Clustering of Standard Errors in Stata | Conley SEs 1

Source: U.S. Census Bureau

Figure 2: US Crime Numbers by County

Spatial Clustering of Standard Errors in Stata | Conley SEs 3

These maps suggest that there is an issue of spatial dependence in the units, which violates the important assumption of error terms being independently and identically distributed (i.i.d).

This leads to a reduction in the number of i.i.d observations, which may inflate the t-stats and distort the significance levels.

Donaldson and Storeygard (2016) point out that a distinction is important if the issue is on the outcome variable or the independent variable of interest.

The violation of the i.i.d assumption occurs when spatial dependence is on the dependent variable.

A paper by Kelly Morgan in 2019 examines 27 papers in four top journals (The American Economic Review, Econometrica, The Journal of Political Economy, and The Quarterly Journal of Economics) looking at persistence regressions to examine the degree of spatial correlation.

ALSO READ   How to Download ERA-5 Climate Datasets using Python (SOLVED)

The author finds that when replacing the outcome with spatial noise, only around one-quarter of the studies still have robust results.

I have observed that few papers tend to implement these spatially adjusted standard errors even when their outcomes tend to display patterns of spatial dependence.

The advice: run a Moran’s test to first assess the presence of spatial clustering of error terms and then spatially cluster the standard errors following Conley (1999) if you reject the null hypothesis of no spatial correlation in the residuals.

The next question is: “How do I implement these Conley standard errors?”. Let’s discuss how to do it in Stata.

How to Spatially Cluster the Standard Errors at Different Cutoff Distances in Stata

One way studies often account for spatial dependence is by calculating the adjusted standard errors using the procedure developed by Timothy Conley (1999), which relies on a Generalized Method of Moments approach and computes a variance-covariance matrix with spatial weights.

These standard errors account for heteroscedasticity, unit-specific serial correlation, and cross-sectional spatial correlation.

Thanks to the incredible work of Solomon Hsiang (2010), and Thiemo Fetzer (2014), there is a way to implement these spatially adjusted errors in Stata.

In a Linear Regression Model

Consider a (hypothetical) shock such as a hurricane or a flood affecting part of a country and we are interested to identify its impact on crop production or crime.

Suppose you have a 5-year panel data but the shock occurred after the third wave. Our difference-in-difference (DiD) specification could be:

Y = a + b*treat + c*treat*post + d*X + e,

where d represents the impact of the shock on our outcome Y, and e represents the error terms. Treat is a dummy indicating that the village was flooded, and 0, otherwise. Post variable takes the value of 1 for waves 4 and 5, and 0, otherwise.

If the treatment is constructed at the village level, for example, it is common in the literature to cluster standard errors at that level. In Stata, it would be:

reghdfe outcome treatment post post#treatment controls, cluster(village) a(village)

However, a shock such as flooding is not always restricted within the village’s administrative boundaries, and also a heatmap of crop production may show a local correlation among nearby units.

While we may find a significant negative impact of the shock on crop production while clustering at the village level, it is likely that calculating spatially adjusted standard errors at higher cutoff radiuses than the village leads to a weak to no impact.

Suppose we have the centroids or geocoordinates of the villages in question, the Conley adjusted standard errors can be implemented in Stata as follows:

ALSO READ   [SOLVED] Best Fix for “The Print Spooler Service is Not Working” on Windows in 2022

ols_spatial_HAC outcome treatment post post#treatment controls, lat(village latitude) lon(village longitude) timevar(year) panelvar(villageID) dist(300) lag(2)

In the specification above, I assume a cutoff radius of 300 kilometers (i.e., spatial correlation between the error terms of two units is zero at 300km), and the temporal autocorrelation is set to 2 time periods.

One issue with Hsiang code is that Stata starts to become slow when adding many controls.

An example is:

reg2hdfespatial outcome treatment post post#treatment controls, lat(village latitude) lon(village longitude) timevar(year) panelvar(villageID) dist(300) lag(2)

In an Instrumental Variable Specification

During my time as a World Bank Robert McNamara fellow, one of the research projects I worked on relied on the instrumental variable strategy for identification. The issue is that the instrument also showed patterns of spatial dependence.

Both Hsiang and Fetzer codes developed do not apply to the IV setting.

Tim Foreman developed an adapted version of the Hsiang and Fetzer codes to the IV setting called “spatial_hac_iv”.

An example would be:

spatial_hac_iv outcome controls (treatment = instrument), id(villageID) time(year) latitude (villagelat) longitude(villagelon) dist(300) lag(2)

An alternative to this command is the acreg command. An example is:

acreg outcome controls (treatment = instrument), id(villageID) time(year) latitude (villagelat) longitude(villagelon) dist(300) lag(2) spatial hac pfe1(villageID) pfe2(year)

Note: pfe1 pfe2 allows to partial out village and year fixed effects.

Challenges with the IV Setting

The main challenge I encountered with this adapted code is that it is very slow to run as the number of observations increases.

For example, on a sample of 1,000 obs, it took about 1 minute to produce results, on a sample of 5,000 obs, it took 30 minutes, on a sample of 100,000 obs, my relatively “powerful” laptop crashed after days of running.

What is the Appropriate Radius Cutoff ?

Another question could be “What is the recommended cutoff distance I should use when implementing these spatially adjusted standard errors?”. There does not seem to be a consensus in the literature about the exact number and different papers have used different cutoffs. For example, Nunn and Wantchekon (2011) examine how historical slave trade is related to the level of trust in African countries, which may affect their current economic development.

They use a cutoff of about 3 degrees for Africa. The authors find similar standard errors when compared to the traditional clustered standard errors at the ethnicity level, or the clustered standard errors at both the ethnic group and district levels. Hsiang (2010) uses a cutoff distance of 300 km.

My suggestion: implement different cutoffs in the robustness tests to show at what cutoff distance your significance holds and at what cutoff distance the results start becoming weak or insignificant as shown below.

ALSO READ   [SOLVED] Best Fix for “The Print Spooler Service is Not Working” on Windows in 2022

Code to Spatially Cluster & Table

** http://www.trfetzer.com/wp-content/uploads/reg2hdfespatial-id1id2.ado // make sure to install this ado file
ssc install tmpdir //install this
ssc install reg2hdfe // to install the reghdfe package

gl controls erosion_dummy soil01 soil02 soil03 days_past_floods_10yrs clim01 clim05 land02 landcover3 land03_agroeco_zone2 terrain_roughness3 //put controls in a global
cd “$tables/” //specify the folder where the tables will be exported

// let’s start with a 10km radius.
*==== Distance cutoff 10KM======================================
eststo spatialreg_10: reg2hdfespatial logvalue_crop_production treated_twice post post_treated_twice $controls, lat(lat_dd_mod) lon(lon_dd_mod) timevar(yearcat) panelvar(hhid1) dist(10) lag(2) star //main model
estadd scalar Obs= e(N) // put # of obs in a scalar called Obs
sum value_crop_production if !missing(value_crop_production) //summarizing the outcome variable
estadd scalar MP=r(mean) // putting the mean value in a scalar called MP
quietly estadd local yearfe “Yes”, replace
quietly estadd local clusterfe “Yes”, replace
quietly estadd local controls “Yes”, replace
quietly estadd local distcut “10”, replace

*==== Distance cutoff 50KM======================================
eststo spatialreg_50: reg2hdfespatial logvalue_crop_production treated_twice post post_treated_twice $controls , lat(lat_dd_mod) lon(lon_dd_mod) timevar(yearcat) panelvar(hhid1) dist(50) lag(2) star
estadd scalar Obs= e(N)
sum value_crop_production if !missing(value_crop_production)
estadd scalar MP=r(mean)
quietly estadd local yearfe “Yes”, replace
quietly estadd local clusterfe “Yes”, replace
quietly estadd local controls “Yes”, replace
quietly estadd local distcut “50”, replace

*==== Distance cutoff 50KM======================================
eststo spatialreg_100: reg2hdfespatial logvalue_crop_production treated_twice post post_treated_twice $controls , lat(lat_dd_mod) lon(lon_dd_mod) timevar(yearcat) panelvar(hhid1) dist(100) lag(2) star
estadd scalar Obs= e(N)
sum value_crop_production if !missing(value_crop_production)
estadd scalar MP=r(mean)
quietly estadd local yearfe “Yes”, replace
quietly estadd local clusterfe “Yes”, replace
quietly estadd local controls “Yes”, replace
quietly estadd local distcut “100”, replace

// Exporting the final Table
esttab spatialreg_10 spatialreg_50 spatialreg_100 using "$tables/spatially_adjusted_ses_fetzer", keep(post_treated_twice) b(%9.3f) se(%9.3f) star(* 0.10 ** 0.05 *** 0.01) stats(yearfe clusterfe controls distcut MP Obs , fmt(0 0 0 0 0 0) label("Year FE" "HH FE" "Controls" " Cutoff Radius (km)" "Mean of Dep. Variable" "Observations")) coeflabels(post_treated_twice "Treated Twice * Post" ) title("Table XX: Spatially Adjusted Standard Errors at different Cutoff Radiuses following Fretzer Procedure") addnotes("Notes: Controls are similar to the ones in the main specification. Cutoff Radius is the radius at which spatial dependence is assumed to be zero. The number of lags for serial correlation is assumed to be 2. ") rtf replace nogaps style(fixed) compress varwidth(30) note("Clustered Standard errors at EA level in parentheses" ) mlabels("Log Crop Production" "Log Crop Production" "Log Crop Production")

Below is what the final table looks like.

Spatial Clustering of Standard Errors in Stata | Conley SEs 5

Other interesting resources on the topic of spatial clustering of SEs

Randomly Drawn Equators? – by Florence Kondylis & John Loeser

Clustering, Spatial Correlations and Randomization Inference – by Barriosm et al (2010)

Introduction to conleyreg – by Christian Düben

Berenger Djoumessi

Berenger is a Ph.D. graduate from the University of Minnesota in Applied Economics, where he studied topics at the intersection of environmental, development, and agricultural economics. Contact: djoum003@umn.edu

Recent Posts