```
import pandas as pd
import numpy as np
import datetime as dt
import itertools
import linearmodels as lm
import sqlite3
```

# Fixed Effects and Clustered Standard Errors

You are reading the work-in-progress edition of **Tidy Finance with Python**. Code chunks and text might change over the next couple of months. We are always looking for feedback via contact@tidy-finance.org. Meanwhile, you can find the complete R version here.

In this chapter, we provide an intuitive introduction to the two popular concepts of *fixed effects regressions* and *clustered standard errors*. When working with regressions in empirical finance, you will sooner or later be confronted with discussions around how you deal with omitted variables bias and dependence in your residuals. The concepts we introduce in this chapter are designed to address such concerns.

We focus on a classical panel regression common to the corporate finance literature (e.g., Fazzari et al. 1988; Erickson and Whited 2012; Gulen and Ion 2015): firm investment modeled as a function that increases in firm cash flow and firm investment opportunities.

Typically, this investment regression uses quarterly balance sheet data provided via Compustat because it allows for richer dynamics in the regressors and more opportunities to construct variables. As we focus on the implementation of fixed effects and clustered standard errors, we use the annual Compustat data from our previous chapters and leave the estimation using quarterly data as an exercise. We demonstrate below that the regression based on annual data yields qualitatively similar results to estimations based on quarterly data from the literature, namely confirming the positive relationships between investment and the two regressors.

The current chapter relies on the following set of packages.

## Data Preparation

We use CRSP and annual Compustat as data sources from our `SQLite`

-database introduced in Chapters 2-4. In particular, Compustat provides balance sheet and income statement data on a firm level, while CRSP provides market valuations.

```
= sqlite3.connect("data/tidy_finance.sqlite")
tidy_finance
= pd.read_sql_query(
crsp_monthly ="SELECT gvkey, month, mktcap FROM crsp_monthly",
sql=tidy_finance,
con={"month": {"unit": "D", "origin": "unix"}}
parse_dates
)
= pd.read_sql_query(
compustat ="SELECT datadate, gvkey, year, at, be, capx, oancf, txdb FROM compustat",
sql=tidy_finance,
con={"datadate": {"unit": "D", "origin": "unix"}}
parse_dates )
```

The classical investment regressions model the capital investment of a firm as a function of operating cash flows and Tobin’s q, a measure of a firm’s investment opportunities. We start by constructing investment and cash flows which are usually normalized by lagged total assets of a firm. In the following code chunk, we construct a *panel* of firm-year observations, so we have both cross-sectional information on firms as well as time-series information for each firm.

```
= (compustat
data_investment = lambda x: pd.to_datetime(x["datadate"]).dt.to_period('M').dt.to_timestamp())
.assign(month "gvkey", "year", "at"])
.merge(compustat.get([={"at": "at_lag"})
.rename(columns= lambda x: x["year"] + 1),
.assign(year =["gvkey", "year"], how="left")
on"at > 0 and at_lag > 0")
.query(= lambda x: x["capx"] / x["at_lag"],
.assign(investment = lambda x: x["oancf"] / x["at_lag"])
cash_flows
)
= (data_investment
data_investment "gvkey", "year", "investment"])
.merge(data_investment.get([={"investment": "investment_lead"})
.rename(columns= lambda x: x["year"] - 1),
.assign(year =["gvkey", "year"], how="left")
on )
```

Tobin’s q is the ratio of the market value of capital to its replacement costs. It is one of the most common regressors in corporate finance applications (e.g., Fazzari et al. 1988; Erickson and Whited 2012). We follow the implementation of Gulen and Ion (2015) and compute Tobin’s q as the market value of equity (`mktcap`

) plus the book value of assets (`at`

) minus book value of equity (`be`

) plus deferred taxes (`txdb`

), all divided by book value of assets (`at`

). Finally, we only keep observations where all variables of interest are non-missing, and the reported book value of assets is strictly positive.

```
= (data_investment
data_investment =["gvkey", "month"], how="left")
.merge(crsp_monthly, on= lambda x: (x["mktcap"] + x["at"] - x["be"] + x["txdb"]) / x["at"])
.assign(tobins_q "gvkey", "year", "investment_lead", "cash_flows", "tobins_q"])
.get([
.dropna() )
```

As the variable construction typically leads to extreme values that are most likely related to data issues (e.g., reporting errors), many papers include winsorization of the variables of interest. Winsorization involves replacing values of extreme outliers with quantiles on the respective end. The following function implements the winsorization for any percentage cut that should be applied on either end of the distributions. In the specific example, we winsorize the main variables (`investment`

, `cash_flows`

, and `tobins_q`

) at the 1 percent level.

```
def winsorize(x, cut):
= x.copy()
tmp_x = np.nanquantile(tmp_x, 1 - cut)
upper_quantile = np.nanquantile(tmp_x, cut)
lower_quantile > upper_quantile] = upper_quantile
tmp_x[tmp_x < lower_quantile] = lower_quantile
tmp_x[tmp_x return tmp_x
= (data_investment
data_investment = lambda x: winsorize(x["investment_lead"], 0.01),
.assign(investment_lead = lambda x: winsorize(x["cash_flows"], 0.01),
cash_flows = lambda x: winsorize(x["tobins_q"], 0.01))
tobins_q )
```

Before proceeding to any estimations, we highly recommend tabulating summary statistics of the variables that enter the regression. These simple tables allow you to check the plausibility of your numerical variables, as well as spot any obvious errors or outliers. Additionally, for panel data, plotting the time series of the variable’s mean and the number of observations is a useful exercise to spot potential problems.

```
= (data_investment
data_investment_summary =["gvkey", "year"],
.melt(id_vars=["investment_lead", "cash_flows", "tobins_q"],
value_vars="measure")
var_name"measure", "value"])
.get(["measure")
.groupby(= [0.05, 0.5, 0.95])
.describe(percentiles
) data_investment_summary
```

value | ||||||||
---|---|---|---|---|---|---|---|---|

count | mean | std | min | 5% | 50% | 95% | max | |

measure | ||||||||

cash_flows | 124194.0 | 0.014524 | 0.266255 | -1.495325 | -0.456624 | 0.064875 | 0.272627 | 0.480034 |

investment_lead | 124194.0 | 0.058407 | 0.077781 | 0.000000 | 0.000727 | 0.033305 | 0.208287 | 0.466986 |

tobins_q | 124194.0 | 1.987102 | 1.686705 | 0.571063 | 0.792256 | 1.384615 | 5.333876 | 10.849820 |

## Fixed Effects

To illustrate fixed effects regressions, we use the `linearmodels`

package, which is both computationally powerful and flexible with respect to model specifications. We start out with the basic investment regression using the simple model \[ \text{Investment}_{i,t+1} = \alpha + \beta_1\text{Cash Flows}_{i,t}+\beta_2\text{Tobin's q}_{i,t}+\varepsilon_{i,t},\] where \(\varepsilon_t\) is i.i.d. normally distributed across time and firms. We use the `PanelOLS()`

-function to estimate the simple model so that the output has the same structure as the other regressions below.

```
= lm.PanelOLS.from_formula(
model_ols ="investment_lead ~ cash_flows + tobins_q + 1",
formula=data_investment.set_index(["gvkey", "year"]),
data
).fit() model_ols.summary
```

Dep. Variable: | investment_lead | R-squared: | 0.0445 |

Estimator: | PanelOLS | R-squared (Between): | 0.0222 |

No. Observations: | 124194 | R-squared (Within): | 0.0404 |

Date: | Tue, Aug 01 2023 | R-squared (Overall): | 0.0445 |

Time: | 15:58:56 | Log-likelihood | 1.438e+05 |

Cov. Estimator: | Unadjusted | ||

F-statistic: | 2891.1 | ||

Entities: | 13904 | P-value | 0.0000 |

Avg Obs: | 8.9322 | Distribution: | F(2,124191) |

Min Obs: | 1.0000 | ||

Max Obs: | 34.000 | F-statistic (robust): | 2891.1 |

P-value | 0.0000 | ||

Time periods: | 34 | Distribution: | F(2,124191) |

Avg Obs: | 3652.8 | ||

Min Obs: | 469.00 | ||

Max Obs: | 5237.0 | ||

Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |

cash_flows | 0.0514 | 0.0008 | 61.588 | 0.0000 | 0.0498 | 0.0531 |

tobins_q | 0.0077 | 0.0001 | 58.206 | 0.0000 | 0.0074 | 0.0079 |

Intercept | 0.0424 | 0.0003 | 124.05 | 0.0000 | 0.0417 | 0.0431 |

As expected, the regression output shows significant coefficients for both variables. Higher cash flows and investment opportunities are associated with higher investment. However, the simple model actually may have a lot of omitted variables, so our coefficients are most likely biased. As there is a lot of unexplained variation in our simple model (indicated by the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and the true values could be above or below zero. Note that there are no clear cutoffs to decide when an R-squared is high or low, but it depends on the context of your application and on the comparison of different models for the same data.

One way to tackle the issue of omitted variable bias is to get rid of as much unexplained variation as possible by including *fixed effects* - i.e., model parameters that are fixed for specific groups (e.g., Wooldridge 2010). In essence, each group has its own mean in fixed effects regressions. The simplest group that we can form in the investment regression is the firm level. The firm fixed effects regression is then \[ \text{Investment}_{i,t+1} = \alpha_i + \beta_1\text{Cash Flows}_{i,t}+\beta_2\text{Tobin's q}_{i,t}+\varepsilon_{i,t},\] where \(\alpha_i\) is the firm fixed effect and captures the firm-specific mean investment across all years. In fact, you could also compute firms’ investments as deviations from the firms’ average investments and estimate the model without the fixed effects. The idea of the firm fixed effect is to remove the firm’s average investment, which might be affected by firm-specific variables that you do not observe. For example, firms in a specific industry might invest more on average. Or you observe a young firm with large investments but only small concurrent cash flows, which will only happen in a few years. This sort of variation is unwanted because it is related to unobserved variables that can bias your estimates in any direction.

To include the firm fixed effect, we use `gvkey`

(Compustat’s firm identifier) as follows:

```
= lm.PanelOLS.from_formula(
model_fe_firm ="investment_lead ~ cash_flows + tobins_q + EntityEffects ",
formula=data_investment.set_index(["gvkey", "year"]),
data
).fit() model_fe_firm.summary
```

Dep. Variable: | investment_lead | R-squared: | 0.0595 |

Estimator: | PanelOLS | R-squared (Between): | 0.2571 |

No. Observations: | 124194 | R-squared (Within): | 0.0595 |

Date: | Tue, Aug 01 2023 | R-squared (Overall): | 0.2363 |

Time: | 15:58:56 | Log-likelihood | 1.956e+05 |

Cov. Estimator: | Unadjusted | ||

F-statistic: | 3486.4 | ||

Entities: | 13904 | P-value | 0.0000 |

Avg Obs: | 8.9322 | Distribution: | F(2,110288) |

Min Obs: | 1.0000 | ||

Max Obs: | 34.000 | F-statistic (robust): | 3486.4 |

P-value | 0.0000 | ||

Time periods: | 34 | Distribution: | F(2,110288) |

Avg Obs: | 3652.8 | ||

Min Obs: | 469.00 | ||

Max Obs: | 5237.0 | ||

Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |

cash_flows | 0.0146 | 0.0010 | 15.155 | 0.0000 | 0.0127 | 0.0165 |

tobins_q | 0.0113 | 0.0001 | 82.633 | 0.0000 | 0.0110 | 0.0115 |

F-test for Poolability: 10.357

P-value: 0.0000

Distribution: F(13903,110288)

Included effects: Entity

The regression output shows a lot of unexplained variation at the firm level that is taken care of by including the firm fixed effect as the adjusted R-squared rises above 50%. In fact, it is more interesting to look at the within R-squared that shows the explanatory power of a firm’s cash flow and Tobin’s q *on top* of the average investment of each firm. We can also see that the coefficients changed slightly in magnitude but not in sign.

There is another source of variation that we can get rid of in our setting: average investment across firms might vary over time due to macroeconomic factors that affect all firms, such as economic crises. By including year fixed effects, we can take out the effect of unobservables that vary over time. The two-way fixed effects regression is then \[ \text{Investment}_{i,t+1} = \alpha_i + \alpha_t + \beta_1\text{Cash Flows}_{i,t}+\beta_2\text{Tobin's q}_{i,t}+\varepsilon_{i,t},\] where \(\alpha_t\) is the time fixed effect. Here you can think of higher investments during an economic expansion with simultaneously high cash flows.

```
= lm.PanelOLS.from_formula(
model_fe_firmyear ="investment_lead ~ cash_flows + tobins_q + EntityEffects + TimeEffects",
formula=data_investment.set_index(["gvkey", "year"]),
data
).fit() model_fe_firmyear.summary
```

Dep. Variable: | investment_lead | R-squared: | 0.0516 |

Estimator: | PanelOLS | R-squared (Between): | 0.2415 |

No. Observations: | 124194 | R-squared (Within): | 0.0588 |

Date: | Tue, Aug 01 2023 | R-squared (Overall): | 0.2251 |

Time: | 15:58:57 | Log-likelihood | 1.989e+05 |

Cov. Estimator: | Unadjusted | ||

F-statistic: | 2998.7 | ||

Entities: | 13904 | P-value | 0.0000 |

Avg Obs: | 8.9322 | Distribution: | F(2,110255) |

Min Obs: | 1.0000 | ||

Max Obs: | 34.000 | F-statistic (robust): | 2998.7 |

P-value | 0.0000 | ||

Time periods: | 34 | Distribution: | F(2,110255) |

Avg Obs: | 3652.8 | ||

Min Obs: | 469.00 | ||

Max Obs: | 5237.0 | ||

Parameter | Std. Err. | T-stat | P-value | Lower CI | Upper CI | |

cash_flows | 0.0182 | 0.0009 | 19.314 | 0.0000 | 0.0163 | 0.0200 |

tobins_q | 0.0102 | 0.0001 | 75.546 | 0.0000 | 0.0099 | 0.0105 |

F-test for Poolability: 11.296

P-value: 0.0000

Distribution: F(13936,110255)

Included effects: Entity, Time

The inclusion of time fixed effects did only marginally affect the R-squared and the coefficients, which we can interpret as a good thing as it indicates that the coefficients are not driven by an omitted variable that varies over time.

How can we further improve the robustness of our regression results? Ideally, we want to get rid of unexplained variation at the firm-year level, which means we need to include more variables that vary across firm *and* time and are likely correlated with investment. Note that we cannot include firm-year fixed effects in our setting because then cash flows and Tobin’s q are colinear with the fixed effects, and the estimation becomes void.

Before we discuss the properties of our estimation errors, we want to point out that regression tables are at the heart of every empirical analysis, where you compare multiple models. Fortunately, the `results.compare()`

function provides a convenient way to tabulate the regression output (with many parameters to customize and even print the output in LaTeX). We recommend printing \(t\)-statistics rather than standard errors in regression tables because the latter are typically very hard to interpret across coefficients that vary in size. We also do not print p-values because they are sometimes misinterpreted to signal the importance of observed effects (Wasserstein and Lazar 2016). The \(t\)-statistics provide a consistent way to interpret changes in estimation uncertainty across different model specifications.

```
= lm.panel.results.compare(
comparison
[model_ols, model_fe_firm, model_fe_firmyear]
) comparison.summary
```

Model 0 | Model 1 | Model 2 | |

Dep. Variable | investment_lead | investment_lead | investment_lead |

Estimator | PanelOLS | PanelOLS | PanelOLS |

No. Observations | 124194 | 124194 | 124194 |

Cov. Est. | Unadjusted | Unadjusted | Unadjusted |

R-squared | 0.0445 | 0.0595 | 0.0516 |

R-Squared (Within) | 0.0404 | 0.0595 | 0.0588 |

R-Squared (Between) | 0.0222 | 0.2571 | 0.2415 |

R-Squared (Overall) | 0.0445 | 0.2363 | 0.2251 |

F-statistic | 2891.1 | 3486.4 | 2998.7 |

P-value (F-stat) | 0.0000 | 0.0000 | 0.0000 |

===================== | ================= | ================= | ================= |

cash_flows | 0.0514 | 0.0146 | 0.0182 |

(61.588) | (15.155) | (19.314) | |

tobins_q | 0.0077 | 0.0113 | 0.0102 |

(58.206) | (82.633) | (75.546) | |

Intercept | 0.0424 | ||

(124.05) | |||

======================= | =================== | =================== | =================== |

Effects | Entity | Entity | |

Time |

T-stats reported in parentheses

## Clustering Standard Errors

Apart from biased estimators, we usually have to deal with potentially complex dependencies of our residuals with each other. Such dependencies in the residuals invalidate the i.i.d. assumption of OLS and lead to biased standard errors. With biased OLS standard errors, we cannot reliably interpret the statistical significance of our estimated coefficients.

In our setting, the residuals may be correlated across years for a given firm (time-series dependence), or, alternatively, the residuals may be correlated across different firms (cross-section dependence). One of the most common approaches to dealing with such dependence is the use of *clustered standard errors* (Petersen 2008). The idea behind clustering is that the correlation of residuals *within* a cluster can be of any form. As the number of clusters grows, the cluster-robust standard errors become consistent (Donald and Lang 2007; Wooldridge 2010). A natural requirement for clustering standard errors in practice is hence a sufficiently large number of clusters. Typically, around at least 30 to 50 clusters are seen as sufficient (Cameron, Gelbach, and Miller 2011).

Instead of relying on the iid assumption, we can use the `cov_type="clustered"`

option in the `fit()`

-function as above. The code chunk below applies both one-way clustering by firm as well as two-way clustering by firm and year.

```
= lm.PanelOLS.from_formula(
model_cluster_firm ="investment_lead ~ cash_flows + tobins_q + EntityEffects ",
formula=data_investment.set_index(["gvkey", "year"]),
data="clustered", cluster_entity=True, cluster_time=False)
).fit(cov_type
= lm.PanelOLS.from_formula(
model_cluster_firmyear ="investment_lead ~ cash_flows + tobins_q + EntityEffects + TimeEffects",
formula=data_investment.set_index(["gvkey", "year"]),
data="clustered", cluster_entity=True, cluster_time=True) ).fit(cov_type
```

The table below shows the comparison of the different assumptions behind the standard errors. In the first column, we can see highly significant coefficients on both cash flows and Tobin’s q. By clustering the standard errors on the firm level, the \(t\)-statistics of both coefficients drop in half, indicating a high correlation of residuals within firms. If we additionally cluster by year, we see a drop, particularly for Tobin’s q, again. Even after relaxing the assumptions behind our standard errors, both coefficients are still comfortably significant as the \(t\) statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed significance tests.

```
= lm.panel.results.compare(
comparison_clustered
[model_fe_firmyear, model_cluster_firm, model_cluster_firmyear]
) comparison_clustered.summary
```

Model 0 | Model 1 | Model 2 | |

Dep. Variable | investment_lead | investment_lead | investment_lead |

Estimator | PanelOLS | PanelOLS | PanelOLS |

No. Observations | 124194 | 124194 | 124194 |

Cov. Est. | Unadjusted | Clustered | Clustered |

R-squared | 0.0516 | 0.0595 | 0.0516 |

R-Squared (Within) | 0.0588 | 0.0595 | 0.0588 |

R-Squared (Between) | 0.2415 | 0.2571 | 0.2415 |

R-Squared (Overall) | 0.2251 | 0.2363 | 0.2251 |

F-statistic | 2998.7 | 3486.4 | 2998.7 |

P-value (F-stat) | 0.0000 | 0.0000 | 0.0000 |

===================== | ================= | ================= | ================= |

cash_flows | 0.0182 | 0.0146 | 0.0182 |

(19.314) | (8.6927) | (9.0599) | |

tobins_q | 0.0102 | 0.0113 | 0.0102 |

(75.546) | (38.155) | (15.673) | |

======================= | =================== | =================== | =================== |

Effects | Entity | Entity | Entity |

Time | Time |

T-stats reported in parentheses

Inspired by Abadie et al. (2017), we want to close this chapter by highlighting that choosing the right dimensions for clustering is a design problem. Even if the data is informative about whether clustering matters for standard errors, they do not tell you whether you should adjust the standard errors for clustering. Clustering at too aggregate levels can hence lead to unnecessarily inflated standard errors.

## Exercises

- Estimate the two-way fixed effects model with two-way clustered standard errors using quarterly Compustat data from WRDS.
- Following Peters and Taylor (2017), compute Tobin’s q as the market value of outstanding equity
`mktcap`

plus the book value of debt (`dltt`

+`dlc`

) minus the current assets`atc`

and everything divided by the book value of property, plant and equipment`ppegt`

. What is the correlation between the measures of Tobin’s q? What is the impact on the two-way fixed effects regressions?

## References

*Working Paper*. http://www.nber.org/papers/w24003.

*Journal of Business & Economic Statistics*29 (2): 238–49. http://www.jstor.org/stable/25800796.

*The Review of Economics and Statistics*89 (2): 221–33. https://doi.org/10.1162/rest.89.2.221.

*Review of Financial Studies*25 (4): 1286–1329. https://doi.org/10.1093/rfs/hhr120.

*Brookings Papers on Economic Activity*1988 (1): 141–206. http://www.jstor.org/stable/2534426.

*Review of Financial Studies*29 (3): 523–64. https://doi.org/10.1093/rfs/hhv050.

*Journal of Financial Economics*123 (2): 251–72. https://doi.org/10.1016/j.jfineco.2016.03.011.

*Review of Financial Studies*22 (1): 435–80. https://doi.org/10.1093/rfs/hhn053.

*The American Statistician*70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

*Econometric analysis of cross section and panel data*. The MIT Press. http://www.jstor.org/stable/j.ctt5hhcfr.