```
import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf
from regtabletotext import prettify_result
from statsmodels.regression.rolling import RollingOLS
from plotnine import *
from mizani.breaks import date_breaks
from mizani.formatters import percent_format, date_format
from joblib import Parallel, delayed, cpu_count
from itertools import product
```

# Beta Estimation

You are reading **Tidy Finance with Python**. You can find the equivalent chapter for the sibling **Tidy Finance with R** here.

In this chapter, we introduce an important concept in financial economics: The exposure of an individual stock to changes in the market portfolio. According to the Capital Asset Pricing Model (CAPM) of Sharpe (1964), Lintner (1965), and Mossin (1966), cross-sectional variation in expected asset returns should be a function of the covariance between the excess return of the asset and the excess return on the market portfolio. The regression coefficient of excess market returns on excess stock returns is usually called the market beta. We show an estimation procedure for the market betas. We do not go into details about the foundations of market beta but simply refer to any treatment of the CAPM for further information. Instead, we provide details about all the functions that we use to compute the results. In particular, we leverage useful computational concepts: Rolling-window estimation and parallelization.

We use the following Python packages throughout this chapter:

Compared to previous chapters, we introduce `statsmodels`

(Seabold and Perktold 2010) for regression analysis and for sliding-window regressions and `joblib`

(Team 2023) for parallelization.

## Estimating Beta Using Monthly Returns

The estimation procedure is based on a rolling-window estimation, where we may use either monthly or daily returns and different window lengths. First, let us start with loading the monthly CRSP data from our SQLite database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.

```
= sqlite3.connect(database="data/tidy_finance_python.sqlite")
tidy_finance
= (pd.read_sql_query(
crsp_monthly ="SELECT permno, month, industry, ret_excess FROM crsp_monthly",
sql=tidy_finance,
con={"month"})
parse_dates
.dropna()
)
= pd.read_sql_query(
factors_ff3_monthly ="SELECT month, mkt_excess FROM factors_ff3_monthly",
sql=tidy_finance,
con={"month"}
parse_dates
)
= (crsp_monthly
crsp_monthly ="left", on="month")
.merge(factors_ff3_monthly, how )
```

To estimate the CAPM regression coefficients \[
r_{i, t} - r_{f, t} = \alpha_i + \beta_i(r_{m, t}-r_{f,t})+\varepsilon_{i, t},
\tag{1}\] we regress stock excess returns `ret_excess`

on excess returns of the market portfolio `mkt_excess`

.

Python provides a simple solution to estimate (linear) models with the function `smf.ols()`

. The function requires a formula as input that is specified in a compact symbolic form. An expression of the form `y ~ model`

is interpreted as a specification that the response `y`

is modeled by a linear predictor specified symbolically by `model`

. Such a model consists of a series of terms separated by `+`

operators. In addition to standard linear models, `smf.ols()`

provides a lot of flexibility. You should check out the documentation for more information. To start, we restrict the data only to the time series of observations in CRSP that correspond to Apple’s stock (i.e., to `permno`

14593 for Apple) and compute \(\hat\alpha_i\) as well as \(\hat\beta_i\).

```
= (smf.ols(
model_beta ="ret_excess ~ mkt_excess",
formula=crsp_monthly.query("permno == 14593"))
data
.fit()
) prettify_result(model_beta)
```

```
OLS Model:
ret_excess ~ mkt_excess
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.010 0.005 2.004 0.046
mkt_excess 1.389 0.111 12.467 0.000
Summary statistics:
- Number of observations: 504
- R-squared: 0.236, Adjusted R-squared: 0.235
- F-statistic: 155.435 on 1 and 502 DF, p-value: 0.000
```

`smf.ols()`

returns an object of class `RegressionModel`

, which contains all the information we usually care about with linear models. `prettify_result()`

returns an overview of the estimated parameters. The output above indicates that Apple moves excessively with the market as the estimated \(\hat\beta_i\) is above one (\(\hat\beta_i \approx 1.4\)).

## Rolling-Window Estimation

After we estimated the regression coefficients on an example, we scale the estimation of \(\beta_i\) to a whole different level and perform rolling-window estimations for the entire CRSP sample.

We take a total of five years of data (`window_size`

) and require at least 48 months with return data to compute our betas (`min_obs`

). Check out the Exercises if you want to compute beta for different time periods. We first identify firm identifiers (`permno`

) for which CRSP contains sufficiently many records.

```
= 60
window_size = 48
min_obs
= (crsp_monthly
valid_permnos
.dropna()"permno")["permno"]
.groupby(
.count()="counts")
.reset_index(namef"counts > {window_size}+1")
.query( )
```

Before we proceed with the estimation, one important issue is worth emphasizing: `RollingOLS`

returns the estimated parameters of a linear regression by incorporating a window of the last `window_size`

rows. Whenever monthly returns are implicitly missing (which means there is simply no entry recorded, e.g., because a company was delisted and only traded publicly again later), such a fixed window size may cause outdated observations to influence the estimation results. We thus recommend making such implicit missing rows explicit.

We hence collect information about the first and last listing date of each `permno`

.

```
= (crsp_monthly
permno_information ="inner", on="permno")
.merge(valid_permnos, how"permno"])
.groupby([=("month", "min"),
.aggregate(first_month=("month", "max"))
last_month
.reset_index() )
```

To complete the missing observations in the CRSP sample, we obtain all possible `permno`

-`month`

combinations.

```
= crsp_monthly["permno"].unique()
unique_permno = factors_ff3_monthly["month"].unique()
unique_month
= pd.DataFrame(
all_combinations
product(unique_permno, unique_month), =["permno", "month"]
columns )
```

Finally, we expand the CRSP sample and include a row (with missing excess returns) for each possible `permno`

-`month`

observation that falls within the start and end date where the respective `permno`

has been publicly listed.

```
= (all_combinations
returns_monthly "permno", "month", "ret_excess"]),
.merge(crsp_monthly.get([="left", on=["permno", "month"])
how="left", on="permno")
.merge(permno_information, how"(month >= first_month) & (month <= last_month)")
.query(=["first_month", "last_month"])
.drop(columns"permno", "month", "industry"]),
.merge(crsp_monthly.get([="left", on=["permno", "month"])
how="left", on="month")
.merge(factors_ff3_monthly, how )
```

The following function implements the CAPM regression for a dataframe (or a part thereof) containing at least `min_obs`

observations to avoid huge fluctuations if the time series is too short. If the condition is violated (i.e., the time series is too short) the function returns a missing value.

```
def roll_capm_estimation(data, window_size, min_obs):
"""Calculate rolling CAPM estimation."""
= data.sort_values("month")
data
= (RollingOLS.from_formula(
result ="ret_excess ~ mkt_excess",
formula=data,
data=window_size,
window=min_obs,
min_nobs="drop")
missing
.fit()"mkt_excess")
.params.get(
)
= data.index
result.index
return result
```

Before we approach the whole CRSP sample, let us focus on a couple of examples for well-known firms.

```
= pd.DataFrame({
examples "permno": [14593, 10107, 93436, 17778],
"company": ["Apple", "Microsoft", "Tesla", "Berkshire Hathaway"]
})
```

It is actually quite simple to perform the rolling-window estimation for an arbitrary number of stocks, which we visualize in the following code chunk and the resulting Figure 1.

```
= (returns_monthly
beta_example ="inner", on="permno")
.merge(examples, how"permno"])
.groupby([apply(lambda x: x.assign(
.=roll_capm_estimation(x, window_size, min_obs))
beta
)=True)
.reset_index(drop
.dropna() )
```

```
= (
plot_beta
ggplot(beta_example, ="month", y="beta", color="company", linetype="company")) +
aes(x+
geom_line() ="", y="", color="", linetype="",
labs(x="Monthly beta estimates for example stocks") +
title=date_breaks("5 year"), labels=date_format("%Y"))
scale_x_datetime(breaks
) plot_beta.draw()
```

## Parallelized Rolling-Window Estimation

Next, we perform the rolling window estimation for the entire cross-section of stocks in the CRSP sample. For that purpose, we can apply the code snippet from the example above to compute rolling window regression coefficients for all stocks. This is how to do it with the `joblib`

package to use multiple cores. Note that we use `cpu_count()`

to determine the number of cores available for parallelization but keep one core free for other tasks. Some machines might freeze if all cores are busy with Python jobs.

```
def roll_capm_estimation_for_joblib(permno, group):
"""Calculate rolling CAPM estimation using joblib."""
if "date" in group.columns:
= group.sort_values(by="date")
group else:
= group.sort_values(by="month")
group
= (RollingOLS.from_formula(
beta_values ="ret_excess ~ mkt_excess",
formula=group,
data=window_size,
window=min_obs,
min_nobs="drop"
missing
)
.fit()"mkt_excess")
.params.get(
)
= pd.DataFrame(beta_values)
result = ["beta"]
result.columns "month"] = group["month"].values
result["permno"] = permno
result[
try:
"date"] = group["date"].values
result[= result[
result "month")["date"].transform("max")) == result["date"]
(result.groupby(
]except(KeyError):
pass
return result
= (returns_monthly
permno_groups ="inner", on="permno")
.merge(valid_permnos, how"permno", group_keys=False)
.groupby(
)
= cpu_count()-1
n_cores
= (
beta_monthly
pd.concat(=n_cores)
Parallel(n_jobs
(delayed(roll_capm_estimation_for_joblib)(name, group)for name, group in permno_groups)
)
.dropna()={"beta": "beta_monthly"})
.rename(columns )
```

## Estimating Beta Using Daily Returns

Before we provide some descriptive statistics of our beta estimates, we implement the estimation for the daily CRSP sample as well. Depending on the application, you might either use longer horizon beta estimates based on monthly data or shorter horizon estimates based on daily returns. As loading the full daily CRSP data requires relatively large amounts of memory, we split the beta estimation into smaller chunks.

First, we load the daily Fama-French market excess returns and extract the vector of dates.

```
= pd.read_sql_query(
factors_ff3_daily ="SELECT date, mkt_excess FROM factors_ff3_daily",
sql=tidy_finance,
con={"date"}
parse_dates
)
= factors_ff3_daily["date"].unique() unique_date
```

For the daily data, we consider around three months of data (i.e., 60 trading days), require at least 50 observations, and estimate betas in batches of 500.

```
= 60
window_size = 50
min_obs
= list(crsp_monthly["permno"].unique().astype(str))
permnos
= 500
batch_size = np.ceil(len(permnos)/batch_size).astype(int) batches
```

We then proceed to perform the same steps as with the monthly CRSP data, just in batches: Load in daily returns, transform implicit missing returns to explicit ones, keep only valid stocks with a minimum number of rows, and parallelize the beta estimation across stocks.

```
= []
beta_daily
for j in range(1, batches+1):
= permnos[
permno_batch -1)*batch_size):(min(j*batch_size, len(permnos)))
((j
]
= (
permno_batch_formatted ", ".join(f"'{permno}'" for permno in permno_batch)
)= f"({permno_batch_formatted})"
permno_string
= (
crsp_daily_sub_query "SELECT permno, month, date, ret_excess "
"FROM crsp_daily "
f"WHERE permno IN {permno_string}"
)
= pd.read_sql_query(
crsp_daily_sub =crsp_daily_sub_query,
sql=tidy_finance,
con={"permno": int},
dtype={"date", "month"}
parse_dates
)
= (crsp_daily_sub
valid_permnos "permno")["permno"]
.groupby(
.count()="counts")
.reset_index(namef"counts > {window_size}+1")
.query(="counts")
.drop(columns
)
= (crsp_daily_sub
permno_information ="inner", on="permno")
.merge(valid_permnos, how"permno"])
.groupby([=("date", "min"),
.aggregate(first_date=("date", "max"),)
last_date
.reset_index()
)
= permno_information["permno"].unique()
unique_permno
= pd.DataFrame(
all_combinations
product(unique_permno, unique_date), =["permno", "date"]
columns
)
= (crsp_daily_sub
returns_daily ="right", on=["permno", "date"])
.merge(all_combinations, how="left", on="permno")
.merge(permno_information, how"(date >= first_date) & (date <= last_date)")
.query(=["first_date", "last_date"])
.drop(columns="left", on="date")
.merge(factors_ff3_daily, how
)
= (returns_daily
permno_groups "permno", group_keys=False)
.groupby(
)
= (
beta_daily_sub
pd.concat(=n_cores)
Parallel(n_jobs
(delayed(roll_capm_estimation_for_joblib)(name, group)for name, group in permno_groups)
)
.dropna()={"beta": "beta_daily"})
.rename(columns
)
beta_daily.append(beta_daily_sub)
print(f"Batch {j} out of {batches} done ({(j/batches)*100:.2f}%)\n")
= pd.concat(beta_daily) beta_daily
```

## Comparing Beta Estimates

What is a typical value for stock betas? To get some feeling, we illustrate the dispersion of the estimated \(\hat\beta_i\) across different industries and across time below. Figure 2 shows that typical business models across industries imply different exposure to the general market economy. However, there are barely any firms that exhibit a negative exposure to the market factor.

```
= (beta_monthly
beta_industries ="inner", on=["permno", "month"])
.merge(crsp_monthly, how="beta_monthly")
.dropna(subset"industry","permno"])["beta_monthly"]
.groupby(["mean")
.aggregate(
.reset_index()
)
= (beta_industries
industry_order "industry")["beta_monthly"]
.groupby("median")
.aggregate(
.sort_values()
.index.tolist()
)
= (
plot_beta_industries
ggplot(beta_industries, ="industry", y="beta_monthly")) +
aes(x+
geom_boxplot() +
coord_flip() ="", y="",
labs(x="Firm-specific beta distributions by industry") +
title=industry_order)
scale_x_discrete(limits
) plot_beta_industries.draw()
```

Next, we illustrate the time-variation in the cross-section of estimated betas. Figure 3 shows the monthly deciles of estimated betas (based on monthly data) and indicates an interesting pattern: First, betas seem to vary over time in the sense that during some periods, there is a clear trend across all deciles. Second, the sample exhibits periods where the dispersion across stocks increases in the sense that the lower decile decreases and the upper decile increases, which indicates that for some stocks, the correlation with the market increases, while for others it decreases. Note also here: stocks with negative betas are a rare exception.

```
= (beta_monthly
beta_quantiles "month")["beta_monthly"]
.groupby(=np.arange(0.1, 1.0, 0.1))
.quantile(q
.reset_index()={"level_1": "quantile"})
.rename(columns=lambda x: (x["quantile"]*100).astype(int))
.assign(quantile
.dropna()
)
= ["-", "--", "-.", ":"]
linetypes = beta_quantiles["quantile"].nunique()
n_quantiles
= (
plot_beta_quantiles
ggplot(beta_quantiles, ="month", y="beta_monthly",
aes(x="factor(quantile)", linetype="factor(quantile)")) +
color+
geom_line() ="", y="", color="", linetype="",
labs(x="Monthly deciles of estimated betas") +
title=date_breaks("5 year"), labels=date_format("%Y")) +
scale_x_datetime(breaks
scale_linetype_manual(=[linetypes[l % len(linetypes)] for l in range(n_quantiles)]
values
)
) plot_beta_quantiles.draw()
```

To compare the difference between daily and monthly data, we combine beta estimates to a single table. Then, we use the table to plot a comparison of beta estimates for our example stocks in Figure 4.

```
= (beta_monthly
beta "permno", "month", "beta_monthly"])
.get(["permno", "month", "beta_daily"]),
.merge(beta_daily.get([="outer", on=["permno", "month"])
how
)
= (beta
beta_comparison ="permno")
.merge(examples, on=["permno", "month", "company"], var_name="name",
.melt(id_vars=["beta_monthly", "beta_daily"], value_name="value")
value_vars
.dropna()
)
= (
plot_beta_comparison
ggplot(beta_comparison,="month", y="value", color="name")) +
aes(x+
geom_line() "~company", ncol=1) +
facet_wrap(="", y="", color="",
labs(x="Comparison of beta estimates using monthly and daily data") +
title=date_breaks("10 years"),
scale_x_datetime(breaks=date_format("%Y")) +
labels=(6.4, 6.4))
theme(figure_size
) plot_beta_comparison.draw()
```

The estimates in Figure 4 look as expected. As you can see, the beta estimates really depend on the estimation window and data frequency. Nevertheless, one can observe a clear connection between daily and monthly betas in this example, in magnitude and the dynamics over time.

Finally, we write the estimates to our database so that we can use them in later chapters.

```
(beta.to_sql(="beta",
name=tidy_finance,
con="replace",
if_exists=False
index
) )
```

Whenever you perform some kind of estimation, it also makes sense to do rough plausibility tests. A possible check is to plot the share of stocks with beta estimates over time. This descriptive analysis helps us discover potential errors in our data preparation or the estimation procedure. For instance, suppose there was a gap in our output without any betas. In this case, we would have to go back and check all previous steps to find out what went wrong. Figure 5 does not indicate any troubles, so let us move on to the next check.

```
= (crsp_monthly
beta_long ="left", on=["permno", "month"])
.merge(beta, how=["permno", "month"], var_name="name",
.melt(id_vars=["beta_monthly", "beta_daily"], value_name="value")
value_vars
)
= (beta_long
beta_shares "month", "name"])
.groupby([=("value", lambda x: sum(~x.isna())/len(x)))
.aggregate(share
.reset_index()
)
= (
plot_beta_long
ggplot(beta_shares, ="month", y="share", color="name", linetype="name")) +
aes(x+
geom_line() ="", y="", color="", linetype="",
labs(x="End-of-month share of securities with beta estimates") +
title=percent_format()) +
scale_y_continuous(labels=date_breaks("10 year"), labels=date_format("%Y"))
scale_x_datetime(breaks
) plot_beta_long.draw()
```

We also encourage everyone to always look at the distributional summary statistics of variables. You can easily spot outliers or weird distributions when looking at such tables.

`"name")["value"].describe().round(2) beta_long.groupby(`

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

name | ||||||||

beta_daily | 3279386.0 | 0.75 | 0.94 | -44.85 | 0.20 | 0.69 | 1.23 | 61.64 |

beta_monthly | 2073073.0 | 1.10 | 0.70 | -8.96 | 0.64 | 1.03 | 1.47 | 10.35 |

The summary statistics also look plausible for the two estimation procedures.

Finally, since we have two different estimators for the same theoretical object, we expect the estimators to be at least positively correlated (although not perfectly as the estimators are based on different sample periods and frequencies).

`"beta_monthly", "beta_daily"]).corr().round(2) beta.get([`

beta_monthly | beta_daily | |
---|---|---|

beta_monthly | 1.00 | 0.32 |

beta_daily | 0.32 | 1.00 |

Indeed, we find a positive correlation between our beta estimates. In the subsequent chapters, we mainly use the estimates based on monthly data, as most readers should be able to replicate them and should not encounter potential memory limitations that might arise with the daily data.

## Exercises

- Compute beta estimates based on monthly data using one, three, and five years of data and impose a minimum number of observations of 10, 28, and 48 months with return data, respectively. How strongly correlated are the estimated betas?
- Compute beta estimates based on monthly data using five years of data and impose different numbers of minimum observations. How does the share of
`permno`

-`month`

observations with successful beta estimates vary across the different requirements? Do you find a high correlation across the estimated betas? - Instead of using
`joblib`

, perform the beta estimation in a loop (using either monthly or daily data) for a subset of 100 permnos of your choice. Verify that you get the same results as with the parallelized code from above. - Filter out the stocks with negative betas. Do these stocks frequently exhibit negative betas, or do they resemble estimation errors?
- Compute beta estimates for multi-factor models such as the Fama-French three-factor model. For that purpose, you extend your regression to \[ r_{i, t} - r_{f, t} = \alpha_i + \sum\limits_{j=1}^k\beta_{i,k}(r_{j, t}-r_{f,t})+\varepsilon_{i, t} \tag{2}\] where \(r_{i, t}\) are the \(k\) factor returns. Thus, you estimate four parameters (\(\alpha_i\) and the slope coefficients). Provide some summary statistics of the cross-section of firms and their exposure to the different factors.

## References

*The Journal of Finance*20 (4): 587–615. https://doi.org/10.1111/j.1540-6261.1965.tb02930.x.

*Econometrica*34 (4): 768–83. https://doi.org/10.2307/1910098.

*9th Python in Science Conference*.

*The Journal of Finance*19 (3): 425–42. https://doi.org/10.1111/j.1540-6261.1964.tb02865.x.