```
import pandas as pd
import numpy as np
import sqlite3
from plotnine import *
from mizani.formatters import percent_format, date_format
from mizani.breaks import date_breaks
from itertools import product
from sklearn.model_selection import (
train_test_split, GridSearchCV, TimeSeriesSplit, cross_val_score
)from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet, Lasso, Ridge
```

# Factor Selection via Machine Learning

You are reading **Tidy Finance with Python**. You can find the equivalent chapter for the sibling **Tidy Finance with R** here.

The aim of this chapter is twofold. From a data science perspective, we introduce `scikit-learn`

, a collection of packages for modeling and machine learning (ML). `scikit-learn`

comes with a handy workflow for all sorts of typical prediction tasks. From a finance perspective, we address the notion of *factor zoo* (Cochrane 2011) using ML methods. We introduce Lasso, Ridge, and Elastic Net regression as a special case of penalized regression models. Then, we explain the concept of cross-validation for model *tuning* with Elastic Net regularization as a popular example. We implement and showcase the entire cycle from model specification, training, and forecast evaluation within the `scikit-learn`

universe. While the tools can generally be applied to an abundance of interesting asset pricing problems, we apply penalized regressions for identifying macroeconomic variables and asset pricing factors that help explain a cross-section of industry portfolios.

In previous chapters, we illustrate that stock characteristics such as size provide valuable pricing information in addition to the market beta. Such findings question the usefulness of the Capital Asset Pricing Model. In fact, during the last decades, financial economists discovered a plethora of additional factors which may be correlated with the marginal utility of consumption (and would thus deserve a prominent role in pricing applications). The search for factors that explain the cross-section of expected stock returns has produced hundreds of potential candidates, as noted more recently by Harvey, Liu, and Zhu (2016), Harvey (2017), Mclean and Pontiff (2016), and Hou, Xue, and Zhang (2020). Therefore, given the multitude of proposed risk factors, the challenge these days rather is: *do we believe in the relevance of hundreds of risk factors?* During recent years, promising methods from the field of ML got applied to common finance applications. We refer to Mullainathan and Spiess (2017) for a treatment of ML from the perspective of an econometrician, Nagel (2021) for an excellent review of ML practices in asset pricing, Easley et al. (2020) for ML applications in (high-frequency) market microstructure, and Dixon, Halperin, and Bilokon (2020) for a detailed treatment of all methodological aspects.

## Brief Theoretical Background

This is a book about *doing* empirical work in a tidy manner, and we refer to any of the many excellent textbook treatments of ML methods and especially penalized regressions for some deeper discussion. Excellent material is provided, for instance, by Hastie, Tibshirani, and Friedman (2009), Gareth et al. (2013), and De Prado (2018). Instead, we briefly summarize the idea of Lasso and Ridge regressions as well as the more general Elastic Net. Then, we turn to the fascinating question on *how* to implement, tune, and use such models with the `scikit-learn`

package.

To set the stage, we start with the definition of a linear model: Suppose we have data \((y_t, x_t), t = 1,\ldots, T\), where \(x_t\) is a \((K \times 1)\) vector of regressors and \(y_t\) is the response for observation \(t\). The linear model takes the form \(y_t = \beta' x_t + \varepsilon_t\) with some error term \(\varepsilon_t\) and has been studied in abundance. For \(K\leq T\), the well-known ordinary-least square (OLS) estimator for the \((K \times 1)\) vector \(\beta\) minimizes the sum of squared residuals and is then \[\hat{\beta}^\text{ols} = \left(\sum\limits_{t=1}^T x_t'x_t\right)^{-1} \sum\limits_{t=1}^T x_t'y_t. \tag{1}\]

While we are often interested in the estimated coefficient vector \(\hat\beta^\text{ols}\), ML is about the predictive performance most of the time. For a new observation \(\tilde{x}_t\), the linear model generates predictions such that \[\hat y_t = E\left(y|x_t = \tilde x_t\right) = \hat\beta^\text{ols}{}' \tilde x_t. \tag{2}\] Is this the best we can do? Not necessarily: instead of minimizing the sum of squared residuals, penalized linear models can improve predictive performance by choosing other estimators \(\hat{\beta}\) with lower variance than the estimator \(\hat\beta^\text{ols}\). At the same time, it seems appealing to restrict the set of regressors to a few meaningful ones, if possible. In other words, if \(K\) is large (such as for the number of proposed factors in the asset pricing literature), it may be a desirable feature to *select* reasonable factors and set \(\hat\beta^{\text{ols}}_k = 0\) for some redundant factors.

It should be clear that the promised benefits of penalized regressions, i.e., reducing the mean squared error (MSE), come at a cost. In most cases, reducing the variance of the estimator introduces a bias such that \(E\left(\hat\beta\right) \neq \beta\). What is the effect of such a bias-variance trade-off? To understand the implications, assume the following data-generating process for \(y\): \[y = f(x) + \varepsilon, \quad \varepsilon \sim (0, \sigma_\varepsilon^2) \tag{3}\] We want to recover \(f(x)\), which denotes some unknown functional which maps the relationship between \(x\) and \(y\). While the properties of \(\hat\beta^\text{ols}\) as an unbiased estimator may be desirable under some circumstances, they are certainly not if we consider predictive accuracy. Alternative predictors \(\hat{f}(x)\) could be more desirable: For instance, the MSE depends on our model choice as follows: \[\begin{aligned} MSE &=E\left(\left(y-\hat{f}(x)\right)^2\right)=E\left(\left(f(x)+\epsilon-\hat{f}(x)\right)^2\right)\\ &= \underbrace{E\left(\left(f(x)-\hat{f}(x)\right)^2\right)}_{\text{total quadratic error}}+\underbrace{E\left(\epsilon^2\right)}_{\text{irreducible error}} \\ &= E\left(\hat{f}(x)^2\right)+E\left(f(x)^2\right)-2E\left(f(x)\hat{f}(x)\right)+\sigma_\varepsilon^2\\ &=E\left(\hat{f}(x)^2\right)+f(x)^2-2f(x)E\left(\hat{f}(x)\right)+\sigma_\varepsilon^2\\ &=\underbrace{\text{Var}\left(\hat{f}(x)\right)}_{\text{variance of model}}+ \underbrace{\left(E(f(x)-\hat{f}(x))\right)^2}_{\text{squared bias}} +\sigma_\varepsilon^2. \end{aligned} \tag{4}\] While no model can reduce \(\sigma_\varepsilon^2\), a biased estimator with small variance may have a lower MSE than an unbiased estimator.

### Ridge regression

One biased estimator is known as Ridge regression. Hoerl and Kennard (1970) propose to minimize the sum of squared errors *while simultaneously imposing a penalty on the \(L_2\) norm of the parameters* \(\hat\beta\). Formally, this means that for a penalty factor \(\lambda\geq 0\), the minimization problem takes the form \(\min_\beta \left(y - X\beta\right)'\left(y - X\beta\right)\text{ s.t. } \beta'\beta \leq c\). Here \(c\geq 0\) is a constant that depends on the choice of \(\lambda\). The larger \(\lambda\), the smaller \(c\) (technically speaking, there is a one-to-one relationship between \(\lambda\), which corresponds to the Lagrangian of the minimization problem above and \(c\)). Here, \(X = \left(x_1 \ldots x_T\right)'\) and \(y = \left(y_1, \ldots, y_T\right)'\). A closed-form solution for the resulting regression coefficient vector \(\beta^\text{ridge}\) exists: \[\hat{\beta}^\text{ridge} = \left(X'X + \lambda I\right)^{-1}X'y,
\tag{5}\] where \(I\) is the identity matrix of dimension \(K\). A couple of observations are worth noting: \(\hat\beta^\text{ridge} = \hat\beta^\text{ols}\) for \(\lambda = 0\) and \(\hat\beta^\text{ridge} \rightarrow 0\) for \(\lambda\rightarrow \infty\). Also for \(\lambda > 0\), \(\left(X'X + \lambda I\right)\) is non-singular even if \(X'X\) is which means that \(\hat\beta^\text{ridge}\) exists even if \(\hat\beta\) is not defined. However, note also that the Ridge estimator requires careful choice of the hyperparameter \(\lambda\) which controls the *amount of regularization*: a larger value of \(\lambda\) implies *shrinkage* of the regression coefficient toward 0; a smaller value of \(\lambda\) reduces the bias of the resulting estimator.

Note that \(X\) usually contains an intercept column with ones. As a general rule, the associated intercept coefficient is not penalized. In practice, this often implies that \(y\) is simply demeaned before computing \(\hat\beta^\text{ridge}\).

What about the statistical properties of the Ridge estimator? First, the bad news is that \(\hat\beta^\text{ridge}\) is a biased estimator of \(\beta\). However, the good news is that (under homoscedastic error terms) the variance of the Ridge estimator is guaranteed to be *smaller* than the variance of the OLS estimator. We encourage you to verify these two statements in the Exercises. As a result, we face a trade-off: The Ridge regression sacrifices some unbiasedness to achieve a smaller variance than the OLS estimator.

### Lasso

An alternative to Ridge regression is the Lasso (*l*east *a*bsolute *s*hrinkage and *s*election *o*perator). Similar to Ridge regression, the Lasso (Tibshirani 1996) is a penalized and biased estimator. The main difference to Ridge regression is that Lasso does not only *shrink* coefficients but effectively selects variables by setting coefficients for *irrelevant* variables to zero. Lasso implements a \(L_1\) penalization on the parameters such that: \[\hat\beta^\text{Lasso} = \arg\min_\beta \left(Y - X\beta\right)'\left(Y - X\beta\right)\text{ s.t. } \sum\limits_{k=1}^K|\beta_k| < c(\lambda). \tag{6}\] There is no closed-form solution for \(\hat\beta^\text{Lasso}\) in the above maximization problem, but efficient algorithms exist (e.g., the `glmnet`

package for R and Python). Like for Ridge regression, the hyperparameter \(\lambda\) has to be specified beforehand.

The corresponding Lagrangian reads as follows \[\begin{aligned}\hat\beta_\lambda^\text{Lasso} = \arg\min_\beta \left(Y - X\beta\right)'\left(Y - X\beta\right) + \lambda\sum\limits_{k=1}^K|\beta_k|.\end{aligned} \tag{7}\]

### Elastic Net

The Elastic Net (Zou and Hastie 2005) combines \(L_1\) with \(L_2\) penalization and encourages a grouping effect, where strongly correlated predictors tend to be in or out of the model together. In terms of the Lagrangian, this more general framework considers the following optimization problem: \[\hat\beta^\text{EN} = \arg\min_\beta \left(Y - X\beta\right)'\left(Y - X\beta\right) + \lambda(1-\rho)\sum\limits_{k=1}^K|\beta_k| +\frac{1}{2}\lambda\rho\sum\limits_{k=1}^K\beta_k^2 \tag{8}\] Now, we have to choose two hyperparameters: the *shrinkage* factor \(\lambda\) and the *weighting parameter* \(\rho\). The Elastic Net resembles Lasso for \(\rho = 0\) and Ridge regression for \(\rho = 1\). While the `glmnet`

package provides efficient algorithms to compute the coefficients of penalized regressions, it is a good exercise to implement Ridge and Lasso estimation on your own before you use the `scikit-learn`

back-end.

## Python Packages

To get started, we load the required packages and data. The main focus is on the workflow behind the `scikit-learn`

(Pedregosa et al. 2011) package collection.

## Data Preparation

In this analysis, we use four different data sources that we load from our SQLite database introduced in Accessing and Managing Financial Data. We start with two different sets of factor portfolio returns which have been suggested as representing practical risk factor exposure and thus should be relevant when it comes to asset pricing applications.

- The standard workhorse: monthly Fama-French 3 factor returns (market, small-minus-big, and high-minus-low book-to-market valuation sorts) defined in Fama and French (1992) and Fama and French (1993).
- Monthly q-factor returns from Hou, Xue, and Zhang (2014). The factors contain the size factor, the investment factor, the return-on-equity factor, and the expected growth factor.

Next, we include macroeconomic predictors which may predict the general stock market economy. Macroeconomic variables effectively serve as conditioning information such that their inclusion hints at the relevance of conditional models instead of unconditional asset pricing. We refer the interested reader to Cochrane (2009) on the role of conditioning information.

- Our set of macroeconomic predictors comes from Welch and Goyal (2008). The data has been updated by the authors until 2021 and contains monthly variables that have been suggested as good predictors for the equity premium. Some of the variables are the dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury bill rate, and inflation.

Finally, we need a set of *test assets*. The aim is to understand which of the plenty factors and macroeconomic variable combinations prove helpful in explaining our test assets’ cross-section of returns. In line with many existing papers, we use monthly portfolio returns from ten different industries according to the definition from Kenneth French’s homepage as test assets.

```
= sqlite3.connect(database="data/tidy_finance_python.sqlite")
tidy_finance
= (pd.read_sql_query(
factors_ff3_monthly ="SELECT * FROM factors_ff3_monthly",
sql=tidy_finance,
con={"date"})
parse_dates"factor_ff_")
.add_prefix(
)
= (pd.read_sql_query(
factors_q_monthly ="SELECT * FROM factors_q_monthly",
sql=tidy_finance,
con={"date"})
parse_dates"factor_q_")
.add_prefix(
)
= (pd.read_sql_query(
macro_predictors ="SELECT * FROM macro_predictors",
sql=tidy_finance,
con={"date"})
parse_dates"macro_")
.add_prefix(
)
= (pd.read_sql_query(
industries_ff_monthly ="SELECT * FROM industries_ff_monthly",
sql=tidy_finance,
con={"date"})
parse_dates="date", var_name="industry", value_name="ret")
.melt(id_vars )
```

We combine all the monthly observations into one dataframe.

```
= (industries_ff_monthly
data
.merge(factors_ff3_monthly, ="left", left_on="date", right_on="factor_ff_date")
how
.merge(factors_q_monthly, ="left", left_on="date", right_on="factor_q_date")
how
.merge(macro_predictors, ="left", left_on="date", right_on="macro_date")
how=lambda x: x["ret"] - x["factor_ff_rf"])
.assign(ret_excess=["ret", "factor_ff_date", "factor_q_date", "macro_date"])
.drop(columns
.dropna() )
```

Our data contains 23 columns of regressors with the 13 macro-variables and 9 factor returns for each month. Figure 1 provides summary statistics for the 10 monthly industry excess returns in percent. One can see that the dispersion in the excess returns varies widely across industries.

```
= (ggplot(data,
data_plot ="industry", y="ret_excess")) +
aes(x+
geom_boxplot() +
coord_flip() ="", y="",
labs(x="Excess return distributions by industry in percent") +
title=percent_format())
scale_y_continuous(labels
) data_plot.draw()
```

## Machine Learning Workflow

To illustrate penalized linear regressions, we employ the `scikit-learn`

collection of packages for modeling and ML. Using the ideas of Ridge and Lasso regressions, the following example guides you through (i) pre-processing the data (data split and variable mutation), (ii) building models, (iii) fitting models, and (iv) tuning models to create the “best” possible predictions.

### Pre-process data

We want to explain excess returns with all available predictors. The regression equation thus takes the form \[r_{t} = \alpha_0 + \left(\tilde f_t \otimes \tilde z_t\right)B + \varepsilon_t \tag{9}\] where \(r_t\) is the vector of industry excess returns at time \(t\), \(\otimes\) denotes the Kronecker product and \(\tilde f_t\) and \(\tilde z_t\) are the (standardized) vectors of factor portfolio returns and macroeconomic variables.

We hence perform the following pre-processing steps:

- We exclude the column
*month*from the analysis - We include all interaction terms between factors and macroeconomic predictors
- We demean and scale each regressor such that the standard deviation is one

Scaling is often necessary in machine learning applications, especially when combining variables of different magnitudes or units, or when using algorithms sensitive to feature scales (e.g., gradient descent-based algorithms). We use `ColumnTransformer()`

to scale all regressors using `StandardScaler()`

. The `remainder="drop"`

ensures that only the specified columns are retained in the output, and others are dropped. The option `verbose_feature_names_out=False`

ensures that the output feature names remain unchanged. Also note that we use the `zip()`

function to pair each element from `column_names`

with its corresponding list of values from `new_column_values`

, creating tuples, and then convert these tuples into a dictionary using `dict()`

from which we create a dataframe.

```
= data.filter(like="macro").columns
macro_variables = data.filter(like="factor").columns
factor_variables
= list(product(macro_variables, factor_variables))
column_combinations
= []
new_column_values for macro_column, factor_column in column_combinations:
* data[factor_column])
new_column_values.append(data[macro_column]
= [" x ".join(t) for t in column_combinations]
column_names = pd.DataFrame(dict(zip(column_names, new_column_values)))
new_columns
= pd.concat([data, new_columns], axis=1)
data
= ColumnTransformer(
preprocessor =[
transformers"scale", StandardScaler(),
(for col in data.columns
[col if col not in ["ret_excess", "date", "industry"]])
],="drop",
remainder=False
verbose_feature_names_out )
```

### Build a model

Next, we can build an actual model based on our pre-processed data. In line with the definition above, we estimate regression coefficients of a Lasso regression such that we get

\[\begin{aligned}\hat\beta_\lambda^\text{Lasso} = \arg\min_\beta \left(Y - X\beta\right)'\left(Y - X\beta\right) + \lambda\sum\limits_{k=1}^K|\beta_k|.\end{aligned} \tag{10}\] In the application at hand, \(X\) contains 117 columns with all possible interactions between factor returns and macroeconomic variables. We want to emphasize that the workflow for *any* model is very similar, irrespective of the specific model. As you will see further below, it is straightforward to fit Ridge regression coefficients and, later, Neural networks or Random forests with similar code. For now, we start with the linear regression model with an arbitrarily chosen value for the penalty factor \(\lambda\) (denoted as `alpha=0.007`

in the code below). In the setup below, `l1_ratio`

denotes the value of \(1-\rho\), hence setting `l1_ratio=1`

implies the Lasso.

```
= ElasticNet(
lm_model =0.007,
alpha=1,
l1_ratio=5000,
max_iter=False
fit_intercept
)
= Pipeline([
lm_pipeline "preprocessor", preprocessor),
("regressor", lm_model)
( ])
```

That’s it - we are done! The object `lm_model_pipeline`

contains the definition of our model with all required information, in particular the pre-processing steps and the regression model.

### Fit a model

With the pipeline from above, we are ready to fit it to the data. Typically, we use training data to fit the model. The training data is pre-processed according to our recipe steps, and the Lasso regression coefficients are computed. For illustrative purposes, we focus on the manufacturing industry for now.

```
= data.query("industry == 'manuf'")
data_manufacturing = "2011-12-01"
training_date
= (data_manufacturing
data_manufacturing_training f"date<'{training_date}'")
.query(
)
= lm_pipeline.fit(
lm_fit
data_manufacturing_training, "ret_excess")
data_manufacturing_training.get( )
```

First, we focus on the in-sample predicted values \(\hat{y}_t = x_t\hat\beta^\text{Lasso}.\) Figure 2 illustrates the projections for the *entire* time series of the manufacturing industry portfolio returns.

```
= (pd.DataFrame({
predicted_values "Fitted value": lm_fit.predict(data_manufacturing),
"Realization": data_manufacturing.get("ret_excess")
})= data_manufacturing["date"])
.assign(date ="date", var_name="Variable", value_name="return")
.melt(id_vars
)
= (
predicted_values_plot
ggplot(predicted_values, ="date", y="return",
aes(x="Variable", linetype="Variable")) +
color
annotate("rect",
=data_manufacturing_training["date"].max(),
xmin=data_manufacturing["date"].max(),
xmax=-np.inf, ymax=np.inf,
ymin=0.25, fill="#808080"
alpha+
) +
geom_line() ="", y="", color="", linetype="",
labs(x="Monthly realized and fitted manufacturing risk premia") +
title=date_breaks("5 years"),
scale_x_datetime(breaks=date_format("%Y")) +
labels=percent_format())
scale_y_continuous(labels
) predicted_values_plot.draw()
```

What do the estimated coefficients look like? To analyze these values, it is worth computing the coefficients \(\hat\beta^\text{Lasso}\) directly. The code below estimates the coefficients for the Lasso and Ridge regression for the processed training data sample for a grid of different \(\lambda\)’s.

```
= preprocessor.fit_transform(data_manufacturing)
x = data_manufacturing["ret_excess"]
y
= np.logspace(-5, 5, 100)
alphas
= []
coefficients_lasso for a in alphas:
= Lasso(alpha=a, fit_intercept=False)
lasso
coefficients_lasso.append(lasso.fit(x, y).coef_)
= (pd.DataFrame(coefficients_lasso)
coefficients_lasso =alphas, model="Lasso")
.assign(alpha=["alpha", "model"])
.melt(id_vars
)
= []
coefficients_ridge for a in alphas:
= Ridge(alpha=a, fit_intercept=False)
ridge
coefficients_ridge.append(ridge.fit(x, y).coef_)
= (pd.DataFrame(coefficients_ridge)
coefficients_ridge =alphas, model="Ridge")
.assign(alpha=["alpha", "model"])
.melt(id_vars )
```

The dataframes `lasso_coefficients`

and `ridge_coefficients`

contain an entire sequence of estimated coefficients for multiple values of the penalty factor \(\lambda\). Figure 3 illustrates the trajectories of the regression coefficients as a function of the penalty factor. Both Lasso and Ridge coefficients converge to zero as the penalty factor increases.

```
= (
coefficients_plot
ggplot(pd.concat([coefficients_lasso, coefficients_ridge]), ="alpha", y="value", color="variable")) +
aes(x+
geom_line() "model") +
facet_wrap(="Penalty factor (lambda)", y="",
labs(x="Estimated coefficient paths for different penalty factors") +
title+
scale_x_log10() ="none"))
theme(legend_position coefficients_plot.draw()
```

### Tune a model

To compute \(\hat\beta_\lambda^\text{Lasso}\) , we simply imposed an arbitrary value for the penalty hyperparameter \(\lambda\). Model tuning is the process of optimally selecting such hyperparameters through *cross-validation*.

The goal for choosing \(\lambda\) (or any other hyperparameter, e.g., \(\rho\) for the Elastic Net) is to find a way to produce predictors \(\hat{Y}\) for an outcome \(Y\) that minimizes the mean squared prediction error \(\text{MSPE} = E\left( \frac{1}{T}\sum_{t=1}^T (\hat{y}_t - y_t)^2 \right)\). Unfortunately, the MSPE is not directly observable. We can only compute an estimate because our data is random and because we do not observe the entire population.

Obviously, if we train an algorithm on the same data that we use to compute the error, our estimate \(\text{MSPE}\) would indicate way better predictive accuracy than what we can expect in real out-of-sample data. The result is called overfitting.

Cross-validation is a technique that allows us to alleviate this problem. We approximate the true MSPE as the average of many MSPE obtained by creating predictions for \(K\) new random samples of the data, none of them used to train the algorithm \(\frac{1}{K} \sum_{k=1}^K \frac{1}{T}\sum_{t=1}^T \left(\hat{y}_t^k - y_t^k\right)^2\). In practice, this is done by carving out a piece of our data and pretending it is an independent sample. We again divide the data into a training set and a test set. The MSPE on the test set is our measure for actual predictive ability, while we use the training set to fit models with the aim to find the *optimal* hyperparameter values. To do so, we further divide our training sample into (several) subsets, fit our model for a grid of potential hyperparameter values (e.g., \(\lambda\)), and evaluate the predictive accuracy on an *independent* sample. This works as follows:

- Specify a grid of hyperparameters
- Obtain predictors \(\hat{y}_i(\lambda)\) to denote the predictors for the used parameters \(\lambda\)
- Compute \[ \text{MSPE}(\lambda) = \frac{1}{K} \sum_{k=1}^K \frac{1}{T}\sum_{t=1}^T \left(\hat{y}_t^k(\lambda) - y_t^k\right)^2 . \tag{11}\] With K-fold cross-validation, we do this computation \(K\) times. Simply pick a validation set with \(M=T/K\) observations at random and think of these as random samples \(y_1^k, \dots, y_{\tilde{T}}^k\), with \(k=1\).

How should you pick \(K\)? Large values of \(K\) are preferable because the training data better imitates the original data. However, larger values of \(K\) will have much higher computation time. `scikit-learn`

provides all required tools to conduct \(K\)-fold cross-validation. We just have to update our model specification. In our case, we specify the penalty factor \(\lambda\) as well as the mixing factor \(\rho\) as *free* parameters.

For our sample, we consider a time-series cross-validation sample. This means that we tune our models with 20 samples of length five years with a validation period of four years. For a grid of possible hyperparameters, we then fit the model for each fold and evaluate \(\hat{\text{MSPE}}\) in the corresponding validation set. Finally, we select the model specification with the lowest MSPE in the validation set. First, we define the cross-validation folds based on our training data only.

Then, we evaluate the performance for a grid of different penalty values. `scikit-learn`

provides functionalities to construct a suitable grid of hyperparameters with `GridSearchCV()`

. The code chunk below creates a \(10 \times 3\) hyperparameters grid. Then, the method `fit()`

evaluates all the models for each fold.

```
= 5
initial_years = 48
assessment_months = int(len(data_manufacturing)/assessment_months) - 1
n_splits = 12
length_of_year = np.logspace(-6, 2, 100)
alphas
= TimeSeriesSplit(
data_folds =n_splits,
n_splits=assessment_months,
test_size=initial_years * length_of_year
max_train_size
)
= {
params "regressor__alpha": alphas,
"regressor__l1_ratio": (0.0, 0.5, 1)
}
= GridSearchCV(
finder
lm_pipeline,=params,
param_grid="neg_root_mean_squared_error",
scoring=data_folds
cv
)
= finder.fit(
finder "ret_excess")
data_manufacturing, data_manufacturing.get( )
```

After the tuning process, we collect the evaluation metrics (the root mean-squared error in our example) to identify the *optimal* model. Figure 4 illustrates the average validation set’s root mean-squared error for each value of \(\lambda\) and \(\rho\).

```
= (pd.DataFrame(finder.cv_results_)
validation
.assign(=lambda x: -x["mean_test_score"],
mspe=lambda x: pd.to_numeric(
param_regressor__alpha"param_regressor__alpha"], errors="coerce"
x[
)
)
)
= (ggplot(validation,
validation_plot ="param_regressor__alpha", y="mspe",
aes(x="param_regressor__l1_ratio",
color="param_regressor__l1_ratio",
shape="param_regressor__l1_ratio")) +
group+
geom_point() +
geom_line() ="Penalty factor (lambda)", y="Root MSPE",
labs(x ="Root MSPE for different penalty factors",
title="Proportion of Lasso Penalty",
color="Proportion of Lasso Penalty") +
shape+
scale_x_log10() ="none")
guides(linetype
) validation_plot.draw()
```

Figure 4 shows that the MSPE drops faster for Lasso and Elastic Net compared to Ridge regressions as penalty factor increases. However, for higher penalty factors, the MSPE for Ridge regressions dips below the others, which both slightly increase again above a certain threshold. Recall that the larger the regularization, the more restricted the model becomes. The best performing model yields a penalty parameter (`alpha`

) of 0.0043 and a mixture factor (\(\rho\)) of 0.5.

### Full workflow

Our starting point was the question: Which factors determine industry returns? While Avramov et al. (2023) provide a Bayesian analysis related to the research question above, we choose a simplified approach: To illustrate the entire workflow, we now run the penalized regressions for all ten industries. We want to identify relevant variables by fitting Lasso models for each industry returns time series. More specifically, we perform cross-validation for each industry to identify the optimal penalty factor \(\lambda\).

First, we define the Lasso model with one tuning parameter.

```
= Lasso(fit_intercept=False, max_iter=5000)
lm_model
= {"regressor__alpha": alphas}
params
= Pipeline([
lm_pipeline "preprocessor", preprocessor),
("regressor", lm_model)
( ])
```

The following task can be easily parallelized to reduce computing time, but we use a simple loop for ease of exposition.

```
= data["industry"].drop_duplicates()
all_industries
= []
results for industry in all_industries:
print(industry)
= GridSearchCV(
finder
lm_pipeline,=params,
param_grid="neg_mean_squared_error",
scoring=data_folds
cv
)
= finder.fit(
finder "industry == @industry"),
data.query("industry == @industry").get("ret_excess")
data.query(
)
results.append(!= 0)
pd.DataFrame(finder.best_estimator_.named_steps.regressor.coef_
)
= (
selected_factors
pd.DataFrame(-1].get_feature_names_out(),
lm_pipeline[:=["variable"]
columns
)= lambda x: (
.assign(variable "variable"].str.replace("factor_|ff_|q_|macro_",""))
x[
)**dict(zip(all_industries, results)))
.assign(="variable", var_name ="industry")
.melt(id_vars"value == True")
.query( )
```

What has just happened? In principle, exactly the same as before but instead of computing the Lasso coefficients for one industry, we did it for ten sequentially. Now, we just have to do some housekeeping and keep only variables that Lasso does *not* set to zero. We illustrate the results in a heat map in Figure 5.

```
= (
selected_factors_plot
ggplot(selected_factors, ="variable", y="industry")) +
aes(x+
geom_tile() ="", y="",
labs(x="Selected variables for different industries") +
title+
coord_flip() =reversed) +
scale_x_discrete(limits=element_text(rotation=70, hjust=1),
theme(axis_text_x=(6.4, 6.4))
figure_size
) selected_factors_plot.draw()
```

The heat map in Figure 5 conveys two main insights: first, we see that many factors, macroeconomic variables, and interaction terms are not relevant for explaining the cross-section of returns across the industry portfolios. In fact, only `factor_ff_mkt_excess`

and its interaction with `macro_bm`

a role for several industries. Second, there seems to be quite some heterogeneity across different industries. While barely any variable is selected by Lasso for Utilities, many factors are selected for, e.g., Durable and Manufacturing, but the selected factors do not necessarily coincide. In other words, there seems to be a clear picture that we do not need many factors, but Lasso does not provide a factor that consistently provides pricing abilities across industries.

## Exercises

- Write a function that requires three inputs, namely,
`y`

(a \(T\) vector),`X`

(a \((T \times K)\) matrix), and`lambda`

and then returns the Ridge estimator (a \(K\) vector) for a given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should indicate whether \(X\) contains a vector of ones as the first column, which should be exempt from the \(L_2\) penalty. - Compute the \(L_2\) norm (\(\beta'\beta\)) for the regression coefficients based on the predictive regression from the previous exercise for a range of \(\lambda\)’s and illustrate the effect of penalization in a suitable figure.
- Now, write a function that requires three inputs, namely,
`y`

(a \(T\) vector),`X`

(a \((T \times K)\) matrix), and \(\lambda\) and then returns the Lasso estimator (a \(K\) vector) for a given penalization parameter \(\lambda\). Recall that the intercept should not be penalized. Therefore, your function should indicate whether \(X\) contains a vector of ones as the first column, which should be exempt from the \(L_1\) penalty. - After you understand what Ridge and Lasso regressions are doing, familiarize yourself with the
`glmnet`

package’s documentation. It is a thoroughly tested and well-established package that provides efficient code to compute the penalized regression coefficients for Ridge and Lasso and for combinations, commonly called*Elastic Nets*.

## References

*The Journal of Finance*78 (3): 1593–1646. https://doi.org/10.1111/jofi.13226.

*Asset pricing (revised edition)*. Princeton University Press.

*The Journal of Finance*66 (4): 1047–1108. https://doi.org/10.1111/j.1540-6261.2011.01671.x.

*Advances in financial machine learning*. John Wiley & Sons. https://doi.org/10.5555/3217448.

*Machine learning in finance*. Springer.

*Review of Financial Studies*34 (7): 3316–63. https://doi.org/10.1093/rfs/hhaa078.

*The Journal of Finance*47 (2): 427–65. https://doi.org/2329112.

*Journal of Financial Economics*33 (1): 3–56. https://doi.org/10.1016/0304-405X(93)90023-5.

*An introduction to statistical learning: With applications in R*. Springer.

*The Journal of Finance*72 (4): 1399–1440. https://doi.org/10.1111/jofi.12530.

*Review of Financial Studies*29 (1): 5–68. https://doi.org/10.1093/rfs/hhv059.

*The elements of statistical learning: Data mining, inference and prediction*. 2nd ed. Springer. https://hastie.su.domains/ElemStatLearn/.

*Technometrics*12 (1): 69–82. https://doi.org/1267352.

*Review of Financial Studies*28 (3): 650–705. https://doi.org/10.1093/rfs/hhu068.

*Review of Financial Studies*33 (5): 2019–2133. https://doi.org/10.1093/rfs/hhy131.

*The Journal of Finance*71 (1): 5–32. https://doi.org/10.1111/jofi.12365.

*Journal of Economic Perspectives*31 (2): 87–106. https://doi.org/10.1257/jep.31.2.87.

*Machine learning in asset pricing*. Princeton University Press.

*Journal of Machine Learning Research*12: 2825–30.

*Journal of the Royal Statistical Society. Series B (Methodological)*58 (1): 267–88. http://www.jstor.org/stable/2346178.

*Review of Financial Studies*21 (4): 1455–1508. https://doi.org/10.1093/rfs/hhm014.

*Journal of the Royal Statistical Society. Series B (Statistical Methodology)*67 (2): 301–20. https://www.jstor.org/stable/3647580.