```
import pandas as pd
import numpy as np
import datetime as dt
import sqlite3
```

# Value and Bivariate Sorts

You are reading the work-in-progress edition of **Tidy Finance with Python**. Code chunks and text might change over the next couple of months. We are always looking for feedback via contact@tidy-finance.org. Meanwhile, you can find the complete R version here.

In this chapter, we extend univariate portfolio analysis to bivariate sorts, which means we assign stocks to portfolios based on two characteristics. Bivariate sorts are regularly used in the academic asset pricing literature and are the basis for the Fama and French three factors. However, some scholars also use sorts with three grouping variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.

We form portfolios on firm size and the book-to-market ratio. To calculate book-to-market ratios, accounting data is required, which necessitates additional steps during portfolio formation. In the end, we demonstrate how to form portfolios on two sorting variables using so-called independent and dependent portfolio sorts.

The current chapter relies on this set of packages.

## Data Preparation

First, we load the necessary data from our `SQLite`

-database introduced in Chapters 2-4. We conduct portfolio sorts based on the CRSP sample but keep only the necessary columns in our memory. We use the same data sources for firm size as in Chapter 8.

```
= sqlite3.connect("data/tidy_finance.sqlite")
tidy_finance
= (pd.read_sql_query(
crsp_monthly ="SELECT permno, gvkey, month, ret_excess, mktcap, mktcap_lag, exchange FROM crsp_monthly",
sql=tidy_finance,
con={"permno": np.int32},
dtype={"month": {"unit": "D","origin": "unix"}})
parse_dates
.dropna() )
```

Further, we utilize accounting data. The most common source of accounting data is Compustat. We only need book equity data in this application, which we select from our database. Additionally, we convert the variable `datadate`

to its monthly value, as we only consider monthly returns here and do not need to account for the exact date.

```
= (pd.read_sql_query(
book_equity ="SELECT gvkey, datadate, be from compustat",
sql=tidy_finance,
con={"datadate": {"unit":"D", "origin":"unix"}})
parse_dates
.dropna()= lambda x: (pd.to_datetime(x["datadate"]).dt.to_period("M").dt.to_timestamp()))
.assign(month )
```

## Book-to-Market Ratio

A fundamental problem in handling accounting data is the *look-ahead bias* - we must not include data in forming a portfolio that is not public knowledge at the time. Of course, researchers have more information when looking into the past than agents had at that moment. However, abnormal excess returns from a trading strategy should not rely on an information advantage because the differential cannot be the result of informed agents’ trades. Hence, we have to lag accounting information.

We continue to lag market capitalization and firm size by one month. Then, we compute the book-to-market ratio, which relates a firm’s book equity to its market equity. Firms with high (low) book-to-market ratio are called value (growth) firms. After matching the accounting and market equity information from the same month, we lag book-to-market by six months. This is a sufficiently conservative approach because accounting information is usually released well before six months pass. However, in the asset pricing literature, even longer lags are used as well.^{1}

Having both variables, i.e., firm size lagged by one month and book-to-market lagged by six months, we merge these sorting variables to our returns using the `sorting_date`

-column created for this purpose. The final step in our data preparation deals with differences in the frequency of our variables. Returns and firm size are recorded monthly. Yet the accounting information is only released on an annual basis. Hence, we only match book-to-market to one month per year and have eleven empty observations. To solve this frequency issue, we carry the latest book-to-market ratio of each firm to the subsequent months, i.e., we fill the missing observations with the most current report. This is done via the `fillna()`

-function after sorting by date and firm (which we identify by `permno`

and `gvkey`

) and on a firm basis (which we do by `.groupby()`

as usual). We filter out all observations with accounting data that is older than a year. As the last step, we remove all rows with missing entries because the returns cannot be matched to any annual report.

```
= (crsp_monthly
me = lambda x: x["month"] + pd.DateOffset(months=1))
.assign(sorting_date ={"mktcap": "me"})
.rename(columns"permno", "sorting_date", "me"])
.get([
)
= (book_equity
bm ="inner", on=["gvkey", "month"])
.merge(crsp_monthly, how= lambda x: x["be"] / x["mktcap"],
.assign(bm = lambda x: x["month"] + pd.DateOffset(months=6))
sorting_date = lambda x: x["sorting_date"])
.assign(comp_date "permno", "gvkey", "sorting_date", "comp_date", "bm"])
.get([
)
= (crsp_monthly
data_for_sorts
.merge(bm, ="left",
how=["permno", "gvkey", "month"],
left_on=["permno", "gvkey", "sorting_date"])
right_on
.merge(me, ="left",
how=["permno", "month"],
left_on=["permno", "sorting_date"])
right_on"permno", "gvkey", "month", "ret_excess",
.get(["mktcap_lag", "me", "bm", "exchange", "comp_date"])
)
# TODO: only fill bm and comp_date, but keep all columns
= (data_for_sorts
data_for_sorts =["permno", "gvkey", "month"])
.sort_values(by"permno", "gvkey"])
.groupby([apply(lambda x: x.fillna(method="ffill"))
.= True)
.reset_index(drop = lambda x: x["month"] - pd.DateOffset(months=12))
.assign(threshold_date "comp_date > threshold_date")
.query(=["comp_date", "threshold_date"])
.drop(columns
.dropna() )
```

The last step of preparation for the portfolio sorts is the computation of breakpoints. We continue to use the same function allowing for the specification of exchanges to use for the breakpoints. Additionally, we reintroduce the argument `sorting_variable`

into the function for defining different sorting variables.

```
def assign_portfolio(data, sorting_variable, n_portfolios, exchanges):
= (data
breakpoints "exchange in @exchanges")
.query(
.get(sorting_variable)0, 1, num=n_portfolios+1),
.quantile(np.linspace(="linear")
interpolation
.drop_duplicates()
)= pd.cut(data[sorting_variable],
assigned_portfolios =breakpoints,
bins=pd.Series(range(1, breakpoints.size)),
labels=True,
include_lowest="drop")
duplicatesreturn assigned_portfolios
```

After these data preparation steps, we present bivariate portfolio sorts on an independent and dependent basis.

## Independent Sorts

Bivariate sorts create portfolios within a two-dimensional space spanned by two sorting variables. It is then possible to assess the return impact of either sorting variable by the return differential from a trading strategy that invests in the portfolios at either end of the respective variables spectrum. We create a five-by-five matrix using book-to-market and firm size as sorting variables in our example below. We end up with 25 portfolios. Since we are interested in the *value premium* (i.e., the return differential between high and low book-to-market firms), we go long the five portfolios of the highest book-to-market firms and short the five portfolios of the lowest book-to-market firms. The five portfolios at each end are due to the size splits we employed alongside the book-to-market splits.

To implement the independent bivariate portfolio sort, we assign monthly portfolios for each of our sorting variables separately to create the variables `portfolio_bm`

and `portfolio_me`

, respectively. Then, these separate portfolios are combined to the final sort stored in `portfolio_combined`

. After assigning the portfolios, we compute the average return within each portfolio for each month. Additionally, we keep the book-to-market portfolio as it makes the computation of the value premium easier. The alternative would be to disaggregate the combined portfolio in a separate step. Notice that we weigh the stocks within each portfolio by their market capitalization, i.e., we decide to value-weight our returns.

```
= (data_for_sorts
value_portfolios "month", group_keys=False)
.groupby(apply(lambda x: x.assign(
.= assign_portfolio(data=x,
portfolio_bm ="bm",
sorting_variable=5,
n_portfolios=["NYSE"]),
exchanges= assign_portfolio(data=x,
portfolio_me ="me",
sorting_variable=5,
n_portfolios=["NYSE"])))
exchanges"month", "portfolio_bm", "portfolio_me"])
.groupby([apply(lambda x: pd.Series(
."ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])}))
{
.reset_index() )
```

Equipped with our monthly portfolio returns, we are ready to compute the value premium. However, we still have to decide how to invest in the five high and the five low book-to-market portfolios. The most common approach is to weigh these portfolios equally, but this is yet another researcher’s choice. Then, we compute the return differential between the high and low book-to-market portfolios and show the average value premium.

```
= (value_portfolios
value_premium "month", "portfolio_bm"])
.groupby(["ret": "mean"})
.aggregate({
.reset_index()"month")
.groupby(apply(lambda x: pd.Series(
."value_premium": (x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() - x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean())}))
{
.reset_index()"value_premium": "mean"})
.aggregate({
)print(value_premium * 100)
```

```
value_premium 0.381508
dtype: float64
```

The resulting annualized value premium is 4.608 percent.

## Dependent Sorts

In the previous exercise, we assigned the portfolios without considering the second variable in the assignment. This protocol is called independent portfolio sorts. The alternative, i.e., dependent sorts, creates portfolios for the second sorting variable within each bucket of the first sorting variable. In our example below, we sort firms into five size buckets, and within each of those buckets, we assign firms to five book-to-market portfolios. Hence, we have monthly breakpoints that are specific to each size group. The decision between independent and dependent portfolio sorts is another choice for the researcher. Notice that dependent sorts ensure an equal amount of stocks within each portfolio.

To implement the dependent sorts, we first create the size portfolios by calling `assign_portfolio()`

with `sorting_variable = "me"`

. Then, we group our data again by month and by the size portfolio before assigning the book-to-market portfolio. The rest of the implementation is the same as before. Finally, we compute the value premium.

```
= (data_for_sorts
value_portfolios "month", group_keys=False)
.groupby(apply(lambda x: x.assign(
.= assign_portfolio(data=x,
portfolio_me ="me",
sorting_variable=5,
n_portfolios=["NYSE"])))
exchanges"month", "portfolio_me"], group_keys=False)
.groupby([apply(lambda x: x.assign(
.= assign_portfolio(data=x,
portfolio_bm ="bm",
sorting_variable=5,
n_portfolios=["NYSE"])))
exchanges"month", "portfolio_bm", "portfolio_me"])
.groupby([apply(lambda x: pd.Series(
."ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])}))
{
.reset_index()
)
= (value_portfolios
value_premium "month", "portfolio_bm"])
.groupby(["ret": "mean"})
.aggregate({
.reset_index()"month")
.groupby(apply(lambda x: pd.Series(
."value_premium": x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() - x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean()}))
{
.reset_index()"value_premium": "mean"})
.aggregate({
)
print(value_premium * 100)
```

```
value_premium 0.315498
dtype: float64
```

The value premium from dependent sorts is 3.948 percent per year.

Overall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we sort the portfolios independently of each other. Yet we also discuss how to create dependent portfolio sorts. Along the lines of Chapter 8, we see how many choices a researcher has to make to implement portfolio sorts, and bivariate sorts increase the number of choices.

## Exercises

- In Chapter 8, we examine the distribution of market equity. Repeat this analysis for book equity and the book-to-market ratio (alongside a plot of the breakpoints, i.e., deciles).
- When we investigate the portfolios, we focus on the returns exclusively. However, it is also of interest to understand the characteristics of the portfolios. Write a function to compute the average characteristics for size and book-to-market across the 25 independently and dependently sorted portfolios.
- As for the size premium, also the value premium constructed here does not follow Fama and French (1993). Implement a p-hacking setup as in Chapter 8 to find a premium that comes closest to their HML premium.

## References

*Journal of Financial Economics*33 (1): 3–56. https://doi.org/10.1016/0304-405X(93)90023-5.

## Footnotes

The definition of a time lag is another choice a researcher has to make, similar to breakpoint choices as we describe in the Chapter 8 on p-hacking.↩︎