Replicating Fama-French Factors

Note

You are reading Tidy Finance with Python. You can find the equivalent chapter for the sibling Tidy Finance with R here.

In this chapter, we provide a replication of the famous Fama-French factor portfolios. The Fama-French factor models are a cornerstone of empirical asset pricing Fama and French (2015). On top of the market factor represented by the traditional CAPM beta, the three-factor model includes the size and value factors to explain the cross section of returns. Its successor, the five-factor model, additionally includes profitability and investment as explanatory factors.

We start with the three-factor model. We already introduced the size and value factors in Value and Bivariate Sorts, and their definition remains the same: size is the SMB factor (small-minus-big) that is long small firms and short large firms. The value factor is HML (high-minus-low) and is long in high book-to-market firms and short in low book-to-market counterparts.

After the replication of the three-factor model, we move to the five-factors by constructing the profitability factor RMW (robust-minus-weak) as the difference between the returns of firms with high and low operating profitability and the investment factor CMA (conservative-minus-aggressive) as the difference between firms with high versus low investment rates.

The current chapter relies on this set of Python packages.

import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf

from regtabletotext import prettify_result

Data Preparation

We use CRSP and Compustat as data sources, as we need exactly the same variables to compute the size and value factors in the way Fama and French do it. Hence, there is nothing new below, and we only load data from our SQLite database introduced in Accessing and Managing Financial Data and WRDS, CRSP, and Compustat.¹

tidy_finance = sqlite3.connect(
  database="data/tidy_finance_python.sqlite"
)

crsp_monthly = (pd.read_sql_query(
    sql=("SELECT permno, gvkey, date, ret_excess, mktcap, "
         "mktcap_lag, exchange FROM crsp_monthly"),
    con=tidy_finance,
    parse_dates={"date"})
  .dropna()
)

compustat = (pd.read_sql_query(
    sql="SELECT gvkey, datadate, be, op, inv FROM compustat",
    con=tidy_finance,
    parse_dates={"datadate"})
  .dropna()
)

factors_ff3_monthly = pd.read_sql_query(
  sql="SELECT date, smb, hml FROM factors_ff3_monthly",
  con=tidy_finance,
  parse_dates={"date"}
)

factors_ff5_monthly = pd.read_sql_query(
  sql="SELECT date, smb, hml, rmw, cma FROM factors_ff5_monthly",
  con=tidy_finance,
  parse_dates={"date"}
)

Yet when we start merging our dataset for computing the premiums, there are a few differences to Value and Bivariate Sorts. First, Fama and French form their portfolios in June of year \(t\), whereby the returns of July are the first monthly return for the respective portfolio. For firm size, they consequently use the market capitalization recorded for June. It is then held constant until June of year \(t+1\).

Second, Fama and French also have a different protocol for computing the book-to-market ratio. They use market equity as of the end of year \(t - 1\) and the book equity reported in year \(t-1\), i.e., the datadate is within the last year. Hence, the book-to-market ratio can be based on accounting information that is up to 18 months old. Market equity also does not necessarily reflect the same time point as book equity.

To implement all these time lags, we again employ the temporary sorting_date-column. Notice that when we combine the information, we want to have a single observation per year and stock since we are only interested in computing the breakpoints held constant for the entire year. We ensure this by a call of drop_duplicates() at the end of the chunk below.

size = (crsp_monthly
  .query("date.dt.month == 6")
  .assign(sorting_date=lambda x: (x["date"]+pd.DateOffset(months=1)))
  .get(["permno", "exchange", "sorting_date", "mktcap"])
  .rename(columns={"mktcap": "size"})
)

market_equity = (crsp_monthly
  .query("date.dt.month == 12")
  .assign(sorting_date=lambda x: (x["date"]+pd.DateOffset(months=7)))
  .get(["permno", "gvkey", "sorting_date", "mktcap"])
  .rename(columns={"mktcap": "me"})
)

book_to_market = (compustat
  .assign(
    sorting_date=lambda x: (pd.to_datetime(
      (x["datadate"].dt.year+1).astype(str)+"0701", format="%Y%m%d")
    )
  )
  .merge(market_equity, how="inner", on=["gvkey", "sorting_date"])
  .assign(bm=lambda x: x["be"]/x["me"])
  .get(["permno", "sorting_date", "me", "bm"])
)

sorting_variables = (size
  .merge(book_to_market, how="inner", on=["permno", "sorting_date"])
  .dropna()
  .drop_duplicates(subset=["permno", "sorting_date"])
 )

Portfolio Sorts

Next, we construct our portfolios with an adjusted assign_portfolio() function. Fama and French rely on NYSE-specific breakpoints to independently form two portfolios in the size dimension at the median and three portfolios in the dimension of book-to-market at the 30- and 70-percentiles. The sorts for book-to-market require an adjustment to the function in Value and Bivariate Sorts because it would not produce the right breakpoints. Instead of n_portfolios, we now specify percentiles, which takes the sequence of breakpoints as an object specified in the function’s call. Specifically, we give percentiles = [0, 0.3, 0.7, 1] to the function. Additionally, we perform a join with our return data to ensure that we only use traded stocks when computing the breakpoints as a first step.

def assign_portfolio(data, sorting_variable, percentiles):
    """Assign portfolios to a bin according to a sorting variable."""
    
    breakpoints = (data
      .query("exchange == 'NYSE'")
      .get(sorting_variable)
      .quantile(percentiles, interpolation="linear")
    )
    breakpoints.iloc[0] = -np.Inf
    breakpoints.iloc[breakpoints.size-1] = np.Inf
    
    assigned_portfolios = pd.cut(
      data[sorting_variable],
      bins=breakpoints,
      labels=pd.Series(range(1, breakpoints.size)),
      include_lowest=True,
      right=False
    )
    
    return assigned_portfolios

portfolios = (sorting_variables
  .groupby("sorting_date")
  .apply(lambda x: x
    .assign(
      portfolio_size=assign_portfolio(x, "size", [0, 0.5, 1]),
      portfolio_bm=assign_portfolio(x, "bm", [0, 0.3, 0.7, 1])
    )
  )
  .reset_index(drop=True)
  .get(["permno", "sorting_date", "portfolio_size", "portfolio_bm"])
)

Alternatively, you can implement the same portfolio sorts using the assign_portfolio() function from the tidyfinance package, which we omit to avoid repeating almost the same code chunk as above.

Next, we merge the portfolios to the return data for the rest of the year. To implement this step, we create a new column sorting_date in our return data by setting the date to sort on to July of \(t-1\) if the month is June (of year \(t\)) or earlier or to July of year \(t\) if the month is July or later.

portfolios = (crsp_monthly
  .assign(
    sorting_date=lambda x: (pd.to_datetime(
      x["date"].apply(lambda x: str(x.year-1)+
        "0701" if x.month <= 6 else str(x.year)+"0701")))
  )
  .merge(portfolios, how="inner", on=["permno", "sorting_date"])
)

Fama-French Three-Factor Model

Equipped with the return data and the assigned portfolios, we can now compute the value-weighted average return for each of the six portfolios. Then, we form the Fama-French factors. For the size factor (i.e., SMB), we go long in the three small portfolios and short the three large portfolios by taking an average across either group. For the value factor (i.e., HML), we go long in the two high book-to-market portfolios and short the two low book-to-market portfolios, again weighting them equally (using the mean() function).

factors_replicated = (portfolios
  .groupby(["portfolio_size", "portfolio_bm", "date"])
  .apply(lambda x: pd.Series({
    "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
    })
   )
  .reset_index()
  .groupby("date")
  .apply(lambda x: pd.Series({
    "smb_replicated": (
      x["ret"][x["portfolio_size"] == 1].mean() - 
        x["ret"][x["portfolio_size"] == 2].mean()),
    "hml_replicated": (
      x["ret"][x["portfolio_bm"] == 3].mean() -
        x["ret"][x["portfolio_bm"] == 1].mean())
    }))
  .reset_index()
)

factors_replicated = (factors_replicated
  .merge(factors_ff3_monthly, how="inner", on="date")
  .round(4)
)

Replication Evaluation

In the previous section, we replicated the size and value premiums following the procedure outlined by Fama and French. The final question is then: how close did we get? We answer this question by looking at the two time-series estimates in a regression analysis using smf.ols(). If we did a good job, then we should see a non-significant intercept (rejecting the notion of systematic error), a coefficient close to one (indicating a high correlation), and an adjusted R-squared close to one (indicating a high proportion of explained variance).

To test the success of the SMB factor, we hence run the following regression:

model_smb = (smf.ols(
    formula="smb ~ smb_replicated", 
    data=factors_replicated
  )
  .fit()
)
prettify_result(model_smb)

OLS Model:
smb ~ smb_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept         -0.000       0.000       -1.396    0.163
smb_replicated     0.989       0.004      233.832    0.000

Summary statistics:
- Number of observations: 738
- R-squared: 0.987, Adjusted R-squared: 0.987
- F-statistic: 54,677.232 on 1 and 736 DF, p-value: 0.000

The results for the SMB factor are quite convincing as all three criteria outlined above are met and the coefficient is 0.99 and R-squared are at 99 percent.

model_hml = (smf.ols(
    formula="hml ~ hml_replicated", 
    data=factors_replicated
  )
  .fit()
)
prettify_result(model_hml)

OLS Model:
hml ~ hml_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.000       0.000        1.880    0.061
hml_replicated     0.963       0.007      135.833    0.000

Summary statistics:
- Number of observations: 738
- R-squared: 0.962, Adjusted R-squared: 0.962
- F-statistic: 18,450.732 on 1 and 736 DF, p-value: 0.000

The replication of the HML factor is also a success, although at a slightly lower level with coefficient of 0.96 and R-squared around 96 percent.

The evidence allows us to conclude that we did a relatively good job in replicating the original Fama-French size and value premiums, although we do not know their underlying code. From our perspective, a perfect match is only possible with additional information from the maintainers of the original data.

Fama-French Five-Factor Model

Now, let us move to the replication of the five-factor model. We extend the other_sorting_variables table from above with the additional characteristics operating profitability op and investment inv. Note that the dropna() statement yields different sample sizes, as some firms with be values might not have op or inv values.

other_sorting_variables = (compustat
  .assign(
    sorting_date=lambda x: (pd.to_datetime(
      (x["datadate"].dt.year+1).astype(str)+"0701", format="%Y%m%d")
    )
  )
  .merge(market_equity, how="inner", on=["gvkey", "sorting_date"])
  .assign(bm=lambda x: x["be"]/x["me"])
  .get(["permno", "sorting_date", "me", "bm", "op", "inv"])
)

sorting_variables = (size
  .merge(other_sorting_variables, how="inner", on=["permno", "sorting_date"])
  .dropna()
  .drop_duplicates(subset=["permno", "sorting_date"])
 )

In each month, we independently sort all stocks into the two size portfolios. The value, profitability, and investment portfolios, on the other hand, are the results of dependent sorts based on the size portfolios. We then merge the portfolios to the return data for the rest of the year just as above.

portfolios = (sorting_variables
  .groupby("sorting_date")
  .apply(lambda x: x
    .assign(
      portfolio_size=assign_portfolio(x, "size", [0, 0.5, 1])
    )
  )
  .reset_index(drop=True)
  .groupby(["sorting_date", "portfolio_size"])
  .apply(lambda x: x
    .assign(
      portfolio_bm=assign_portfolio(x, "bm", [0, 0.3, 0.7, 1]),
      portfolio_op=assign_portfolio(x, "op", [0, 0.3, 0.7, 1]),
      portfolio_inv=assign_portfolio(x, "inv", [0, 0.3, 0.7, 1])
    )
  )
  .reset_index(drop=True)
  .get(["permno", "sorting_date", 
        "portfolio_size", "portfolio_bm",
        "portfolio_op", "portfolio_inv"])
)

portfolios = (crsp_monthly
  .assign(
    sorting_date=lambda x: (pd.to_datetime(
      x["date"].apply(lambda x: str(x.year-1)+
        "0701" if x.month <= 6 else str(x.year)+"0701")))
  )
  .merge(portfolios, how="inner", on=["permno", "sorting_date"])
)

Now, we want to construct each of the factors, but this time, the size factor actually comes last because it is the result of averaging across all other factor portfolios. This dependency is the reason why we keep the table with value-weighted portfolio returns as a separate object that we reuse later. We construct the value factor, HML, as above by going long the two portfolios with high book-to-market ratios and shorting the two portfolios with low book-to-market.

portfolios_value = (portfolios
  .groupby(["portfolio_size", "portfolio_bm", "date"])
  .apply(lambda x: pd.Series({
      "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
    })
  )
  .reset_index()
)

factors_value = (portfolios_value
  .groupby("date")
  .apply(lambda x: pd.Series({
    "hml_replicated": (
      x["ret"][x["portfolio_bm"] == 3].mean() - 
        x["ret"][x["portfolio_bm"] == 1].mean())})
  )
  .reset_index()
)

For the profitability factor, RMW (robust-minus-weak), we take a long position in the two high profitability portfolios and a short position in the two low profitability portfolios.

portfolios_profitability = (portfolios
  .groupby(["portfolio_size", "portfolio_op", "date"])
  .apply(lambda x: pd.Series({
      "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
    })
  )
  .reset_index()
)

factors_profitability = (portfolios_profitability
  .groupby("date")
  .apply(lambda x: pd.Series({
    "rmw_replicated": (
      x["ret"][x["portfolio_op"] == 3].mean() - 
        x["ret"][x["portfolio_op"] == 1].mean())})
  )
  .reset_index()
)

For the investment factor, CMA (conservative-minus-aggressive), we go long the two low investment portfolios and short the two high investment portfolios.

portfolios_investment = (portfolios
  .groupby(["portfolio_size", "portfolio_inv", "date"])
  .apply(lambda x: pd.Series({
      "ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
    })
  )
  .reset_index()
)

factors_investment = (portfolios_investment
  .groupby("date")
  .apply(lambda x: pd.Series({
    "cma_replicated": (
      x["ret"][x["portfolio_inv"] == 1].mean() - 
        x["ret"][x["portfolio_inv"] == 3].mean())})
  )
  .reset_index()
)

Finally, the size factor, SMB, is constructed by going long the nine small portfolios and short the nine large portfolios.

factors_size = (
  pd.concat(
    [portfolios_value, portfolios_profitability, portfolios_investment], 
    ignore_index=True
  )
  .groupby("date")
  .apply(lambda x: pd.Series({
    "smb_replicated": (
      x["ret"][x["portfolio_size"] == 1].mean() - 
        x["ret"][x["portfolio_size"] == 2].mean())})
  )
  .reset_index()
)

We then join all factors together into one dataframe and construct again a suitable table to run tests for evaluating our replication.

factors_replicated = (factors_size
  .merge(factors_value, how="outer", on="date")
  .merge(factors_profitability, how="outer", on="date")
  .merge(factors_investment, how="outer", on="date")
)

factors_replicated = (factors_replicated
  .merge(factors_ff5_monthly, how="inner", on="date")
  .round(4)
)

Let us start the replication evaluation again with the size factor:

model_smb = (smf.ols(
    formula="smb ~ smb_replicated", 
    data=factors_replicated
  )
  .fit()
)
prettify_result(model_smb)

OLS Model:
smb ~ smb_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept         -0.000       0.000       -1.437    0.151
smb_replicated     0.964       0.004      226.956    0.000

Summary statistics:
- Number of observations: 726
- R-squared: 0.986, Adjusted R-squared: 0.986
- F-statistic: 51,509.225 on 1 and 724 DF, p-value: 0.000

The results for the SMB factor are quite convincing, as all three criteria outlined above are met and the coefficient is 0.96 and the R-squared is at 99 percent.

model_hml = (smf.ols(
    formula="hml ~ hml_replicated", 
    data=factors_replicated
  )
  .fit()
)
prettify_result(model_hml)

OLS Model:
hml ~ hml_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.001        0.00        1.965     0.05
hml_replicated     0.988        0.01       98.006     0.00

Summary statistics:
- Number of observations: 726
- R-squared: 0.930, Adjusted R-squared: 0.930
- F-statistic: 9,605.171 on 1 and 724 DF, p-value: 0.000

The replication of the HML factor is also a success, although at a slightly higher coefficient of 0.99 and an R-squared around 93 percent.

model_rmw = (smf.ols(
    formula="rmw ~ rmw_replicated", 
    data=factors_replicated
  )
  .fit()
)
prettify_result(model_rmw)

OLS Model:
rmw ~ rmw_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept           0.00       0.000        0.202     0.84
rmw_replicated      0.95       0.009      107.552     0.00

Summary statistics:
- Number of observations: 726
- R-squared: 0.941, Adjusted R-squared: 0.941
- F-statistic: 11,567.457 on 1 and 724 DF, p-value: 0.000

We are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an R-squared around 94 percent.

model_cma = (smf.ols(
    formula="cma ~ cma_replicated", 
    data=factors_replicated
  )
  .fit()
)
prettify_result(model_cma)

OLS Model:
cma ~ cma_replicated

Coefficients:
                Estimate  Std. Error  t-Statistic  p-Value
Intercept          0.001       0.000        3.821      0.0
cma_replicated     0.964       0.008      121.274      0.0

Summary statistics:
- Number of observations: 726
- R-squared: 0.953, Adjusted R-squared: 0.953
- F-statistic: 14,707.393 on 1 and 724 DF, p-value: 0.000

Finally, the CMA factor also replicates well with a coefficient of 0.96 and an R-squared around 95 percent.

Overall, our approach seems to replicate the Fama-French three and five-factor models just as well as the three-factors.

Key Takeaways

The three-factor model adds size (SMB) and value (HML) to the traditional CAPM, while the five-factor model extends this with profitability (RMW) and investment (CMA) factors.
The portfolio construction follows the original Fama-French methodology, including NYSE breakpoints, specific time lags, and sorting rules based on firm characteristics.
The quality of replication can be evaluated using regression analysis and confirms strong alignment with the original Fama-French data.

Exercises

Fama and French (1993) claim that their sample excludes firms until they have appeared in Compustat for two years. Implement this additional filter and compare the improvements of your replication effort.
On his homepage, Kenneth French provides instructions on how to construct the most common variables used for portfolio sorts. Try to replicate the univariate portfolio sort return time series for E/P (earnings/price) provided on his homepage and evaluate your replication effort using regressions.

References

Fama, Eugene F., and Kenneth R. French. 1992. “The cross-section of expected stock returns.” The Journal of Finance 47 (2): 427–65. https://doi.org/2329112.

———. 1993. “Common risk factors in the returns on stocks and bonds.” Journal of Financial Economics 33 (1): 3–56. https://doi.org/10.1016/0304-405X(93)90023-5.

———. 2015. “A Five-Factor Asset Pricing Model.” Journal of Financial Economics 116 (1): 1–22. https://doi.org/10.1016/j.jfineco.2014.10.010.

Footnotes

Note that Fama and French (1992) claim to exclude financial firms. To a large extent this happens through using industry format “INDL”, as we do in WRDS, CRSP, and Compustat. Neither the original paper, nor Ken French’s website, or the WRDS replication contains any indication that financial companies are excluded using additional filters such as industry codes.↩︎