library(tidyverse)
library(scales)
library(RSQLite)
library(dbplyr)
WRDS, CRSP, and Compustat
This chapter shows how to connect to Wharton Research Data Services (WRDS), a popular provider of financial and economic data for research applications. We use this connection to download the most commonly used data for stock and firm characteristics, CRSP and Compustat. Unfortunately, this data is not freely available, but most students and researchers typically have access to WRDS through their university libraries. Assuming that you have access to WRDS, we show you how to prepare and merge the databases and store them in the SQLite
-database introduced in the previous chapter. We conclude this chapter by providing some tips for working with the WRDS database.
First, we load the packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them.
We use the same date range as in the previous chapter to ensure consistency.
<- ymd("1960-01-01")
start_date <- ymd("2021-12-31") end_date
Accessing WRDS
WRDS is the most widely used source for asset and firm-specific financial data used in academic settings. WRDS is a data platform that provides data validation, flexible delivery options, and access to many different data sources. The data at WRDS is also organized in an SQL database, although they use the PostgreSQL engine. This database engine is just as easy to handle with R as SQLite. We use the RPostgres
package to establish a connection to the WRDS database (Wickham, Ooms, and Müller 2022). Note that you could also use the odbc
package to connect to a PostgreSQL database, but then you need to install the appropriate drivers yourself. RPostgres
already contains a suitable driver.
library(RPostgres)
To establish a connection, you use the function dbConnect()
with the following arguments. Note that you need to replace the user
and password
arguments with your own credentials. We defined system variables for the purpose of this book because we obviously do not want (and are not allowed) to share our credentials with the rest of the world (these system variables are stored in an .Renviron
-file and loaded with the Sys.getenv()
function).
Additionally, you have to use multi-factor (i.e., two-factor) authentication since May 2023 when establishing a PostgreSQL
or other remote connections. You have two choices to provide the additional identification. First, if you have Duo Push enabled for your WRDS account, you will receive a push notification on your mobile phone when trying to establish a connection with the code below. Upon accepting the notification, you can continue your work. Second, you can log in to a WRDS website that requires multi-factor authentication with your username and the same IP address. Once you have successfully identified yourself on the website, your username-IP combination will be remembered for 30 days, and you can comfortably use the remote connection below.
<- dbConnect(
wrds Postgres(),
host = "wrds-pgdata.wharton.upenn.edu",
dbname = "wrds",
port = 9737,
sslmode = "require",
user = Sys.getenv("user"),
password = Sys.getenv("password")
)
The remote connection to WRDS is very useful. Yet, the database itself contains many different tables. You can check the WRDS homepage to identify the table’s name you are looking for (if you go beyond our exposition). Alternatively, you can also query the data structure with the function dbSendQuery()
. If you are interested, there is an exercise below that is based on WRDS’ tutorial on “Querying WRDS Data using R”. Furthermore, the penultimate section of this chapter shows how to investigate the structure of databases.
Downloading and Preparing CRSP
The Center for Research in Security Prices (CRSP) provides the most widely used data for US stocks. We use the wrds
connection object that we just created to first access monthly CRSP return data. Actually, we need three tables to get the desired data: (i) the CRSP monthly security file,
<- tbl(wrds, in_schema("crsp", "msf")) msf_db
- the identifying information,
<- tbl(wrds, in_schema("crsp", "msenames")) msenames_db
and (iii) the delisting information.
<- tbl(wrds, in_schema("crsp", "msedelist")) msedelist_db
We use the three remote tables to fetch the data we want to put into our local database. Just as above, the idea is that we let the WRDS database do all the work and just download the data that we actually need. We apply common filters and data selection criteria to narrow down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep only US-listed stocks as identified via share codes shrcd
10 and 11, and (iii) we keep only months within permno-specific start dates namedt
and end dates nameendt
. In addition, we add delisting codes and returns. You can read up in the great textbook of Bali, Engle, and Murray (2016) for an extensive discussion on the filters we apply in the code below.
<- msf_db |>
crsp_monthly filter(date >= start_date & date <= end_date) |>
inner_join(
|>
msenames_db filter(shrcd %in% c(10, 11)) |>
select(permno, exchcd, siccd, namedt, nameendt),
by = c("permno")
|>
) filter(date >= namedt & date <= nameendt) |>
mutate(month = floor_date(date, "month")) |>
left_join(
|>
msedelist_db select(permno, dlstdt, dlret, dlstcd) |>
mutate(month = floor_date(dlstdt, "month")),
by = c("permno", "month")
|>
) select(
# Security identifier
permno, # Date of the observation
date, # Month of the observation
month, # Return
ret, # Shares outstanding (in thousands)
shrout, # Last traded price in a month
altprc, # Exchange code
exchcd, # Industry code
siccd, # Delisting return
dlret, # Delisting code
dlstcd |>
) collect() |>
mutate(
month = ymd(month),
shrout = shrout * 1000
)
Now, we have all the relevant monthly return data in memory and proceed with preparing the data for future analyses. We perform the preparation step at the current stage since we want to avoid executing the same mutations every time we use the data in subsequent chapters.
The first additional variable we create is market capitalization (mktcap
), which is the product of the number of outstanding shares shrout
and the last traded price in a month altprc
. Note that in contrast to returns ret
, these two variables are not adjusted ex-post for any corporate actions like stock splits. Moreover, the altprc
is negative whenever the last traded price does not exist, and CRSP decides to report the mid-quote of the last available order book instead. Hence, we take the absolute value of the market cap. We also keep the market cap in millions of USD just for convenience as we do not want to print huge numbers in our figures and tables. In addition, we set zero market cap to missing as it makes conceptually little sense (i.e., the firm would be bankrupt).
<- crsp_monthly |>
crsp_monthly mutate(
mktcap = abs(shrout * altprc) / 1000000,
mktcap = na_if(mktcap, 0)
)
The next variable we frequently use is the one-month lagged market capitalization. Lagged market capitalization is typically used to compute value-weighted portfolio returns, as we demonstrate in a later chapter. The most simple and consistent way to add a column with lagged market cap values is to add one month to each observation and then join the information to our monthly CRSP data.
<- crsp_monthly |>
mktcap_lag mutate(month = month %m+% months(1)) |>
select(permno, month, mktcap_lag = mktcap)
<- crsp_monthly |>
crsp_monthly left_join(mktcap_lag, by = c("permno", "month"))
If you wonder why we do not use the lag()
function, e.g., via crsp_monthly |> group_by(permno) |> mutate(mktcap_lag = lag(mktcap))
, take a look at the exercises.
Next, we follow Bali, Engle, and Murray (2016) in transforming listing exchange codes to explicit exchange names.
<- crsp_monthly |>
crsp_monthly mutate(exchange = case_when(
%in% c(1, 31) ~ "NYSE",
exchcd %in% c(2, 32) ~ "AMEX",
exchcd %in% c(3, 33) ~ "NASDAQ",
exchcd .default = "Other"
))
Similarly, we transform industry codes to industry descriptions following Bali, Engle, and Murray (2016). Notice that there are also other categorizations of industries (e.g., Fama and French 1997) that are commonly used.
<- crsp_monthly |>
crsp_monthly mutate(industry = case_when(
>= 1 & siccd <= 999 ~ "Agriculture",
siccd >= 1000 & siccd <= 1499 ~ "Mining",
siccd >= 1500 & siccd <= 1799 ~ "Construction",
siccd >= 2000 & siccd <= 3999 ~ "Manufacturing",
siccd >= 4000 & siccd <= 4899 ~ "Transportation",
siccd >= 4900 & siccd <= 4999 ~ "Utilities",
siccd >= 5000 & siccd <= 5199 ~ "Wholesale",
siccd >= 5200 & siccd <= 5999 ~ "Retail",
siccd >= 6000 & siccd <= 6799 ~ "Finance",
siccd >= 7000 & siccd <= 8999 ~ "Services",
siccd >= 9000 & siccd <= 9999 ~ "Public",
siccd TRUE ~ "Missing"
))
We also construct returns adjusted for delistings as described by Bali, Engle, and Murray (2016). The delisting of a security usually results when a company ceases operations, declares bankruptcy, merges, does not meet listing requirements, or seeks to become private. The adjustment tries to reflect the returns of investors who bought the stock in the month before the delisting and held it until the delisting date. After this transformation, we can drop the delisting returns and codes.
<- crsp_monthly |>
crsp_monthly mutate(ret_adj = case_when(
is.na(dlstcd) ~ ret,
!is.na(dlstcd) & !is.na(dlret) ~ dlret,
%in% c(500, 520, 580, 584) |
dlstcd >= 551 & dlstcd <= 574) ~ -0.30,
(dlstcd == 100 ~ ret,
dlstcd TRUE ~ -1
|>
)) select(-c(dlret, dlstcd))
Next, we compute excess returns by subtracting the monthly risk-free rate provided by our Fama-French data. As we base all our analyses on the excess returns, we can drop adjusted returns and the risk-free rate from our tibble. Note that we ensure excess returns are bounded by -1 from below as a return less than -100% makes no sense conceptually. Before we can adjust the returns, we have to connect to our database and load the tibble factors_ff_monthly
.
<- dbConnect(
tidy_finance SQLite(),
"data/tidy_finance.sqlite",
extended_types = TRUE
)
<- tbl(tidy_finance, "factors_ff_monthly") |>
factors_ff_monthly select(month, rf) |>
collect()
<- crsp_monthly |>
crsp_monthly left_join(factors_ff_monthly,
by = "month"
|>
) mutate(
ret_excess = ret_adj - rf,
ret_excess = pmax(ret_excess, -1)
|>
) select(-ret_adj, -rf)
Since excess returns and market capitalization are crucial for all our analyses, we can safely exclude all observations with missing returns or market capitalization.
<- crsp_monthly |>
crsp_monthly drop_na(ret_excess, mktcap, mktcap_lag)
Finally, we store the monthly CRSP file in our database.
dbWriteTable(tidy_finance,
"crsp_monthly",
value = crsp_monthly,
overwrite = TRUE
)
First Glimpse of the CRSP Sample
Before we move on to other data sources, let us look at some descriptive statistics of the CRSP sample, which is our main source for stock returns.
Figure 1 shows the monthly number of securities by listing exchange over time. NYSE has the longest history in the data, but NASDAQ lists a considerably large number of stocks. The number of stocks listed on AMEX decreased steadily over the last couple of decades. By the end of 2021, there were 2,779 stocks with a primary listing on NASDAQ, 1,395 on NYSE, 145 on AMEX, and only one belonged to the other category.
|>
crsp_monthly count(exchange, date) |>
ggplot(aes(x = date, y = n, color = exchange, linetype = exchange)) +
geom_line() +
labs(
x = NULL, y = NULL, color = NULL, linetype = NULL,
title = "Monthly number of securities by listing exchange"
+
) scale_x_date(date_breaks = "10 years", date_labels = "%Y") +
scale_y_continuous(labels = comma)
Next, we look at the aggregate market capitalization grouped by the respective listing exchanges in Figure 2. To ensure that we look at meaningful data which is comparable over time, we adjust the nominal values for inflation. In fact, we can use the tables that are already in our database to calculate aggregate market caps by listing exchange and plotting it just as if they were in memory. All values in Figure 2 are at the end of 2021 USD to ensure intertemporal comparability. NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-listed stocks.
tbl(tidy_finance, "crsp_monthly") |>
left_join(tbl(tidy_finance, "cpi_monthly"), by = "month") |>
group_by(month, exchange) |>
summarize(
mktcap = sum(mktcap, na.rm = TRUE) / cpi,
.groups = "drop"
|>
) collect() |>
mutate(month = ymd(month)) |>
ggplot(aes(
x = month, y = mktcap / 1000,
color = exchange, linetype = exchange
+
)) geom_line() +
labs(
x = NULL, y = NULL, color = NULL, linetype = NULL,
title = "Monthly market cap by listing exchange in billions of Dec 2021 USD"
+
) scale_x_date(date_breaks = "10 years", date_labels = "%Y") +
scale_y_continuous(labels = comma)
Of course, performing the computation in the database is not really meaningful because we can easily pull all the required data into our memory. The code chunk above is slower than performing the same steps on tables that are already in memory. However, we just want to illustrate that you can perform many things in the database before loading the data into your memory. Before we proceed, we load the monthly CPI data.
<- tbl(tidy_finance, "cpi_monthly") |>
cpi_monthly collect()
Next, we look at the same descriptive statistics by industry. Figure 3 plots the number of stocks in the sample for each of the SIC industry classifiers. For most of the sample period, the largest share of stocks is in manufacturing, albeit the number peaked somewhere in the 90s. The number of firms associated with public administration seems to be the only category on the rise in recent years, even surpassing manufacturing at the end of our sample period.
<- crsp_monthly |>
crsp_monthly_industry left_join(cpi_monthly, by = "month") |>
group_by(month, industry) |>
summarize(
securities = n_distinct(permno),
mktcap = sum(mktcap) / mean(cpi),
.groups = "drop"
)
|>
crsp_monthly_industry ggplot(aes(
x = month,
y = securities,
color = industry,
linetype = industry
+
)) geom_line() +
labs(
x = NULL, y = NULL, color = NULL, linetype = NULL,
title = "Monthly number of securities by industry"
+
) scale_x_date(date_breaks = "10 years", date_labels = "%Y") +
scale_y_continuous(labels = comma)
We also compute the market cap of all stocks belonging to the respective industries and show the evolution over time in Figure 4. All values are again in terms of billions of end of 2021 USD. At all points in time, manufacturing firms comprise of the largest portion of market capitalization. Toward the end of the sample, however, financial firms and services begin to make up a substantial portion of the market cap.
|>
crsp_monthly_industry ggplot(aes(
x = month,
y = mktcap / 1000,
color = industry,
linetype = industry
+
)) geom_line() +
labs(
x = NULL, y = NULL, color = NULL, linetype = NULL,
title = "Monthly total market cap by industry in billions as of Dec 2021 USD"
+
) scale_x_date(date_breaks = "10 years", date_labels = "%Y") +
scale_y_continuous(labels = comma)
Daily CRSP Data
Before we turn to accounting data, we provide a proposal for downloading daily CRSP data. While the monthly data from above typically fit into your memory and can be downloaded in a meaningful amount of time, this is usually not true for daily return data. The daily CRSP data file is substantially larger than monthly data and can exceed 20GB. This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to copy the entire data set to your local database), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session.
There is a solution to this challenge. As with many big data problems, you can split up the big task into several smaller tasks that are easy to handle. That is, instead of downloading data about many stocks all at once, download the data in small batches for each stock consecutively. Such operations can be implemented in for()
-loops, where we download, prepare, and store the data for a single stock in each iteration. This operation might nonetheless take a couple of hours, so you have to be patient either way (we often run such code overnight). To keep track of the progress, you can use txtProgressBar()
. Eventually, we end up with more than 68 million rows of daily return data. Note that we only store the identifying information that we actually need, namely permno
, date
, and month
alongside the excess returns. We thus ensure that our local database contains only the data we actually use and that we can load the full daily data into our memory later. Notice that we also use the function dbWriteTable()
here with the option to append the new data to an existing table, when we process the second and all following batches.
<- tbl(wrds, in_schema("crsp", "dsf"))
dsf_db
<- tbl(tidy_finance, "factors_ff_daily") |>
factors_ff_daily collect()
<- tbl(tidy_finance, "crsp_monthly") |>
permnos distinct(permno) |>
pull()
<- txtProgressBar(
progress min = 0,
max = length(permnos),
initial = 0,
style = 3
)
for (j in 1:length(permnos)) {
<- permnos[j]
permno_sub <- dsf_db |>
crsp_daily_sub filter(permno == permno_sub &
>= start_date & date <= end_date) |>
date select(permno, date, ret) |>
collect() |>
drop_na()
if (nrow(crsp_daily_sub) > 0) {
<- crsp_daily_sub |>
crsp_daily_sub mutate(month = floor_date(date, "month")) |>
left_join(factors_ff_daily |>
select(date, rf), by = "date") |>
mutate(
ret_excess = ret - rf,
ret_excess = pmax(ret_excess, -1)
|>
) select(permno, date, month, ret_excess)
dbWriteTable(tidy_finance,
"crsp_daily",
value = crsp_daily_sub,
overwrite = ifelse(j == 1, TRUE, FALSE),
append = ifelse(j != 1, TRUE, FALSE)
)
}setTxtProgressBar(progress, j)
}
close(progress)
<- tbl(tidy_finance, "crsp_daily") crsp_daily_db
To the best of our knowledge, the daily CRSP data does not require any adjustments like the monthly data. The adjustment of the monthly data comes from the fact that CRSP aggregates daily data into monthly observations and has to decide which prices and returns to record if a stock gets delisted. In the daily data, there is simply no price or return after delisting, so there is also no aggregation problem.
Preparing Compustat data
Firm accounting data are an important source of information that we use in portfolio analyses in subsequent chapters. The commonly used source for firm financial information is Compustat provided by S&P Global Market Intelligence, which is a global data vendor that provides financial, statistical, and market information on active and inactive companies throughout the world. For US and Canadian companies, annual history is available back to 1950 and quarterly as well as monthly histories date back to 1962.
To access Compustat data, we can again tap WRDS, which hosts the funda
table that contains annual firm-level information on North American companies.
<- tbl(wrds, in_schema("comp", "funda")) funda_db
We follow the typical filter conventions and pull only data that we actually need: (i) we get only records in industrial data format, (ii) in the standard format (i.e., consolidated information in standard presentation), and (iii) only data in the desired time window.
<- funda_db |>
compustat filter(
== "INDL" &
indfmt == "STD" &
datafmt == "C" &
consol >= start_date & datadate <= end_date
datadate |>
) select(
# Firm identifier
gvkey, # Date of the accounting data
datadate, # Stockholders' equity
seq, # Total common/ordinary equity
ceq, # Total assets
at, # Total liabilities
lt, # Deferred taxes and investment tax credit
txditc, # Deferred taxes
txdb, # Investment tax credit
itcb, # Preferred stock redemption value
pstkrv, # Preferred stock liquidating value
pstkl, # Preferred stock par value
pstk, # Capital investment
capx, # Operating cash flow
oancf |>
) collect()
Next, we calculate the book value of preferred stock and equity inspired by the variable definition in Ken French’s data library. Note that we set negative or zero equity to missing which is a common practice when working with book-to-market ratios (see Fama and French 1992 for details).
<- compustat |>
compustat mutate(
be = coalesce(seq, ceq + pstk, at - lt) +
coalesce(txditc, txdb + itcb, 0) -
coalesce(pstkrv, pstkl, pstk, 0),
be = if_else(be <= 0, as.numeric(NA), be)
)
We keep only the last available information for each firm-year group. Note that datadate
defines the time the corresponding financial data refers to (e.g., annual report as of December 31, 2021). Therefore, datadate
is not the date when data was made available to the public. Check out the exercises for more insights into the peculiarities of datadate
.
<- compustat |>
compustat mutate(year = year(datadate)) |>
group_by(gvkey, year) |>
filter(datadate == max(datadate)) |>
ungroup()
With the last step, we are already done preparing the firm fundamentals. Thus, we can store them in our local database.
dbWriteTable(tidy_finance,
"compustat",
value = compustat,
overwrite = TRUE
)
Merging CRSP with Compustat
Unfortunately, CRSP and Compustat use different keys to identify stocks and firms. CRSP uses permno
for stocks, while Compustat uses gvkey
to identify firms. Fortunately, a curated matching table on WRDS allows us to merge CRSP and Compustat, so we create a connection to the CRSP-Compustat Merged table (provided by CRSP).
<- tbl(
ccmxpf_linktable_db
wrds,in_schema("crsp", "ccmxpf_linktable")
)
The linking table contains links between CRSP and Compustat identifiers from various approaches. However, we need to make sure that we keep only relevant and correct links, again following the description outlined in Bali, Engle, and Murray (2016). Note also that currently active links have no end date, so we just enter the current date via today()
.
<- ccmxpf_linktable_db |>
ccmxpf_linktable filter(linktype %in% c("LU", "LC") &
%in% c("P", "C") &
linkprim == 1) |>
usedflag select(permno = lpermno, gvkey, linkdt, linkenddt) |>
collect() |>
mutate(linkenddt = replace_na(linkenddt, today()))
We use these links to create a new table with a mapping between stock identifier, firm identifier, and month. We then add these links to the Compustat gvkey
to our monthly stock data.
<- crsp_monthly |>
ccm_links inner_join(ccmxpf_linktable, by = "permno", relationship = "many-to-many") |>
filter(!is.na(gvkey) & (date >= linkdt & date <= linkenddt)) |>
select(permno, gvkey, date)
<- crsp_monthly |>
crsp_monthly left_join(ccm_links, by = c("permno", "date"))
As the last step, we update the previously prepared monthly CRSP file with the linking information in our local database.
dbWriteTable(tidy_finance,
"crsp_monthly",
value = crsp_monthly,
overwrite = TRUE
)
Before we close this chapter, let us look at an interesting descriptive statistic of our data. As the book value of equity plays a crucial role in many asset pricing applications, it is interesting to know for how many of our stocks this information is available. Hence, Figure 5 plots the share of securities with book equity values for each exchange. It turns out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 60s but hovers around 80% for all periods thereafter. We can ignore the erratic coverage of securities that belong to the other category since there is only a handful of them anyway in our sample.
|>
crsp_monthly group_by(permno, year = year(month)) |>
filter(date == max(date)) |>
ungroup() |>
left_join(compustat, by = c("gvkey", "year")) |>
group_by(exchange, year) |>
summarize(
share = n_distinct(permno[!is.na(be)]) / n_distinct(permno),
.groups = "drop"
|>
) ggplot(aes(
x = year,
y = share,
color = exchange,
linetype = exchange
+
)) geom_line() +
labs(
x = NULL, y = NULL, color = NULL, linetype = NULL,
title = "Share of securities with book equity values by exchange"
+
) scale_y_continuous(labels = percent) +
coord_cartesian(ylim = c(0, 1))
Some Tricks for PostgreSQL Databases
As we mentioned above, the WRDS database runs on PostgreSQL rather than SQLite. Finding the right tables for your data needs can be tricky in the WRDS PostgreSQL instance, as the tables are organized in schemas. If you wonder what the purpose of schemas is, check out this documetation. For instance, if you want to find all tables that live in the crsp
schema, you run
dbListObjects(wrds, Id(schema = "crsp"))
This operation returns a list of all tables that belong to the crsp
family on WRDS, e.g., <Id> schema = crsp, table = msenames
. Similarly, you can fetch a list of all tables that belong to the comp
family via
dbListObjects(wrds, Id(schema = "comp"))
If you want to get all schemas, then run
dbListObjects(wrds)
Exercises
- Check out the structure of the WRDS database by sending queries in the spirit of “Querying WRDS Data using R” and verify the output with
dbListObjects()
. How many tables are associated with CRSP? Can you identify what is stored within msp500? - Compute
mkt_cap_lag
usinglag(mktcap)
rather than joins as above. Filter out all the rows where the lag-based market capitalization measure is different from the one we computed above. Why are they different? - In the main part, we look at the distribution of market capitalization across exchanges and industries. Now, plot the average market capitalization of firms for each exchange and industry. What do you find?
datadate
refers to the date to which the fiscal year of a corresponding firm refers to. Count the number of observations in Compustat by month of this date variable. What do you find? What does the finding suggest about pooling observations with the same fiscal year?- Go back to the original Compustat data in
funda_db
and extract rows where the same firm has multiple rows for the same fiscal year. What is the reason for these observations? - Repeat the analysis of market capitalization for book equity, which we computed from the Compustat data. Then, use the matched sample to plot book equity against market capitalization. How are these two variables related?