Using DuckDB with WRDS Data

Demonstrate the power of DuckDB and dbplyr with WRDS data.
Data
R
Author
Affiliation

University of Melbourne

Published

December 22, 2023

In this short note, I show how one can use DuckDB with WRDS data stored in the PostgreSQL database provided by WRDS. I then use some simple benchmarks to show how DuckDB offers a powerful, fast analytical engine for researchers in accounting and finance.

To make the analysis concrete, I focus on data used in the excellent recent book “Tidy Finance with R”. Essentially, I combine data from CRSP’s daily stock return file (crsp.dsf) with data on factor returns from Ken French’s website and then run an aggregate query.

Summary of findings

While using DuckDB simplifies the process of collecting data from WRDS (and results in a shorter download time), the real differences come after the data sets are on your computer. Using DuckDB to load the data and run an aggregate query reduces the time taken from over two minutes using dplyr to well under one second. DuckDB from disk is faster than dplyr from RAM. Additionally, DuckDB is faster than SQLite. In fact, for many queries DuckDB would be faster collecting data from WRDS than dplyr is with data in a local SQLite database. While performance isn’t everything, gains like these likely deliver real quality-of-life benefits to data analysts.

I also show that almost all the performance benefits of DuckDB are realized even if the data are stored in parquet files. This is useful information because, while the format of DuckDB database files remains in flux, parquet files are regarded by many as the “modern CSV” and can be read by many software systems, including R and Python. I describe how one could maintain a local library of parquet files including copies of WRDS tables here.

This note illustrates the power of the core Tidy Finance approach. With a few tweaks, one can springboard from the SQLite-and-dplyr approach of the book to the very cutting-edge of data science tools and approaches.

Databases and tidy data

A popular way to manage and store data is with SQL databases. Tidy Finance with R uses SQLite, which “implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.” In this note, I use DuckDB, which has been described as offering “SQLite for Analytics”. DuckDB is like SQLite in not requiring a server process, but like server-based databases such as PostgreSQL in terms of support for advanced SQL features and data types.

While storing data in a DuckDB database offers some benefits of SQLite (e.g., data compression), the real benefits of using DuckDB come from using the database engine for data analytics. For the most part, Tidy Finance with R uses SQLite for storage and uses dplyr and in-memory data frames for analysis. For example, in the chapter on beta estimation, the data are read into memory immediately using collect() before any analysis is conducted. However, the dbplyr package allows many analytical tasks to be performed in the database. In this note, I demonstrate how using DuckDB and dbplyr can lead to significant performance gains.

Getting data

There are two data sets that we need to collect. The first is the factor returns, which we collect from Ken French’s website using the frenchdata package. The second is from CRSP’s daily stock file, which we get from WRDS.

We start by loading three packages. Note that we load DBI rather than the underlying database driver package.1 In addition to these three packages, you should have the duckdb and RSQLite packages installed. Use install.packages() in R to install any missing packages.

library(tidyverse)
library(DBI)
library(frenchdata)
library(arrow)

Next we set up a DuckDB database file in the data directory, creating this directory if need be. We set read_only = FALSE because we will want to write to this database connection.

if (!dir.exists("data")) dir.create("data")

tidy_finance <- dbConnect(
  duckdb::duckdb(),
  "data/tidy_finance.duckdb",
  read_only = FALSE
)

Fama-French factor returns

We use the same start_date and end_date values used in “Tidy Finance with R” and the code below also is adapted from that book. However, we use the copy_to() function from dplyr to save the table to our database.

start_date <- ymd("1960-01-01")
end_date <- ymd("2021-12-31")

factors_ff_daily_raw <- 
  download_french_data("Fama/French 3 Factors [Daily]")
New names:
• `` -> `...1`
factors_ff_daily <- 
  factors_ff_daily_raw$subsets$data[[1]] |>
  mutate(
    date = ymd(date),
    across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),
    .keep = "none"
  ) |>
  rename_with(str_to_lower) |>
  rename(mkt_excess = `mkt-rf`) |>
  filter(date >= start_date & date <= end_date) |>
  copy_to(tidy_finance,
          df = _,
          name = "factors_ff_daily",
          temporary = FALSE,
          overwrite = TRUE)

Getting daily returns from WRDS

Next, I specify the connection details as follows. I recommend using environment variables (e.g., set using Sys.setenv()), as this facilitates sharing code with others. You should not include this chunk of code in your code, rather run it before executing your other code. In addition to setting these environment variables, you may want to set PGPASSWORD too. (Hopefully it is obvious that your should use your WRDS ID and password, not mine.)

Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
           PGPORT = 9737L,
           PGDATABASE = "wrds",
           PGUSER = Sys.getenv("WRDS_USER"),
           PGPASSWORD = Sys.getenv("WRDS_PASSWORD"))

Third, we connect to the CRSP daily stock file in the WRDS PostgreSQL database.

pg <- dbConnect(RPostgres::Postgres())
dsf_db <- tbl(pg, Id(schema = "crsp", table = "dsf"))

As we can see, we have access to data in crsp.dsf.

dsf_db
# Source:   table<"crsp"."dsf"> [?? x 20]
# Database: postgres  [pweiss@wrds-pgdata.wharton.upenn.edu:9737/wrds]
  cusip    permno permco issuno hexcd hsiccd date       bidlo askhi
  <chr>     <dbl>  <dbl>  <dbl> <dbl>  <dbl> <date>     <dbl> <dbl>
1 68391610  10000   7952  10396     3   3990 1986-01-07  2.38  2.75
2 68391610  10000   7952  10396     3   3990 1986-01-08  2.38  2.62
3 68391610  10000   7952  10396     3   3990 1986-01-09  2.38  2.62
4 68391610  10000   7952  10396     3   3990 1986-01-10  2.38  2.62
5 68391610  10000   7952  10396     3   3990 1986-01-13  2.5   2.75
# ℹ more rows
# ℹ 11 more variables: prc <dbl>, vol <dbl>, ret <dbl>, bid <dbl>,
#   ask <dbl>, shrout <dbl>, cfacpr <dbl>, cfacshr <dbl>,
#   openprc <dbl>, numtrd <dbl>, retx <dbl>

Before proceeding with our first benchmark, we will make a version of system.time() that works with assignment.2

system_time <- function(x) {
  print(system.time(x))
  x
}

The following code is adapted from the Tidy Finance code here. But the original code is much more complicated and takes slightly longer to run.3

rs <- dbExecute(tidy_finance, "DROP TABLE IF EXISTS crsp_daily")

crsp_daily <- 
  dsf_db |>
  filter(between(date, start_date, end_date),
         !is.na(ret)) |>
  select(permno, date, ret) |>
  mutate(month = as.Date(floor_date(date, "month"))) |>
  copy_to(tidy_finance, df = _, name = "dsf_temp") |>
  left_join(factors_ff_daily |>
              select(date, rf), by = "date") |>
  mutate(
    ret_excess = ret - rf,
    ret_excess = pmax(ret_excess, -1, na.rm = TRUE)
  ) |>
  select(permno, date, month, ret_excess) |>
  compute(name = "crsp_daily", temporary = FALSE, overwrite = TRUE) |>
  system_time()
   user  system elapsed 
  56.53    2.64  256.12 

Saving data to SQLite

If you have been working through “Tidy Finance”, you may already have an SQLite database containing crsp_daily. If not, we can easily create one now and copy the table from our DuckDB database to SQLite.

tidy_finance_sqlite <- dbConnect(
  RSQLite::SQLite(),
  "data/tidy_finance.sqlite",
  extended_types = TRUE
)
copy_to(tidy_finance_sqlite,
        crsp_daily,
        name = "crsp_daily",
        overwrite = TRUE,
        temporary = FALSE)

dbExecute(tidy_finance_sqlite, "VACUUM")

We can also save the data to a parquet file.

dbExecute(tidy_finance, 
          "COPY crsp_daily TO 'data/crsp_daily.parquet' 
          (FORMAT 'PARQUET')")

Having created our two databases, we disconnect from them. This mimics the most common “write-once, read-many” pattern for using databases.

dbDisconnect(tidy_finance_sqlite)
dbDisconnect(tidy_finance, shutdown = TRUE)

Benchmarking a simple aggregation query

The following is a simple comparison of several different ways of doing some basic data analysis with R. After running the code above, we have the table crsp_daily as described in Tidy Finance in two separate databases—a SQLite database and a DuckDB database—and in a parquet file.

The following examines the same query processed in three different ways.

  1. Using dplyr on an in-memory data frame
  2. Using dbplyr with an SQLite database
  3. Using dbplyr with a DuckDB database
  4. Using dbplyr with DuckDB and a parquet file.
  5. Using dbplyr with the arrow library and a parquet file.

dplyr

We first need to load the data into memory.

tidy_finance <- dbConnect(
  RSQLite::SQLite(),
  "data/tidy_finance.sqlite",
  extended_types = TRUE
)

crsp_daily <- tbl(tidy_finance, "crsp_daily")

What takes most time is simply loading nearly 2GB of data into memory.

crsp_daily_local <- 
  crsp_daily |> 
  collect() |>
  system_time()
   user  system elapsed 
 208.87    2.75  280.47 

Once the data are in memory, it is relatively quick to run a summary query.

crsp_daily_local |> 
  group_by(month) |> 
  summarize(ret = mean(ret_excess, na.rm = TRUE)) |> 
  collect() |> 
  system_time()
   user  system elapsed 
   4.79    1.18    8.08 
# A tibble: 744 × 2
  month            ret
  <date>         <dbl>
1 1960-01-01 -0.00213 
2 1960-02-01  0.000325
3 1960-03-01 -0.00115 
4 1960-04-01 -0.00106 
5 1960-05-01  0.00114 
# ℹ 739 more rows
rm(crsp_daily_local)

dbplyr with SQLite

Things are faster with SQLite, though there’s no obvious way to split the time between reading the data and performing the aggregation. Note that we have a collect() at the end. This will not take a noticeable amount of time, but seems to be a reasonable step if our plan is to analyse the aggregated data in R.

crsp_daily |> 
  group_by(month) |> 
  summarize(ret = mean(ret_excess, na.rm = TRUE)) |> 
  collect() |>
  system_time()
   user  system elapsed 
   38.6    11.0    63.5 
# A tibble: 744 × 2
  month            ret
  <date>         <dbl>
1 1960-01-01 -0.00213 
2 1960-02-01  0.000325
3 1960-03-01 -0.00115 
4 1960-04-01 -0.00106 
5 1960-05-01  0.00114 
# ℹ 739 more rows
dbDisconnect(tidy_finance)

dbplyr with DuckDB

Let’s consider DuckDB. Note that we are only reading the data here, so we set read_only = TRUE in connecting to the database. Apart from the connection, there is no difference between the code here and the code above using SQLite.

tidy_finance <- dbConnect(
  duckdb::duckdb(),
  "data/tidy_finance.duckdb",
  read_only = TRUE)
crsp_daily <- tbl(tidy_finance, "crsp_daily")
crsp_daily |> 
  group_by(month) |> 
  summarize(ret = mean(ret_excess, na.rm = TRUE)) |> 
  collect() |>
  system_time()
   user  system elapsed 
   0.51    0.11    0.50 
# A tibble: 744 × 2
  month           ret
  <date>        <dbl>
1 1991-12-01 0.00237 
2 1992-05-01 0.000740
3 1993-01-01 0.00413 
4 1993-05-01 0.00265 
5 2010-10-01 0.00198 
# ℹ 739 more rows

Having done our benchmarks, we can take a quick peek at the data.

crsp_daily |> 
  group_by(month) |> 
  summarize(ret = mean(ret_excess, na.rm = TRUE)) |> 
  arrange(month) |>
  collect()
# A tibble: 744 × 2
  month            ret
  <date>         <dbl>
1 1960-01-01 -0.00213 
2 1960-02-01  0.000325
3 1960-03-01 -0.00115 
4 1960-04-01 -0.00106 
5 1960-05-01  0.00114 
# ℹ 739 more rows

Finally, we disconnect from the database. This will happen automatically if we close R, etc., and is less important if we have read_only = TRUE (so there is no lock on the file), but we keep things tidy here.

dbDisconnect(tidy_finance, shutdown = TRUE)

dbplyr with DuckDB and a parquet file

Let’s do the benchmark using the parquet data.

db <- dbConnect(duckdb::duckdb())
crsp_daily <- tbl(db, "read_parquet('data/crsp_daily.parquet')")
crsp_daily |> 
  group_by(month) |> 
  summarize(ret = mean(ret_excess, na.rm = TRUE)) |> 
  collect() |>
  system_time()
   user  system elapsed 
   1.63    0.73    0.51 
# A tibble: 744 × 2
  month            ret
  <date>         <dbl>
1 1970-01-01 -0.00210 
2 1971-05-01 -0.00262 
3 1973-02-01 -0.00415 
4 1973-11-01 -0.00927 
5 1967-05-01 -0.000372
# ℹ 739 more rows
dbDisconnect(db, shutdown = TRUE)

The arrow library with a parquet file

Let’s do one more benchmark using the parquet data with the arrow library.

crsp_daily <- open_dataset("data/crsp_daily.parquet")
crsp_daily |> 
  group_by(month) |> 
  summarize(ret = mean(ret_excess, na.rm = TRUE)) |> 
  collect() |>
  system_time()
   user  system elapsed 
   0.25    0.01    0.76 
# A tibble: 744 × 2
  month            ret
  <date>         <dbl>
1 1989-01-01  0.00285 
2 1986-06-01  0.000194
3 1986-07-01 -0.00352 
4 1986-08-01  0.00115 
5 1986-09-01 -0.00279 
# ℹ 739 more rows

Footnotes

  1. This is how it’s done in “R for Data Science”. I have read comments by Hadley Wickham that this is the right way to do it, but I can’t find those comments.↩︎

  2. If we put system.time() at the end of this pipe, then crsp_daily would hold the value returned by that function rather than the result of the pipeline preceding it. At first, the system_time() function may seem like magic, but Hadley Wickham explained to me that this works because of lazy evaluation, which is discussed in “Advanced R” here. Essentially, x is evaluated just once—inside system.time()—and its value is returned in the next line.↩︎

  3. Performance will vary according to the speed of your connection to WRDS. Note that this query does temporarily use a significant amount of RAM on my machine, it is not clear that DuckDB will use as much RAM if this is more constrained. If necessary, you can run (say) dbExecute(tidy_finance, "SET memory_limit='1GB'") to constrain DuckDB’s memory usage; doing so has little impact on performance for this query.↩︎