import pandas as pd
import numpy as np
Accessing and Managing Financial Data
You are reading Tidy Finance with Python. You can find the equivalent chapter for the sibling Tidy Finance with R here.
In this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome when using different data formats and across different projects. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages.
This chapter shows how to import different open-source datasets. Specifically, our data comes from the application programming interface (API) of Yahoo!Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series. We store all the data in a single database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.
First, we load the Python packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them.
Moreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.
= "1960-01-01"
start_date = "2023-12-31" end_date
Fama-French Data
We start by downloading some famous Fama-French factors (e.g., Fama and French 1993) and portfolio returns commonly used in empirical asset pricing. Fortunately, the pandas-datareader
package provides a simple interface to read data from Kenneth French’s Data Library.
import pandas_datareader as pdr
We can use the pdr.DataReader()
function of the package to download monthly Fama-French factors. The set Fama/French 3 Factors contains the return time series of the market (mkt_excess
), size (smb
), and value (hml
) factors alongside the risk-free rates (rf
). Note that we have to do some manual work to parse all the columns correctly and scale them appropriately, as the raw Fama-French data comes in a unique data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French’s finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to pandas_datareader
.
= pdr.DataReader(
factors_ff3_monthly_raw ="F-F_Research_Data_Factors",
name="famafrench",
data_source=start_date,
start=end_date)[0]
end
= (factors_ff3_monthly_raw
factors_ff3_monthly 100)
.divide(="date")
.reset_index(names=lambda x: pd.to_datetime(x["date"].astype(str)))
.assign(datestr.lower, axis="columns")
.rename(={"mkt-rf": "mkt_excess"})
.rename(columns )
We also download the set 5 Factors (2x3), which additionally includes the return time series of the profitability (rmw
) and investment (cma
) factors. We demonstrate how the monthly factors are constructed in Replicating Fama and French Factors.
= pdr.DataReader(
factors_ff5_monthly_raw ="F-F_Research_Data_5_Factors_2x3",
name="famafrench",
data_source=start_date,
start=end_date)[0]
end
= (factors_ff5_monthly_raw
factors_ff5_monthly 100)
.divide(="date")
.reset_index(names=lambda x: pd.to_datetime(x["date"].astype(str)))
.assign(datestr.lower, axis="columns")
.rename(={"mkt-rf": "mkt_excess"})
.rename(columns )
It is straightforward to download the corresponding daily Fama-French factors with the same function.
= pdr.DataReader(
factors_ff3_daily_raw ="F-F_Research_Data_Factors_daily",
name="famafrench",
data_source=start_date,
start=end_date)[0]
end
= (factors_ff3_daily_raw
factors_ff3_daily 100)
.divide(="date")
.reset_index(namesstr.lower, axis="columns")
.rename(={"mkt-rf": "mkt_excess"})
.rename(columns )
In a subsequent chapter, we also use the monthly returns from ten industry portfolios, so let us fetch that data, too.
= pdr.DataReader(
industries_ff_monthly_raw ="10_Industry_Portfolios",
name="famafrench",
data_source=start_date,
start=end_date)[0]
end
= (industries_ff_monthly_raw
industries_ff_monthly 100)
.divide(="date")
.reset_index(names=lambda x: pd.to_datetime(x["date"].astype(str)))
.assign(datestr.lower, axis="columns")
.rename( )
It is worth taking a look at all available portfolio return time series from Kenneth French’s homepage. You should check out the other sets by calling pdr.famafrench.get_available_datasets()
.
q-Factors
In recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the Hou, Xue, and Zhang (2014) q-factor model. We refer to the extended background information provided by the original authors for further information. The q-factors can be downloaded directly from the authors’ homepage from within pd.read_csv()
.
We also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the “R_”-prescript using regular expressions and write all column names in lowercase. We then query the data to select observations between the start and end dates. Finally, we use the double asterisk (**
) notation in the assign
function to apply the same transform of dividing by 100 to all four factors by iterating through them. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on try. You can check out style guides available online, e.g., Hadley Wickham’s tidyverse
style guide.
= (
factors_q_monthly_link "https://global-q.org/uploads/1/2/2/6/122679606/"
"q5_factors_monthly_2023.csv"
)
= (pd.read_csv(factors_q_monthly_link)
factors_q_monthly
.assign(=lambda x: (
date"year"].astype(str) + "-" +
pd.to_datetime(x["month"].astype(str) + "-01"))
x[
)=["R_F", "R_MKT", "year"])
.drop(columns=lambda x: x.replace("R_", "").lower())
.rename(columnsf"date >= '{start_date}' and date <= '{end_date}'")
.query(
.assign(**{col: lambda x: x[col]/100 for col in ["me", "ia", "roe", "eg"]}
) )
Macroeconomic Predictors
Our next data source is a set of macroeconomic variables often used as predictors for the equity premium. Welch and Goyal (2008) comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data on Amit Goyal’s website. Since the data is an XLSX-file stored on a public Google Drive location, we need additional packages to access the data directly from our Python session. Usually, you need to authenticate if you interact with Google drive directly in Python. Since the data is stored via a public link, we can proceed without any authentication.
= "1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG"
sheet_id = "macro_predictors.xlsx"
sheet_name = (
macro_predictors_link f"https://docs.google.com/spreadsheets/d/{sheet_id}"
f"/gviz/tq?tqx=out:csv&sheet={sheet_name}"
)
Next, we read in the new data and transform the columns into the variables that we later use:
- The dividend price ratio (
dp
), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices (Campbell and Shiller 1988; Campbell and Yogo 2006). - Dividend yield (
dy
), the difference between the log of dividends and the log of lagged prices (Ball 1978). - Earnings price ratio (
ep
), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index (Campbell and Shiller 1988). - Dividend payout ratio (
de
), the difference between the log of dividends and the log of earnings (Lamont 1998). - Stock variance (
svar
), the sum of squared daily returns on the S&P 500 index (Guo 2006). - Book-to-market ratio (
bm
), the ratio of book value to market value for the Dow Jones Industrial Average (Kothari and Shanken 1997). - Net equity expansion (
ntis
), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks (Campbell, Hilscher, and Szilagyi 2008). - Treasury bills (
tbl
), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis (Campbell 1987). - Long-term yield (
lty
), the long-term government bond yield from Ibbotson’s Stocks, Bonds, Bills, and Inflation Yearbook (Welch and Goyal 2008). - Long-term rate of returns (
ltr
), the long-term government bond returns from Ibbotson’s Stocks, Bonds, Bills, and Inflation Yearbook (Welch and Goyal 2008). - Term spread (
tms
), the difference between the long-term yield on government bonds and the Treasury bill (Campbell 1987). - Default yield spread (
dfy
), the difference between BAA and AAA-rated corporate bond yields (Fama and French 1989). - Inflation (
infl
), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics (Campbell and Vuolteenaho 2004).
For variable definitions and the required data transformations, you can consult the material on Amit Goyal’s website.
= (
macro_predictors =",")
pd.read_csv(macro_predictors_link, thousands
.assign(=lambda x: pd.to_datetime(x["yyyymm"], format="%Y%m"),
date=lambda x: np.log(x["D12"])-np.log(x["Index"]),
dp=lambda x: np.log(x["D12"])-np.log(x["Index"].shift(1)),
dy=lambda x: np.log(x["E12"])-np.log(x["Index"]),
ep=lambda x: np.log(x["D12"])-np.log(x["E12"]),
de=lambda x: x["lty"]-x["tbl"],
tms=lambda x: x["BAA"]-x["AAA"]
dfy
)={"b/m": "bm"})
.rename(columns"date", "dp", "dy", "ep", "de", "svar", "bm",
.get(["ntis", "tbl", "lty", "ltr", "tms", "dfy", "infl"])
"date >= @start_date and date <= @end_date")
.query(
.dropna() )
Other Macroeconomic Data
The Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. As an illustration, we use the already familiar pandas-datareader
package to fetch consumer price index (CPI) data that can be found under the CPIAUCNS key.
= (pdr.DataReader(
cpi_monthly ="CPIAUCNS",
name="fred",
data_source=start_date,
start=end_date
end
)="date")
.reset_index(names={"CPIAUCNS": "cpi"})
.rename(columns=lambda x: x["cpi"]/x["cpi"].iloc[-1])
.assign(cpi )
Note that we use the assign()
in the last line to set the current (latest) price level as the reference inflation level. To download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the PCU2122212122210 key.
Setting Up a Database
Now that we have downloaded some (freely available) data from the web into the memory of our Python session, let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code.
There are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an SQLite-database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured SQL database engine. Note that SQL (Structured Query Language) is a standard language for accessing and manipulating databases.
import sqlite3
An SQLite-database is easily created - the code below is really all there is. You do not need any external software. Otherwise, date columns are stored and retrieved as integers. We will use the resulting file tidy_finance.sqlite
in the subfolder data
for all subsequent chapters to retrieve our data.
= sqlite3.connect(database="data/tidy_finance_python.sqlite") tidy_finance
Next, we create a remote table with the monthly Fama-French factor data. We do so with the pandas
function to_sql()
, which copies the data to our SQLite-database.
(factors_ff3_monthly="factors_ff3_monthly",
.to_sql(name=tidy_finance,
con="replace",
if_exists=False)
index )
Now, if we want to have the whole table in memory, we need to call pd.read_sql_query()
with the corresponding query. You will see that we regularly load the data into the memory in the next chapters.
pd.read_sql_query(="SELECT date, rf FROM factors_ff3_monthly",
sql=tidy_finance,
con={"date"}
parse_dates )
date | rf | |
---|---|---|
0 | 1960-01-01 | 0.0033 |
1 | 1960-02-01 | 0.0029 |
2 | 1960-03-01 | 0.0035 |
3 | 1960-04-01 | 0.0019 |
4 | 1960-05-01 | 0.0027 |
... | ... | ... |
763 | 2023-08-01 | 0.0045 |
764 | 2023-09-01 | 0.0043 |
765 | 2023-10-01 | 0.0047 |
766 | 2023-11-01 | 0.0044 |
767 | 2023-12-01 | 0.0043 |
768 rows × 2 columns
The last couple of code chunks are really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages.
Before we move on to the next data source, let us also store the other six tables in our new SQLite database.
= {
data_dict "factors_ff5_monthly": factors_ff5_monthly,
"factors_ff3_daily": factors_ff3_daily,
"industries_ff_monthly": industries_ff_monthly,
"factors_q_monthly": factors_q_monthly,
"macro_predictors": macro_predictors,
"cpi_monthly": cpi_monthly
}
for key, value in data_dict.items():
=key,
value.to_sql(name=tidy_finance,
con="replace",
if_exists=False) index
From now on, all you need to do to access data that is stored in the database is to follow two steps: (i) Establish the connection to the SQLite-database and (ii) execute the query to fetch the data. For your convenience, the following steps show all you need in a compact fashion.
import pandas as pd
import sqlite3
= sqlite3.connect(database="data/tidy_finance_python.sqlite")
tidy_finance
= pd.read_sql_query(
factors_q_monthly ="SELECT * FROM factors_q_monthly",
sql=tidy_finance,
con={"date"}
parse_dates )
Managing SQLite Databases
Finally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.
To optimize the database file, you can run the VACUUM
command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the execute()
function.
"VACUUM") tidy_finance.execute(
The VACUUM
command actually performs a couple of additional cleaning steps, which you can read about in this tutorial.
Exercises
- Download the monthly Fama-French factors manually from Kenneth French’s data library and read them in via
pd.read_csv()
. Validate that you get the same data as via thepandas-datareader
package. - Download the daily Fama-French 5 factors using the
pdr.DataReader()
package. After the successful download and conversion to the column format that we used above, compare therf
,mkt_excess
,smb
, andhml
columns offactors_ff3_daily
tofactors_ff5_daily
. Discuss any differences you might find.