This website is the online version of Tidy Finance with R, a book currently under development and intended for eventual print release via Chapman & Hall/CRC. The book is the result of a joint effort of Christoph Scheuch, Stefan Voigt, and Patrick Weiss.
We are grateful for any kind of feedback on every aspect of the book. So please get in touch with us via email@example.com if you spot typos, discover any issues that deserve more attention, or if you have suggestions for additional chapters and sections. Additionally, let us know if you found the text helpful. We look forward to hearing from you!
Financial economics is a vibrant area of research, a central part of all businesses activities, and at least implicitly relevant for our everyday life. Despite its relevance for our society and a vast number of empirical studies of financial phenomenons, one quickly learns that the actual implementation of models to solve problems in the area of financial economics is typically rather opaque. As graduate students, we were particularly surprised by the lack of public code for seminal papers or even textbooks on key concepts of financial economics. The lack of transparent code not only leads to numerous replication efforts (and their failures), but it also constitutes a waste of resources on problems that have already been solved by countless others in secrecy.
This book aims to lift the curtain on reproducible finance by providing a fully transparent code base for many common financial applications. We hope to inspire others to share their code publicly and take part in our journey towards more reproducible research in the future.
We write this book for three audiences:
- Students who want to acquire the basic tools required to conduct financial research ranging from undergrad to graduate level. The book’s structure is simple enough such that the material is sufficient for self-study purposes.
- Instructors who look for materials to teach courses in empirical finance or financial economics. We provide plenty of examples and focus on intuitive explanations that can easily be adjusted or expanded. At the end of each chapter we provide exercises which we hope inspire students to dig deeper.
- Data analysts or statisticians who work on issues dealing with financial data and who need practical tools to succeed.
The book is currently divided into 5 parts:
- Chapter 1 introduces you to important concepts around which our approach to Tidy Finance revolves.
- Chapters 2-4 provide tools to organize your data and prepare the most common data sets used in financial research. Although many important data are behind paywalls, we start by describing different open source data and how to download them. We then move on to prepare two of the most popular datasets in financial research: CRSP and Compustat. Then, we cover corporate bond data from TRACE. We reuse the data from these chapters in all subsequent chapters. Chapter 5 contains an overview over common alternative data provides for which direct access vie R packages exist.
- Chapters 6-11 deal with key concepts of empirical asset pricing such as beta estimation, portfolio sorts, performance analysis, and asset pricing regressions.
- Chapters 12-15 apply linear models to panel data and machine learning methods to problems in factor selection and option pricing.
- Chapters 16-17 provide approaches for parametric, constrained portfolio optimization, and backtesting procedures.
Each chapter is self-contained and can be read individually. Yet the data chapters provide important background necessary for the data management in all other chapters.
This book is about empirical work. While we assume only basic knowledge in statistics and econometrics, we do not provide detailed treatments of the underlying theoretical models or methods applied in this book. Instead, you find references to the seminal academic work in journal articles or textbooks for more detailed treatments. We believe that our comparative advantage is to provide a thorough implementation of typical approaches such as portfolio sorts, backtesting procedures, regressions, machine learning methods, or other related topics in empirical finance. We enrich our implementations with discussions of the needy-greedy choices you face while conducting empirical analyses. We hence refrain from deriving theoretical models or extensively discussing the statistical properties of well-established tools.
Our book is close in spirit to other books that provide fully reproducible code for financial applications. We view them as complementary to our work and want to highlight the differences:
- Regenstein Jr (2018) provides an excellent introduction and discussion of different tools for standard applications in finance (e.g., how to compute returns and sample standard deviations of a time series of stock returns). Our book, in contrast, has a clear focus on applications of state-of-the-art for academic research in finance. We thus fill a niche that allows aspiring researchers or instructors to rely on a well-designed code base.
- Coqueret and Guida (2020) constitutes a great compendium to our book with respect to applications related to return prediction and portfolio formation. The book primarily targets practitioners and has a hands-on focus. Our book, in contrast, relies on the typical databases used in financial research and focuses on the preparation of such datasets for academic applications. In addition, our chapter on machine learning focuses on factor selection instead of return prediction.
Although we emphasize the importance of reproducible workflow principles, we do not provide introductions to some of the core tools that we relied on to create and maintain this book:
- Version control systems such as Git are vital in managing any programming project. Originally designed to organize the collaboration of software developers, even solo data analysts will benefit from adopting version control. Git also makes it simple to publicly share code and allow others to reproduce your findings. We refer to Bryan (2022) for a gentle introduction into the (sometimes painful) life with Git.
- Good communication of results is a key ingredient to reproducible and transparent research. To compile this book, we heavily draw on a suite of fantastic open source tools. First, Wickham (2016) provides a highly customizable, yet easy to use system for creating data visualizations. Wickham and Grolemund (2016) provides an intuitive introduction into creating graphics using this approach. Second, in our daily work and to compile this book, we used the markdown-based authoring framework described in Xie, Allaire, and Grolemund (2018) and Xie, Dervieux, and Riederer (2020). Markdown documents are fully reproducible and support dozens of static and dynamic output formats. Lastly, Xie (2016) tremendously facilitates authoring markdown-based books. We do not provide introductions to these tools, as the resources above already provide easily accessible tutorials.
- Good writing is also important for the presentation of findings. We neither claim to be experts in this domain, nor do we try to sound particularly academic. On the contrary, we deliberately use a more colloquial language to describe all the methods and results presented in this book in order to allow our readers to relate more easily to the mainly technical content. For those who desire more guidance with respect to proper academic writing for financial economics, we recommend Kiesling (2003), Cochrane (2005), and Jacobsen (2014) who all provide essential tips (condensed to a few pages).
We believe that R is among the best choices for a programming language in the area of finance. Some of our favorite features include:
- R is free and open-source so that you can use it in academic and professional contexts.
- A diverse and active online community works on a broad range of tools.
- A massive set of actively maintained packages for all kinds of applications exists, e.g., data manipulation, visualization, machine learning, etc.
- Powerful tools for communication, e.g., Rmarkdown and shiny, are readily available.
- RStudio is one of the best development environments for interactive data analysis.
- Strong foundations of functional programming are provided.
- Smooth integration with other programming languages, e.g., SQL, Python, C, C++, Fortran, etc.
For more information why R is great, we refer to Wickham et al. (2019).
As you start working with data, you quickly realize that you spend a lot of time reading, cleaning, and transforming your data. In fact, it is often said that more than 80% of data analysis is spent on preparing data. By tidying data, we want to structure data sets to facilitate further analyses. As Wickham (2014) puts it:
[T]idy datasets are all alike, but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).
In its essence, tidy data follows these three principles:
- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.
Throughout this book, we try to follow these principles as best as we can. If you want to learn more about tidy data principles in an informal manner, we refer you to this vignette as part of Wickham and Girlich (2022).
In addition to the data layer, there are also tidy coding principles outlined in the tidy tools manifesto that we try to follow:
- Reuse existing data structures.
- Compose simple functions with the pipe.
- Embrace functional programming.
- Design for humans.
In particular, we heavily draw on a set of packages called the
tidyverse (Wickham et al. 2019). The
tidyverse is a consistent set of packages for all data analysis tasks, ranging from importing and wrangling to visualizing and modeling data with the same grammar. In addition to explicit tidy principles, the
tidyverse has further benefits: (i) if you master one package, it is easier to master others, and (ii) the core packages are developed and maintained by the Public Benefit Company Posit.
These core packages contained in the
ggplot2 (Wickham 2016),
dplyr (Wickham, François, et al. 2022),
tidyr (Wickham and Girlich 2022),
readr (Wickham, Hester, and Bryan 2022),
purrr (Henry and Wickham 2020),
tibble (Müller and Wickham 2022),
stringr (Wickham 2019), and
forcats (Wickham 2021).
Throughout the book we use the native pipe
|>, a powerful tool to clearly express a sequence of operations. Readers familiar with the
tidyverse may be used to the predecessor
%>% that is part of the
magrittr package. For all our applications, the native and
magrittr pipe behave identically, so we opt for the one that is simpler and part of base R. For a more thorough discussion on the subtle differences between the two pipes, we refer to the second edition of Wickham and Grolemund (2016).
Before we continue, make sure you have all the software you need for this book:
- Install R and RStudio. To get a walk-through of the installation for every major operating system, follow the steps outlined in this summary. The whole process should be done in a few clicks. If you wonder about the difference: R is an open-source language and environment for statistical computing and graphics, free to download and use. While R runs the computations, RStudio is an integrated development environment that provides an interface by adding many convenient features and tools. We suggest doing all the coding in RStudio.
- Open RStudio and install the
tidyverse. Not sure how it works? You find helpful information on how to install packages in this brief summary.
If you are new to R, we recommend starting with the following sources:
- A very gentle and good introduction into the workings of R can be found in the form of the weighted dice project. Once you are done setting up R on your machine, try to follow the instructions in this project.
- The main book on the
tidyverse, Wickham and Grolemund (2016) is available online and for free: R for Data Science explains the majority of the tools we use in our book.
- If you are an instructor searching for effectively teach R and data science methods, we recommend to take a look on the excellent data science toolbox by Mine Cetinkaya-Rundel.
- RStudio provides a range of excellent cheat sheets with extensive information on how to use the
This book is licensed to you under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
The code samples in this book are licensed under Creative Commons CC0 1.0 Universal (CC0 1.0), i.e., public domain.
This book was written in RStudio using
bookdown (Xie 2016). The website is hosted with GitHub Pages. The complete source is available from GitHub.
We generated all plots in this book using
ggplot2 and its classic dark-on-light theme (
This version of the book was built with R version 4.2.1 (2022-06-23, Funny-Looking Kid) and the following packages: