Tidy Finance with R

Preface

This website is the online version of Tidy Finance with R, a book published via Chapman & Hall/CRC. The book is the result of a joint effort of Christoph Scheuch, Stefan Voigt, and Patrick Weiss.

We are grateful for any kind of feedback on every aspect of the book. So please get in touch with us via contact@tidy-finance.org if you spot typos, discover any issues that deserve more attention, or if you have suggestions for additional chapters and sections. Additionally, let us know if you found the text helpful. We look forward to hearing from you!

Support Tidy Finance

Buy our book via your preferred vendor or support us with coffee here.

Why Does This Book Exist?

Financial economics is a vibrant area of research, a central part of all business activities, and at least implicitly relevant to our everyday life. Despite its relevance for our society and a vast number of empirical studies of financial phenomena, one quickly learns that the actual implementation of models to solve problems in the area of financial economics is typically rather opaque. As graduate students, we were particularly surprised by the lack of public code for seminal papers or even textbooks on key concepts of financial economics. The lack of transparent code not only leads to numerous replication efforts (and their failures) but also constitutes a waste of resources on problems that countless others have already solved in secrecy.

This book aims to lift the curtain on reproducible finance by providing a fully transparent code base for many common financial applications. We hope to inspire others to share their code publicly and take part in our journey toward more reproducible research in the future.

Who Should Read This Book?

We write this book for three audiences:

Students who want to acquire the basic tools required to conduct financial research ranging from the undergraduate to graduate level. The book’s structure is simple enough such that the material is sufficient for self-study purposes.
Instructors who look for materials to teach courses in empirical finance or financial economics. We provide plenty of examples and focus on intuitive explanations that can easily be adjusted or expanded. At the end of each chapter, we provide exercises that we hope inspire students to dig deeper.
Data analysts or statisticians who work on issues dealing with financial data and who need practical tools to succeed.

What Will You Learn?

The book is currently divided into five parts:

The first part introduces you to important concepts around which our approach to Tidy Finance revolves.
The second part provides tools to organize your data and prepare the most common datasets used in financial research. Although many important data are behind paywalls, we start by describing different open-source data and how to download them. We then move on to prepare two of the most popular datasets in financial research: CRSP and Compustat. Then, we cover corporate bond data from TRACE. We reuse the data from these chapters in all subsequent chapters. The last chapter of this part contains an overview of common alternative data providers for which direct access vie R packages exist.
The third part deals with key concepts of empirical asset pricing, such as beta estimation, portfolio sorts, performance analysis, and asset pricing regressions.
In the fourth part, we apply linear models to panel data and machine learning methods to problems in factor selection and option pricing.
The last part provides approaches for parametric, constrained portfolio optimization, and backtesting procedures.

Each chapter is self-contained and can be read individually. Yet, the data chapters provide an important background necessary for data management in all other chapters.

What Won’t You Learn?

This book is about empirical work. While we assume only basic knowledge of statistics and econometrics, we do not provide detailed treatments of the underlying theoretical models or methods applied in this book. Instead, you find references to the seminal academic work in journal articles or textbooks for more detailed treatments. We believe that our comparative advantage is to provide a thorough implementation of typical approaches such as portfolio sorts, backtesting procedures, regressions, machine learning methods, or other related topics in empirical finance. We enrich our implementations by discussing the nitty-gritty choices you face while conducting empirical analyses. We hence refrain from deriving theoretical models or extensively discussing the statistical properties of well-established tools.

Our book is close in spirit to other books that provide fully reproducible code for financial applications. We view them as complementary to our work and want to highlight the differences:

Regenstein Jr (2018) provides an excellent introduction and discussion of different tools for standard applications in finance (e.g., how to compute returns and sample standard deviations of a time series of stock returns). In contrast, our book clearly focuses on applications of the state-of-the-art for academic research in finance. We thus fill a niche that allows aspiring researchers or instructors to rely on a well-designed code base.
Coqueret and Guida (2020) constitute a great compendium to our book with respect to applications related to return prediction and portfolio formation. The book primarily targets practitioners and has a hands-on focus. Our book, in contrast, relies on the typical databases used in financial research and focuses on the preparation of such datasets for academic applications. In addition, our chapter on machine learning focuses on factor selection instead of return prediction.

Although we emphasize the importance of reproducible workflow principles, we do not provide introductions to some of the core tools that we relied on to create and maintain this book:

Version control systems such as Git are vital in managing any programming project. Originally designed to organize the collaboration of software developers, even solo data analysts will benefit from adopting version control. Git also makes it simple to publicly share code and allow others to reproduce your findings. We refer to Bryan (2022) for a gentle introduction to the (sometimes painful) life with Git.
Good communication of results is a key ingredient to reproducible and transparent research. To compile this book, we heavily draw on a suite of fantastic open-source tools. First, Wickham (2016) provide a highly customizable yet easy-to-use system for creating data visualizations. Wickham, Çetinkaya-Rundel, and Grolemund (2023) provides an intuitive introduction to creating graphics using this approach. Second, in our daily work and to compile this book, we used the markdown-based authoring framework described in Xie, Allaire, and Grolemund (2018) and Xie, Dervieux, and Riederer (2020). Markdown documents are fully reproducible and support dozens of static and dynamic output formats. Lastly, Xie (2016) tremendously facilitates authoring markdown-based books. We do not provide introductions to these tools, as the resources above already provide easily accessible tutorials.
Good writing is also important for the presentation of findings. We neither claim to be experts in this domain nor do we try to sound particularly academic. On the contrary, we deliberately use a more colloquial language to describe all the methods and results presented in this book in order to allow our readers to relate more easily to the rather technical content. For those who desire more guidance with respect to formal academic writing for financial economics, we recommend Kiesling (2003), Cochrane (2005), and Jacobsen (2014), who all provide essential tips (condensed to a few pages).

Why R?

We believe that R (R Core Team 2022) is among the best choices for a programming language in the area of finance. Some of our favorite features include:

R is free and open-source, so that you can use it in academic and professional contexts.
A diverse and active online community works on a broad range of tools.
A massive set of actively maintained packages for all kinds of applications exists, e.g., data manipulation, visualization, machine learning, etc.
Powerful tools for communication, e.g., Rmarkdown and shiny, are readily available.
RStudio is one of the best development environments for interactive data analysis.
Strong foundations of functional programming are provided.
Smooth integration with other programming languages, e.g., SQL, Python, C, C++, Fortran, etc.

For more information on why R is great, we refer to Wickham et al. (2019).

Why Tidy?

As you start working with data, you quickly realize that you spend a lot of time reading, cleaning, and transforming your data. In fact, it is often said that more than 80 percent of data analysis is spent on preparing data. By tidying data, we want to structure datasets to facilitate further analyses. As Wickham (2014) puts it:

[T]idy datasets are all alike, but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).

In its essence, tidy data follows these three principles:

Every column is a variable.
Every row is an observation.
Every cell is a single value.

Throughout this book, we try to follow these principles as best as we can. If you want to learn more about tidy data principles in an informal manner, we refer you to this vignette as part of Wickham and Girlich (2022).

In addition to the data layer, there are also tidy coding principles outlined in the tidy tools manifesto that we try to follow:

Reuse existing data structures.
Compose simple functions with the pipe.
Embrace functional programming.
Design for humans.

In particular, we heavily draw on a set of packages called the tidyverse (Wickham et al. 2019). The tidyverse is a consistent set of packages for all data analysis tasks, ranging from importing and wrangling to visualizing and modeling data with the same grammar. In addition to explicit tidy principles, the tidyverse has further benefits: (i) if you master one package, it is easier to master others, and (ii) the core packages are developed and maintained by the Public Benefit Company Posit. These core packages contained in the tidyverse are: ggplot2 (Wickham 2016), dplyr (Wickham et al. 2022), tidyr (Wickham and Girlich 2022), readr (Wickham, Hester, and Bryan 2022), purrr (Henry and Wickham 2020), tibble (Müller and Wickham 2022), stringr (Wickham 2019), forcats (Wickham 2021), and lubridate (Grolemund and Wickham 2011).

Note

Throughout the book we use the native pipe |>, a powerful tool to clearly express a sequence of operations. Readers familiar with the tidyverse may be used to the predecessor %>% that is part of the magrittr package. For all our applications, the native and magrittr pipe behave identically, so we opt for the one that is simpler and part of base R. For a more thorough discussion on the subtle differences between the two pipes, we refer to this blog post second edition by Hadley Wickham.

About the Authors

We met at the Vienna Graduate School of Finance from which each of us graduated with a different focus but a shared passion: coding with R. We continue to sharpen our R skills as part of our current occupations:

Christoph Scheuch is an independent business intelligence & data science expert. Previously, he was the Head of AI, Director of Product, and Head of BI & Data Science at the social trading platform wikifolio.com.. He also was an external lecturer at the Vienna University of Economics and Business (WU), where he obtained his PhD in finance as part of the Vienna Graduate School of Finance (VGSF).
Stefan Voigt is an Assistant Professor of Finance at the Department of Economics at the University in Copenhagen and a research fellow at the Danish Finance Institute. His research focuses on blockchain technology, high-frequency trading, and financial econometrics. Stefan’s research has been published in the leading finance and econometrics journals. He received the Danish Finance Institute Teaching Award 2022 for his courses for students and practitioners on empirical finance based on this book.
Patrick Weiss is an Assistant Professor of Finance at Reykjavik University and an external lecturer at the Vienna University of Economics and Business. His research activity centers around the intersection of empirical asset pricing and corporate finance. Patrick is especially passionate about empirical asset pricing and has published research in leading journals in financial economics.

License

This book is licensed to you under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0. The code samples in this book are licensed under Creative Commons CC0 1.0 Universal (CC0 1.0), i.e., public domain. You can cite this project as follows:

Scheuch, C., Voigt, S., & Weiss, P. (2023). Tidy Finance with R (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b23237.

@book{Scheuch2023,
  title = {Tidy Finance with R},
  author = {Scheuch, Christoph and Voigt, Stefan and Weiss, Patrick},
  year = {2023},
  publisher = {Chapman and Hall/CRC},
  edition  = {1st},
  url = {https://tidy-finance.org/r/},
  doi = {https://doi.org/10.1201/b23237}
}

Future Updates and Changes

This book represents a snapshot of research practices and available data at a particular time. However, time does not stop. As you read this text, there is new data, packages used here have changed, and research practices might be updated. We as authors of Tidy Finance are committed to staying up-to-date and keeping up with the newest developments. Therefore, you can expect updates to Tidy Finance on a continuous basis. The best way for you to monitor the ongoing developments, is to check our online Changelog frequently.

References

Bryan, Jennifer. 2022. “Happy Git and GitHub for the useR.” https://github.com/jennybc/happy-git-with-r.

Cochrane, John H. 2005. “Writing tips for PhD students.” Note. https://www.johnhcochrane.com/research-all/writing-tips-for-phd-studentsnbsp.

Coqueret, Guillaume, and Tony Guida. 2020. Machine learning for factor investing: R version. Chapman; Hall/CRC. https://www.mlfactor.com/.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and times made easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.

Henry, Lionel, and Hadley Wickham. 2020. purrr: Functional programming tools. https://CRAN.R-project.org/package=purrr.

Jacobsen, Ben. 2014. “Some research and writing tips.” Note. https://albertjmenkveld.com/text/Jacobsen14.pdf.

Kiesling, Lynne. 2003. “Writing tips for economics (and pretty much anything else).” Note. https://nuwrite.northwestern.edu/communities/social-sciences/economics/docs/writing-advice-for-papers-in-economics/Kiesling%20writingguidelines.pdf.

Müller, Kirill, and Hadley Wickham. 2022. tibble: Simple data frames. https://CRAN.R-project.org/package=tibble.

R Core Team. 2022. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Regenstein Jr, Jonathan K. 2018. Reproducible finance with R: Code flows and shiny apps for portfolio analysis. Chapman; Hall/CRC. http://www.reproduciblefinance.com/start-here/.

Wickham, Hadley. 2014. “Tidy data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.

———. 2016. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2019. stringr: Simple, consistent wrappers for common string operations. https://CRAN.R-project.org/package=stringr.

———. 2021. forcats: Tools for working with categorical variables (Factors). https://CRAN.R-project.org/package=forcats.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for data science: Import, tidy, transform, visualize, and model data. Second. O’Reilly. https://r4ds.hadley.nz/.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. dplyr: A grammar of data manipulation. https://CRAN.R-project.org/package=dplyr.

Wickham, Hadley, and Maximilian Girlich. 2022. tidyr: Tidy messy data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2022. readr: Read rectangular text data. https://CRAN.R-project.org/package=readr.

Xie, Yihui. 2016. bookdown: Authoring books and technical documents with R Markdown. Chapman; Hall/CRC. https://bookdown.org/yihui/bookdown.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The definitive guide. Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R markdown cookbook. Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.