What is Tidy Finance?

An op-ed about the motives behind Tidy Finance with R
Op-Ed
Authors

Christoph Scheuch

Stefan Voigt

Patrick Weiss

Published

January 16, 2023

Empirical finance can be tedious. Many standard tasks, such as cleaning data or forming factor portfolios, require a lot of effort. The code to produce even seminal results is typically opaque. Why should researchers have to reinvent the wheel over and over again?

Tidy Finance with R is our take on how to conduct empirical research in financial economics from scratch. Whether you are an industry professional looking to expand your quant skills, a graduate student diving into the finance world, or an academic researcher, this book shows you how to use R to master applications in asset pricing, portfolio optimization, risk management, and option pricing.

We wrote this book to provide a comprehensive guide to using R for financial analysis. Our book collects all the tools we wish we would have had at hand at the beginning of our graduate studies in finance. Without transparent code for standard procedures, numerous replication efforts (and their failures) feel like a waste of resources. We have been there, as probably everybody working with data has. Since we kicked off our careers, we have constantly updated our methods, coding styles, and workflows. Our book reflects our lessons learned. By sharing them, we aim to help others avoid dead ends.

Working on problems that countless others have already solved in secrecy is not just tedious, it even may have detrimental effects. In a recent study1 together with hundreds of research teams from across the globe, Albert J. Menkveld, the best-publishing Dutch economist according to Economentop 40, shows that without a standard path to do empirical analysis, results may vary substantially. Even if teams set out to analyze the same research question based on the same data, implementation details are important and deserve more than treatment as subtleties.

There will always be multiple acceptable ways to test relevant research questions. So why should it matter that our book lifts our curtain on reproducible finance by providing a fully transparent code base for many typical financial applications? First and foremost, we hope to inspire others to make their research truly reproducible. This is not a purely theoretical exercise: our examples start with data preparation and conclude with communicating results to get readers to do empirical analysis on real data as fast as possible. We believe that the need for precise academic writing does not stop where the code begins. Understanding and agreeing on standard procedures hopefully frees up resources to focus on what matters: a novel research project, a seminar paper, or a thorough analysis for your employer. If our book helps to provide a foundation for discussions on which determinants render code useful, we have achieved much more than we were hoping for at the beginning of this project.

Unlike typical stylized examples, our book starts with the problems of any serious research project. The often overlooked challenge behind empirical research may seem trivial at first glance: we need data to conduct our analyses. Finance is not an exception: raw data, often hidden behind proprietary financial data sources, requires cleaning before there is any hope of extracting valuable insights from it. While you can despise data cleaning, you cannot ignore it.

We describe and provide the code to prepare typical open-source and proprietary financial data sources (e.g., CRSP, Compustat, Mergent FISD, TRACE). We reuse these data in all the subsequent chapters, which we keep as self-contained as possible. The empirical applications range from key concepts of empirical asset pricing (beta estimation, portfolio sorts, performance analysis, Fama-French factors) to modeling and machine learning applications (fixed effects estimation, clustering standard errors, difference-in-difference estimators, ridge regression, Lasso, Elastic net, random forests, neural networks) and portfolio optimization techniques.

Necessarily, our book reflects our opinionated perspective on how to perform empirical analyses. From our experience as researchers and instructors, we believe in the value of the workflows we teach and apply daily. The entire book rests on two core concepts: coding principles using the tidyverse family of R packages and tidy data.

We base our book entirely on the open-source programming language R. R and the tidyverse community provide established tools to perform typical data science tasks, ranging from cleaning and manipulation to plotting and modeling. R is hence the ideal environment to develop an accessible code base for future finance students. The concept of tidy data refers to organizing financial data in a structured and consistent way, allowing for easy analysis and understanding.2 Taken together, tidy data and code help achieve the ultimate goal: to provide a fundamentally human-centered experience that makes it easier to teach, learn, and replicate the code of others – or even your own!

We are convinced that empirical research in finance is in desperate need of reproducible code to form standards for otherwise repetitive tasks. Instructors and researchers have already reached out to us with grateful words about our book. Tidy Finance finds its way into lecture halls across the globe already today. Various recent developments support our call for increased transparency. For instance, Cam Harvey, the former editor of the Journal of Finance, and a former president of the American Finance Association, openly argues that the profession needs to tackle the replication crisis.3 Top journals in financial economics increasingly adopt code and data-sharing policies to increase transparency. The industry and academia are aware and concerned (if not alarmed) about these issues, which is why we believe that the timing for publishing Tidy Finance with R could not be better.

Footnotes

  1. Menkveld, A. J. et al. (2022). “Non-standard Errors”. http://dx.doi.org/10.2139/ssrn.3961574↩︎

  2. Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10↩︎

  3. Wigglesworth, R. (2021). The hidden ‘replication crisis’ of finance. Financial Times. https://www.ft.com/content/9025393f-76da-4b8f-9436-4341485c75d0↩︎