It is no secret that I’m a big supporter of using machine learning for causal inference. My interest in this field can be traced back to a Computational Statistics course I took with Marina Khismatullina at Uni Bonn in 2021. I was intrigued by how well machine learning method peformed in predictive scenarios with a large number of covariates and enjoyed learning comparing different ML models such as LASSO, Random Forest, XGBoost, and Support Vector Machines. We also learned about model training/test, cross-validation, and the bias-variance trade-off, where ML methods tend to sacrifice a little bit of bias for lower variance in their point estimates.
But in economics, we generally care about issues of causality, where bias, variance, and valid confidence intervals are the key considerations for any estimator. So my question was, how can we re-purpose the strength and flexibility of ML in large covariate spaces for high-dimensional problems in statistical inference? This issue is becoming even more important nowadays since there has been a drastic increase in the availability and granularity of data.
Over the course of my studies, I concentrated heavily on this topic and took several classes in econometrics and statistics to better understand the potential merger between machine learning’s superior predictive capabilities on one hand and the causal framework of traditional econometrics on the other. The result of this two-year effort was my master thesis “Causal Inference Using Random Forests and LASSO,”, where I explored innovative methods leveraging machine learning algorithms to evaluate causal effects, specifically through simulations and empirical applications. In this post, I would like to show the promise of these methods in specific econometric and highlight use-cases how this could help your research.
The study was motivated by the growing need for robust statistical methods capable of handling large, complex datasets often encountered in economics. Traditional econometric approaches sometimes fall short when addressing high-dimensional data and highly complex nuisance parameters. This research focused on integrating machine learning techniques, particularly Random Forests and LASSO, to enhance causal inference in econometric models.
My research was heavily inspired by the extensive works of Susan Athey (Stanford) and Victor Chernozhukov (MIT) in the space of ML for estimation of treatment effects. To that end, I first provided a comprehensive review of existing methods and their limitations, setting the stage for the introduction of Generalized Random Forests (GRF), Double Machine Learning (DML), and Post-Double-Selection approaches. These methods were hand=picked for their potential to improve the accuracy and reliability of treatment effect estimations in complex econometric models.
As a very cursory glance at these methods:
I used a series of Monte Carlo simulations to test the performance of these estimators under different scenarios—ranging from randomized controlled trials to settings with strong confounding. These simulations were crucial for demonstrating the practical applicability and robustness of the ML methods in mimicking real-world data complexities.
The theoretical insights and simulation results were applied to two empirical cases:
401(k) Participation and Financial Assets: This study evaluated how eligibility and participation in 401(k) plans affect net financial assets, showcasing the practical implications of these ML techniques in policy evaluation.
Robustness of ML Estimators: The ML-based methods, particularly GRF and DML, showed strong performance in handling complex, high-dimensional datasets and produced more reliable causal inferences compared to traditional methods.
The empirical applications illustrated how these advanced statistical techniques could be used to inform policy decisions and improve economic outcomes.
The thesis extended existing theories in econometrics by integrating and enhancing machine learning algorithms for better causal inference, providing a new framework that can be adopted in various economic and policy research areas.
This research bridged the gap between traditional econometric approaches and modern machine learning techniques, offering substantial improvements in the estimation of causal effects. The findings not only support the use of these advanced methods in academic and policy-making contexts but also open avenues for future research in econometrics and statistics.
This thesis, supervised by Prof. Dr. Christoph Breunig at Rheinische Friedrich-Wilhelms-Universität Bonn, represents a significant step forward in the use of machine learning to enhance causal inference, demonstrating the potential of integrating cutting-edge statistical techniques in economic research.