Course syllabus#

Content#

Eight sessions of three hours presenting important concepts and methods on machine learning for econometrics. The course focuses on flexible models and causal inference in high dimensions. Most of the sessions display a mix between theoretical considerations and practical application with hands-on in python or R.

The last session is a presentation of the course evaluation project by groups of students.

For now, the website covers only the contents for the following topics:

    1. Statistical learning and regularized linear models

    1. Flexible models for tabular data

    1. Reminders of potential outcomes and Directed Acyclic Graphs

  • 4.a. Event studies: Causal methods for panel data

Motivation#

High dimension: sparsity in confounders (lasso, double lasso), nonlinearities in confounders (double ML), heterogeneities of effects (generic ML).

Session 1 – Statistical learning and regularized linear models#

  • Reminder of statistical learning: Bias variance tradeoff, appropriate representation, over/under-fitting

  • Regularized regression : lasso, ridge, elastic net, post-lasso

  • Practical session:

    • Common pitfalls in the interpretation of coefficients of linear models.

  • References:

    • Estève et al. (2022)

    • Hastie et al. (2017)

Session 2 – Flexible models for tabular data#

  • Trees, random forests, boosting

  • Cross-validation, nested cross-validation

  • Practical session: Hyper-parameters selection for flexible models

  • References:

Session 3 – Potential outcomes, Directed Acyclic Graphs, confounder selection#

  • Reminder on causal inference: prediction/causation, potential outcomes, asking a sound causal question (PICO)

  • Causal graph, front-door criteria, and valid adjustment sets.

  • Practical Session: DAGs, valid and invalid adjustment sets, with simple linear models and simulations. Introduction to the DoubleML package.

  • References:

Session 4a – Event studies: Causal methods for panel data#

  • A causal approach to Difference-in-Differences

  • Synthetic controls

  • Interrupted time series analysis and state space models

  • Practical session: Comparison of different methods for panel data

  • References:

Session 4b – Double-lasso for statistical inference#

Session 5 – Double machine learning: Neyman-orthogonality#

  • Importance of sample splitting for double machine learning

  • Double-robust estimator approach (also known as augmented inverse propensity weighting)

  • Debiased (or double) machine learning, Neyman-orthogonality, method-of-moments

  • Practical session: The Effect of Gun Ownership on Gun-Homicide Rates

  • References:

Session 6 – Heterogeneous treatment effect#

Session 7 – Heterogeneous treatment effect#

  • Practical session: - Gaillac and L’Hour (2019), Chapter 6

    • Kitagawa and Tetenov (2018)

Bibliography#

[1] (1,2)

Loïc Estève, Guillaume Lemaitre, Olivier Grisel, Gael Varoquaux, Arturo Amor, Lilian, Benoit Rospars, Thomas Schmitt, Lucy Liu, Bruno P. Kinoshita, hackmd-deploy, ph4ge, Peter Steinbach, Alexandre Boucaud, Benson Muite, Jérémie du Boisberranger, Michael Notter, Pierre, Shane P, alagarrigue, Mehrdad Mohammadian, and parmentelat. Inria/scikit-learn-mooc: third mooc session. Zenodo, 2022. URL: https://doi.org/10.5281/zenodo.7220307, doi:10.5281/zenodo.7220307.

[2] (1,2)

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. 2017.

[3]

Kevin P Murphy. Probabilistic machine learning: an introduction. MIT press, 2022.

[4] (1,2,3,4,5)

Victor Chernozhukov, Christian Hansen, Nathan Kallus, Martin Spindler, and Vasilis Syrgkanis. Applied causal inference powered by ml and ai. arXiv preprint arXiv:2403.02467, 2024. URL: https://causalml-book.org.

[5] (1,2)

Stefan Wager. Stats 361: causal inference. Stanford University, 2020. URL: https://web.stanford.edu/~swager/stats361.pdf.

[6]

Tyler J VanderWeele. Principles of confounder selection. European journal of epidemiology, 34:211–219, 2019.

[7] (1,2,3,4)

C Gaillac and J L’Hour. Machine learning for econometrics, lecture notes ensae paris. Lecture notes, 2019.

[8]

Alberto Abadie. Using synthetic controls: feasibility, data requirements, and methodological aspects. Journal of economic literature, 59(2):391–425, 2021.

[9]

Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.

[10]

Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021.

[11]

Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86(2):591–616, 2018.