PH 290-011: TLBBD

Targeted Learning in Biomedical Big Data

A special topics in Biostatistics course at UC Berkeley in Spring 2018.

Welcome! This is the course website for Targeted Learning in Biomedical Big Data, offered in the Spring 2018 semester at UC Berkeley.

Course Information

Faculty Instructor: Mark van der Laan
- Office Hours: Th 1:00-1:45P in Haviland 108
Graduate Student Instructor: Nima Hejazi
- Office Hours: W 3:00-4:00P in Mulford 230
Syllabus (PDF)
Course Control Number: 42472
Lectures: TuTh 11:00A-12:30P in Mulford 230
Labs: W 2:00-3:00P in Mulford 230
Lab materials are available on GitHub here
Mailing list: please subscribe here

Schedule

The schedule of video lectures and content is publicly available here.

Sketch of Topics:

The roadmap of statistical learning, examples of data-generating experiments, traditional data analysis
Structural causal models, causal quantities, identification, interventions, optimal interventions, and identifiability results
Nonparametric density estimation, Super Learning of a density, Super Learning of conditional multinomial distributions or densities
Super Learner and an oracle inequality for the general cross-validation selector, Super Learning in prediction, Super Learning of optimal individualized treatment rules
The Highly Adaptive Lasso (HAL) and nonparametric regression estimators
Asymptotic linearity, influence curves, and statistical inference based on influence curves
Pathwise differentiable target parameters, gradients and canonical gradient of infinite-dimensional models
Definition of MLEs and NP-MLEs, efficient influence curves, theorems of efficiency
Efficient one-step estimators, online one-step estimators
Targeted maximum likelihood estimation (TMLE), TMLEs of causal effects of multiple time-point interventions based on longitudinal data

Lab 1: Reproducible Research with R, git, GitHub I

Our first lab for Targeted Learning in Biomedical Big Data is a simple introduction to computationally reproducible research with R, git, and GitHub. We’ll discuss the importance of computational reproducibility and set some standards for how these principles apply to the work we’ll do in this class.

Resources

The above resources provide a mechanism to remedy any deficiencies that may be present in background knowledge on scientific computing. You may wish to consult them regularly as the course progresses.

Lab 2: Reproducible Research with R, git, GitHub II

For our second lab, we’ll continue our discussion of computationally reproducible research and further explore the simple toolbox at which we’ve already started to take a look.

Lab Materials:

Lab 3: Structural Causal Models, Interventions, and Identifiability

Welcome to our third lab. Here we’ll begin discussing a few of the foundational principles of statistical causal inference, including structural causal models (SCMs), interventions, and the notion of identifiability. We’ll draw from several references, including our two Targeted Learning texts, as well as Judea Pearl’s Causality: Models, Reasoning, and Inference, a canonical reference.

Lab Materials:

Lab 4: Cross-Validation for Error Assessment

For our fourth lab, we’re switching gears from causal inference to statistical estimation. We will go through an introduction to cross-validation, a technique that ensures we obtain honest estimates of the error of a given statistical estimation technique by splitting data splitting. We will also discuss the very general implementation of cross-validation in the recently developed origami R package.

Lab Materials:

Lab 5: Introduction to the Super Learner Algorithm

Now that we’ve set the stage, for our fifth lab we’ll begin discussing flexible, data-adaptive estimation (i.e., machine learning), including the asymptotically optimal ensemble learning algorithm, Super Learner. We’ll dive into how to apply the Super Learning approach using the new sl3 R package.

Lab Materials

Lab 6: Libraries and Screening Algorithms for Super Learning

In our sixth lab meeting, we’ll continue discussing Super Learning, an optimal approach to ensemble machine learning. Here, we’ll dive into how to construct algorithmic libraries over which to apply the cross-validation selector that forms the heart of the Super Learning procedure. We’ll also discuss how to perform variable (or observation) screening within this framework. To wrap up, we’ll apply these procedures using the sl3 R package.

Lab Materials

Lab 7: Nonparametric Density Estimation

Today, we’ll switch topics and begin a discussion on kernel density estimation and nonparametric methods for estimating densities. We’ll begin by thinking about the motivations for kernel density estimation, including a brief review of the relevant theory. To make these ideas concrete, we’ll walk through how to use the condensier R package to perform density estimation by fitting generalized linear models at various discrete bins over the support of the outcome of interest. This will set the stage for a future discussion on Super Learning of densities.

Lab Materials

Lab 8: Computational Causal Inference: Stochastic Treatment Regimes

Here we revisit a discussion on stochastic treatment regimes and how such rules can be implemented when performing simulations. We first review the notion of a stochastic treatment rule and how such treatment rules related to dynamic treatment allocation strategies. We will step away from examining specific R packages and instead place our focus on analyzing the effects of such probabilistic interventions using rigorous simulation studies.

Lab 9: Super Learning of Densities

Today, we’ll continue our ongoing discussion of Super Learner, this time taking a look at how to perform density estimation within this framework. Having discussed the motivations behind density estimation, we’ll jump right in to considering how to optimally perform density estimation using cross-validation. To make these ideas concrete, we’ll walk through how to use the condensier R package to perform density estimation with arbitrary machine learning algorithms by using the sl3 R package.

Lab Materials

Lab 10: The Highly Adaptive LASSO

Today, we’ll (try to) have a self-contained discussion of the Highly Adaptive LASSO (HAL), a nonparametric regression estimator endowed with powerful optimality properties and defined using the class of functions of bounded variation norm. We’ll review some of the key theoretical properties of the estimator discussed in the lecture component of the class and then begin to apply HAL by using the hal9001 R package.