PH 290-011: TLBBD

Targeted Learning in Biomedical Big Data

A special topics in Biostatistics course at UC Berkeley in Spring 2018.

Welcome! This is the course website for Targeted Learning in Biomedical Big Data, offered in the Spring 2018 semester at UC Berkeley.

Course Information


The schedule of video lectures and content is publicly available here.

Sketch of Topics:

Lab 1: Reproducible Research with R, git, GitHub I

Our first lab for Targeted Learning in Biomedical Big Data is a simple introduction to computationally reproducible research with R, git, and GitHub. We’ll discuss the importance of computational reproducibility and set some standards for how these principles apply to the work we’ll do in this class.


The above resources provide a mechanism to remedy any deficiencies that may be present in background knowledge on scientific computing. You may wish to consult them regularly as the course progresses.

Lab 2: Reproducible Research with R, git, GitHub II

For our second lab, we’ll continue our discussion of computationally reproducible research and further explore the simple toolbox at which we’ve already started to take a look.

Lab Materials:

Lab 3: Structural Causal Models, Interventions, and Identifiability

Welcome to our third lab. Here we’ll begin discussing a few of the foundational principles of statistical causal inference, including structural causal models (SCMs), interventions, and the notion of identifiability. We’ll draw from several references, including our two Targeted Learning texts, as well as Judea Pearl’s Causality: Models, Reasoning, and Inference, a canonical reference.

Lab Materials:

Lab 4: Cross-Validation for Error Assessment

For our fourth lab, we’re switching gears from causal inference to statistical estimation. We will go through an introduction to cross-validation, a technique that ensures we obtain honest estimates of the error of a given statistical estimation technique by splitting data splitting. We will also discuss the very general implementation of cross-validation in the recently developed origami R package.

Lab Materials:

Lab 5: Introduction to the Super Learner Algorithm

Now that we’ve set the stage, for our fifth lab we’ll begin discussing flexible, data-adaptive estimation (i.e., machine learning), including the asymptotically optimal ensemble learning algorithm, Super Learner. We’ll dive into how to apply the Super Learning approach using the new sl3 R package.

Lab Materials

Lab 6: Libraries and Screening Algorithms for Super Learning

In our sixth lab meeting, we’ll continue discussing Super Learning, an optimal approach to ensemble machine learning. Here, we’ll dive into how to construct algorithmic libraries over which to apply the cross-validation selector that forms the heart of the Super Learning procedure. We’ll also discuss how to perform variable (or observation) screening within this framework. To wrap up, we’ll apply these procedures using the sl3 R package.

Lab Materials

Lab 7: Nonparametric Density Estimation

Today, we’ll switch topics and begin a discussion on kernel density estimation and nonparametric methods for estimating densities. We’ll begin by thinking about the motivations for kernel density estimation, including a brief review of the relevant theory. To make these ideas concrete, we’ll walk through how to use the condensier R package to perform density estimation by fitting generalized linear models at various discrete bins over the support of the outcome of interest. This will set the stage for a future discussion on Super Learning of densities.

Lab Materials

Lab 8: Computational Causal Inference: Stochastic Treatment Regimes

Here we revisit a discussion on stochastic treatment regimes and how such rules can be implemented when performing simulations. We first review the notion of a stochastic treatment rule and how such treatment rules related to dynamic treatment allocation strategies. We will step away from examining specific R packages and instead place our focus on analyzing the effects of such probabilistic interventions using rigorous simulation studies.

Lab 9: Super Learning of Densities

Today, we’ll continue our ongoing discussion of Super Learner, this time taking a look at how to perform density estimation within this framework. Having discussed the motivations behind density estimation, we’ll jump right in to considering how to optimally perform density estimation using cross-validation. To make these ideas concrete, we’ll walk through how to use the condensier R package to perform density estimation with arbitrary machine learning algorithms by using the sl3 R package.

Lab Materials

Lab 10: The Highly Adaptive LASSO

Today, we’ll (try to) have a self-contained discussion of the Highly Adaptive LASSO (HAL), a nonparametric regression estimator endowed with powerful optimality properties and defined using the class of functions of bounded variation norm. We’ll review some of the key theoretical properties of the estimator discussed in the lecture component of the class and then begin to apply HAL by using the hal9001 R package.

Lab Materials