A special topics in Biostatistics course at UC Berkeley in Spring 2018.

Welcome! This is the course website for Targeted Learning in Biomedical Big Data, offered in the Spring 2018 semester at UC Berkeley.

- Faculty Instructor: Mark van der
Laan
- Office Hours: Th 1:00-1:45P in Haviland 108

- Graduate Student Instructor: Nima Hejazi
- Office Hours: W 3:00-4:00P in Mulford 230

- Syllabus (PDF)
- Course Control Number: 42472
- Lectures: TuTh 11:00A-12:30P in Mulford 230
- Labs: W 2:00-3:00P in Mulford 230
- Lab materials are available on GitHub here
- Mailing list: please subscribe here

The schedule of video lectures and content is publicly available here.

- The roadmap of statistical learning, examples of data-generating experiments, traditional data analysis
- Structural causal models, causal quantities, identification, interventions, optimal interventions, and identifiability results
- Nonparametric density estimation, Super Learning of a density, Super Learning of conditional multinomial distributions or densities
- Super Learner and an oracle inequality for the general cross-validation selector, Super Learning in prediction, Super Learning of optimal individualized treatment rules
- The Highly Adaptive Lasso (HAL) and nonparametric regression estimators
- Asymptotic linearity, influence curves, and statistical inference based on influence curves
- Pathwise differentiable target parameters, gradients and canonical gradient of infinite-dimensional models
- Definition of MLEs and NP-MLEs, efficient influence curves, theorems of efficiency
- Efficient one-step estimators, online one-step estimators
- Targeted maximum likelihood estimation (TMLE), TMLEs of causal effects of multiple time-point interventions based on longitudinal data

Our first lab for Targeted Learning in Biomedical Big Data is a simple introduction to computationally reproducible research with R, git, and GitHub. We’ll discuss the importance of computational reproducibility and set some standards for how these principles apply to the work we’ll do in this class.

- Version Control with git (Software Carpentry)
- R for Reproducible Scientific Analysis (Software Carpentry)
- Tools for Reproducible Research (Karl Broman)
- Happy Git and GitHub for the useR (Jenny Bryan)

*The above resources provide a mechanism to remedy any deficiencies that may be
present in background knowledge on scientific computing. You may wish to consult
them regularly as the course progresses.*

For our second lab, we’ll continue our discussion of computationally reproducible research and further explore the simple toolbox at which we’ve already started to take a look.

Welcome to our third lab. Here we’ll begin discussing a few of the foundational
principles of statistical causal inference, including structural causal models
(SCMs), interventions, and the notion of identifiability. We’ll draw from
several references, including our two Targeted Learning texts, as well as
Judea Pearl’s *Causality: Models, Reasoning, and
Inference*,
a canonical reference.

For our fourth lab, we’re switching gears from causal inference to statistical
estimation. We will go through an introduction to cross-validation, a technique
that ensures we obtain honest estimates of the error of a given statistical
estimation technique by splitting data splitting. We will also discuss the very
general implementation of cross-validation in the recently developed `origami`

R package.

Now that we’ve set the stage, for our fifth lab we’ll begin discussing flexible,
data-adaptive estimation (i.e., machine learning), including the asymptotically
optimal ensemble learning algorithm, Super Learner. We’ll dive into how to apply
the Super Learning approach using the new `sl3`

R
package.

In our sixth lab meeting, we’ll continue discussing Super Learning, an optimal
approach to ensemble machine learning. Here, we’ll dive into how to construct
algorithmic libraries over which to apply the cross-validation selector that
forms the heart of the Super Learning procedure. We’ll also discuss how to
perform variable (or observation) screening within this framework. To wrap up,
we’ll apply these procedures using the `sl3`

R
package.

Today, we’ll switch topics and begin a discussion on kernel density estimation
and nonparametric methods for estimating densities. We’ll begin by thinking
about the motivations for kernel density estimation, including a brief review of
the relevant theory. To make these ideas concrete, we’ll walk through how to use
the `condensier`

R package to perform
density estimation by fitting generalized linear models at various discrete bins
over the support of the outcome of interest. This will set the stage for a
future discussion on Super Learning of densities.

Here we revisit a discussion on stochastic treatment regimes and how such rules can be implemented when performing simulations. We first review the notion of a stochastic treatment rule and how such treatment rules related to dynamic treatment allocation strategies. We will step away from examining specific R packages and instead place our focus on analyzing the effects of such probabilistic interventions using rigorous simulation studies.