A special topics in Biostatistics course at UC Berkeley in Spring 2018.
Welcome! This is the course website for Targeted Learning in Biomedical Big Data, offered in the Spring 2018 semester at UC Berkeley.
The schedule of video lectures and content is publicly available here.
Our first lab for Targeted Learning in Biomedical Big Data is a simple introduction to computationally reproducible research with R, git, and GitHub. We’ll discuss the importance of computational reproducibility and set some standards for how these principles apply to the work we’ll do in this class.
The above resources provide a mechanism to remedy any deficiencies that may be present in background knowledge on scientific computing. You may wish to consult them regularly as the course progresses.
For our second lab, we’ll continue our discussion of computationally reproducible research and further explore the simple toolbox at which we’ve already started to take a look.
Welcome to our third lab. Here we’ll begin discussing a few of the foundational principles of statistical causal inference, including structural causal models (SCMs), interventions, and the notion of identifiability. We’ll draw from several references, including our two Targeted Learning texts, as well as Judea Pearl’s Causality: Models, Reasoning, and Inference, a canonical reference.
For our fourth lab, we’re switching gears from causal inference to statistical
estimation. We will go through an introduction to cross-validation, a technique
that ensures we obtain honest estimates of the error of a given statistical
estimation technique by splitting data splitting. We will also discuss the very
general implementation of cross-validation in the recently developed origami
R package.
Now that we’ve set the stage, for our fifth lab we’ll begin discussing flexible,
data-adaptive estimation (i.e., machine learning), including the asymptotically
optimal ensemble learning algorithm, Super Learner. We’ll dive into how to apply
the Super Learning approach using the new sl3
R
package.
In our sixth lab meeting, we’ll continue discussing Super Learning, an optimal
approach to ensemble machine learning. Here, we’ll dive into how to construct
algorithmic libraries over which to apply the cross-validation selector that
forms the heart of the Super Learning procedure. We’ll also discuss how to
perform variable (or observation) screening within this framework. To wrap up,
we’ll apply these procedures using the sl3
R
package.
Today, we’ll switch topics and begin a discussion on kernel density estimation
and nonparametric methods for estimating densities. We’ll begin by thinking
about the motivations for kernel density estimation, including a brief review of
the relevant theory. To make these ideas concrete, we’ll walk through how to use
the condensier
R package to perform
density estimation by fitting generalized linear models at various discrete bins
over the support of the outcome of interest. This will set the stage for a
future discussion on Super Learning of densities.
Here we revisit a discussion on stochastic treatment regimes and how such rules can be implemented when performing simulations. We first review the notion of a stochastic treatment rule and how such treatment rules related to dynamic treatment allocation strategies. We will step away from examining specific R packages and instead place our focus on analyzing the effects of such probabilistic interventions using rigorous simulation studies.
Today, we’ll continue our ongoing discussion of Super Learner, this time taking
a look at how to perform density estimation within this framework. Having
discussed the motivations behind density estimation, we’ll jump right in to
considering how to optimally perform density estimation using cross-validation.
To make these ideas concrete, we’ll walk through how to use the condensier
R
package to perform density
estimation with arbitrary machine learning algorithms by using the
sl3
R package.
Today, we’ll (try to) have a self-contained discussion of the Highly Adaptive
LASSO (HAL), a nonparametric regression estimator endowed with powerful
optimality properties and defined using the class of functions of bounded
variation norm. We’ll review some of the key theoretical properties of
the estimator discussed in the lecture component of the class and then begin to
apply HAL by using the hal9001
R
package.