Introduction to Gaussian Processes for Time Dependent Data

VectorByte Methods Training

The VectorByte Team (Parul Patil, Virginia Tech)

Gaussian Process: Introduction

  • A Gaussian Process model is a non paramteric and flexible regression model.

  • It started being used in the field of spatial statistics, where it is called kriging.

  • It is also widely used in the field of machine learning since it makes fast predictions and gives good uncertainty quantification commonly used as a surrogate model. (Gramacy 2020)

Uses and Benefits

  • Surrogate Models: Imagine a case where field experiments are infeasible and computer experiments take a long time to run, we can approximate the computer experiments using a surrogate model.

  • It can be used to calibrate computer models w.r.t the field observations or computer simulations.

  • The ability to this model to provide good UQ makes it very useful in other fields such as ecology where the data is sparse and noisy and therefore good uncertainty measures are paramount.

What is a GP?

  • Here, we assume that the data come from a Multivariate Normal Distribution (MVN).

  • We then make predictions conditional on the data.

  • We are essentially taking a “fancy” average of the data to make predictions

  • To understand how it works, let’s first visualize this concept and then look into the math

Visualizing a GP

How does a GP work

  • Our “fancy” average was just using data around the new location.

  • We are using points closer to each other to correlate the responses.

  • Now we know, we are averaging the data “nearby”… How do you define that?

  • This indicates that we are using distance in some way. Where…? We need some math

Mathematically

  • Any normal distribution can be described by a mean vector \mu and a covariance matrix \Sigma.

  • Mathematically, we can write it as,

Y_{\ n \times 1} \sim N \ ( \ \mu(X)_{\ n \times 1} \ , \ \Sigma(X)_{ \ n \times n} \ ) Here, Y is the response of interest and n is the number of observations.

  • Our goal is to find Y_p \ \vert \ Y, X.

Distance

  • Recall that in Linear Regression, you have \ \Sigma \ = \sigma^2 \mathbb{I}
  • For a GP, the covariance matrix ( \Sigma ) is defined by a kernel.

  • Consider,

\Sigma_n = \tau^2 C_n where C_n = \exp \left( - \vert \vert x - x' \vert \vert^2 \right), and x and x' are input locations.

Interpretation of the kernel

  • C_n = \exp \left( - \vert \vert x - x' \vert \vert^2 \right)

  • The covariance structure now depends on how close together the inputs. If inputs are close in distance, then the responses are more highly correlated.

  • The covariance will decay at an exponential rate as x moves away from x'.

Summarizing our data

  • Now we will learn how to use a GP to make predictions at new locations.

  • As we learnt, we condition on the data. We can think of this as the prior.

\begin{equation} Y_n \ \vert X_n \sim \mathcal{N} \ ( \ 0 \ , \ \tau^2 \ C_n(X_n) \ ) \\ \end{equation}

Now, consider, (\mathcal{X}, \mathcal{Y}) as the predictive set.

How to make predictions

  • The goal is to find the distribution of \mathcal{Y} \ \vert X_n, Y_n which in this case is the posterior distribution.

  • By properties of Normal distribution, the posterior is also normally distributed.

We also need to write down the mean and variance of the posterior distribution so it’s ready for use.

How to make predictions

  • First we will “stack” the predictions and the data.
\begin{equation*} \begin{bmatrix} \mathcal{Y} \\ Y_n \\ \end{bmatrix} \ \sim \ \mathcal{N} \left( \; \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix}\; , \; \begin{bmatrix} \Sigma(\mathcal{X}, \mathcal{X}) & \Sigma(\mathcal{X}, X_n)\\ \Sigma({X_n, \mathcal{X}}) & \Sigma_n\\ \end{bmatrix} \; \right) \\[5pt] \end{equation*}
  • Now, let’s denote the predictive mean with \mu(\mathcal{X}) and predictive variance with \sigma^2(\mathcal{X})
\begin{equation} \begin{aligned} \mathcal{Y} \mid Y_n, X_n \sim N\left(\mu(\mathcal{X}) \ , \ \sigma^2(\mathcal{X})\right) \end{aligned} \end{equation}

Distribution of Interest!

  • We will apply the properties of conditional Normal distributions.
\begin{equation*} \begin{aligned} \mu(\mathcal{X}) & = \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} Y_n \\[10pt] \sigma^2(\mathcal{X}) & = \Sigma(\mathcal{X}, \mathcal{X}) - \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} \Sigma(X_n, \mathcal{X}) \\ \end{aligned} \end{equation*}
  • Everything we do, relies on these equations. Now, let’s learn more about \Sigma

Sigma Matrix

  • \Sigma_n = \tau^2 C_n where C_n is our kernel.
  • One of the most common kernels which we will focus on is the squared exponential distance kernel written as

C_n = \exp{ \left( -\frac{\vert\vert x - x' \vert \vert ^2}{\theta} \right ) + g \mathbb{I_n}}

  • What’s \tau^2, g and \theta though? No more math. We will just conceptually go through these

Hyper Parameters

A GP is non parameteric, however, has some hyper-parameters. In this case,

  • \tau^2 (Scale): This parameter can be used to adjust the amplitude of the data.

  • \theta (Length-scale): This parameter controls the rate of decay of correlation.

  • g (Nugget): This parameter controls the noise in the covariance structure (adds discontinuity)

Scale (Amplitude)

  • A random draw from a multivariate normal distribution with \tau^2 = 1 will produce data between -2 and 2.

  • Now let’s visualize what happens when we increase \tau^2 to 25.

Length-scale (Rate of decay of correlation)

  • Determines how “wiggly” a function is

  • Smaller \theta means wigglier functions i.e. visually:

Nugget (Noise)

  • Ensures discontinuity and prevents interpolation which in turn yields better UQ.

  • We will compare a sample from g ~ 0 (< 1e-8 for numeric stability) vs g = 0.1 to observe what actually happens.

Toy Example (1D Example)

X <- matrix(seq(0, 2*pi, length = 100), ncol =1)
n <- nrow(X) 
true_y <- 5 * sin(X)
obs_y <- true_y + rnorm(n, sd=1)

Toy Example (1D Example)

eps <- sqrt(.Machine$double.eps)
gpi <- laGP::newGP(X = X,Z = obs_y, d = 0.1, g = 0.1 * var(obs_y), dK = TRUE) 
mle <- laGP::mleGP(gpi = gpi, param = c("d", "g"), 
                   tmin= c(eps, eps), tmax= c(10, var(obs_y))) 
XX <- matrix(seq(0, 2*pi, length = 1000), ncol =1)
p <- laGP::predGP(gpi = gpi, XX = XX)

Extentions

  • Anisotropic Gaussian Processes: Suppose our data is multi-dimensional, we can control the length-scale (\theta) for each dimension.

  • Heteroskedastic Gaussian Processes: Suppose our data is noisy and the noise is input dependent, then we can use a different nugget for each unique input rather than a scalar g.

Anisotropic Gaussian Processes

In this situation, we can rewrite the C_n matrix as,

C_\theta(x , x') = \exp{ \left( -\sum_{k=1}^{m} \frac{ (x_k - x_k')^2 }{\theta_k} \right ) + g \mathbb{I_n}}

Here, \theta = (\theta_1, \theta_2, …, \theta_m) is a vector of length-scales, where m is the dimension of the input space.

Heteroskedastic Gaussian Processes

HetGP Setup

  • Let Y_n be the response vector of size n. 

  • Let X = (X_1, X_2 ... X_n) be the input space.

Then, a regular GP is written as:

\begin{align*} Y_N \ & \ \sim GP \left( 0 \ , \tau^2 C_n \right); \ \text{where, }\\[2pt] C_n & \ = \exp{ \left( -\frac{\vert\vert x - x' \vert \vert ^2}{\theta} \right ) + g \mathbb{I_n}} \end{align*}

HetGP Setup

In case of a hetGP, we have:

\begin{aligned} Y_n\ & \ \sim GP \left( 0 \ , \tau^2 C_{n, \Lambda} \right) \ \ \text{where, }\\[2pt] C_{n, \Lambda} & \ = \exp{ \left( -\frac{\vert\vert x - x' \vert \vert ^2}{\theta} \right ) + \Lambda_n} \ \ \ \text{and, }\ \ \\[2pt] \ \ \Lambda_n \ & \ = \ \text{Diag}(\lambda_1, \lambda_2 ... , \lambda_n) \\[2pt] \end{aligned}

  • Instead of one nugget for the GP, we have a vector of nuggets i.e. a unique nugget for each unique input.

HetGP Predictions

  • Recall, for a GP, we make predictions using the following:
\begin{equation*} \begin{aligned} \mu(\mathcal{X}) & = \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} Y_n \\ \sigma^2(\mathcal{X}) & = \Sigma(\mathcal{X}, \mathcal{X}) - \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} \Sigma(X_n, \mathcal{X}) \\ \end{aligned} \end{equation*}
  • We can make predictions using these same equations with \Sigma(X_n) \ \ = \tau^2 C_{n, \Lambda}

Toy Example (1D Example)

Toy Example (1D Example)

Toy Example (1D Example)

Intro to Ticks Problem

Tick Population Forecasting

Some details about the challenge:

  • Objective: Forecast tick density for 4 weeks into the future

  • Sites: The data is collected across 9 different sites, each plot was of size 1600m^2 using a drag cloth

  • Data: Sparse and irregularly spaced. We only have ~650 observations across 10 years at 9 locations

Predictors

  • X_1 Iso-week: The week in which the tick density was recorded.

  • X_2 Sine wave: \left( \text{sin} \ ( \frac{2 \ \pi \ X_1}{106} ) \right)^2.

  • X_3 Greenness: Environmental predictor (in practical)

Practical

  • Setup these predictors
  • Transform the data to normal
  • Fit a GP to the Data
  • Make Predictions on a testing set
  • Check how predictions perform.

References

Binois, Mickael, Robert B Gramacy, and Mike Ludkovski. 2018. “Practical Heteroscedastic Gaussian Process Modeling for Large Simulation Experiments.” Journal of Computational and Graphical Statistics 27 (4): 808–21.
Gramacy, Robert B. 2020. Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences. Chapman; Hall/CRC.
Thomas, R Quinn, Carl Boettiger, Cayelan C Carey, Michael C Dietze, Leah R Johnson, Melissa A Kenney, Jason S Mclachlan, et al. 2022. “The NEON Ecological Forecasting Challenge.” Authorea Preprints.