Introduction to Gaussian Processes for Time Dependent Data

Virginia Tech

The VectorByte Team (Parul Vijay Patil, Leah R. Johnson, Robert B. Gramacy)

Outline

We will go through the following topics:

  1. Motivation as Ecologists

  2. Introducing Gaussian Processes (GPs)

  3. Hyper-Parameters in GPs

  4. Fitting a GP with some code

  5. Heteroskedastic GPs (HetGPs)

  6. Fitting a HetGP with some code

  7. Motivating Ecological Example: Hands-On Practice

Time-Depedent Data

  • In ecology, we often have data with features such as:

    • Sparse and irregularly sampled time series (Time-Dependent Data).

    • Varies from location to location.

    • Often consists of several noisy observations due to sampling errors.

Gaussian Process: Introduction

  • Gaussian Process (GP) models are non paramteric and flexible regression model.

  • Excellent for uncertainty quantification (UQ).

  • Suppose we have n observations Y_n corresponding to X_n inputs, then

Y_{n} \sim N (\mu, \Sigma_n)

  • We wish to estimate the response at a new input X_p i.e. Y_p \mid Y_n, X_n.

  • Note: If you set \mu = X \beta and \Sigma_n = \sigma^2 \mathbb{I}, we have a Linear Regression (LR) Setup.

Distance

  • For a GP, the covariance matrix, \Sigma_n is defined by a distance based kernel.

  • Consider,

\Sigma_n = \tau^2 C_n \quad \text{where} \quad C_n^{ij} = \exp \left( - \vert \vert x_i - x_j \vert \vert^2 \right)

  • The covariance structure now depends on how close together the inputs.

  • The covariance will decay at an exponential rate as x moves away from x'.

Visualizing a GP

Predictions using the Data

  • We wish to find \mathcal{Y} \vert X_n, Y_n and we know Y_n \vert X_n \sim \mathcal{N}( 0, \Sigma_n).

  • First we will “stack” the predictions and the data.

\begin{equation*} \begin{bmatrix} \mathcal{Y} \\ Y_n \\ \end{bmatrix} \ \sim \ \mathcal{N} \left( \; \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix}\; , \; \begin{bmatrix} \Sigma(\mathcal{X}, \mathcal{X}) & \Sigma(\mathcal{X}, X_n)\\ \Sigma({X_n, \mathcal{X}}) & \Sigma_n\\ \end{bmatrix} \; \right) \end{equation*}
  • By properties of Normal distribution, \mathcal{Y} \vert X_n, Y_n is also normally distributed.

  • We can notate this as:

\begin{equation*} \begin{aligned} \mathcal{Y} \mid Y_n, X_n \sim \mathcal{N} \left(\mu(\mathcal{X}), \sigma^2(\mathcal{X})\right) \end{aligned} \end{equation*}

Distribution of Interest!

  • We will apply the properties of conditional Normal distributions.
\begin{equation*} \begin{aligned} \mu(\mathcal{X}) & = \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} Y_n \\[10pt] \sigma^2(\mathcal{X}) & = \Sigma(\mathcal{X}, \mathcal{X}) - \Sigma(\mathcal{X}, X_n) \Sigma_n^{-1} \Sigma(X_n, \mathcal{X}) \\ \end{aligned} \end{equation*}
  • To make predictions at a single new location x, \Sigma(x, x) = 1 + g.

  • We need to focus on \Sigma so we can tune our GP appropriately.

Sigma Matrix

  • \Sigma_n = \tau^2 \left( C_{\theta}(X_n) + g \mathbb{I_n} \right) where C_n is our kernel.
  • One of the most common kernels which we will focus on is the squared exponential distance kernel written as

C_\theta(x, x') = \exp{ \left( -\frac{\vert\vert x - x' \vert \vert ^2}{\theta} \right ) }

  • What’s \tau^2, g and \theta though? No more math. We will just conceptually go through these

Hyper Parameters

A GP is non parameteric, however, has some hyper-parameters. In this case,

  • \tau^2 (Scale): This parameter can be used to adjust the amplitude of the data.

  • \theta (Length-scale): This parameter controls the rate of decay of correlation.

  • g (Nugget): This parameter controls the noise in the covariance structure (adds discontinuity)

These hyper-parameters \{ \tau^2, \theta, g \} are inferred using MLE/Bayesian schemes.

Scale (Amplitude)

  • A random draw from a multivariate normal distribution with \tau^2 = 1 will produce data between -2 and 2.

  • Now let’s visualize what happens when we increase \tau^2 to 25.

Length-scale (Rate of decay of correlation)

  • Determines how “wiggly” a function is

  • Smaller \theta means wigglier functions i.e. visually:

Nugget (Noise)

  • Ensures discontinuity and prevents interpolation which in turn yields better UQ.

  • We will compare a sample from g ~ 0 (< 1e-8 for numeric stability) vs g = 0.1 to observe what actually happens.

Toy Example (1D Example)


X <- matrix(seq(0, 2*pi, length = 100), ncol =1)
n <- nrow(X) 
true_y <- 5 * sin(X)
obs_y <- true_y + rnorm(n, sd=1)

XX <- matrix(seq(0, 2*pi, length = 200), ncol =1)

Toy Example (1D Example)


# Fit GP
gp <- mleHomGP(X, obs_y)

# Make Predictions
p <- predict(gp, XX)

Toy Example (1D Example)


gp <- mleHomGP(X, obs_y)

# Make Predictions
p <- predict(gp, XX)

mean_gp <- p$mean
s2_gp <- p$sd2 + p$nugs

Extentions: Anisotropic Gaussian Processes

  • Suppose we have a d dimensional input space, X_{n \times d}. We can have one length-scale for each dimension i.e. \mathbf{\theta} = \{ \theta_1, \dots, \theta_d \}.

  • In this situation, we can rewrite the C_n matrix as,

C_\theta(x , x') = \exp{ \left( -\sum_{k=1}^{d} \frac{ (x_k - x_k')^2 }{\theta_k} \right )}

  • This is also called a Seperable GP

  • We will explore newGPsep, mleGPsep and predGPsep.

Tick Populations: ORNL

Iso-week (c) Periodicity Transformed Density
2014-06-23 0.49 1.00 5.72
2014-07-14 0.55 0.98 5.18
2015-05-04 0.36 0.82 3.51


# Fitting the GP
gp <- mleHomGP(X, y)

# Make Predictions
ppt <- predict(gp, XXt)

Tick Populations: ORNL

Iso-week (c) Periodicity Transformed Density
2014-06-23 0.49 1.00 5.72
2014-07-14 0.55 0.98 5.18
2015-05-04 0.36 0.82 3.51


# Fitting the GP
gp <- mleHomGP(X, y)

# Make Predictions
ppt <- predict(gp, XXt)

Extension: Heteroskedastic GPs (HetGP)

Extension: Heteroskedastic GPs (HetGP)

  • We can use a different nugget for each unique input rather than a scalar g.

HetGP Setup

  • Let X_n, Y_n be the data and \mathbf{\lambda_n} be the noise level at input X_n.

  • In case of a hetGP, we have:

\begin{align*} Y_n & \sim GP \left( 0, \tau^2 \left( C_{\theta_Y}(X_n) + \Lambda_n \right) \right) \quad \text{where,} \quad \Lambda_n = \text{Diag}(\bold{\lambda}_n);\\ \log \bold{\lambda}_n & \sim GP \left( 0, \tau_\lambda^2 C_{\theta_\lambda}(X_n) \right)\\ \end{align*}

  • Note that in a regular GP: \Lambda_n = g \mathbb{I}_n. We average over the noise across the input space.

  • We must infer \{ \bold{\lambda}_n, \theta_Y, \theta_\lambda, \tau^2, \tau^2_\lambda \} using MLE/Bayesian schemes.

HetGP Predictions

HetGP Latent Layer Modeling

HetGP: Toy Example (1D Example)


homgp <- mleHomGP(x, y)
p_hom <- predict(object = homgp, x = XX)

mean_gp <- p_hom$mean
s2_gp <- p_hom$sd2 + p_hom$nugs

HetGP: Toy Example (1D Example)


homgp <- mleHomGP(x, y)
p_hom <- predict(object = homgp, x = XX)

mean_gp <- p_hom$mean
s2_gp <- p_hom$sd2 + p_hom$nugs

hetgp <- mleHetGP(x, y)
p_het <- predict(object = hetgp, x = XX)

mean <-  p_het$mean
s2 <-  p_het$sd2 + p_het$nugs

HetGP: Noise Levels


Tick Population Forecasting

  • Objective: Forecast tick density for 4 weeks into the future.

  • Sites: The data is collected across 9 different NEON plots.

  • Data: Sparse and irregularly spaced.

  • n = ~570 observations since 2014.

Predictors

  • Iso-week, X_1 = 1,2,... 53.

  • Periodicity, X_2 = \text{sin}^2 \left( \frac{2 \pi X_1}{106} \right).

  • Mean Elevation, X_3.

Practical

  • Setup these predictors
  • Transform the data to normal
  • Fit a GP to the Data
  • Make Predictions on a testing set
  • Check how predictions perform.

References

Binois, Mickael, Robert B Gramacy, and Mike Ludkovski. 2018. “Practical Heteroscedastic Gaussian Process Modeling for Large Simulation Experiments.” Journal of Computational and Graphical Statistics 27 (4): 808–21.
Thomas, R Quinn, Carl Boettiger, Cayelan C Carey, Michael C Dietze, Leah R Johnson, Melissa A Kenney, Jason S Mclachlan, et al. 2022. “The NEON Ecological Forecasting Challenge.” Authorea Preprints.