Vignette 2: GSPCR specification options

Here we focus on the specifications of the GSPCR model. Three arguments of the cv_gspcr() should be specified carefully:

In this vignette we consider a simple scenario with a continuous dependent variable and a set of continuous predictors. First, we load the required packages and store the example dataset GSPCRexdata (see the helpfile for details ?GSPCRexdata) in two separate objects:

# Load R packages
library(gspcr) # this package!
library(superpc) # alternative comparison package
library(patchwork) # combining ggplots

# Comment goal of code
X <- GSPCRexdata$X$cont
y <- GSPCRexdata$y$cont

1 Association measures

As described in the introduction, gspcr allows for the specification of different bivariate association measures. We can run gspcr using as a threshold type:

Another important aspect to consider is the number of threshold values that should be considered. This can be specified with the nthrs argument. Using the following code we can compare the solution paths obtained by the different association measures and values for a given number of PCs.

# Define a vector of threshold types
threshold_types <- c("LLS", "normalized", "PR2")

# Train the GSPCR model with the different values
out_trhs <- lapply(
    X = threshold_types,
    FUN = function(i) {
        cv_gspcr(
            dv = y,
            ivs = X,
            thrs = i,       # threshold type
            nthrs = 20,     # number of threshold values
            npcs_range = 1, 
            K = 10
        )
    }
)

# Plot them
plots <- lapply(out_trhs, function(i) {
    plot(
        x = i,
        y = "F",
        labels = FALSE,     # We are using a single nPC, do not need the label
        discretize = FALSE, # Makes X-axis more readable
        print = FALSE
    )
})

# Patchwork ggplots
plots[[1]] + plots[[2]] + plots[[3]]
Figure 1: Solution paths for different association measures.

Figure 1: Solution paths for different association measures.

As you can see, the solution paths are similar, although LLS tended to favor lower threshold values.

2 Fit measures

We can use different cross-validation fit measures. See the help file for the list options (?cv_gspcr).

# Measures
fit_measure_vec <- c("LRT", "PR2", "MSE", "F", "AIC", "BIC")

# Train the GSPCR model with the different values
out_fit_meas <- lapply(fit_measure_vec, function(i) {
    cv_gspcr(
        dv = y,
        ivs = X,
        fit_measure = i,
        thrs = "normalized",
        nthrs = 20,
        npcs_range = 1,
        K = 10
    )
})

# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
    # Reverse y?
    rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])

    # Make plots
    plot(
        x = out_fit_meas[[i]],
        y = fit_measure_vec[[i]],
        labels = FALSE,
        y_reverse = rev,
        errorBars = FALSE,
        discretize = FALSE,
        print = FALSE
    )
})

# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
Figure 2: Solution paths for different fit measures.

Figure 2: Solution paths for different fit measures.

As you can see, the different fit measures return equivalent solution paths. This is true for any number of PCs:

# Train the GSPCR model with the different values
out_fit_meas <- lapply(fit_measure_vec, function(i) {
    cv_gspcr(
        dv = y,
        ivs = X,
        fit_measure = i,
        thrs = "normalized",
        nthrs = 20,
        npcs_range = 5,
        K = 10
    )
})

# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
    # Reverse y?
    rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])

    # Make plots
    plot(
        x = out_fit_meas[[i]],
        y = fit_measure_vec[[i]],
        labels = FALSE,
        y_reverse = rev,
        errorBars = FALSE,
        discretize = FALSE,
        print = FALSE
    )
})

# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
Figure 3: Solution paths for different fit measures when using 5 PCs.

Figure 3: Solution paths for different fit measures when using 5 PCs.

3 Number of components

We can use cross-validation to select the number of PCs as well. We can use the npcs_range argument to specify the range of the number of PCs to consider.

# Train the model
out_npcs <- cv_gspcr(
    dv = y,
    ivs = X,
    npcs_range = c(2, 5, 10)
)

# Plot solution paths
plot(out_npcs)
Figure 4: Solution paths for different fit measures when cross-validating the number of PCs.

Figure 4: Solution paths for different fit measures when cross-validating the number of PCs.

Given the choice of 2, 5, or 10 PCs, we would use 2 PCs with the second threshold value.