missForestPredict convergence criteria and error monitoring

Elena Albu

2023-12-11

Introduction

What is this document?

This document explains in detail the convergence criteria and error monitoring.

For basic usage of the missForestPredict package read the Using the missForestPredict package vignette.

Convergence criteria

missForestPredict imputes each variable in a dataset in an iterative fashion, using an adapted version of the misForest algorithm (Stekhoven and Bühlmann 2012). By default, convergence is calculated based on the OOB error, but the apparent error can be used too. At each iteration the out-of-bag (OOB) error is calculated for each variable separately. To obtain a global error the OOB errors for all variables a weighted average is used, that can be controlled by the var_weights parameter. By default the weight of each variable in the convergence criteria is set to the proportion of missing values in the dataset.

The normalized mean square error (NMSE) is used for both continuous and categorical variables. For continuous variables, it is equivalent to \(1 - R^2\). For categorical variables, it is equivalent to \(1 - BSS\) (Brier Skill Score).

Continuous variables:

\(NMSE = \frac{\sum_{i=1}^{N}(x_i - \hat{x_i})^2}{\sum_{i=1}^{N} (x_i - \bar{x})^2} = 1 - R^2\), \(i = 1, 2, ... N\)

\({x_i}\) = the true value of variable x for observation i

\(\bar{x}\) = the mean value of variable x

\(\hat{x_i}\) = prediction (imputation) of variable x for observation i

\(N\) = number of observations

Categorical variables (including ordinal factors):

\(NMSE = \frac{BS}{BSref} = 1 - BSS\)

\(BS =\frac{1}{N}\sum_{j=1}^{R}\sum_{i=1}^{N}(p_{ij} - x_{ij})^2\), \(i = 1, 2, ... N\), \(j = 1, 2, ... R\)

\(BSref =\frac{1}{N}\sum_{j=1}^{R}\sum_{i=1}^{N}(p_{j} - x_{ij})^2=1-\sum_{j=1}^{R}p_j^2\)

\({x_{ij}}\) = the true value of variable x for observation i and class j (1 if observation i has class j and 0 otherwise)

\(p_{ij}\) = prediction (probability) for observation i and class j

\(p_{j}\) = proportion of the event in class j

\(N\) = number of observations

\(R\) = number of classes

The Brier Score (\(BS\)) is calculated as the sum of square distances between the predictions (as probabilities) and the true values (0 or 1) for each class. The reference Brier Score (\(BSref\)) is calculated as the Brier Score of a predictor that predicts the proportion of the event in each class (Brier et al. 1950).

Imputation error monitoring

At each iteration, the MSE (mean square error) and NMSE (normalized mean square error) are saved and reported for each variable if verbose = TRUE.

Post imputation, the error can be checked using the evaluate_imputation_error function.

Examples

Please note that in all following examples we set the ranger parameter num.threads to 2. If you are running the code on a machine with more cores and you are willing to use them for running the code, you can remove this parameter completely.

Data used for demonstration

Iris data The iris dataset in R base contains 4 continuous variables and one categorical variable with three categories for N = 150 flowers (Anderson 1935).

Diamonds data The diamonds dataset from ggplot2 R package contains seven continuous variables and three categorical variables for N = 53940 diamonds (Wickham 2016).

Inspect convergence errors

The MSE and NMSE errors are returned at the end of the imputation as part of the return object.

library(missForestPredict)

data(iris)
set.seed(2022)
iris_mis <- produce_NA(iris, proportion = 0.5)

set.seed(2022)
missForest_object <- missForestPredict::missForest(iris_mis, verbose = TRUE, num.threads = 2)
#> Imputation sequence (missing proportion):  Sepal.Length (0.5) Sepal.Width (0.5) Petal.Length (0.5) Petal.Width (0.5) Species (0.5) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.2733877591, 0.1174921291, 0.2974377969, 0.0406175941, 0.0445078208
#>     OOB errors NMSE:           0.3835061935, 0.7412939699, 0.1024426742, 0.0699874304, 0.2010357269
#>     diff. convergence measure: 0.700346801
#>     time:                      0.2 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.071155368, 0.1117334137, 0.1350545451, 0.0317446323, 0.0368819682
#>     OOB errors NMSE:           0.0998161894, 0.7049604642, 0.0465150996, 0.054698593, 0.1665907957
#>     diff. convergence measure: 0.0851369706
#>     time:                      0.17 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.0753490553, 0.0925979462, 0.0824618803, 0.0298573717, 0.0331095627
#>     OOB errors NMSE:           0.1056990608, 0.5842289156, 0.0284012846, 0.0514466889, 0.1495513574
#>     diff. convergence measure: 0.0306507669
#>     time:                      0.21 seconds
#> 
#>   missForest iteration 4 in progress...done!
#>     OOB errors MSE:            0.0727207271, 0.0813881577, 0.0866222306, 0.0283070915, 0.0325095776
#>     OOB errors NMSE:           0.1020120627, 0.513502913, 0.0298341806, 0.0487754296, 0.1468413069
#>     diff. convergence measure: 0.0156722829
#>     time:                      0.2 seconds
#> 
#>   missForest iteration 5 in progress...done!
#>     OOB errors MSE:            0.0765218346, 0.0823391774, 0.0822525938, 0.0281909296, 0.0273295562
#>     OOB errors NMSE:           0.1073442264, 0.5195031889, 0.0283292028, 0.0485752731, 0.1234438601
#>     diff. convergence measure: 0.0027540283
#>     time:                      0.2 seconds
#> 
#>   missForest iteration 6 in progress...done!
#>     OOB errors MSE:            0.0904249301, 0.0809881436, 0.086694038, 0.028991682, 0.0295946342
#>     OOB errors NMSE:           0.1268473792, 0.510979101, 0.0298589123, 0.0499550349, 0.1336749067
#>     diff. convergence measure: -0.0048239166
#>     time:                      0.22 seconds

print(missForest_object$OOB_err)
#>    iteration     variable        MSE       NMSE        MER  macro_F1 F1_score
#> 1          0 Sepal.Length         NA 1.00000000         NA        NA       NA
#> 2          0  Sepal.Width         NA 1.00000000         NA        NA       NA
#> 3          0 Petal.Length         NA 1.00000000         NA        NA       NA
#> 4          0  Petal.Width         NA 1.00000000         NA        NA       NA
#> 5          0      Species         NA 1.00000000         NA        NA       NA
#> 6          1 Sepal.Length 0.27338776 0.38350619         NA        NA       NA
#> 7          1  Sepal.Width 0.11749213 0.74129397         NA        NA       NA
#> 8          1 Petal.Length 0.29743780 0.10244267         NA        NA       NA
#> 9          1  Petal.Width 0.04061759 0.06998743         NA        NA       NA
#> 10         1      Species 0.04450782 0.20103573 0.08000000 0.9222896       NA
#> 11         2 Sepal.Length 0.07115537 0.09981619         NA        NA       NA
#> 12         2  Sepal.Width 0.11173341 0.70496046         NA        NA       NA
#> 13         2 Petal.Length 0.13505455 0.04651510         NA        NA       NA
#> 14         2  Petal.Width 0.03174463 0.05469859         NA        NA       NA
#> 15         2      Species 0.03688197 0.16659080 0.08000000 0.9228758       NA
#> 16         3 Sepal.Length 0.07534906 0.10569906         NA        NA       NA
#> 17         3  Sepal.Width 0.09259795 0.58422892         NA        NA       NA
#> 18         3 Petal.Length 0.08246188 0.02840128         NA        NA       NA
#> 19         3  Petal.Width 0.02985737 0.05144669         NA        NA       NA
#> 20         3      Species 0.03310956 0.14955136 0.08000000 0.9228758       NA
#> 21         4 Sepal.Length 0.07272073 0.10201206         NA        NA       NA
#> 22         4  Sepal.Width 0.08138816 0.51350291         NA        NA       NA
#> 23         4 Petal.Length 0.08662223 0.02983418         NA        NA       NA
#> 24         4  Petal.Width 0.02830709 0.04877543         NA        NA       NA
#> 25         4      Species 0.03250958 0.14684131 0.06666667 0.9355050       NA
#> 26         5 Sepal.Length 0.07652183 0.10734423         NA        NA       NA
#> 27         5  Sepal.Width 0.08233918 0.51950319         NA        NA       NA
#> 28         5 Petal.Length 0.08225259 0.02832920         NA        NA       NA
#> 29         5  Petal.Width 0.02819093 0.04857527         NA        NA       NA
#> 30         5      Species 0.02732956 0.12344386 0.08000000 0.9228758       NA
#> 31         6 Sepal.Length 0.09042493 0.12684738         NA        NA       NA
#> 32         6  Sepal.Width 0.08098814 0.51097910         NA        NA       NA
#> 33         6 Petal.Length 0.08669404 0.02985891         NA        NA       NA
#> 34         6  Petal.Width 0.02899168 0.04995503         NA        NA       NA
#> 35         6      Species 0.02959463 0.13367491 0.06666667 0.9355050       NA
#> 36         7 Sepal.Length         NA         NA         NA        NA       NA
#> 37         7  Sepal.Width         NA         NA         NA        NA       NA
#> 38         7 Petal.Length         NA         NA         NA        NA       NA
#> 39         7  Petal.Width         NA         NA         NA        NA       NA
#> 40         7      Species         NA         NA         NA        NA       NA
#> 41         8 Sepal.Length         NA         NA         NA        NA       NA
#> 42         8  Sepal.Width         NA         NA         NA        NA       NA
#> 43         8 Petal.Length         NA         NA         NA        NA       NA
#> 44         8  Petal.Width         NA         NA         NA        NA       NA
#> 45         8      Species         NA         NA         NA        NA       NA
#> 46         9 Sepal.Length         NA         NA         NA        NA       NA
#> 47         9  Sepal.Width         NA         NA         NA        NA       NA
#> 48         9 Petal.Length         NA         NA         NA        NA       NA
#> 49         9  Petal.Width         NA         NA         NA        NA       NA
#> 50         9      Species         NA         NA         NA        NA       NA
#> 51        10 Sepal.Length         NA         NA         NA        NA       NA
#> 52        10  Sepal.Width         NA         NA         NA        NA       NA
#> 53        10 Petal.Length         NA         NA         NA        NA       NA
#> 54        10  Petal.Width         NA         NA         NA        NA       NA
#> 55        10      Species         NA         NA         NA        NA       NA

These can be plotted in a graph if visual inspection seems easier to understand. We will plot the errors using ggplot2 package.

library(dplyr)
library(tidyr)
library(ggplot2)

missForest_object$OOB_err %>% 
  filter(!is.na(NMSE)) %>% 
  ggplot(aes(iteration, NMSE, col = variable)) +
  geom_point() +
  geom_line()

Use weighted average in convergence

The convergence of the algorithm is based on the a weighted average of the OOB NMSE for each variable. The weights are proportional to the proportion of missing values in the dataset. There can be situations when this in not the optimal choice. The weights for each variable can be adjusted via the var_weights parameter. In the following example we will create different proportion of missing values for each variable and adjust the weights to be equal.

We will create missing values only on the last two variables (Petal.Width and Species) and first run the imputation with the default setting. Keep in mind that missForestPredict builds models for all variables in the dataset, regardless of the missingness rate. Models will be built also for variables that are complete. Their weight will be zero in the convergence criteria, but imputation models will still be stored for these variables and can be later used on new observations; if unexpectedly missing values will occur in your test set, these will be imputed using these learned models.

library(missForestPredict)
library(dplyr)
library(tidyr)
library(ggplot2)

data(iris)

proportion_missing <- c(0, 0, 0, 0.3, 0.3)

set.seed(2022)
iris_mis <- produce_NA(iris, proportion = proportion_missing)

set.seed(2022)
missForest_object <- missForestPredict::missForest(iris_mis, verbose = TRUE, num.threads = 2)
#> Imputation sequence (missing proportion):  Sepal.Length (0) Sepal.Width (0) Petal.Length (0) Petal.Width (0.3) Species (0.3) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.1235818874, 0.0905813347, 0.1934736751, 0.0442122339, 0.0334087323
#>     OOB errors NMSE:           0.1814386367, 0.4799954852, 0.0625015372, 0.0768086337, 0.1504621215
#>     diff. convergence measure: 0.8863646224
#>     time:                      0.15 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.1125602343, 0.0852139398, 0.0616312306, 0.0374273394, 0.0319707164
#>     OOB errors NMSE:           0.1652570282, 0.4515533639, 0.0199099265, 0.0650214329, 0.1439857633
#>     diff. convergence measure: 0.0091317795
#>     time:                      0.17 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.1122819459, 0.0829346456, 0.0586239043, 0.0379312945, 0.0318547888
#>     OOB errors NMSE:           0.1648484548, 0.4394752585, 0.0189384118, 0.0658969396, 0.1434636628
#>     diff. convergence measure: -0.0001767031
#>     time:                      0.16 seconds

# plot convergence
missForest_object$OOB_err %>% 
  filter(!is.na(NMSE)) %>% 
  ggplot(aes(iteration, NMSE, col = variable)) +
  geom_point() +
  geom_line()

We will further adapt the weights to be equal. You can observe that the results (number of iterations) will be different in this case.


set.seed(2022)
missForest_object <- missForestPredict::missForest(iris_mis, verbose = TRUE, 
                                                   var_weights = setNames(rep(1, ncol(iris_mis)), colnames(iris_mis)), num.threads = 2)
#> Imputation sequence (missing proportion):  Sepal.Length (0) Sepal.Width (0) Petal.Length (0) Petal.Width (0.3) Species (0.3) 
#>   missForest iteration 1 in progress...done!
#>     OOB errors MSE:            0.1235818874, 0.0905813347, 0.1934736751, 0.0442122339, 0.0334087323
#>     OOB errors NMSE:           0.1814386367, 0.4799954852, 0.0625015372, 0.0768086337, 0.1504621215
#>     diff. convergence measure: 0.8097587171
#>     time:                      0.16 seconds
#> 
#>   missForest iteration 2 in progress...done!
#>     OOB errors MSE:            0.1125602343, 0.0852139398, 0.0616312306, 0.0374273394, 0.0319707164
#>     OOB errors NMSE:           0.1652570282, 0.4515533639, 0.0199099265, 0.0650214329, 0.1439857633
#>     diff. convergence measure: 0.0210957799
#>     time:                      0.18 seconds
#> 
#>   missForest iteration 3 in progress...done!
#>     OOB errors MSE:            0.1122819459, 0.0829346456, 0.0586239043, 0.0379312945, 0.0318547888
#>     OOB errors NMSE:           0.1648484548, 0.4394752585, 0.0189384118, 0.0658969396, 0.1434636628
#>     diff. convergence measure: 0.0026209574
#>     time:                      0.28 seconds
#> 
#>   missForest iteration 4 in progress...done!
#>     OOB errors MSE:            0.1142524416, 0.0811637847, 0.0581306484, 0.0371718621, 0.0307816046
#>     OOB errors NMSE:           0.1677414682, 0.4300913689, 0.0187790658, 0.0645775996, 0.1386303882
#>     diff. convergence measure: 0.0025605674
#>     time:                      0.18 seconds
#> 
#>   missForest iteration 5 in progress...done!
#>     OOB errors MSE:            0.1150285581, 0.0836558658, 0.0570142275, 0.0373563133, 0.0323224142
#>     OOB errors NMSE:           0.1688809356, 0.4432970442, 0.0184184068, 0.0648980414, 0.1455696963
#>     diff. convergence measure: -0.0042488467
#>     time:                      0.15 seconds

# plot convergence
missForest_object$OOB_err %>% 
  filter(!is.na(NMSE)) %>% 
  ggplot(aes(iteration, NMSE, col = variable)) +
  geom_point() +
  geom_line()

Assess imputation error when the true values are known

Post imputation, the error can be checked using the evaluate_imputation_error function. This can be done, of course, only when the true values (passed via xtrue) are known. The errors are calculated as differences from the true values.

As Species is a categorical variable and it is imputed with one of the classes of the variable (setosa, versicolor or virginica) and not with probabilities, only the MER (missclassification error rate) can be calculated post imputation.

By default, evaluate_imputation_error returns the MSE and NMSE using only the missing values, while the OOB error in the convergence criteria is calculated using only the non-missing values.

evaluate_imputation_error(missForest_object$ximp, iris_mis, iris)
#>       variable        MSE       NMSE        MER  macro_F1 F1_score
#> 1 Petal.Length 0.00000000 0.00000000         NA        NA       NA
#> 2  Petal.Width 0.02745403 0.04811037         NA        NA       NA
#> 3 Sepal.Length 0.00000000 0.00000000         NA        NA       NA
#> 4  Sepal.Width 0.00000000 0.00000000         NA        NA       NA
#> 5      Species         NA         NA 0.02222222 0.9791463       NA

References

Anderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.” Bull. Am. Iris Soc. 59: 2–5.
Brier, Glenn W et al. 1950. “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review 78 (1): 1–3.
Stekhoven, Daniel J, and Peter Bühlmann. 2012. “MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.