Abstract

In this vignette, we go through creating balanced partitions for train/test sets and balanced folds for cross-validation with the R package groupdata2. We write a simple cross-validation function and use it on some linear regression models.

The purpose of the vignette is to give a basic understanding of cross-validation. We recommend using the cvms package in combination with groupdata2 for actual cross-validation tasks.

groupdata2 is a set of methods for easy grouping, windowing, folding, partitioning and splitting of data.

For a more extensive description of groupdata2, please see Description of groupdata2

Contact author at r-pkgs@ludvigolsen.dk

Introduction

In this vignette, we will train a couple of linear regression models on some data. We will use cross-validation with balanced folds to find the best model, and then test that model on a subset of the original data that we haven’t used for training the model.

Our first task will be to split the dataset into a training set and a test set using partition(). Then, we will create folds using the fold() function. We will code up a simple cross-validation function and put some models to the test!

While we will be writing our own code for cross-validation, it is recommended to use the cvms package for this functionality, when working with linear or logistic regression models.

groupdata2 functions in focus

partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable (e.g. diagnosis) and/or one numerical variable (e.g. age). It is also able to keep all datapoints with a shared ID (e.g. participant_id) in the same partition.

fold() creates (optionally) balanced folds for cross-validation. It can balance folds on one categorical variable (e.g. diagnosis) and/or one numerical variable (e.g. age). It is also able to keep all datapoints with a shared ID (e.g. participant_id) in the same fold. If we want to run repeated cross-validation, it can create multiple unique fold columns at once.

What is cross-validation?

The essence of cross-validation is to test a model against data that it hasn’t been trained on, i.e. estimating out-of-sample error. It is done by first dividing the data into groups called folds.

Say we choose to divide the data into 5 folds. Then, in the first iteration, we train a model on the first four folds and test it on the fifth fold. In the second iteration, we then train on folds 2,3,4,5 and test on fold 1. We continue changing which fold is the test fold until all folds have been test folds (i.e. we train and test 5 times in total). In the end we get the average performance of the models and compare these to other cross-validated models.

The model with the lowest average error is believed to be the best at predicting unseen data from the same population(s) and thus chosen for further interpretation / use.

This is a great technique, and fold() makes it even more powerful.

Why training and test sets?

Even with cross-validation, we have a risk of overfitting our models to our data. That is, while we get really good predictions for our current data, it won’t generalize to new data that we might gather in the future.

To test if this is the case, we store some of the data and don’t touch it before we feel confident that we have found the best model. Then we use the model to predict the targets / dependent variable in that data. If the results are much worse on the test set, it likely means that our model is suffering from overfitting! Then we might go back and adjust the models, until we’re pretty confident again. We might also change the number of folds or run repeated cross-validation instead.

Generally though, we should use the test set as few times (>1) as possible, as running this cycle too many times could accidentally find a model that is a good predictor of the specific datapoints in the test set, but not in general.

If the test set results are instead somewhat similar to the cross-validation results, these are the results that we report (possibly along with the cross-validation results).

The data

Let’s say, we have scored 10 participants with either of two diagnoses 'a' and 'b' on a very interesting task, that you are free to call ‘the task’.

# Attach some packages
library(groupdata2)
library(dplyr)
library(knitr) # kable()
library(lmerTest) #lmer()


# Create data frame
df <- data.frame("participant" = factor(as.integer(
                                        rep(c('1','2', '3', '4', '5', 
                                        '6', '7', '8', '9', '10'), 3))),
                "age" = rep(c(20,23,27,21,32,31,43,21,34,32), 3),
                "diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b', 
                                           'a', 'a', 'a', 'b', 'b'), 3)),
                "score" = c(10,24,15,35,24,14,11,16,33,29,  # for 1st session
                            24,40,30,50,54,25,35,32,53,55,  # for 2nd session
                            45,67,40,78,62,30,41,44,66,81)) # for 3rd session

# Order by participant
df <- df %>% arrange(participant)

# Remove index
rownames(df) <- NULL

# Add session info
df$session <- as.integer(rep(c('1','2', '3'), 10))

# Show the data frame
kable(df, align = 'c')

participant	age	diagnosis	score	session
1	20	a	10	1
1	20	a	24	2
1	20	a	45	3
2	23	b	24	1
2	23	b	40	2
2	23	b	67	3
3	27	a	15	1
3	27	a	30	2
3	27	a	40	3
4	21	b	35	1
4	21	b	50	2
4	21	b	78	3
5	32	b	24	1
5	32	b	54	2
5	32	b	62	3
6	31	a	14	1
6	31	a	25	2
6	31	a	30	3
7	43	a	11	1
7	43	a	35	2
7	43	a	41	3
8	21	a	16	1
8	21	a	32	2
8	21	a	44	3
9	34	b	33	1
9	34	b	53	2
9	34	b	66	3
10	32	b	29	1
10	32	b	55	2
10	32	b	81	3

As we can see, there are 5 participants for each diagnosis, and they all seem to have gotten better at the task throughout the three sessions they went through.

Creating train/test sets

In this step, we will split our data into two subsets called train_set and test_set. The main point will be, that we must avoid leakage between the two sets, before it is of any use to us.

What is leakage?

If we were to split the data randomly, with 80 percent going to the training set, and 20 percent going to the test set, we would likely have some of the participants in both the training set AND the test set. But our motivation for having a test set is that we want to know how well our model predicts new, future participants, not how well it knows the ones it saw during training. Furthermore, if our model is overfitted to those participants, our test set might not warn us of this. With this approach, we could get a really low error on both the test set and the training set even though our model is useless outside of the data, we’re working with.

So, how do we deal with this? In this case, it’s as simple as making sure each participant is only in one of the data sets. We can do this with partition() and the id_col argument.


xpectr::set_test_seed(1) # For reproducibility

# Split data in 20/80 (percentage)
partition(df, p = 0.2, id_col = "participant") %>% 
  .[1] %>% # See only the test set 
  kable()  # Pretty tables :)

	participant	age	diagnosis	score	session
4	10	32	b	29	1
5	10	32	b	55	2
6	10	32	b	81	3
16	5	32	b	24	1
17	5	32	b	54	2
18	5	32	b	62	3

The above shows the test set. It contains 2 participants (10 and 5). They both have all 3 sessions in this subset, meaning they are not in the training set! Perfect!

But ehmm.. if we look at the diagnosis column, they both have the diagnosis 'b'. If we would like to know how well our future model classifies both of the diagnoses, this won’t do. It would be nicer if we have somewhat the same balance of both diagnoses in both the training set and the test set. This is luckily what the cat_col argument is for.

More specifically (behind the scenes), it first subsets the full dataset by each class in the categorical column. Then, it creates partitions in each subset and merges the partitions in the end. When using both the id_col and cat_col arguments, it first subsets by class, then partitions by the unique values in id_col. This means, that the final partition sizes might not always be exactly as specified, depending on the data. Therefore, it is good practice to look at the distribution of participants, classes, etc. in the final partitions, which we will do after trying out that cat_col argument to get our final test and train sets!


xpectr::set_test_seed(1) # For reproducibility

# Split data in 20/80 (percentage)
parts <- partition(df, p = 0.2, id_col = "participant", cat_col = 'diagnosis')

test_set <- parts[[1]]
train_set <- parts[[2]]

# Show test_set
test_set %>% kable()

participant	age	diagnosis	score	session
8	21	a	16	1
8	21	a	32	2
8	21	a	44	3
10	32	b	29	1
10	32	b	55	2
10	32	b	81	3

Now, our test set contains 2 participants (8 and 10), and the diagnosis column contains 50 percent a’s and 50 percent b’s. Let’s count the diagnoses in the training set as well.

train_set %>% 
  count(diagnosis) %>% 
  kable(align = 'c')

diagnosis	n
a	12
b	12

There are 12 rows for each diagnosis in the training set. In this case, we know that the rest of the participants are all “fully” in this dataset. But let’s count them anyway, so we will remember to do it in the future.

train_set %>% 
  count(participant) %>% 
  kable(align = 'c')

participant	n
1	3
2	3
3	3
4	3
5	3
6	3
7	3
9	3

Creating folds for cross-validation

In this section, we will create balanced folds for cross-validation. The thought process resembles that in the previous section. fold() basically just creates a number of similarly sized partitions. Where partition() returned a list of data frames (this is optional, btw.), fold() will return the entire training set with a new column called ".folds". This will be used directly in the cross-validation function to subset the data on each iteration. Remember, that the folds take shifts at being the cross-validation test set (not to be confused with the test set we created in the previous step).

As we’re basically recreating the training / test set scenario that we discussed previously, we want to avoid leakage between the folds. We would also like somewhat balanced distributions of the diagnoses between the folds, though this depends on the context. Let’s create our balanced folds:

xpectr::set_test_seed(1) # For reproducibility

train_set <- fold(train_set, k = 4, cat_col = 'diagnosis', id_col = 'participant')

# Order by .folds
train_set <- train_set %>% arrange(.folds)

train_set %>% kable()

participant	age	diagnosis	score	session	.folds
7	43	a	11	1	1
7	43	a	35	2	1
7	43	a	41	3	1
2	23	b	24	1	1
2	23	b	40	2	1
2	23	b	67	3	1
1	20	a	10	1	2
1	20	a	24	2	2
1	20	a	45	3	2
5	32	b	24	1	2
5	32	b	54	2	2
5	32	b	62	3	2
6	31	a	14	1	3
6	31	a	25	2	3
6	31	a	30	3	3
4	21	b	35	1	3
4	21	b	50	2	3
4	21	b	78	3	3
3	27	a	15	1	4
3	27	a	30	2	4
3	27	a	40	3	4
9	34	b	33	1	4
9	34	b	53	2	4
9	34	b	66	3	4

The training set now contains the ".folds" column. As fold() groups the output data frame by this column, we can count the number of rows for each diagnosis and participant per fold like so:

train_set %>% 
  count(participant, diagnosis) %>% 
  kable(align = 'c')

.folds	participant	diagnosis	n
1	2	b	3
1	7	a	3
2	1	a	3
2	5	b	3
3	4	b	3
3	6	a	3
4	3	a	3
4	9	b	3

Fold 1 contains participants 2 and 7 with 3 rows each. Participant 2 has the diagnosis 'b' and participant 7 has the diagnosis 'a'. This pattern is the same for all folds. This dataset was created for the purpose of showing these functions, so we should expect real world data to be less perfectly distributed and remember to run some checks of the created folds and partitions. Also be aware, that the maximum number of folds is restricted by the number of participants in the classes, if using the id_col and the cat_col arguments together. Remember, that the function first subsets the dataset by cat_col and then creates groups (folds) from the unique values in id_col in each subset, before merging the subsets. So if you only have 3 participants in one of the classes, this will be the maximum number of folds you can create.

Cross-validation

What’s left, you ask? Well, now we get to have fun, training and comparing models using cross-validation! If you just came for partition() and fold(), you can skip this part. Personally, I will move on and walk you (you won’t let me walk alone, will you? pleaase?) through a simple cross-validation function for testing a set of models. The models will be predicting the scores of the participants.

Cross-validation function

We have 4 folds in our training set, so we want to train our models on 3 of the folds and test on the last fold. All the folds should be used as test fold once. While there are often faster alternatives to a for loop, we will use it to illustrate the process. We will create a simple cross-validation function where we can specify the model to test, and whether it has random effects. The performance of the model will be measured with RMSE (Root Mean Square Error).

# RMSE metric
rmse <- function(predictions, targets){
  sqrt(mean((predictions - targets)^2))
}

crossvalidate <- function(data, k, model, dependent, random = FALSE){
  # 'data' is the training set with the ".folds" column
  # 'k' is the number of folds we have
  # 'model' is a string describing a linear regression model formula
  # 'dependent' is a string with the name of the score column we want to predict
  # 'random' is a logical (TRUE/FALSE); do we have random effects in the model?
  
  # Initialize empty list for recording performances
  performances <- c()
  
  # One iteration per fold
  for (fold in 1:k){
    
    # Create training set for this iteration
    # Subset all the datapoints where .folds does not match the current fold
    training_set <- data[data$.folds != fold,]
    
    # Create test set for this iteration
    # Subset all the datapoints where .folds matches the current fold
    testing_set <- data[data$.folds == fold,]
    
    ## Train model

    # If there is a random effect,
    # use lmer() to train model
    # else use lm()

    if (isTRUE(random)){

      # Train linear mixed effects model on training set
      model <- lmer(model, training_set, REML = FALSE)

    } else {

      # Train linear model on training set
      model <- lm(model, training_set)

    }

    ## Test model

    # Predict the dependent variable in the testing_set with the trained model
    predicted <- predict(model, testing_set, allow.new.levels = TRUE)

    # Get the Root Mean Square Error between the predicted and the observed
    RMSE <- rmse(predicted, testing_set[[dependent]])

    # Add the RMSE to the performance list
    performances[fold] <- RMSE


  }

  # Return the mean of the recorded RMSEs
  return(c('RMSE' = mean(performances)))

}

Linear regression models

We could have a hypothesis that people with the diagnosis 'b' tend to be better at the experiment. This could be tested by a simple linear model.

lm(score ~ diagnosis, df) %>% 
  broom::tidy()
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic     p.value
#>   <chr>          <dbl>     <dbl>     <dbl>       <dbl>
#> 1 (Intercept)     27.5      4.07      6.74 0.000000253
#> 2 diagnosisb      22.6      5.76      3.92 0.000515

The linear model supports the hypothesis, as the average score of participants with diagnosis 'b' is larger than that of participants with diagnosis 'a', and as this difference is statistically significant, we can reasonably reject the null-hypothesis that there is no difference between the two diagnoses.

To improve our model, we might want to include the information we have about age and session. Perhaps, the older participants do better than the younger? And perhaps, participants with the diagnosis 'b' are better at learning over time (session)? By including such information in our model, we might be better at predicting the score. We could also use participant as a random effect, to factor out the personal differences.

Let’s list a bunch of possible model formulas that we can then compare with cross-validation. The cross-validation function needs the model to be passed in the format below (a string). Instead of looking at summaries for each model, we will find the best model with cross-validation and only look at the summary for that one. Notice that when we want to compare models, we want to keep the same random effects for all the models, so we are only comparing the combination of fixed effects.

m0 <- 'score ~ 1 + (1 | participant)'
m1 <- 'score ~ diagnosis + (1 | participant)'
m2 <- 'score ~ diagnosis + age + (1 | participant)'
m3 <- 'score ~ diagnosis + session + (1 | participant)'
m4 <- 'score ~ diagnosis * session + (1 | participant)'
m5 <- 'score ~ diagnosis * session + age + (1 | participant)'

Now, let’s cross-validate the 6 models. If you get a message saying "boundary (singular) fit", it’s because the model is too complex for our small dataset. To avoid this message, try removing the random effects from the models.

m0
#> [1] "score ~ 1 + (1 | participant)"
crossvalidate(train_set, k = 4, model = m0, dependent = 'score', random = TRUE)
#>     RMSE 
#> 18.27222

m1
#> [1] "score ~ diagnosis + (1 | participant)"
crossvalidate(train_set, k = 4, model = m1, dependent = 'score', random = TRUE)
#>    RMSE 
#> 14.7661

m2
#> [1] "score ~ diagnosis + age + (1 | participant)"
crossvalidate(train_set, k = 4, model = m2, dependent = 'score', random = TRUE)
#>     RMSE 
#> 15.28533

m3
#> [1] "score ~ diagnosis + session + (1 | participant)"
crossvalidate(train_set, k = 4, model = m3, dependent = 'score', random = TRUE)
#>     RMSE 
#> 6.176477

m4
#> [1] "score ~ diagnosis * session + (1 | participant)"
crossvalidate(train_set, k = 4, model = m4, dependent = 'score', random = TRUE)
#>     RMSE 
#> 5.950014

m5
#> [1] "score ~ diagnosis * session + age + (1 | participant)"
crossvalidate(train_set, k = 4, model = m5, dependent = 'score', random = TRUE)
#>     RMSE 
#> 7.004665

The model m4 has the lowest error on average and so, we assume that it is the best predictor of out-of-sample data. To make sure it doesn’t overfit the data, let’s look at the error on the test set.

# Training the model on the full training set
model_m4 <- lmer(m4, train_set, REML = FALSE)

# Predict the dependent variable in the test set with the trained model
predicted <- predict(model_m4, test_set, allow.new.levels = TRUE)

# Get the Root Mean Square Error between the predicted and the observed
RMSE <- rmse(predicted, test_set[['score']])
RMSE
#> [1] 6.418155

6.42 is a bit higher than the cross-validated average, but it’s so close that we’re not afraid of overfitting. Be aware that the scale of the RMSE is dependent on the scale of the dependent variable in our dataset, so you might find some models with much higher RMSEs than this one. What matters is the relative difference between models, and that the best model will have the lowest error.

Let’s look at its summary:

model_m4 %>%
  summary()
#> Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
#>   method [lmerModLmerTest]
#> Formula: m4
#>    Data: train_set
#> 
#>      AIC      BIC   logLik deviance df.resid 
#>    157.0    164.1    -72.5    145.0       18 
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -1.8723 -0.6998 -0.0137  0.7042  1.6556 
#> 
#> Random effects:
#>  Groups      Name        Variance Std.Dev.
#>  participant (Intercept)  3.714   1.927   
#>  Residual                21.398   4.626   
#> Number of obs: 24, groups:  participant, 8
#> 
#> Fixed effects:
#>                    Estimate Std. Error      df t value Pr(>|t|)    
#> (Intercept)          0.1667     3.6621 22.2764   0.046   0.9641    
#> diagnosisb           9.4167     5.1790 22.2764   1.818   0.0825 .  
#> session             13.2500     1.6355 16.0000   8.102 4.71e-07 ***
#> diagnosisb:session   6.3750     2.3129 16.0000   2.756   0.0141 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Correlation of Fixed Effects:
#>             (Intr) dgnssb sessin
#> diagnosisb  -0.707              
#> session     -0.893  0.632       
#> dgnssb:sssn  0.632 -0.893 -0.707

In this model, we have a significant interaction between diagnosis and session. The interpretation of this result would be quite different from that of the first model, we tried.

Notice that the cross-validated m3 model is so close to the cross-validated m4 model that we can’t be completely sure which is better. The observed differences might as well stem from the randomness in the partition() or fold() steps. In such a case, we could turn to repeated cross-validation, where we create multiple fold columns (by setting num_fold_cols in fold()) and average their cross-validated results. If the models are still very close, we might consider reporting all three models or at least checking if the interpretations of the three models differ. These things are highly dependent on the context.

When working with linear and logistic regression, the cvms package contains functions for (repeated) cross-validation and is specifically designed to work well with groupdata2.

Cross-validation with groupdata2

Ludvig Renbo Olsen

2024-12-18