Modal counts and frequencies

In addition to finding a vector’s modes, you might be interested in some metadata about them:

This vignette lays out all the functions for modal metadata. In the end, it talks about a special feature of these functions, the max_unique argument.

Modal count

mode_count() computes the number of modes:

mode_count(c(5, 5, 6))
#> [1] 1
mode_count(c(5, 5, 6, 6, 7))
#> [1] 2

Even with missing values, the number of modes is sometimes known. It can only be 1 here because even if the NA is secretly "b", then "b" would appear twice, but "a" would appear three times:

mode_count(c("a", "a", "a", "b", NA))
#> [1] 1

All of this only works if the full set of modes can be determined. Below, NA could secretly be 7, 8, or any other value. If it’s 8, both numbers are equally frequent. Otherwise, 7 is the only mode. Since we lack this information, the number of modes is unknown.

mode_count(c(7, 7, 7, 8, 8, NA))
#> [1] NA

Use mode_count_range() in such cases. It will determine the minimal and maximal number of modes, never returning NA. For more on mode_count_range(), see below, section Maximal number of unique values.

mode_count_range(c(7, 7, 7, 8, 8, NA))
#> [1] 1 2

Modal frequency

mode_frequency() counts the instances of a vector’s modes in the vector:

mode_frequency(c(4, 4, 5))
#> [1] 2
mode_frequency(c(4, 4, 4, 5))
#> [1] 3

Missing values are an issue here, even if the mode is obvious. Each NA might be another instance of the mode, so the frequency is unknown:

mode_frequency(c(1, 1, 1, 1, 2, NA, NA))
#> [1] NA

With mode_frequency_range(), at least the minimal and maximal frequencies can be determined. It never returns NA. The minimum frequency supposes that no NAs represent the mode; the maximum frequency supposes that all of them do. In this way, there are four instances of 1 without counting the NAs, and six with counting them:

mode_frequency_range(c(1, 1, 1, 1, 2, NA, NA))
#> [1] 4 6

Trivial modes

Related to frequencies, mode_is_trivial() flags cases where the mode is not meaningful. It returns TRUE if all values are equally frequent. Modality is trivial in this case because it is a property of all values taken together, not of some values over others.

mode_is_trivial(c("a", "b", "c"))
#> [1] TRUE
mode_is_trivial(c(1, 1, 2, 2, 3, 3))
#> [1] TRUE
mode_is_trivial(c(1, 1, 1, 2, 3))
#> [1] FALSE

The mode is clearly not a useful concept in the first two cases (cf. Härdle, Klinke, and Rönz 2015, 40). Some authors say that the mode is not defined if each value appears only once (Manikandan 2011, 214). However, it is certainly possible for the maximal frequency to be 1, so the only way for such distributions not to have any modal values would be a specific exception in the definition of the mode. The same applies to uniformly distributed data in general. No such exception appears in any definition that I am aware of. Even if it were to be suggested, I think the more elegant solution would be to accept all values of uniformly distributed data as trivially modal.

Maximal number of unique values

All of moder’s functions for metadata, such as mode_is_trivial() and mode_count_range(), have a max_unique argument. It allows you to state how many unique values your data can have at the maximum. Why is this important? The two functions care about possible modes beyond the known values. In other words, their results might depend on whether or not the NAs can mask modal values that don’t even occur among the known values! If that is possible, it presents an additional source of uncertainty.

Conversely, max_unique limits the possible number of such wildcard modes. Specify it as an integer that is the maximal number of unique values. If there can be no other values than those already known, specify max_unique as "known" instead. Always use "known" if you have factor data or you will get a warning. (The idea behind factors is that all possible values are known at the outset.)

Note that this argument does not represent an analytical decision but simply conveys your knowledge of the data to the computer. There is no meaningful choice to make: If the maximum number of unique values is known, you must specify max_unique; if not, you must not do so. Otherwise, you risk incorrect results if any values are missing. The default is NULL because the baseline assumption is always that nothing is known about missing values except for their number.

Below is an example. If two of the NAs represent 8 and the other three stand for a third value, all values appear with the same frequency. In this case, all values would trivially be modes in the sense of mode_is_trivial(). This scenario is not certain at all, but it can’t be ruled out either, so the function returns NA. As mode_count_range() shows, there could be three modes at most. (The minimum is always one if any values are missing.)

x1 <- c(7, 7, 7, 8, NA, NA, NA, NA, NA)
mode_is_trivial(x1)
#> [1] NA
mode_count_range(x1)
#> [1] 1 3

The picture is different if we know that each missing value must represent a known value, i.e., 7 or 8. Even if two NAs stand for 8, the other three can’t be evenly distributed across 7 and 8, so one of these values must be more frequent than the other one. This makes the mode nontrivial. Also, there can only be one mode, so both the minimal and maximal mode counts are 1.

x1
#> [1]  7  7  7  8 NA NA NA NA NA
mode_is_trivial(x1, max_unique = "known")
#> [1] FALSE
mode_count_range(x1, max_unique = "known")
#> [1] 1 1

Three more functions have a max_unique parameter: mode_count(), mode_frequency(), and mode_frequency_range(). However, this only matters for corner cases. See this Github issue.

Modal counts and frequencies

Modal count

Modal frequency

Trivial modes

Maximal number of unique values

References