library(moder)
In addition to finding a vector’s modes, you might be interested in some metadata about them:
A vector’s modal count is the number of its modes.
A vector’s modal frequency is the number of times that any single mode appears in the vector.
This vignette lays out all the functions for modal metadata. In the
end, it talks about a special feature of these functions, the
max_unique
argument.
mode_count()
computes the number of modes:
mode_count(c(5, 5, 6))
#> [1] 1
mode_count(c(5, 5, 6, 6, 7))
#> [1] 2
Even with missing values, the number of modes is sometimes known. It
can only be 1 here because even if the NA
is secretly
"b"
, then "b"
would appear twice, but
"a"
would appear three times:
mode_count(c("a", "a", "a", "b", NA))
#> [1] 1
All of this only works if the full set of modes can be determined.
Below, NA
could secretly be 7
, 8
,
or any other value. If it’s 8
, both numbers are equally
frequent. Otherwise, 7
is the only mode. Since we lack this
information, the number of modes is unknown.
mode_count(c(7, 7, 7, 8, 8, NA))
#> [1] NA
Use mode_count_range()
in such cases. It will determine
the minimal and maximal number of modes, never returning
NA
. For more on mode_count_range()
, see below,
section Maximal number of unique values.
mode_count_range(c(7, 7, 7, 8, 8, NA))
#> [1] 1 2
mode_frequency()
counts the instances of a vector’s
modes in the vector:
mode_frequency(c(4, 4, 5))
#> [1] 2
mode_frequency(c(4, 4, 4, 5))
#> [1] 3
Missing values are an issue here, even if the mode is obvious. Each
NA
might be another instance of the mode, so the frequency
is unknown:
mode_frequency(c(1, 1, 1, 1, 2, NA, NA))
#> [1] NA
With mode_frequency_range()
, at least the minimal and
maximal frequencies can be determined. It never returns NA
.
The minimum frequency supposes that no NA
s represent the
mode; the maximum frequency supposes that all of them do. In this way,
there are four instances of 1
without counting the
NA
s, and six with counting them:
mode_frequency_range(c(1, 1, 1, 1, 2, NA, NA))
#> [1] 4 6
Related to frequencies, mode_is_trivial()
flags cases
where the mode is not meaningful. It returns TRUE
if all
values are equally frequent. Modality is trivial in this case because it
is a property of all values taken together, not of some values over
others.
mode_is_trivial(c("a", "b", "c"))
#> [1] TRUE
mode_is_trivial(c(1, 1, 2, 2, 3, 3))
#> [1] TRUE
mode_is_trivial(c(1, 1, 1, 2, 3))
#> [1] FALSE
The mode is clearly not a useful concept in the first two cases (cf. Härdle, Klinke, and Rönz 2015, 40). Some authors say that the mode is not defined if each value appears only once (Manikandan 2011, 214). However, it is certainly possible for the maximal frequency to be 1, so the only way for such distributions not to have any modal values would be a specific exception in the definition of the mode. The same applies to uniformly distributed data in general. No such exception appears in any definition that I am aware of. Even if it were to be suggested, I think the more elegant solution would be to accept all values of uniformly distributed data as trivially modal.
All of moder’s functions for metadata, such as
mode_is_trivial()
and mode_count_range()
, have
a max_unique
argument. It allows you to state how many
unique values your data can have at the maximum. Why is this important?
The two functions care about possible modes beyond the known values. In
other words, their results might depend on whether or not the
NA
s can mask modal values that don’t even occur among the
known values! If that is possible, it presents an additional source of
uncertainty.
Conversely, max_unique
limits the possible number of
such wildcard modes. Specify it as an integer that is the maximal number
of unique values. If there can be no other values than those already
known, specify max_unique
as "known"
instead.
Always use "known"
if you have factor data or you will get
a warning. (The idea behind factors is that all possible values are
known at the outset.)
Note that this argument does not represent an analytical decision but
simply conveys your knowledge of the data to the computer. There is no
meaningful choice to make: If the maximum number of unique values is
known, you must specify max_unique
; if not, you must not do
so. Otherwise, you risk incorrect results if any values are missing. The
default is NULL
because the baseline assumption is always
that nothing is known about missing values except for their number.
Below is an example. If two of the NA
s represent
8
and the other three stand for a third value, all values
appear with the same frequency. In this case, all values would trivially
be modes in the sense of mode_is_trivial()
. This scenario
is not certain at all, but it can’t be ruled out either, so the function
returns NA
. As mode_count_range()
shows, there
could be three modes at most. (The minimum is always one if any values
are missing.)
<- c(7, 7, 7, 8, NA, NA, NA, NA, NA)
x1 mode_is_trivial(x1)
#> [1] NA
mode_count_range(x1)
#> [1] 1 3
The picture is different if we know that each missing value must
represent a known value, i.e., 7
or 8
. Even if
two NA
s stand for 8
, the other three can’t be
evenly distributed across 7
and 8
, so one of
these values must be more frequent than the other one. This makes the
mode nontrivial. Also, there can only be one mode, so both the minimal
and maximal mode counts are 1
.
x1#> [1] 7 7 7 8 NA NA NA NA NA
mode_is_trivial(x1, max_unique = "known")
#> [1] FALSE
mode_count_range(x1, max_unique = "known")
#> [1] 1 1
Three more functions have a max_unique
parameter:
mode_count()
, mode_frequency()
, and
mode_frequency_range()
. However, this only matters for
corner cases. See this Github
issue.