kit provides a collection of fast utility functions implemented in C for data manipulation in R. It serves as a lightweight, high-performance toolkit for tasks that are either slow or cumbersome in base R, such as row-wise operations, vectorized conditionals, and duplicate detection.
Key features include:
psum, pmean, pfirst) using
OpenMP.if-else
logic (iif, nif, vswitch) that
preserves attributes.unique, duplicated, and count for
vectors and data frames.topn).charToFact) and level manipulation
(setlevels).Most functions are implemented in C and support multi-threading where applicable, making them significantly faster than their base R equivalents on large datasets.
Computing row-wise statistics across multiple vectors or data frame
columns is a common task. While base R has pmin() and
pmax(), it lacks efficient equivalents for sum, mean, or
product. kit fills this gap.
psum(), pmean(), and pprod()
compute parallel sum, mean, and product respectively. They accept
multiple vectors or a single list/data frame.
x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)
# Parallel sum
psum(x, y, z, na.rm = TRUE)
#> [1] 6 7 8 7
# Parallel mean
pmean(x, y, z, na.rm = TRUE)
#> [1] 2.000000 3.500000 4.000000 2.333333They are particularly useful for data frames:
fpmin(), fpmax(), and prange()
compute parallel minimum, maximum, and range (max - min) respectively.
They complement base R’s pmin() and pmax(),
providing greater performance and the ability to work efficiently with
data frames.
x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)
# Parallel minimum
fpmin(x, y, z, na.rm = TRUE)
#> [1] 1 3 4 1
# Parallel maximum
fpmax(x, y, z, na.rm = TRUE)
#> [1] 3 4 4 5
# Parallel range (max - min)
prange(x, y, z, na.rm = TRUE)
#> [1] 2 1 0 4Like psum() and pmean(), these functions
preserve the input type when all inputs have the same type, and
automatically promote to the highest type when inputs are mixed (logical
< integer < double). prange() always returns double
to avoid integer overflow.
pfirst() and plast() return the first or
last non-missing value across a set of vectors. This is equivalent to
the SQL COALESCE function (for pfirst).
You can check for conditions or count values row-wise with
pall, pany, and pcount.
a <- c(TRUE, FALSE, NA, TRUE)
b <- c(TRUE, NA, TRUE, FALSE)
c <- c(NA, TRUE, FALSE, TRUE)
# Any TRUE per row?
pany(a, b, c, na.rm = TRUE)
#> [1] TRUE TRUE TRUE TRUE
# Count NAs per row
pcountNA(a, b, c)
#> [1] 1 1 1 0
# Count specific value (e.g., TRUE) per row
pcount(a, b, c, value = TRUE)
#> [1] 2 1 1 2iif)Base R’s ifelse() is known to be slow and often strips
attributes (like Date class or factor levels).
iif() is a faster, more robust alternative that preserves
attributes from the yes argument.
dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03"))
# Base ifelse strips class
class(ifelse(dates > "2024-01-01", dates, dates - 1))
#> [1] "numeric"
# iif preserves class
class(iif(dates > "2024-01-01", dates, dates - 1))
#> [1] "Date"It also supports explicit NA handling:
nif)For multiple conditions, nif() offers a cleaner, more
efficient syntax than nested ifelse() calls, similar to
SQL’s CASE WHEN.
vswitch, nswitch)vswitch() maps input values to outputs efficiently.
status_code <- c(1L, 2L, 3L, 1L, 4L)
vswitch(
x = status_code,
values = c(1L, 2L, 3L),
outputs = c("pending", "approved", "rejected"),
default = "unknown"
)
#> [1] "pending" "approved" "rejected" "pending" "unknown"For pairwise syntax, nswitch() pairs values and outputs
directly.
nswitch(status_code,
1L, "pending",
2L, "approved",
3L, "rejected",
default = "unknown"
)
#> [1] "pending" "approved" "rejected" "pending" "unknown"It can also replace with values from other vectors (columns), mixing scalars and vectors:
kit provides optimized versions of
unique() and duplicated() that are
significantly faster for vectors and data frames.
topn)Sorting a large vector just to get the top few elements is
inefficient. topn() uses a partial sorting algorithm to
retrieve the top (or bottom) \(N\)
indices or values.
charToFact() is a fast alternative to
as.factor() for character vectors, with control over
NA levels.
setlevels() allows you to change factor levels by
reference (in-place), avoiding object copying.
| Task | kit function | Base R equivalent |
|---|---|---|
| Row-wise sum | psum() |
rowSums(cbind(...)) |
| Row-wise mean | pmean() |
rowMeans(cbind(...)) |
| Row-wise min | fpmin() |
pmin(...) |
| Row-wise max | fpmax() |
pmax(...) |
| Row-wise range | prange() |
pmax(...) - pmin(...) |
| First non-NA | pfirst() |
apply(..., 1, function(x) x[!is.na(x)][1]) |
| Fast if-else | iif() |
ifelse() |
| Nested if-else | nif() |
Nested ifelse() |
| Switch | vswitch() |
match() + indexing |
| Unique values | funique() |
unique() |
| Top N indices | topn() |
order()[1:n] |
| Char to Factor | charToFact() |
as.factor() |
For comprehensive details and performance benchmarks, please refer to the individual function documentation.