Introduction to kit

Overview

kit provides a collection of fast utility functions implemented in C for data manipulation in R. It serves as a lightweight, high-performance toolkit for tasks that are either slow or cumbersome in base R, such as row-wise operations, vectorized conditionals, and duplicate detection.

Key features include:

Parallel statistical functions: Row-wise operations (psum, pmean, pfirst) using OpenMP.
Vectorized conditionals: Fast if-else logic (iif, nif, vswitch) that preserves attributes.
Efficient set operations: Faster unique, duplicated, and count for vectors and data frames.
Partial sorting: Retrieve top N elements without sorting the entire vector (topn).
Factor utilities: Fast character-to-factor conversion (charToFact) and level manipulation (setlevels).

Most functions are implemented in C and support multi-threading where applicable, making them significantly faster than their base R equivalents on large datasets.

Parallel Statistical Functions

Computing row-wise statistics across multiple vectors or data frame columns is a common task. While base R has pmin() and pmax(), it lacks efficient equivalents for sum, mean, or product. kit fills this gap.

Row-wise Arithmetic

psum(), pmean(), and pprod() compute parallel sum, mean, and product respectively. They accept multiple vectors or a single list/data frame.

x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)

# Parallel sum
psum(x, y, z, na.rm = TRUE)
#> [1] 6 7 8 7

# Parallel mean
pmean(x, y, z, na.rm = TRUE)
#> [1] 2.000000 3.500000 4.000000 2.333333

They are particularly useful for data frames:

df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9))
psum(df)
#> [1] 12 15 18

Row-wise Min, Max, and Range

fpmin(), fpmax(), and prange() compute parallel minimum, maximum, and range (max - min) respectively. They complement base R’s pmin() and pmax(), providing greater performance and the ability to work efficiently with data frames.

x <- c(1, 3, NA, 5)
y <- c(2, NA, 4, 1)
z <- c(3, 4, 4, 1)

# Parallel minimum
fpmin(x, y, z, na.rm = TRUE)
#> [1] 1 3 4 1

# Parallel maximum
fpmax(x, y, z, na.rm = TRUE)
#> [1] 3 4 4 5

# Parallel range (max - min)
prange(x, y, z, na.rm = TRUE)
#> [1] 2 1 0 4

Like psum() and pmean(), these functions preserve the input type when all inputs have the same type, and automatically promote to the highest type when inputs are mixed (logical < integer < double). prange() always returns double to avoid integer overflow.

# With data frames
fpmin(df)
#> [1] 1 2 3
fpmax(df)
#> [1] 7 8 9
prange(df)
#> [1] 6 6 6

Coalescing Values

pfirst() and plast() return the first or last non-missing value across a set of vectors. This is equivalent to the SQL COALESCE function (for pfirst).

primary   <- c(NA, 2, NA, 4)
secondary <- c(1, NA, 3, NA)
fallback  <- c(0, 0, 0, 0)

# Take first available value
pfirst(primary, secondary, fallback)
#> [1] 1 2 3 4

Logical and Count Operations

You can check for conditions or count values row-wise with pall, pany, and pcount.

a <- c(TRUE, FALSE, NA, TRUE)
b <- c(TRUE, NA, TRUE, FALSE)
c <- c(NA, TRUE, FALSE, TRUE)

# Any TRUE per row?
pany(a, b, c, na.rm = TRUE)
#> [1] TRUE TRUE TRUE TRUE

# Count NAs per row
pcountNA(a, b, c)
#> [1] 1 1 1 0

# Count specific value (e.g., TRUE) per row
pcount(a, b, c, value = TRUE)
#> [1] 2 1 1 2

Vectorized Conditionals

Fast If-Else (`iif`)

Base R’s ifelse() is known to be slow and often strips attributes (like Date class or factor levels). iif() is a faster, more robust alternative that preserves attributes from the yes argument.

dates <- as.Date(c("2024-01-01", "2024-01-02", "2024-01-03"))

# Base ifelse strips class
class(ifelse(dates > "2024-01-01", dates, dates - 1))
#> [1] "numeric"

# iif preserves class
class(iif(dates > "2024-01-01", dates, dates - 1))
#> [1] "Date"

It also supports explicit NA handling:

x <- c(-2, -1, NA, 1, 2)
iif(x > 0, "positive", "non-positive", na = "missing")
#> [1] "non-positive" "non-positive" "missing"      "positive"     "positive"

Nested Conditionals (`nif`)

For multiple conditions, nif() offers a cleaner, more efficient syntax than nested ifelse() calls, similar to SQL’s CASE WHEN.

score <- c(95, 82, 67, 45, 78)

nif(
  score >= 90, "A",
  score >= 80, "B", 
  score >= 70, "C",
  score >= 60, "D",
  default = "F"
)
#> [1] "A" "B" "D" "F" "C"

Vectorized Switch (`vswitch`, `nswitch`)

vswitch() maps input values to outputs efficiently.

status_code <- c(1L, 2L, 3L, 1L, 4L)

vswitch(
  x = status_code,
  values = c(1L, 2L, 3L),
  outputs = c("pending", "approved", "rejected"),
  default = "unknown"
)
#> [1] "pending"  "approved" "rejected" "pending"  "unknown"

For pairwise syntax, nswitch() pairs values and outputs directly.

nswitch(status_code,
  1L, "pending",
  2L, "approved", 
  3L, "rejected",
  default = "unknown"
)
#> [1] "pending"  "approved" "rejected" "pending"  "unknown"

It can also replace with values from other vectors (columns), mixing scalars and vectors:

df <- data.frame(
  code = c(1, 2, 1, 3, 2),
  val_a = c(10, 20, 30, 40, 50),
  val_b = c(100, 200, 300, 400, 500)
)
with(df, nswitch(code,
  1, val_a,
  2, val_b,
  3, 0,
  default = NA_real_
))
#> [1]  10 200  30   0 500

Fast Unique and Duplicates

kit provides optimized versions of unique() and duplicated() that are significantly faster for vectors and data frames.

Unique Values and Duplicates

vec <- c("a", "b", "a", "c", "b")

# Get unique values
funique(vec)
#> [1] "a" "b" "c"

# Check for duplicates
fduplicated(vec)
#> [1] FALSE FALSE  TRUE FALSE  TRUE

uniqLen() efficiently counts the number of unique elements without allocating the unique vector itself:

df <- data.frame(
  x = c(1, 1, 2, 2),
  y = c("a", "a", "b", "b")
)
uniqLen(df)
#> [1] 2
funique(df)
#>   x y
#> 1 1 a
#> 2 2 b

Counting Occurrences

countOccur() produces a frequency table (similar to table() or dplyr::count()) but returns a standard data frame.

countOccur(c("apple", "banana", "apple", "cherry"))
#>   Variable Count
#> 1    apple     2
#> 2   banana     1
#> 3   cherry     1

Sorting and Utilities

Partial Sorting (`topn`)

Sorting a large vector just to get the top few elements is inefficient. topn() uses a partial sorting algorithm to retrieve the top (or bottom) \(N\) indices or values.

set.seed(42)
x <- rnorm(1000)

# Get indices of top 5 values
topn(x, n = 5)
#> [1] 988 525 820 459 900

# Get the actual values (decreasing = FALSE for bottom values)
topn(x, n = 5, decreasing = FALSE, index = FALSE)
#> [1] -3.371739 -3.017933 -2.993090 -2.958780 -2.699930

Factor Manipulation

charToFact() is a fast alternative to as.factor() for character vectors, with control over NA levels.

charToFact(c("a", "b", NA, "a"))
#> [1] a    b    <NA> a   
#> Levels: a b <NA>

setlevels() allows you to change factor levels by reference (in-place), avoiding object copying.

Finding Positions (`fpos`)

fpos() finds the positions of a pattern (needle) within a vector (haystack). It can be used to find occurrences of one vector inside another.

haystack <- c(1, 2, 3, 4, 1, 2, 5)
needle <- c(1, 2)

fpos(needle, haystack)
#> [1] 1 5

Summary

Task	kit function	Base R equivalent
Row-wise sum	`psum()`	`rowSums(cbind(...))`
Row-wise mean	`pmean()`	`rowMeans(cbind(...))`
Row-wise min	`fpmin()`	`pmin(...)`
Row-wise max	`fpmax()`	`pmax(...)`
Row-wise range	`prange()`	`pmax(...) - pmin(...)`
First non-NA	`pfirst()`	`apply(..., 1, function(x) x[!is.na(x)][1])`
Fast if-else	`iif()`	`ifelse()`
Nested if-else	`nif()`	Nested `ifelse()`
Switch	`vswitch()`	`match()` + indexing
Unique values	`funique()`	`unique()`
Top N indices	`topn()`	`order()[1:n]`
Char to Factor	`charToFact()`	`as.factor()`

For comprehensive details and performance benchmarks, please refer to the individual function documentation.