Fast multi-pattern string matching in R with the Aho-Corasick algorithm.
ahocorasick, powered by the Rust aho-corasick
crate, allows you to search for many substrings in one or more strings
with linear complexity in R. It builds a reusable automaton once and
then uses it for detection, counting, locating, and extraction and
replacement.
Aho-Corasick
is a multiple-pattern string matching algorithm designed to search for
many keywords in a text in a single pass. After preprocessing the
pattern set into an automaton, it finds all matches in
O(text_length + number_of_matches), instead of the naive
O(text_length × number_of_patterns) approach. This makes it
especially suitable for high-throughput tasks such as dictionary
matching, keyword extraction, rule-based text filtering, and large-scale
log or document analysis.
For more details on the Aho-Corasick algorithm, please see Aho and Corasick (1975).
Install the released version from CRAN:
install.packages("ahocorasick")Install the released version from R-universe/R-multiverse:
install.packages("ahocorasick", repos = "https://yousa-mirage.r-universe.dev")
# or
install.packages("ahocorasick", repos = "https://community.r-multiverse.org")Install the development version from source. You need Cargo and
rustc on PATH.
# install.packages("pak")
pak::pak("Yousa-Mirage/r-ahocorasick")Start with a character vector of patterns you want to search for,
then compile it into an <ac_automaton> object by
calling ac_build() that can be reused across many
documents.
library(ahocorasick)
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)You can set ascii_case_insensitive = TRUE to make the
search case-insensitive for ASCII characters (a-z and
A-Z) only.
ac_extract() returns one data frame per document, with
both the matched text and the pattern value that produced each
match.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_extract(ac, doc)
hits[[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#> 3 hello helloUse ac_extract_df() when you want one data frame for all
documents.
ac_extract_df(ac, doc)
#> doc_id matches patterns
#> 1 1 hello hello
#> 2 1 world world
#> 3 1 hello hello
#> 4 3 fish fishSet overlapping = TRUE to return overlapping matches.
This is only supported when the automaton was built with the default
match_kind = "standard".
ac_overlap <- ac_build(c("aba", "bab"))
ac_extract_df(ac_overlap, "ababa", overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 aba aba
#> 2 1 bab bab
#> 3 1 aba abaac_locate() returns one data frame per document. Offsets
are 1-based character positions, inclusive on both sides, so they can be
used with substr().
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
hits <- ac_locate(ac, doc)
hits[[1]]
#> pattern_id start end
#> 1 1 18 22
#> 2 2 24 28
#> 3 1 31 35Use ac_locate_df() when you want one data frame for all
documents.
ac_locate_df(ac, doc)
#> doc_id pattern_id start end
#> 1 1 1 18 22
#> 2 1 2 24 28
#> 3 1 1 31 35
#> 4 3 3 1 4Use ac_locate_bytes() when byte offsets are more useful
than R character positions. Byte offsets are 0-based and
byte_end is end-exclusive.
Use ac_detect() when you only need to know whether a
document contains at least one match, and ac_count() when
you need the number of matches.
patterns <- c("hello", "world", "fish")
doc <- c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
ac <- ac_build(patterns)
ac_detect(ac, doc)
#> [1] TRUE FALSE TRUE
ac_count(ac, doc)
#> [1] 3 0 1These functions are usually the cheapest way to answer yes/no or frequency questions without materializing the matched strings or offsets.
Use ac_replace() to replace every non-overlapping match
with the replacement for the matched pattern.
patterns <- c("fox", "brown", "quick")
doc <- "The quick brown fox."
ac <- ac_build(patterns)
ac_replace(ac, doc, c("sloth", "grey", "slow"))
#> [1] "The slow grey sloth."Use one replacement string to replace all patterns with the same value.
ac_replace(ac, doc, "")
#> [1] "The ."Replacement uses the automaton’s match semantics. If you want Polars
replace_many(..., leftmost = TRUE)-style priority, build
the automaton with match_kind = "leftmost_first".
ac_leftmost <- ac_build(
c("append", "appendage", "app"),
match_kind = "leftmost_first"
)
ac_replace(ac_leftmost, "append the app to the appendage", c("x", "y", "z"))
#> [1] "x the z to the xage"The ac_*_file() family applies the same automaton to one
or more files. Pass a vector of file paths and each function returns
results aligned with those files.
ac_files <- ac_build(c("hello", "world"))
paths <- c(tempfile(fileext = ".txt"), tempfile(fileext = ".txt"))
writeLines("hello world", paths[[1]])
writeLines("fish and chips", paths[[2]])
ac_detect_file(ac_files, paths)
#> [1] TRUE FALSE
ac_count_file(ac_files, paths)
#> [1] 2 0
ac_extract_file(ac_files, paths)
#> [[1]]
#> matches patterns
#> 1 hello hello
#> 2 world world
#>
#> [[2]]
#> [1] matches patterns
#> <0 rows> (or 0-length row.names)ac_detect_file(), ac_count_file(),
ac_extract_file(), and ac_replace_file()
support stream = TRUE when the automaton was built with the
default match_kind = "standard". This is useful for large
files when you want to avoid reading the whole file into memory at
once.
ac_locate_file() does not support
stream = TRUE. It returns R character offsets instead of
raw byte offsets, and converting streamed byte positions back into
character positions would require another pass over the file. Keeping
file location search non-streaming makes the behavior simpler and easier
to reason about.
These search functions work naturally inside
dplyr::mutate(). Scalar outputs like
ac_detect(), ac_count(), and
ac_replace() become ordinary columns, while
ac_extract() and ac_locate() can be stored as
list-columns.
patterns <- c("hello", "world", "fish")
ac <- ac_build(patterns)
docs <- tibble::tibble(
doc = c(
"this is my first hello world. hello!",
"nothing to see",
"fish and chips"
)
)
docs |>
dplyr::mutate(
detected = ac_detect(ac, doc),
n_matches = ac_count(ac, doc),
replaced = ac_replace(ac, doc, "[match]"),
extracted = ac_extract(ac, doc)
) |>
tidyr::unnest(extracted)
#> # A tibble: 4 × 6
#> doc detected n_matches replaced matches patterns
#> <chr> <lgl> <int> <chr> <chr> <chr>
#> 1 this is my first hello world. he… TRUE 3 this is… hello hello
#> 2 this is my first hello world. he… TRUE 3 this is… world world
#> 3 this is my first hello world. he… TRUE 3 this is… hello hello
#> 4 fish and chips TRUE 1 [match]… fish fishSearch functions let you choose how NA documents
behave.
doc_na <- c("hello", NA_character_, "world")
ac_na <- ac_build(c("hello", "world"))
ac_detect(ac_na, doc_na, na = "false")
#> [1] TRUE FALSE TRUE
ac_count(ac_na, doc_na, na = "zero")
#> [1] 1 0 1
ac_replace(ac_na, doc_na, "[match]", na = "empty")
#> [1] "[match]" "" "[match]"For list-column workflows, ac_locate(..., na = "empty")
and ac_extract(..., na = "empty") treat missing documents
as no matches.
ac_build() exposes three match kinds from the Rust
library.
standard (the default)The default, match_kind = "standard", returns the match
that finishes first. It is also the only mode that supports overlapping
search.
patterns <- c("content", "disco", "disc", "discontent", "winter")
haystack <- "This is the winter of my discontent"
ac <- ac_build(patterns)
ac_extract_df(ac, haystack)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
ac_extract_df(ac, haystack, overlapping = TRUE)
#> doc_id matches patterns
#> 1 1 winter winter
#> 2 1 disc disc
#> 3 1 disco disco
#> 4 1 discontent discontent
#> 5 1 content contentleftmost_firstmatch_kind = "leftmost_first" returns the leftmost
non-overlapping match. If several patterns start at the same position,
the earlier pattern wins.
ac <- ac_build(
c("disco", "disc"),
match_kind = "leftmost_first"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 disco discoleftmost_longestmatch_kind = "leftmost_longest" returns the leftmost
non-overlapping match. If several patterns start at the same position,
the longest pattern wins.
ac <- ac_build(
c("disco", "disc", "discontent"),
match_kind = "leftmost_longest"
)
ac_extract_df(ac, "discontent")
#> doc_id matches patterns
#> 1 1 discontent discontentimplementation controls the underlying automaton
implementation:
"auto" lets the Rust crate choose a heuristic
default."noncontiguous_nfa" is fast to build and uses moderate
memory."contiguous_nfa" uses memory efficiently and is usually
faster to search than a noncontiguous NFA."dfa" can be fastest to search, but may be slow to
build and can use much more memory.ac <- ac_build(
c("disco", "disc"),
implementation = "dfa"
)
ac_info(ac)$implementation
#> [1] "dfa"See the benchmark
article for results on English, Chinese, and mixed-text workloads.
In the current benchmark, ahocorasick is
fastest for detect and count
across the tested cases. The ac_extract_df() is also the
fastest for extract while preserving a
tidy long data-frame result.
As with any benchmark, real-world results will differ based on your particular situation. If performance is important to your application, measure the alternatives yourself!
Packages that developed to handle multiple string matching in R are less than in Python. These are what I found:
AhoCorasickTrie:
Implemented in C++. It only supports 128 ASCII characters and currently
does not support UTF-8/Unicode (such as CJK and emojis).Biostrings:
It is a special tool for bioinformatics, and its object system is
XString / DNAString / XStringSet.
It is not suitable as a drop-in multi-mode retrieval tool for general R
string vectors.r-polars:
Polars’ string expressions also use the Aho-Corasick algorithm to match
(based on the same crate as
this package). If your data is already in a DataFrame, I
recommend you use Polars for matching strings directly.It’s great to see that Rust is providing more and more modern, safe, high-performance open source tools for R (and also for Python) :)
Inspired by the Python package ahocorasick_rs.
Thanks for the excellent Rust aho-corasick
crate.
MIT © 2026 Yousa Mirage