README

ahocorasick, powered by the Rust aho-corasick crate, allows you to search for many substrings in one or more strings with linear complexity in R. It builds a reusable automaton once and then uses it for detection, counting, locating, and extraction and replacement.

Aho-Corasick is a multiple-pattern string matching algorithm designed to search for many keywords in a text in a single pass. After preprocessing the pattern set into an automaton, it finds all matches in O(text_length + number_of_matches), instead of the naive O(text_length × number_of_patterns) approach. This makes it especially suitable for high-throughput tasks such as dictionary matching, keyword extraction, rule-based text filtering, and large-scale log or document analysis.

Installation

install.packages("ahocorasick")

install.packages("ahocorasick", repos = "https://yousa-mirage.r-universe.dev")
# or
install.packages("ahocorasick", repos = "https://community.r-multiverse.org")

Install the development version from source. You need Cargo and rustc on PATH.

# install.packages("pak")
pak::pak("Yousa-Mirage/r-ahocorasick")

Quick Start

Start with a character vector of patterns you want to search for, then compile it into an <ac_automaton> object by calling ac_build() that can be reused across many documents.

library(ahocorasick)

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

You can set ascii_case_insensitive = TRUE to make the search case-insensitive for ASCII characters (a-z and A-Z) only.

Extracting Matches

ac_extract() returns one data frame per document, with both the matched text and the pattern value that produced each match.

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

hits <- ac_extract(ac, doc)
hits[[1]]
#>   matches patterns
#> 1   hello    hello
#> 2   world    world
#> 3   hello    hello

ac_extract_df(ac, doc)
#>   doc_id matches patterns
#> 1      1   hello    hello
#> 2      1   world    world
#> 3      1   hello    hello
#> 4      3    fish     fish

Set overlapping = TRUE to return overlapping matches. This is only supported when the automaton was built with the default match_kind = "standard".

ac_overlap <- ac_build(c("aba", "bab"))
ac_extract_df(ac_overlap, "ababa", overlapping = TRUE)
#>   doc_id matches patterns
#> 1      1     aba      aba
#> 2      1     bab      bab
#> 3      1     aba      aba

Locating Matches

ac_locate() returns one data frame per document. Offsets are 1-based character positions, inclusive on both sides, so they can be used with substr().

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

hits <- ac_locate(ac, doc)
hits[[1]]
#>   pattern_id start end
#> 1          1    18  22
#> 2          2    24  28
#> 3          1    31  35

ac_locate_df(ac, doc)
#>   doc_id pattern_id start end
#> 1      1          1    18  22
#> 2      1          2    24  28
#> 3      1          1    31  35
#> 4      3          3     1   4

Use ac_locate_bytes() when byte offsets are more useful than R character positions. Byte offsets are 0-based and byte_end is end-exclusive.

Detecting And Counting

Use ac_detect() when you only need to know whether a document contains at least one match, and ac_count() when you need the number of matches.

patterns <- c("hello", "world", "fish")
doc <- c(
  "this is my first hello world. hello!",
  "nothing to see",
  "fish and chips"
)

ac <- ac_build(patterns)

ac_detect(ac, doc)
#> [1]  TRUE FALSE  TRUE
ac_count(ac, doc)
#> [1] 3 0 1

These functions are usually the cheapest way to answer yes/no or frequency questions without materializing the matched strings or offsets.

Replacing Matches

Use ac_replace() to replace every non-overlapping match with the replacement for the matched pattern.

patterns <- c("fox", "brown", "quick")
doc <- "The quick brown fox."
ac <- ac_build(patterns)

ac_replace(ac, doc, c("sloth", "grey", "slow"))
#> [1] "The slow grey sloth."

ac_replace(ac, doc, "")
#> [1] "The   ."

Replacement uses the automaton’s match semantics. If you want Polars replace_many(..., leftmost = TRUE)-style priority, build the automaton with match_kind = "leftmost_first".

ac_leftmost <- ac_build(
  c("append", "appendage", "app"),
  match_kind = "leftmost_first"
)
ac_replace(ac_leftmost, "append the app to the appendage", c("x", "y", "z"))
#> [1] "x the z to the xage"

Searching Files

The ac_*_file() family applies the same automaton to one or more files. Pass a vector of file paths and each function returns results aligned with those files.

ac_files <- ac_build(c("hello", "world"))
paths <- c(tempfile(fileext = ".txt"), tempfile(fileext = ".txt"))
writeLines("hello world", paths[[1]])
writeLines("fish and chips", paths[[2]])

ac_detect_file(ac_files, paths)
#> [1]  TRUE FALSE
ac_count_file(ac_files, paths)
#> [1] 2 0
ac_extract_file(ac_files, paths)
#> [[1]]
#>   matches patterns
#> 1   hello    hello
#> 2   world    world
#> 
#> [[2]]
#> [1] matches  patterns
#> <0 rows> (or 0-length row.names)

ac_detect_file(), ac_count_file(), ac_extract_file(), and ac_replace_file() support stream = TRUE when the automaton was built with the default match_kind = "standard". This is useful for large files when you want to avoid reading the whole file into memory at once.

ac_locate_file() does not support stream = TRUE. It returns R character offsets instead of raw byte offsets, and converting streamed byte positions back into character positions would require another pass over the file. Keeping file location search non-streaming makes the behavior simpler and easier to reason about.

Using With Tidyverse

These search functions work naturally inside dplyr::mutate(). Scalar outputs like ac_detect(), ac_count(), and ac_replace() become ordinary columns, while ac_extract() and ac_locate() can be stored as list-columns.

patterns <- c("hello", "world", "fish")
ac <- ac_build(patterns)

docs <- tibble::tibble(
  doc = c(
    "this is my first hello world. hello!",
    "nothing to see",
    "fish and chips"
  )
)

docs |> 
  dplyr::mutate(
    detected = ac_detect(ac, doc),
    n_matches = ac_count(ac, doc),
    replaced = ac_replace(ac, doc, "[match]"),
    extracted = ac_extract(ac, doc)
  ) |> 
  tidyr::unnest(extracted)
#> # A tibble: 4 × 6
#>   doc                               detected n_matches replaced matches patterns
#>   <chr>                             <lgl>        <int> <chr>    <chr>   <chr>   
#> 1 this is my first hello world. he… TRUE             3 this is… hello   hello   
#> 2 this is my first hello world. he… TRUE             3 this is… world   world   
#> 3 this is my first hello world. he… TRUE             3 this is… hello   hello   
#> 4 fish and chips                    TRUE             1 [match]… fish    fish

Missing Values

doc_na <- c("hello", NA_character_, "world")
ac_na <- ac_build(c("hello", "world"))

ac_detect(ac_na, doc_na, na = "false")
#> [1]  TRUE FALSE  TRUE
ac_count(ac_na, doc_na, na = "zero")
#> [1] 1 0 1
ac_replace(ac_na, doc_na, "[match]", na = "empty")
#> [1] "[match]" ""        "[match]"

For list-column workflows, ac_locate(..., na = "empty") and ac_extract(..., na = "empty") treat missing documents as no matches.

Matching Semantics

standard (the default)

The default, match_kind = "standard", returns the match that finishes first. It is also the only mode that supports overlapping search.

patterns <- c("content", "disco", "disc", "discontent", "winter")
haystack <- "This is the winter of my discontent"

ac <- ac_build(patterns)
ac_extract_df(ac, haystack)
#>   doc_id matches patterns
#> 1      1  winter   winter
#> 2      1    disc     disc
ac_extract_df(ac, haystack, overlapping = TRUE)
#>   doc_id    matches   patterns
#> 1      1     winter     winter
#> 2      1       disc       disc
#> 3      1      disco      disco
#> 4      1 discontent discontent
#> 5      1    content    content

leftmost_first

match_kind = "leftmost_first" returns the leftmost non-overlapping match. If several patterns start at the same position, the earlier pattern wins.

ac <- ac_build(
  c("disco", "disc"),
  match_kind = "leftmost_first"
)
ac_extract_df(ac, "discontent")
#>   doc_id matches patterns
#> 1      1   disco    disco

leftmost_longest

match_kind = "leftmost_longest" returns the leftmost non-overlapping match. If several patterns start at the same position, the longest pattern wins.

ac <- ac_build(
  c("disco", "disc", "discontent"),
  match_kind = "leftmost_longest"
)
ac_extract_df(ac, "discontent")
#>   doc_id    matches   patterns
#> 1      1 discontent discontent

Performance Options

ac <- ac_build(
  c("disco", "disc"),
  implementation = "dfa"
)

ac_info(ac)$implementation
#> [1] "dfa"

Benchmark

See the benchmark article for results on English, Chinese, and mixed-text workloads. In the current benchmark, ahocorasick is fastest for detect and count across the tested cases. The ac_extract_df() is also the fastest for extract while preserving a tidy long data-frame result.

As with any benchmark, real-world results will differ based on your particular situation. If performance is important to your application, measure the alternatives yourself!

Related Work

Packages that developed to handle multiple string matching in R are less than in Python. These are what I found:

It’s great to see that Rust is providing more and more modern, safe, high-performance open source tools for R (and also for Python) :)

ahocorasick

Installation

Quick Start

Extracting Matches

Locating Matches

Detecting And Counting

Replacing Matches

Searching Files

Using With Tidyverse

Missing Values

Matching Semantics

`standard` (the default)

`leftmost_first`

`leftmost_longest`

Performance Options

Benchmark

Acknowledgments

ahocorasick

Installation

Quick Start

Extracting Matches

Locating Matches

Detecting And Counting

Replacing Matches

Searching Files

Using With Tidyverse

Missing Values

Matching Semantics

standard (the default)

leftmost_first

leftmost_longest

Performance Options

Benchmark

Related Work

Acknowledgments

`standard` (the default)

`leftmost_first`

`leftmost_longest`